잘못된 형식의 UTF-8 인 16 비트 슬라이스 인 WTF-8 문자열 집합

유효한 WTF-8 인 16 비트 조각 집합 (Rust : &[u16])을 정의하려고합니다. -encoded), 유효하지 않은 UTF-8 (재 인코딩 된 경우)이 없으므로 이러한 슬라이스를 무작위로 생성 할 수 있습니다. 이것은 String으로 구문 분석하지 않는 Windows 컴퓨터에서 가능한 모든 std::ffi::OsString을 생성하기위한 노력의 일환입니다.잘못된 형식의 UTF-8 인 16 비트 슬라이스 인 WTF-8 문자열 집합

변환 &[u16] -> OsString은 std::os::windows::ffi::OsStringExt::from_wide을 통해 이루어집니다.

/// Creates a WTF-8 string from a potentially ill-formed UTF-16 slice of 16-bit code units. 
/// 
/// This is lossless: calling `.encode_wide()` on the resulting string 
/// will always return the original code units. 
pub fn from_wide(v: &[u16]) -> Wtf8Buf { 
    let mut string = Wtf8Buf::with_capacity(v.len()); 
    for item in char::decode_utf16(v.iter().cloned()) { 
     match item { 
      Ok(ch) => string.push_char(ch), 
      Err(surrogate) => { 
       let surrogate = surrogate.unpaired_surrogate(); 
       // Surrogates are known to be in the code point range. 
       let code_point = unsafe { 
        CodePoint::from_u32_unchecked(surrogate as u32) 
       }; 
       // Skip the WTF-8 concatenation check, 
       // surrogate pairs are already decoded by decode_utf16 
       string.push_code_point_unchecked(code_point) 
      } 
     } 
    } 
    string 
}

변환 OsString -> Result<String, Wtf8Buf>가 동일한 파일에 into_string을 통해 이루어집니다 :이 같은 작업을 정의 libstd/sys_common/wtf8.rs로 리디렉션 next_surrogate와

/// Consumes the WTF-8 string and tries to convert it to UTF-8. 
/// 
/// This does not copy the data. 
/// 
/// If the contents are not well-formed UTF-8 
/// (that is, if the string contains surrogates), 
/// the original WTF-8 string is returned instead. 
pub fn into_string(self) -> Result<String, Wtf8Buf> { 
    match self.next_surrogate(0) { 
     None => Ok(unsafe { String::from_utf8_unchecked(self.bytes) }), 
     Some(_) => Err(self), 
    } 
}

정의로 :

#[inline] 
fn next_surrogate(&self, mut pos: usize) -> Option<(usize, u16)> { 
    let mut iter = self.bytes[pos..].iter(); 
    loop { 
     let b = *iter.next()?; 
     if b < 0x80 { 
      pos += 1; 
     } else if b < 0xE0 { 
      iter.next(); 
      pos += 2; 
     } else if b == 0xED { 
      match (iter.next(), iter.next()) { 
       (Some(&b2), Some(&b3)) if b2 >= 0xA0 => { 
        return Some((pos, decode_surrogate(b2, b3))) 
       } 
       _ => pos += 3 
      } 
     } else if b < 0xF0 { 
      iter.next(); 
      iter.next(); 
      pos += 3; 
     } else { 
      iter.next(); 
      iter.next(); 
      iter.next(); 
      pos += 4; 
     } 
    } 
}

내가 원하는 알고리즘은 Vec<u16>이고, OsString::from_wide(vec.as_slice()).into_string().unwrap_err()은 결코 패닉을 일으키지 않습니다. OsString을 돌려줍니다. 물론 OsString 집합은 최대가되어야하며 사소한 상수를 사용하지 않아야합니다.

우리는 두 개의 작업을 정의 할 수 있습니다,이 작업을 수행하기 위해, 그리고 단순화하기 위해 : Gen 입력 한 임의의 데이터를 생성하기위한 모나드의 일종이다

encode_wide : &[u8] -> &[u16]
valid_wtf8_invalid_utf8 :() -> Gen<Vec<u8>>.

encode_wide와 valid_wtf8_invalid_utf8()에 의해 주어진 펑터를 매핑함으로써 우리는이에서 우리는 Gen<OsString>을받을 수 있습니다, 차례로, Gen<Vec<u16>>를 얻을 수 있습니다.

그러나 작업을 정의하는 방법이 확실하지 않습니다. encode_wide 및 valid_wtf8_invalid_utf8. 주어진 함수의 논리를 뒤집기보다는 취할 수있는 좀 더 직접적인 접근법이 있습니까?

Gen이 추상화되어 있기 때문에 실행 가능한 코드를 기대하지는 않지만 의사 코드 또는 다른 고급 명령어는 깔끔합니다. 감사합니다.)

출처

2017-12-11 Centril

WTF-16 \ UTF-16 또는 WTF-8 \ UTF-8에 문자열을 생성할지 여부는 완전히 분명하지 않습니다. 나는 UTF-16이 유효하지 않은 WTF-16 문자열을 생성하는 것이 더 쉽다고 생각한다.

적어도 하나의 (16 비트) "문자"가 다음과 같은 대리모인지 확인해야한다. 서로 게이트 쌍의 일부가 아닙니다. (이 예에서는 문자열에 NUL 문자를 생성 할 수 있습니다.)

extern crate rand; 

use rand::Rng; 

pub fn gen_wtf16_invalid_utf16<R>(r: &mut R, len: usize) -> Vec<u16> 
where 
    R: Rng, 
{ 
    assert!(len > 0); 
    let mut buf = Vec::with_capacity(len); 
    for _ in 0..len { 
     buf.push(r.next_u32() as u16); 
    } 
    // make element at position `p` a surrogate that is not part 
    // of a surrogate pair 
    let p = r.gen_range(0, len-1); 
    // if first elem or previous entry is not a leading surrogate 
    let gen_trail = (0 == p) || (0xd800 != buf[p-1] & 0xfc00); 
    // if last element or succeeding entry is not a traililng surrogate 
    let gen_lead = (p == len-1) || (0xdc00 != buf[p+1] & 0xfc00); 
    let (force_bits_mask, force_bits_value) = if gen_trail { 
     if gen_lead { 
      // trailing or leading surrogate 
      (0xf800, 0xd800) 
     } else { 
      // trailing surrogate 
      (0xfc00, 0xdc00) 
     } 
    } else { 
     // leading surrogate 
     debug_assert!(gen_lead); 
     (0xfc00, 0xd800) 
    }; 
    debug_assert_eq!(0, (force_bits_value & !force_bits_mask)); 
    buf[p] = (buf[p] & !force_bits_mask) | force_bits_value; 
    buf 
} 

fn main() { 
    let s = gen_wtf16_invalid_utf16(&mut rand::thread_rng(), 10); 
    for c in &s { 
     println!("0x{:04x}", c); 
    } 
}

지금까지 내가 그것을 이해,`Windows 용 OsString` 실제로 OsString`가 함께`있음을 구축하는 Wtf8Buf`하지만 유일한 방법`에 의해 백업됩니다

출처

2017-12-11 10:50:35 Stefan

'from_wide'. 몇 가지 문서 : https://doc.rust-lang.org/nightly/std/ffi/struct.OsString.html 특히 "Windows에서 문자열은 종종 0이 아닌 16 비트 값의 임의 시퀀스이며 UTF- 16 그렇게하는 것이 타당 할 때. " 그게 나에게 매우 혼란 스럽습니다. 알고리즘에 의해 생성 된'vec : Vec '주어진'OsString :: from_wide (vec.as_slice()). into_string(). unwrap_err()'의 결과는 결코 공황입니까? – Centril

생성 된 시퀀스가 유효한 UTF-16이 아닐 것이라고 확신합니다 (쌍 외부에 대리가 포함되어 있기 때문에). 이러한 시퀀스는 WTF-8로 변환 할 때 유효한 UTF-8이 될 수 없습니다 (변환이 무손실이기 때문에 UTF-8을 WTF-16으로 다시 변환하면 유효한 UTF-16으로 끝나기 때문에). 그러나 무손실은 원래 유효하지 않은 UTF-16). – Stefan

바로 그 문제를 해결해 주셔서 감사합니다. 그것을 테스트하고 잘 작동하는 것 같습니다. 건배! – Centril

잘못된 형식의 UTF-8 인 16 비트 슬라이스 인 WTF-8 문자열 집합

답변

관련 문제