2014-11-04 2 views
2

독일어 텍스트에 sent_tokenizer을 사용할 때 이상한 동작이 발생합니다.nltk 문장 토크 나이저 및 특수 문자로 이상한 동작

예제 코드 :

sent_tokenizer = nltk.data.load('tokenizers/punkt/german.pickle') 
for sent in sent_tokenizer.tokenize("Super Qualität. Tolles Teil.") 
     print sent 

이 오류와 함께 실패

Traceback (most recent call last): 
for sent in sent_tokenize("Super Qualität. Tolles Teil."): 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/__init__.py", line 82, in sent_tokenize 
    return tokenizer.tokenize(text) 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1270, in tokenize 
    return list(self.sentences_from_text(text, realign_boundaries)) 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1318, in sentences_from_text 
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1309, in span_tokenize 
    return [(sl.start, sl.stop) for sl in slices] 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1348, in _realign_boundaries 
    for sl1, sl2 in _pair_iter(slices): 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 354, in _pair_iter 
    prev = next(it) 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1324, in _slices_from_text 
    if self.text_contains_sentbreak(context): 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1369, in text_contains_sentbreak 
    for t in self._annotate_tokens(self._tokenize_words(text)): 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1504, in _annotate_second_pass 
    for t1, t2 in _pair_iter(tokens): 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 354, in _pair_iter 
    prev = next(it) 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 621, in _annotate_first_pass 
    for aug_tok in tokens: 
    File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 586, in _tokenize_words 
    for line in plaintext.split('\n'): 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128) 

반면 : 완벽

sent_tokenizer = nltk.data.load('tokenizers/punkt/german.pickle') 
    for sent in sent_tokenizer.tokenize("Super Qualität des Produktes. Tolles Teil.") 
     print sent 

작품

+0

함수 이름 끝에 "r"이 누락 되었습니까? ""sent_tokenize ("Super Qualität. Tolles Teil.")에서 보냄 : –

+0

@ Mr.Polywhirl 질문에 오타가 단지 :-). 이것은 문제가 아니 었습니다. – Chris

+3

문제는 비 ASCII 문자가 포함 된 마지막 단어가있는 문장에 있습니다. 그러나 나는 그 이유를 모른다. 당신이 이런 식으로 "슈퍼 Qualität. Tolles Teil"을 사용한다면. 공장. –

답변

4

나는,728에 해결책을 발견.

Caution: when tokenizing a Unicode string, make sure you are not using an encoded version of the string (it may be necessary to decode it first, e.g. with s.decode("utf8").

그래서 마법처럼

text = "Super Qualität. Tolles Teil." 
sent_tokenizer = nltk.data.load('tokenizers/punkt/german.pickle') 
for sent in sent_tokenizer.tokenize(text.decode('utf8')): 
     print sent 

작품.

관련 문제