2017-04-22 2 views
1

파이썬 코드의 문제점을 파악하는 데 도움이됩니다.UnicodeDecodeError : 'ascii'코덱에서 6 바이트의 바이트 0xe2를 디코딩 할 수 없습니다. 서수가 범위에 없습니다 (128)

코드를

import nltk 
import re 
import pickle 


raw = open('tom_sawyer_shrt.txt').read() 

### this is how the basic Punkt sentence tokenizer works 
#sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle') 
#sents = sent_tokenizer.tokenize(raw) 

### train & tokenize text using text 
sent_trainer = nltk.tokenize.punkt.PunktSentenceTokenizer().train(raw) 
sent_tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer(sent_trainer) 
# break in to sentences 
sents = sent_tokenizer.tokenize(raw) 
# get sentence start/stop indexes 
sentspan = sent_tokenizer.span_tokenize(raw) 



### Remove \n in the middle of setences, due to fixed-width formatting 
for i in range(0,len(sents)-1): 
    sents[i] = re.sub('(?<!\n)\n(?!\n)',' ',raw[sentspan[i][0]:sentspan[i+1][0]]) 

for i in range(1,len(sents)): 
    if (sents[i][0:3] == '"\n\n'): 
     sents[i-1] = sents[i-1]+'"\n\n' 
     sents[i] = sents[i][3:] 


### Loop thru each sentence, fix to 140char 
i=0 
tweet=[] 
while (i<len(sents)): 
    if (len(sents[i]) > 140): 
     ntwt = int(len(sents[i])/140) + 1 
     words = sents[i].split(' ') 
     nwords = len(words) 
     for k in range(0,ntwt): 
      tweet = tweet + [ 
       re.sub('\A\s|\s\Z', '', ' '.join(
       words[int(k*nwords/float(ntwt)): 
         int((k+1)*nwords/float(ntwt))] 
       ))] 
     i=i+1 
    else: 
     if (i<len(sents)-1): 
      if (len(sents[i])+len(sents[i+1]) <140): 
       nextra = 1 
       while (len(''.join(sents[i:i+nextra+1]))<140): 
        nextra=nextra+1 
       tweet = tweet+[ 
        re.sub('\A\s|\s\Z', '',''.join(sents[i:i+nextra])) 
        ]   
       i = i+nextra 
      else: 
       tweet = tweet+[re.sub('\A\s|\s\Z', '',sents[i])] 
       i=i+1 
     else: 
      tweet = tweet+[re.sub('\A\s|\s\Z', '',sents[i])] 
      i=i+1 


### A last pass to clean up leading/trailing newlines/spaces. 
for i in range(0,len(tweet)): 
    tweet[i] = re.sub('\A\s|\s\Z','',tweet[i]) 

for i in range(0,len(tweet)): 
    tweet[i] = re.sub('\A"\n\n','',tweet[i]) 


### Save tweets to pickle file for easy reading later 
output = open('tweet_list.pkl','wb') 
pickle.dump(tweet,output,-1) 
output.close() 


listout = open('tweet_lis.txt','w') 
for i in range(0,len(tweet)): 
    listout.write(tweet[i]) 
    listout.write('\n-----------------\n') 

listout.close() 

이잖아 당신의 캐릭터가 일부 유니 코드가있는 경우

Traceback (most recent call last): File "twain_prep.py", line 13, in sent_trainer = nltk.tokenize.punkt.PunktSentenceTokenizer().train(raw) File "/home/user/.local/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 1227, in train token_cls=self._Token).get_params() File "/home/user/.local/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 649, in init self.train(train_text, verbose, finalize=True) File "/home/user/.local/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 713, in train self._train_tokens(self._tokenize_words(text), verbose) File "/home/user/.local/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 729, in _train_tokens tokens = list(tokens) File "/home/user/.local/lib/python2.7/site-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words for line in plaintext.split('\n'): UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 6: ordinal not in range(128)

답변

1

UnicodeDecodeError가 발생하는 오류 메시지를 이잖아. 기본적으로 파이썬 문자열은 ascii 값만을 처리하므로, tokenizer에 텍스트를 보낼 때 ascii 목록에없는 문자가 포함되어 있어야합니다.

어떻게 해결할 수 있습니까?

텍스트를 ascii 자로 변환하고 '유니 코드'자를 무시할 수 있습니다.

raw = raw..encode('ascii', 'ignore') 

또한, Unicode 오류를 처리하기 위해이 post을 읽을 수 있습니다.

관련 문제