2013-02-04 4 views
1

latin-1과 utf8에서 utf8과 정규화 된 비 악센트 문자가 매핑되어 있습니까?utf8에서 latin-1으로의 기존 매핑이 있습니까? Python

I는 다음과 같은 오류를 받고있다 :

UnicodeEncodeError: 'latin-1' codec can't encode character u'\u010d' in position 4: ordinal not in range(256) 

그리고 수동으로 다음 코드를 수행하여 이러한 각 오류를 해결하고있다. 이 작업을 수행 할 수있는 더 좋은 방법은 ICU transform이 최선의 선택, 거기에 당신이 라틴어로 비 라틴 스크립트를 변환하는 일반적인 수단이 필요하면

def prehunpos(sentence): 
    sentence = sentence.replace(u'\u2018',"'") # left single quote mark 
    sentence = sentence.replace(u'\u2019',"'") # right single quote mark 
    sentence = sentence.replace(u'\u201C','"') # left double quote mark 
    sentence = sentence.replace(u'\u201D','"') # right double quote mark 
    sentence = sentence.replace(u'\u2010',"-") # hyphen 
    sentence = sentence.replace(u'\u2011',"-") # non-break hyphen 
    sentence = sentence.replace(u'\u2012',"-") # figure dash 
    sentence = sentence.replace(u'\u2013',"-") # dash 
    sentence = sentence.replace(u'\u2014',"-") # some sorta dash 
    sentence = sentence.replace(u'\u2015',"-") # long dash 
    sentence = sentence.replace(u'\u2017',"_") # double underscore 
    sentence = sentence.replace(u'\u2014',"-") # some sorta dash 
    sentence = sentence.replace(u'\u2016',"|") # long dash 
    sentence = sentence.replace(u'\u2024',"...") # ... 
    sentence = sentence.replace(u'\u2025',"...") # ... 
    sentence = sentence.replace(u'\u2026',"...") # ... 
    sentence = sentence.replace("\xce\x9d\xce\x91\xce\xa4\xce\x9f",u'NATO') # NATO 

    sentence = sentence.replace(u'\u0391',"A") # Greek Capital Alpha 
    sentence = sentence.replace(u'\u0392',"B") # Greek Capital Beta 
    #sentence = sentence.replace(u'\u0393',"") # Greek Capital Gamma 
    #sentence = sentence.replace(u'\u0394',"") # Greek Capital Delta 
    sentence = sentence.replace(u'\u0395',"E") # Greek Capital Epsilon 
    sentence = sentence.replace(u'\u0396',"Z") # Greek Capital Zeta 
    sentence = sentence.replace(u'\u0397',"H") # Greek Capital Eta 
    #sentence = sentence.replace(u'\u0398',"") # Greek Capital Theta 
    sentence = sentence.replace(u'\u0399',"I") # Greek Capital Iota 
    sentence = sentence.replace(u'\u039a',"K") # Greek Capital Kappa 
    #sentence = sentence.replace(u'\u039b',"") # Greek Capital Lambda 
    sentence = sentence.replace(u'\u039c',"M") # Greek Capital Mu 
    sentence = sentence.replace(u'\u039d',"N") # Greek Capital Nu 
    #sentence = sentence.replace(u'\u039e',"") # Greek Capital Xi 
    sentence = sentence.replace(u'\u039f',"O") # Greek Capital Omicron 
    sentence = sentence.replace(u'\u03a1',"P") # Greek Capital Rho 
    #sentence = sentence.replace(u'\u03a3',"") # Greek Capital Sigma 
    sentence = sentence.replace(u'\u03a4',"T") # Greek Capital Tau 
    sentence = sentence.replace(u'\u03a5',"Y") # Greek Capital Upsilon 
    #ssentence = sentence.replace(u'\u03a6',"") # Greek Capital Phi 
    sentence = sentence.replace(u'\u03a7',"T") # Greek Capital Chi 
    #sentence = sentence.replace(u'\u03a8',"") # Greek Capital Psi 
    #sentence = sentence.replace(u'\u03a9',"") # Greek Capital Omega 

    sentence = sentence.replace(u'\u03b1',"a") # Greek small alpha 
    sentence = sentence.replace(u'\u03b2',"b") # Greek small beta 
    #sentence = sentence.replace(u'\u03b3',"") # Greek small gamma 
    #sentence = sentence.replace(u'\u03b4',"") # Greek small delta 
    sentence = sentence.replace(u'\u03b5',"e") # Greek small epsilon 
    #sentence = sentence.replace(u'\u03b6',"") # Greek small zeta 
    #sentence = sentence.replace(u'\u03b7',"") # Greek small eta 
    #sentence = sentence.replace(u'\u03b8',"") # Greek small thetha 
    sentence = sentence.replace(u'\u03b9',"i") # Greek small iota 
    sentence = sentence.replace(u'\u03ba',"k") # Greek small kappa 
    #sentence = sentence.replace(u'\u03bb',"") # Greek small lamda 
    sentence = sentence.replace(u'\u03bc',"u") # Greek small mu 
    sentence = sentence.replace(u'\u03bd',"v") # Greek small nu 
    #sentence = sentence.replace(u'\u03be',"") # Greek small xi 
    sentence = sentence.replace(u'\u03bf',"o") # Greek small omicron 
    #sentence = sentence.replace(u'\u03c0',"") # Greek small pi 
    sentence = sentence.replace(u'\u03c1',"p") # Greek small rho 
    sentence = sentence.replace(u'\u03c2',"c") # Greek small final sigma 
    #sentence = sentence.replace(u'\u03c3',"") # Greek small sigma 
    sentence = sentence.replace(u'\u03c4',"t") # Greek small tau 
    sentence = sentence.replace(u'\u03c5',"u") # Greek small upsilon 
    #sentence = sentence.replace(u'\u03c6',"") # Greek small phi 
    sentence = sentence.replace(u'\u03c7',"x") # Greek small chi 
    sentence = sentence.replace(u'\u03c8',"x") # Greek small psi 
    sentence = sentence.replace(u'\u03c9',"w") # Greek small omega 


    sentence = sentence.replace(u'\u0103',"a") # Latin a with breve 
    sentence = sentence.replace(u'\u0107',"c") # Latin c with acute 
    sentence = sentence.replace(u'\u010d',"c") # Latin c with caron 
    sentence = sentence.replace(u'\u0161',"s") # Lation s with caron 

    return sentence.strip() 
+0

[유니 코드 문자를 파이썬에서 ascii 문자로 바꾸는 방법 (perl 스크립트가 주어짐)?] (http://stackoverflow.com/questions/2700859/how-to-replace-unicode-characters-by-ascii) -characters-in-python-perl-script-giv) –

+0

http://code.google.com/p/a2bot/source/browse/trunk/lib/unaccent.py –

+0

@MikeSamuel, 문제를 해결하지 못합니다. 정규화 할 수없는 미친 utf8 구두점. – alvas

답변

1

을?입니다. ICU, PyICU (http://pypi.python.org/pypi/PyICU) 용 Python 래퍼가 있습니다. 그러나 단 하나의 스크립트 만 타겟팅하는 경우 (특히 그리스어에 관심이있는 것처럼 보입니다) 매핑 테이블이 가장 빠른 솔루션입니다. 좀 더 간결을 쓸 수 있지만 :

#!/usr/bin/python 
# -*- coding: utf-8 -*- 

greek_to_latin = {u"Α": u"A", u"Β": u"B", u"Γ": u"G"} # ... 
latin_string = "".join(greek_to_latin[c] for c in greek_string) 

또한, 문자의 종류를 식별하기 위해 비 ASCII 구두점 기호를 식별 할 수있는 수단을 가지고있는 한편 UnicodeData 모듈을 확인할 수 있습니다.

+2

결합 대신에'unicode.translate'를 사용하십시오 그런 식으로 결과 :'the_string.translate (greek_to_latin)'. 또한 동일하게 유지되는 문자를 포함시킬 필요가 없습니다. 그대로 유지됩니다. – Bakuriu

+0

물론, 번역 맵의 키에 ord를 사용해야 할지라도 작동합니다. –

+0

''u ''대신''ord (u "è")'를 사용하는 것은 큰 일이 아니지만, 이렇게하면 큰 속도 향상을 얻을 수 있습니다. – Bakuriu

관련 문제