2013-03-16 4 views
0

파이썬에서 큰 XML 파일을 구문 분석하고 있습니다. 나는이 XML을 구문 분석 iterparse 모듈을 사용하고iterparse를 사용하는 중에 XMLSyntax 오류가 발생했습니다.

<?xml version="1.0" encoding="utf-8"?> 
<posthistory> 
    <row Id="1332647" PostHistoryTypeId="5" PostId="723397" RevisionGUID="cd3aafe8-47ee-497d-a4e8-948c2a769d7e" CreationDate="2009-04-06T22:27:07.567" UserId="40414" Comment="Added examples of articles and filenames" Text="I have a large number of text files (1000+) containing articles from academic journals. Each article's file contains a &quot;stub&quot; from the end of the previous article (at the beginning) and from the beginning of the next article (at the end). I need to remove these stubs in preparation for running a frequency analysis on the articles because the stubs are duplicate data. &#xD;&#xA;&#xD;&#xA;I know I can use diff to output the differences between the files, ignore whitespace and treat the files as text, and then compare them manually, but this doesn't male much sense when dealing with this much data. Accuracy does not have to be 100%, so a script that compared each file to the next file and then removed 1 copy of the duplicate text would be perfect. This seems like it would be a pretty common issue when programming so I am surprised that I haven't been able to find anything that does this.&#xD;&#xA;&#xD;&#xA;The file names sort in order, so a script that compares each file to the next sequentially should work. E.G.&#xD;&#xA;&#xD;&#xA;&lt;pre&gt;bul_9_5_181.txt&#xD;&#xA;bul_9_5_186.txt&#xD;&#xA;&lt;/pre&gt;&#xD;&#xA;&#xD;&#xA;are two articles, one starting on page 181 and the other on page 186. &#xD;&#xA;&#xD;&#xA;Note: I am an academic doing content analysis of old journal articles for a project in the history of psychology. I am no programmer, but I do have 10+ years experience with linux and can usually figure things out as I go. &#xD;&#xA;&#xD;&#xA;Thanks for your help&#xD;&#xA;&#xD;&#xA;&lt;b&gt;---example stub at beginning of file: everything before &quot;AFFECTIVE PHENOMENA — EXPERIMENTAL&quot; is duplicate from previous file----&lt;/b&gt;&#xD;&#xA;&#xD;&#xA;SYN&amp;STHESIA&#xD;&#xA;&#xD;&#xA;ISI&#xD;&#xA;&#xD;&#xA;the majority of Portugese words signifying black objects or ideas relating to black. This association is, admittedly, no true synsesthesia, but the author believes that it is only a matter of degree between these logical and spontaneous associations and genuine cases of colored audition.&#xD;&#xA;REFERENCES&#xD;&#xA;&#xD;&#xA;DOWNEY, JUNE E. A Case of Colored Gustation. Amer. J. of Psycho!., 1911, 22, S28-539MEDEIROS-E-ALBUQUERQUE. Sur un phenomene de synopsie presente par des millions de sujets./. de psychol. norm, et path., 1911, 8, 147-151. MYERS, C. S. A Case of Synassthesia. Brit. J. of Psychol., 1911, 4, 228-238.&#xD;&#xA;&#xD;&#xA;AFFECTIVE PHENOMENA — EXPERIMENTAL&#xD;&#xA;BY PROFESSOR JOHN F. .SHEPARD&#xD;&#xA;University of Michigan&#xD;&#xA;&#xD;&#xA;Three articles have appeared from the Leipzig laboratory during the year. Drozynski (2) objects to the use of gustatory and olfactory stimuli in the study of organic reactions with feelings, because of the disturbance of breathing that may be involved. He uses rhythmical auditory stimuli, and finds that when given at different rates and in various groupings,&#xD;&#xA;&#xD;&#xA;&lt;b&gt;---this is from the end of the same file, everything AFTER &quot;1911. Pp.39&quot; is duplicate from the next article---&lt;/b&gt;&#xD;&#xA;&#xD;&#xA;Pleasantness of Colors. Arner. J. of Psychol., 1911, 22, 578-579. 8. WASHBURN, M. F . and CRAWFORD, D . Fluctuations in the Affective Value of Colors During Fixation for One Minute. Amer. J. of Psychol., 1911, 22, 579-J82. 9. WELLS, F . L. and FORBES, A. On Certain Electrical Processes in the Human Body and their Relation to Emotional Reactions. (No. 16 of Archives of Psychology). New York: The Science Press, 1911. Pp. 39.&#xD;&#xA;&#xD;&#xA;AFFECTIVE PHENOMENA — DESCRIPTIVE AND THEORETICAL&#xD;&#xA;BY PROFESSOR H. N. GARDINER Smith College&#xD;&#xA;&#xD;&#xA;Fundamental questions are discussed systematically by Rehmke (18) in a second edition of a well-digested treatise, a characteristic feature of which is its attempt to relate feeling, emotion and mood.&#xD;&#xA;Feeling (Gefuhl) is defined as a Bestimmtheitsbesonderheit des zustdnd-&#xD;&#xA;&#xD;&#xA;lichen Bewusstseins. Consciousness being conceived as the individual soul, its state is assumed to be at any given moment simple and unique; hence the momentary feeling is always one of pleasure or displeasure, never &quot;mixed.&quot; It is determined, not by any one, but by the totality of the objective factors, those being massgebend which are in the focus of attention. A &quot; feeling,&quot; in the ordinary sense, is a complex of the affective state and the &quot;determining&quot; and &quot;accompanying&quot; objective components, the &quot;determining&quot; objects of attention giving the kind of feeling, the &quot;accompanying&quot; organic sensations being mainly responsible for its obscure &quot;coloring&quot; and its degree. Mood (Stimmung) appears in a certain contrast to &quot;feeling&quot; in that in it organic sensation is the &quot;determining&quot; factor and no particular object occupies the focus of attention. Emotion {Affeki) is not contrasted with &quot;feeling,&quot; but is &quot;feeling&quot; characterized by the intensity of the &quot;accompanying&quot; organic sensations, which are rightly included in the emotion; we must not, however, confuse, with James and Lange, the bodily changes which give rise&#xD;&#xA;&#xD;&#xA;&lt;b&gt;---end example---&lt;/b&gt;" /> 
</posthistory> 

다음과 같이 큰 XML 파일의 관련 부분이다.

''' 
Function provides fast iteration of XML files via iterparse 
Source - Listing 5 at Source - http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ 
''' 
def fast_iter(context, func): 
    try: 
     for event, elem in context: 
      func(elem) 

      elem.clear() 
      while elem.getprevious() is not None: 
       del elem.getparent()[0] 
     del context 
    except etree.XMLSyntaxError, e: 
     print e 

나는 다음과 같은 추적과

Input is not proper UTF-8, indicate encoding ! 
Bytes: 0x97 0x20 0x45 0x58, line 3, column 1694 

몇 가지 중요한 포인트를 다음 XMLSyntaxError을받을 -

[1]은 XML이 큰 3기가바이트 (때문에 iterparse 모듈을 사용하는 것이 필수입니다 또는 그 이상).

[2] 구문 오류를 발생시키는 XML 파일의 일부만 제공했습니다.

[1] 내가이 문제를 자동으로 해결할 수

다음과 같이

내 질문

은? 그렇다면 어떻게?

[2]이 문제를 수동으로 해결할 수 있습니까? 그렇다면 어떻게?

[3]이 문제를 무시하고 대용량 XML 파일을 구문 분석 할 수 있습니까? 그렇다면 어떻게? 나는 다음과 같은 자료를 본

:

How should I deal with an XMLSyntaxError in Python's lxml while parsing a large XML file?

Ignore encoding errors in Python (iterparse)?

Is there a way to recover iterparse on invalid Char values?

그러나 아무도 내 질문에 대답 수 없었다.

답변

1

문제는 XML이 UTF-8로 인코딩되었음을 파서에게 알려주고 있다는 것입니다. 실제로는 Windows CP-1252 일 때입니다. 문자열

0x97 0x20 0x45 0x58 

emdash space E X 

을 (CP-1252으로 해석하는 경우)이고 당신의 기분을 상하게하는 줄의 텍스트를 보면 주어진 당신이 실제로 emdash를 볼 수 있습니다 오프셋. 문제는 0x97이 UTF-8 인코딩에서 유효한 문자가 아니라는 것입니다.

유니 코드의 경우 emdash 문자는 &#x2014;이며 UTF-8로 인코딩 된 경우 0xE2 0x80 0x94의 세 바이트로 표시됩니다.

해결 방법은 데이터가 UTF-8로 올바르게 인코딩되었는지 확인하거나 올바른 인코딩을 나타내도록 헤더를 변경하는 것입니다.

+0

ISO-8859-1에서 '0x97' 바이트는 "EM DASH"가 아닌 "END OF GUARDED AREA"제어 문자를 나타냅니다. 그러나 CP1252에서는 '0x97'이 실제로 "EM DASH"입니다. – mzjn

+0

당신 말이 맞아요. 사실 [이 페이지] (http://en.wikipedia.org/wiki/ISO/IEC_8859-1)에 따르면 ISO-8859-1은'0x7F-0x9F' 범위의 문자를 정의하지 않습니다. –

관련 문제