2 차원 텍스트 구문 분석

관련 정보가 비선형 방식으로 여러 줄로 퍼져있는 텍스트 파일을 구문 분석해야합니다. 예 :2 차원 텍스트 구문 분석

1234 
1   IN THE SUPERIOR COURT OF THE STATE OF SOME STATE   
2    IN AND FOR THE COUNTY OF SOME COUNTY     
3      UNLIMITED JURISDICTION       
4       --o0o--         
5                  
6 JOHN SMITH AND JILL SMITH,  )        
             )        
7     Plaintiffs,  )        
             )        
8  vs.       )  No. 12345 
             )        
9 ACME CO, et al.,     )        
             )        
10     Defendants.  )        
    ___________________________________)

나는 원고와 피고의 신원을 알아 내야합니다.

이 성적 증명서

은 formattings의 매우 다양한, 그래서 나는 항상 그 좋은 괄호가있는 믿을 수없는, 또는 원고와 피고 정보는 깔끔하게 오프 박스되고 예 :

1  SUPREME COURT OF THE STATE OF SOME OTHER STATE 
         COUNTY OF COUNTYVILLE 
2     First Judicial District 
        Important Litigation 
3 --------------------------------------------------X 
    THIS DOCUMENT APPLIES TO: 
4 
    JOHN SMITH, 
5       Plaintiff,   Index No. 
                2000-123 
6 
              DEPOSITION 
7     - against -    UNDER ORAL 
              EXAMINATION 
8            OF 
              JOHN SMITH, 
9           Volume I 

10 ACME CO, 
    et al, 
11       Defendants. 

12 --------------------------------------------------X

두 상수는 다음과 같습니다

"원고는"같은 줄에 반드시 을 원고 (들)의 이름 뒤에 발생하지만하지 않습니다.
원고 및 피고인의 이름은 이며 대문자입니다.

아이디어가 있으십니까?

출처

2010-05-02 alexbw

여기
아마도 Python를 사용하여보다 일반적인 접근 방식인가? 이것들을 추가 했습니까, 아니면 소스의 일부입니까? 원고는 대문자라고하지만 "JOHN SMITH and JILL SMITH"에는 소문자가 들어 있습니다. 원고 이름과 "원고"텍스트 사이에 가능한 문자는 무엇입니까? 순전히 흰 공백, 괄호 및 쉼표입니까? –

이들은 소스의 일부인 행 번호입니다. 원고의 대문자를 수정했습니다. 원고 이름과 "원고"사이에 실제로는 무엇이든 될 수 있습니다. 알파벳이 아닌 문자와 공백은 보장되지 않습니다. – alexbw

항상 신경망을 사용할 수 있습니다. 텍스트 파싱에 적합합니다. http://thedailywtf.com/Articles/No,_We_Need_a_Neural_Network.aspx –

나는 Martin's answer을 좋아한다.

import re 

# load file into memory 
# (if large files, provide some limit to how much of the file gets loaded) 
with open('paren.txt','r') as f: 
    paren = f.read() # example doc with parens 

# match all sequences of one or more alphanumeric (or underscore) characters 
# when followed by the word `Plaintiff`; this is intentionally general 
list_of_matches = re.findall(r'(\w+)(?=.*Plaintiff)', paren, 
    re.DOTALL|re.MULTILINE) 

# join the list separating by whitespace 
str_of_matches = ' '.join(list_of_matches) 

# split string by digits (line numbers) 
tokens = re.split(r'\d',str_of_matches) 

# plaintiffs will be in 2nd-to-last group 
plaintiff = tokens[-2].strip()

테스트 : : 왼쪽에있는 숫자는 무엇

with open('paren.txt','r') as f: 
    paren = f.read() # example doc with parens 
list_of_matches = re.findall(r'(\w+)(?=.*Plaintiff)',paren, 
    re.DOTALL|re.MULTILINE) 
str_of_matches = ' '.join(list_of_matches)>>> tokens = re.split(r'\d', str_of_matches) 
tokens = re.split(r'\d', str_of_matches) 
plaintiff = tokens[-2].strip() 
plaintiff 
# prints 'JOHN SMITH and JILL SMITH' 

with open('no_paren.txt','r') as f: 
    no_paren = f.read() # example doc with no parens 
list_of_matches = re.findall(r'(\w+)(?=.*Plaintiff)',no_paren, 
    re.DOTALL|re.MULTILINE) 
str_of_matches = ' '.join(list_of_matches) 
tokens = re.split(r'\d', str_of_matches) 
plaintiff = tokens[-2].strip() 
plaintiff 
# prints 'JOHN SMITH'

출처

2010-05-02 18:30:03 bernie

2 차원 텍스트 구문 분석

답변

관련 문제