패턴을 기반으로 문자열에서 중복 문자열을 제거 할 수있는 방법이 있습니까?

나는이 형식의 파일로 일하고 있어요 : 당신이 볼 수 있듯이패턴을 기반으로 문자열에서 중복 문자열을 제거 할 수있는 방법이 있습니까?

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22491.xml;spectrum=1074 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=2950 true 


=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true 

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true

, 모든 SPEC 라인이 다른 문자열 스펙트럼의 번호가 반복되는 두 가지를 제외하고. 내가하고 싶은 것은 =Cluster= 패턴 사이의 모든 정보를 가져 와서 스펙트럼 값이 반복되는 라인이 있는지 확인하는 것입니다. 반복되는 여러 행이있는 경우 하나만 제외하고 모두 제거합니다.

출력 파일은 다음과 같아야 I는 itertools 모듈로부터 groupby를 사용 하였다

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22491.xml;spectrum=1074 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=2950 true 


=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true 

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true

. 내 입력 파일을 f_input.txt라고하고 출력 파일의 이름을 new_file.txt라고 가정하지만이 스크립트는 SPEC이라는 단어도 제거합니다 ... 그리고이 작업을 수행하지 않기 위해 무엇을 변경할 수 있는지 모르겠습니다. .

from itertools import groupby 

data = (k.rstrip().split("=Cluster=") for k in open("f_input.txt", 'r')) 
final = list(k for k,_ in groupby(list(data))) 

with open("new_file.txt", 'a') as f: 
    for k in final: 
     if k == ['','']: 
      f.write("=Cluster=\n") 
     elif k == ['']: 
      f.write("\n\n") 
     else: 
      f.write("{}\n".join(k))

편집 : 새 조건부. 경우에 따라 회선 번호의 일부가 변경 될 수 있습니다. 예를 들어

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true 
SPEC PRD000682;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true

마지막 줄에서 PRD 번호 부분이 변경되었습니다. 한 가지 해결책은 스펙트럼 번호를 확인하고 반복적 인 스펙트럼을 기반으로 한 라인을 제거하는 것입니다.

이

는 해결책이 될 것입니다 :

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true

출처

2017-02-24 Enrique

당신이 당신의 코드가 작동하는 어떤 코드를 작동 여부를 왜 묻는거야? –

i [2] 또는 i [3]과 같은 다른 줄에서 i [1] 동안 i = file.read(). split ('\ n'), 전체 파일을 반복하고 줄 단위로 검사 해 볼 수 있습니다. 그런 다음 i를 삭제하고 전체 분할 문자열에 대해이 작업을 하나씩 수행하십시오. 하지만 그래 코드가 많을거야. 좋은 해결책이있을 것입니다. –

코드가 제대로 작동하고 아무 문제가 보이지 않습니다. –

파이썬 최단 용액 : P

import os 
os.system("""awk 'line != $0; { line = $0 }' originalfile.txt > dedup.txt""")

출력 :

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22491.xml;spectrum=1074 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=2950 true 

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=1876 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3479 true 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22498.xml;spectrum=3785 true 

=Cluster= 
SPEC PRD000681;PRIDE_Exp_Complete_Ac_22493.xml;spectrum=473 true

(당신이 윈도우에있는 경우, AWK는 Gow 쉽게 설치할 수 있습니다.)

출처

2017-02-24 16:21:31

큰 쉬운 해결책. 감사! – Enrique

이 트릭은 중복 된 항목이있는 경우에만 작동합니다. –

이것은 당신의 원래 코드가 포함 된 파일뿐만 아니라 출력 할 그룹마다 고유의 라인을 새 파일을 엽니 다.

seen은 set이며 이미 뭔가 존재하는지 확인하기에 좋습니다.

data은 list이고 "=Cluster=" 그룹의 반복을 추적합니다.

그러면 각 그룹의 각 행 (은 data으로 지정)을 검토하면됩니다.

라인이 seen 내에 없으면 추가됩니다.

with open ("input file", 'r') as in_file, open("output file", 'w') as out_file: 
    data = [k.rstrip().split("=Cluster=") for k in in_file] 
    for i in data: 
     seen = set() 
     for line in i: 
      if line in seen: 
       continue 
      seen.add(line) 
      out_file.write(line)

편집 : 각 시간이 다른 "=Cluster=" 항상 존재하는 것입니다 세트를 재설정 for i in data 내에서 seen=set()을 이전하고 data 내에서 각 그룹에 대해 인쇄하지 않을 것입니다.

출처

2017-02-24 15:13:52 pstatix

예, 멋지다. 코드를 사용해 보셨습니까? –

'보인 '세트를 재설정해야합니다. –

@ Ev. Kounis 당신이 이것을 올리면서 나는 그것을 업데이트하고있었습니다. 깨달았다 나는 잘못 했어! – pstatix

이렇게하면됩니다.

file_in = r'someFile.txt' 
file_out = r'someOtherFile.txt' 
with open(file_in, 'r') as f_in, open(file_out, 'w') as f_out: 
    seen_spectra = set() 
    for line in f_in: 
     if '=Cluster=' in line or line.strip() == '': 
      seen_spectra = set() 
      f_out.write(line) 
     else: 
      new_spectrum = line.rstrip().split('=')[-1].split()[0] 
      if new_spectrum in seen_spectra: 
       continue 
      else: 
       f_out.write(line) 
       seen_spectra.add(new_spectrum)

이

는 groupby 솔루션 그러나 당신이있는 경우에 당신이 쉽게 따라하고 디버그 할 수있는 솔루션이 아닙니다. 의견에서 언급했듯이이 파일은 16GB 크기이며 메모리에로드하는 것이 가장 좋은 아이디어는 아닙니다.

EDIT: "Each cluster has a specific spectrum. It is not possible to have one spec in one cluster and the same in another"

file_in = r'someFile.txt' 
file_out = r'someOtherFile.txt' 
with open(file_in, 'r') as f_in, open(file_out, 'w') as f_out: 
    seen_spectra = set() 
    for line in f_in: 
     if line.startswith('SPEC'): 
      new_spectrum = line.rstrip().split('=')[-1].split()[0] 
      if spectrum in seen_spectra: 
       continue 
      else: 
       seen_spectra.add(new_spectrum)  
       f_out.write(line)   
     else: 
      f_out.write(line)

출처

2017-02-24 15:17:36

예. 코드가 완벽하게 작동했습니다. 고맙습니다! – Enrique

안녕하세요. Kounis. 저의 상사와 이야기를하고 그 패턴은 내부 = 클러스터 = 스펙트럼 = 숫자 여야한다고 말했습니다. (예 : PRD0013 및 PRD0014)의 수는 변경 될 수 있지만 스펙트럼 번호는 변경 될 수 없으므로 스크립트는 이것을 고려하지 않을 것입니다. 되풀이했다. 스펙트럼 부분을 고려하여 스크립트를 어떻게 바꿀 수 있습니까? – Enrique

@Enrique 나는 이해하지 못했을 까봐 걱정이다. –

유일한 고유 spectrum 번호를 유지 re.search() 기능 및 사용자 spectrums 설정된 물체를 사용하여 용액 :

with open('f_input.txt') as oldfile, open('new_file.txt', 'w') as newfile: 
    spectrums = set() 
    for line in oldfile: 
     if '=Cluster=' in line or not line.strip(): 
      newfile.write(line) 
     else: 
      m = re.search(r'spectrum=(\d+)', line) 
      spectrum = m.group(1) 
      if spectrum not in spectrums: 
       spectrums.add(spectrum) 
       newfile.write(line)

출처

2017-02-24 15:33:11 RomanPerekhrest

Ive가이 오류를 가지고 있습니다 : AttributeError : 'NoneType'객체에 'group'속성이 없습니다. – Enrique

@Enrique, 요점은 무엇입니까? 당신은 이미 하나의 대답을 수락했습니다. – RomanPerekhrest

저는 여러 가지 해결책을 비교하고 가장 효율적인 것을보고 있습니다. – Enrique

패턴을 기반으로 문자열에서 중복 문자열을 제거 할 수있는 방법이 있습니까?

답변

관련 문제