2016-11-04 1 views
0

NLTK은 갈색 코퍼스에 대한 인터페이스와 POS 태그를 가지고 있으며이 같은 액세스 할 수 있습니다NLTK에서 간단한 말뭉치와 꼬리표를 추출하는 방법은?

>>> from nltk.corpus import brown 
>>> brown.tagged_sents() 
[[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')], [(u'The', u'AT'), (u'jury', u'NN'), (u'further', u'RBR'), (u'said', u'VBD'), (u'in', u'IN'), (u'term-end', u'NN'), (u'presentments', u'NNS'), (u'that', u'CS'), (u'the', u'AT'), (u'City', u'NN-TL'), (u'Executive', u'JJ-TL'), (u'Committee', u'NN-TL'), (u',', u','), (u'which', u'WDT'), (u'had', u'HVD'), (u'over-all', u'JJ'), (u'charge', u'NN'), (u'of', u'IN'), (u'the', u'AT'), (u'election', u'NN'), (u',', u','), (u'``', u'``'), (u'deserves', u'VBZ'), (u'the', u'AT'), (u'praise', u'NN'), (u'and', u'CC'), (u'thanks', u'NNS'), (u'of', u'IN'), (u'the', u'AT'), (u'City', u'NN-TL'), (u'of', u'IN-TL'), (u'Atlanta', u'NP-TL'), (u"''", u"''"), (u'for', u'IN'), (u'the', u'AT'), (u'manner', u'NN'), (u'in', u'IN'), (u'which', u'WDT'), (u'the', u'AT'), (u'election', u'NN'), (u'was', u'BEDZ'), (u'conducted', u'VBN'), (u'.', u'.')], ...] 

brown.tagged_sents()는 목록과 목록에있는 각 요소는 문장과 문장의 목록입니다 첫 번째 요소는 단어이고 두 번째 요소는 POS 태그입니다.

목표는 brown 자료를 처리하여 다음과 같은 파일을 얻습니다. 각 줄은 첫 번째 열에 공백으로 구분 된 문장의 단어가 들어 있고 두 번째 열에는 해당 태그가 들어있는 탭으로 구분 된 문장입니다 공백으로 구분 :

from nltk.corpus import brown 
tagged_sents = brown.tagged_sents() 
fout = open('brown.txt', 'w') 
fout.write('\n'.join([' '.join(sent)+'\t'+' '.join(tags) 
         for sent, tags in 
         [zip(*tagged_sent) for tagged_sent in tagged_sents]])) 

을 그리고 그것은 작동하지만 코퍼스로 찾으면 더 나은 방법이있을 :

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place . AT NP-TL NN-TL JJ-TL NN-TL VBD NR AT NN IN NP$ JJ NN NN VBD `` AT NN '' CS DTI NNS VBD NN . 
The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted . AT NN RBR VBD IN NN NNS CS AT NN-TL JJ-TL NN-TL , WDT HVD JJ NN IN AT NN , `` VBZ AT NN CC NNS IN AT NN-TL IN-TL NP-TL '' IN AT NN IN WDT AT NN BEDZ VBN . 
The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. . AT NP NN NN HVD BEN VBN IN NP-TL JJ-TL NN-TL NN-TL NP NP TO VB NNS IN JJ `` NNS '' IN AT JJ NN WDT BEDZ VBN IN NN-TL NP NP NP . 

나는이 시도했다.

답변

0
data = [[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR')]] 

# takes the data in and throws it in a loop 
def data_printer(data): 
    # adds each element to this string 
    string = '' 
    for dat in data: 
     for da in dat: 
      string += ' ' + da[0] 
    print string 
    return string 

data_printer(data) 

순서쌍을 통해 더 좋은 방법이 있습니다. 이것은 수입이없는 최소한의 방법입니다.

+0

태그가 누락되었습니다. P 원하는 질문을 오른쪽 스크롤하십시오. – alvas

+0

또한, 인쇄해서는 안되지만 = 괜찮습니다.) – alvas

+0

니스. 예. 나는 단지 그 샘플을 가지고 있습니다. :) –

관련 문제