2017-03-27 4 views
0

나는 3.5 파이썬 사용 I 프로젝트 생성 gensim 샘플을 바탕으로 내 프로젝트에 다음 코드를 추가 :gensim 메모리 친화적 인 코퍼스 오류

class MyCorpus(object): 
    def __iter__(self): 
     for line in open('files/2/mycorpus.txt'): 
      # assume there's one document per line, tokens separated by whitespace 
      yield dictionary.doc2bow(line.lower().split()) 


corpus_memory_friendly = MyCorpus() # doesn't load the corpus into memory! 
print(corpus_memory_friendly) 

그러나 실행 한 후 내 pycharm 콘솔에서 이러한 오류가 있습니다

Traceback (most recent call last): 
    File "D:/Python-Workspace(s)/GensimSamples/2.Gensim_CorpusStreaming.py", line 31, in <module> 
    for vector in corpus_memory_friendly: # load one vector into memory at a time 
    File "D:/Python-Workspace(s)/GensimSamples/2.Gensim_CorpusStreaming.py", line 17, in __iter__ 
    yield dictionary.doc2bow(line.lower().split()) 
AttributeError: module 'gensim.corpora.dictionary' has no attribute 'doc2bow' 

이 문제를 어떻게 해결할 수 있습니까?

답변

0

사전에 dictionary을 준비하고 수업에 사용할 수 있도록 설정하면 MyCorpus이됩니다. 메모리 친화적 인 코퍼스를 생성하는 샘플 클래스가 될 수있다 : (로그 정보 없음)

import logging 
from pprint import pprint 
from six import iteritems 
from gensim import corpora 

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 


class MyCorpus(object): 
    def __init__(self, text_file='text_corpus.txt', dictionary=None): 
     """ 
     Checks if a dictionary has been given as a parameter. 
     If no dictionary has been given, it creates one and saves it in the disk. 
     """ 
     self.file_name = text_file 
     if dictionary is None: 
      self.prepare_dictionary() 
     else: 
      self.dictionary = dictionary 

    def __iter__(self): 
     for line in open(self.file_name): 
      # assume there's one document per line, tokens separated by whitespace 
      yield self.dictionary.doc2bow(line.lower().split()) 

    def prepare_dictionary(self): 
     stop_list = set('for a of the and to in'.split()) # List of stop words which can also be loaded from a file. 

     # Creating a dictionary using stored the text file and the Dictionary class defined by Gensim. 
     self.dictionary = corpora.Dictionary(line.lower().split() for line in open(self.file_name)) 

     # Collecting the id's of the tokens which exist in the stop-list 
     stop_ids = [self.dictionary.token2id[stop_word] for stop_word in stop_list if 
        stop_word in self.dictionary.token2id] 

     # Collecting the id's of the token which appear only once 
     once_ids = [token_id for token_id, doc_freq in iteritems(self.dictionary.dfs) if doc_freq == 1] 

     # Removing the unwanted tokens using collected id's 
     self.dictionary.filter_tokens(stop_ids + once_ids) 

     # Saving dictionary in the disk for later use: 
     self.dictionary.save('dictionary.dict') 

my_memory_fiendly_corpus = MyCorpus() 

# Saving the corpus 
# corpora.MmCorpus.serialize('corpus.mm', my_memory_fiendly_corpus) 

# To load the saved corpus: 
# corpus = corpora.MmCorpus('corpus.mm') 

print('\t:::The dictionary::::') 
pprint(my_memory_fiendly_corpus.dictionary.token2id) 
print(my_memory_fiendly_corpus) 
print('\n\t:::The corpus::::') 
for vector in my_memory_fiendly_corpus: 
    print(vector) 

출력 : 내가 Gensim 파이썬 모두 아주 새로운 오전으로

:::The dictionary:::: 
{'computer': 2, 
'eps': 8, 
'graph': 10, 
'human': 0, 
'interface': 1, 
'minors': 11, 
'response': 6, 
'survey': 3, 
'system': 5, 
'time': 7, 
'trees': 9, 
'user': 4} 
<__main__.MyCorpus object at 0x7fe0e9ac5c18> 

    :::The corpus:::: 
[(0, 1), (1, 1), (2, 1)] 
[(2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)] 
[(1, 1), (4, 1), (5, 1), (8, 1)] 
[(0, 1), (5, 2), (8, 1)] 
[(4, 1), (6, 1), (7, 1)] 
[(9, 1)] 
[(9, 1), (10, 1)] 
[(9, 1), (10, 1), (11, 1)] 
[(3, 1), (10, 1), (11, 1)] 

, 나는 비슷한 직면 문제의 종류도. this mailing-list은 Gensim을 배우는 데 정말로 도움이되었습니다.