파이썬으로 자료에서 가장 빈번한 단어 추출하기

어쩌면 이것은 어리석은 질문 일지 모르지만, 파이썬으로 가장 자주 나오는 단어 10 개를 추출하는 데 문제가 있습니다. 이것은 내가 지금까지 가지고있는 것이다. 뒤에파이썬으로 자료에서 가장 빈번한 단어 추출하기

import re 
import string 
from nltk.corpus import stopwords 
stoplist = stopwords.words('dutch') 

from collections import defaultdict 
from operator import itemgetter 

def toptenwords(mycorpus): 
    words = mycorpus.words() 
    no_capitals = set([word.lower() for word in words]) 
    filtered = [word for word in no_capitals if word not in stoplist] 
    no_punct = [s.translate(None, string.punctuation) for s in filtered] 
    wordcounter = {} 
    for word in no_punct: 
     if word in wordcounter: 
      wordcounter[word] += 1 
     else: 
      wordcounter[word] = 1 
    sorting = sorted(wordcounter.iteritems(), key = itemgetter, reverse = True) 
    return sorting

내 시체와 함께이 기능을 인쇄하는 경우, 그것은 '1'로 나에게 모든 단어의 목록을 제공 (BTW, 나는 각 10 개 .txt 인 파일을 두 개의 하위 범주와 신체를 읽는 NLTK 작업) 그것. 그것은 나에게 사전을 주지만 모든 나의 가치관은 하나입니다. 그리고 예를 들어 '아기'라는 단어가 내 코퍼스에서 5 ~ 6 번이라는 것을 알고 있습니다. 그리고 여전히 '아기 : 1'을 제공합니다 ... 따라서 원하는 방식으로 작동하지 않습니다.
Can 누군가 나를 도울까요?

출처

2013-01-24 user2007220

내가 몇 달 전과 똑같은 일을했기 때문에 무엇을 공부하고 있습니까? – Amberlamps

나는 언어학을 공부하고있다. 그 일을 해결 했니? – user2007220

문제는 set의 사용에 있습니다.

세트에는 중복 된 단어가 없으므로 단어 세트를 소문자로 만들면 각 단어마다 하나의 오해가있을뿐입니다.

의가 있다고 가정 해 봅시다 당신의 words 있습니다

['banana', 'Banana', 'tomato', 'tomato','kiwi']

후 모든 경우를 낮추는 당신의 람다, 당신은 :

['banana', 'banana', 'tomato', 'tomato','kiwi']

그러나 당신이 :

set(['banana', 'Banana', 'tomato', 'tomato','kiwi'])

반환 :

['banana','tomato','kiwi']

그 순간부터 no_capitals 세트의 계산을 기반으로하므로 각 단어는 한 번만 나타납니다. set을 만들지 마십시오. 프로그램이 정상적으로 작동 할 것입니다.

출처

2013-01-24 11:26:40 pcalcao

감사합니다. 그게 당연한데 – user2007220

대답을 수락 해 주셔서 고맙습니다. 그래서 닫힌 것으로 표시됩니다 :) – pcalcao

어쨌든 NLTK를 사용하는 경우 FreqDist (샘플) 함수를 사용하여 주어진 샘플에서 주파수 분포를 먼저 생성하십시오. 그런 다음 most_common (n) 속성을 호출하여 샘플에서 가장 일반적인 단어를 내림차순으로 정렬하여 찾습니다. 같은 뭔가 :

from nltk.probability import FreqDist 
fdist = FreqDist(stoplist) 
top_ten = fdist.most_common(10)

출처

2014-07-29 10:36:20

파이썬 방법 : 여기에 하나 개의 솔루션이

In [1]: from collections import Counter 

In [2]: words = ['hello', 'hell', 'owl', 'hello', 'world', 'war', 'hello', 'war'] 

In [3]: counter_obj = Counter(words) 

In [4]: counter_obj.most_common() #counter_obj.most_common(n=10) 
Out[4]: [('hello', 3), ('war', 2), ('hell', 1), ('world', 1), ('owl', 1)]

출처

2017-07-07 04:23:09

입니다. 이전 응답에서 논의 된대로 세트를 사용합니다.

 

def token_words(tokn=10, s1_orig='hello i must be going'): 
    # tokn is the number of most common words. 
    # s1_orig is the text blob that needs to be checked. 

    # logic 
    # - clean the text - remove punctuations. 
    # - make everything lower case 
    # - replace common machine read errors. 
    # - create a dictionary with orig words and changed words. 
    # - create a list of unique clean words 
    # - read the "clean" text and count the number of clean words 
    # - sort and print the results 

    #print 'Number of tokens:', tokn 

    # create a dictionary to make puncuations 
    # spaces. 
    punct_dict = { ',':' ', 
        '-':' ', 
        '.':' ', 
        '\n':' ', 
        '\r':' ' 
        } 

    # dictionary for machine reading errors 
    mach_dict = {'1':'I', '0':'O', 
       '6':'b','8':'B' } 


    # get rid of punctuations 
    s1 = s1_orig 
    for k,v in punct_dict.items(): 
     s1 = s1.replace(k,v) 

    # create the original list of words. 
    orig_list = set(s1.split()) 

    # for each word in the original list, 
    # see if it has machine errors. 
    # add error words to a dict. 
    error_words = dict() 
    for a_word in orig_list: 
     a_w2 = a_word 
     for k,v in mach_dict.items(): 
      a_w2 = a_w2.replace(k,v) 

     # lower case the result. 
     a_w2 = a_w2.lower() 

     # add to error word dict. 
     try: 
      error_words[a_w2].append(a_word) 
     except: 
      error_words[a_w2] = [a_word] 

    # get rid of machine errors in the full text. 
    for k,v in mach_dict.items(): 
     s1 = s1.replace(k,v) 

    # make everything lower case 
    s1 = s1.lower() 

    # split sentence into list. 
    s1_list = s1.split() 

    # consider only unqiue words 
    s1_set = set(s1_list) 

    # count the number of times 
    # the each word occurs in s1 
    res_dict = dict() 
    for a_word in s1_set: 
     res_dict[a_word] = s1_list.count(a_word) 


    # sort the result dictionary by values 
    print '--------------' 
    temp = 0 
    for key, value in sorted(res_dict.iteritems(), reverse=True, key=lambda (k,v): (v,k)): 
     if temp < tokn: 
      # print results for token items 
      # get all the words that made up the key 
      final_key = '' 
      for er in error_words[key]: 
       final_key = final_key + er + '|' 
      final_key = final_key[0:-1] 
      print "%[email protected]%s" % (final_key, value) 
     else: 
      pass 
     temp = temp + 1 

    # close the function and return 
    return True 

#-------------------------------------------------------------  
# main 

# read the inputs from command line 
num_tokens = raw_input('Number of tokens desired: ')  
raw_file = raw_input('File name: ') 

# read the file 
try: 
    if num_tokens == '': num_tokens = 10 
    n_t = int(num_tokens) 
    raw_data = open(raw_file,'r').read() 
    token_words(n_t, raw_data) 
except: 
    print 'Token or file error. Please try again.'

출처

2018-03-09 13:20:10 Nikb999

파이썬으로 자료에서 가장 빈번한 단어 추출하기

답변

관련 문제