ngram의 Naive Bayes 분류기

Ruby Classifier library ~ classify privacy policies을 사용하고 있습니다. 나는이 라이브러리에 내장 된 간단한 bag-of-word 접근 방식으로는 충분하지 않다는 결론에 도달했습니다. 분류 정확도를 높이기 위해 개별 단어 외에도 n 그램에 분류자를 훈련시키고 싶습니다.ngram의 Naive Bayes 분류기

전처리 문서가 관련 n 그램 (구두점을 올바르게 처리)을 처리 할 수있는 라이브러리가 있는지 궁금합니다. 하나의 생각이었다 내가 할 수 전처리 그런 루비 분류에 문서와 공급 의사 ngrams :

wordone_wordtwo_wordthree

아니면 등이있는 라이브러리로이 일을 할 수있는 더 좋은 방법이있다 Ngram 기반의 Naive Bayes 분류는 내장에서 가져옵니다. 나는 그들이 Ruby가 아닌 다른 언어를 사용하도록 열어 둔다. (만약 파이썬이 필요하다면 좋은 후보로 보인다).

출처

2012-04-09 babonk

당신이 파이썬을 좋아한다면, 나는 nltk이 당신에게 적합하다고 말할 것입니다. 예를 들어

: 심지어 nltk.NaiveBayesClassifier

출처

2012-04-09 20:21:11

좋은 대답 +1 – Yavar

NLTK는 루비가 제공하는 것보다 여러면에서 놀라운 것처럼 보입니다. 파이썬이 이기고, 감사합니다! – babonk

@babonk 내 기쁨. 나는 nltk가 사용하기에 기쁨이되고 믿을 수 없을 정도로 강력하다는 것을 발견했다. D –

>> s = "She sells sea shells by the sea shore" 
=> "She sells sea shells by the sea shore" 
>> s.split(/ /).each_cons(2).to_a.map {|x,y| x + ' ' + y} 
=> ["She sells", "sells sea", "sea shells", "shells by", "by the", "the sea", "sea shore"]

루비 enumerables가 열거에서 N 개의 연속 항목의 각을 반환하는 방법이라고 enum_cons이 방법이

>>> import nltk 
>>> s = "This is some sample data. Nltk will use the words in this string to make ngrams. I hope that this is useful.".split() 
>>> model = nltk.NgramModel(2, s) 
>>> model._ngrams 
set([('to', 'make'), ('sample', 'data.'), ('the', 'words'), ('will', 'use'), ('some', 'sample'), ('', 'This'), ('use', 'the'), ('make', 'ngrams.'), ('ngrams.', 'I'), ('hope', 'that' 
), ('is', 'some'), ('is', 'useful.'), ('I', 'hope'), ('this', 'string'), ('Nltk', 'will'), ('words', 'in'), ('this', 'is'), ('data.', 'Nltk'), ('that', 'this'), ('string', 'to'), (' 
in', 'this'), ('This', 'is')])

. 그 방법으로 ngrams를 생성하는 것은 간단한 하나의 라이너입니다.

출처

2012-04-10 04:24:06

Thx. 'enum_cons' 대신'each_cons'를 사용해야했습니다. – Dru

Dru : enum_cons처럼 보이지 않습니다. 내 대답에 each_cons로 바뀌 었습니다. 감사! –

ngram의 Naive Bayes 분류기

답변

관련 문제