scikit-learn training 데이터 추가

sklearn에있는 교육 데이터를보고 있는데 here입니다. 문서별로, 일부 뉴스 그룹 컬렉션을 기반으로하는 20 가지 문서 클래스가 포함되어 있습니다. 이러한 카테고리에 속하는 문서를 분류하는 것은 상당히 훌륭한 업무입니다. 그러나 크리켓, 축구, 핵 물리학 등의 카테고리에 기사를 더 추가해야합니다.scikit-learn training 데이터 추가

sports -> cricket, cooking -> French 등등과 같은 각 클래스의 문서 세트가 있습니다. 해당 문서를 추가하려면 어떻게해야합니까? 클래스가 sklearn이되도록 이제 20 개 클래스를 반환하는 인터페이스는 20 개 클래스에도 새로운 클래스를 더합니다. SVM 또는 Naive Bayes을 통해 필요한 일부 교육이있는 경우 데이터 집합에 추가하기 전에 어디에서해야합니까?

출처

2016-07-22 SexyBeast

당신은 당신의 코드를 시간 업로드하시기 바랍니다 수 있습니다 (귀하의 추가 데이터는 위의 형식으로되어 및 /path/to/additional_data에 뿌리를두고있다 랬) 두 데이터 세트를로드 그리고 당신이 붙어있는 곳? –

나는 정말 아무데도 붙어 있지 않다. 보여줄 코드가 없다! Skilearn이 이미 제공하는 20 가지 문서 클래스에 더 많은 교육 데이터 (문서 및 동반 클래스)를 추가하는 방법을 알고 싶습니다. – SexyBeast

추가 데이터에 다음 디렉토리 구조가 있다고 가정하면 (참조) sklearn API를 사용하여 데이터를 가져 오는 것이 훨씬 쉽기 때문에 첫 번째 단계가되어야합니다.

additional_data 
     | 
     |-> sports.cricket 
       | 
       |-> file1.txt 
       |-> file2.txt 
       |-> ... 
     | 
     |-> cooking.french 
       | 
       |-> file1.txt 
       |-> ... 
     ...

python으로 이동,

import os 

from sklearn import cross_validation 
from sklearn.datasets import fetch_20newsgroups 
from sklearn.datasets import load_files 
from sklearn.externals import joblib 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.metrics import accuracy_score 
from sklearn.naive_bayes import MultinomialNB 
from sklearn.pipeline import Pipeline 
import numpy as np 

# Note if you have a pre-defined training/testing split in your additional data, you would merge them with the corresponding 'train' and 'test' subsets of 20news 
news_data = fetch_20newsgroups(subset='all') 
additional_data = load_files(container_path='/path/to/additional_data', encoding='utf-8') 

# Both data objects are of type `Bunch` and therefore can be relatively straightforwardly merged 

# Merge the two data files 
''' 
The Bunch object contains the following attributes: `dict_keys(['target_names', 'description', 'DESCR', 'target', 'data', 'filenames'])` 
The interesting ones for our purposes are 'data' and 'filenames' 
''' 
all_filenames = np.concatenate((news_data.filenames, additional_data.filenames)) # filenames is a numpy array 
all_data = news_data.data + additional_data.data # data is a standard python list 

merged_data_path = '/path/to/merged_data' 

''' 
The 20newsgroups data has a filename a la '/path/to/scikit_learn_data/20news_home/20news-bydate-test/rec.sport.hockey/54367' 
So depending on whether you want to keep the sub directory structure of the train/test splits or not, 
you would either need the last 2 or 3 parts of the path 
''' 
for content, f in zip(all_data, all_filenames): 
    # extract sub path 
    sub_path, filename = f.split(os.sep)[-2:] 

    # Create output directory if not exists 
    p = os.path.join(merged_data_path, sub_path) 
    if (not os.path.exists(p)): 
     os.makedirs(p) 

    # Write data to file 
    with open(os.path.join(p, filename), 'w') as out_file: 
     out_file.write(content) 

# Now that everything is stored at `merged_data_path`, we can use `load_files` to fetch the dataset again, which now includes everything from 20newsgroups and your additional data 
all_data = load_files(container_path=merged_data_path) 

''' 
all_data is yet another `Bunch` object: 
    * `data` contains the data 
    * `target_names` contains the label names 
    * `target contains` the labels in numeric format 
    * `filenames` contains the paths of each individual document 

thus, running a classifier over the data is straightforward 
''' 
vec = CountVectorizer() 
X = vec.fit_transform(all_data.data) 

# We want to create a train/test split for learning and evaluating a classifier (supposing we haven't created a pre-defined train/test split encoded in the directory structure) 
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, all_data.target, test_size=0.2) 

# Create & fit the MNB model 
mnb = MultinomialNB() 
mnb.fit(X_train, y_train) 

# Evaluate Accuracy 
y_predicted = mnb.predict(X_test) 

print('Accuracy: {}'.format(accuracy_score(y_test, y_predicted))) 

# Alternatively, the vectorisation and learning can be packaged into a pipeline and serialised for later use 
pipeline = Pipeline([('vec', CountVectorizer()), ('mnb', MultinomialNB())]) 

# Run the vectorizer and train the classifier on all available data 
pipeline.fit(all_data.data, all_data.target) 

# Serialise the classifier to disk 
joblib.dump(pipeline, '/path/to/model_zoo/mnb_pipeline.joblib') 

# If you get some more data later on, you can deserialise the model and run them through the pipeline again 
p = joblib.load('/path/to/model_zoo/mnb_pipeline.joblib') 

docs_new = ['God is love', 'OpenGL on the GPU is fast'] 

y_predicted = p.predict(docs_new) 
print('Predicted labels: {}'.format(np.array(all_data.target_names)[y_predicted]))

출처

2016-07-27 08:32:46 tttthomasssss

와우. 그것은 매우 유망 해 보입니다. 몇 가지 질문 - 마지막'all_data' 변수로 무엇을 할 수 있습니까? 즉,이 예제 - http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html은 문서를 분류하는 방법을 보여줍니다. 이 경우 위에서 얻은'all_date '를 어떻게 사용합니까? 두 번째로 대답의 마지막 부분이 조금 불분명합니다. 조금 더 설명하면 좋을 것입니다. – SexyBeast

@AttitudeMonger 내 대답을 업데이트했습니다. 어떤 부분이 불분명했는지 간단히 설명 할 수 있습니까? – tttthomasssss

와우. 점점 더 좋아지고 있습니다! 이것은 50보다 큰 현상금을받을 자격이 있으며, 앞으로 더 많은 상을 수여 할 것입니다. :) 다시 돌아와서, 나는 이해할 수 없습니다. 위의 데이터는 기존 데이터 세트와 비교하여 테스트 할 수 있습니까? 예를 들어 위에 제공된 링크에서'docs_new = [ 'God is love', 'GPU의 OpenGL이 빠름'] '일치하는 카테고리를 찾으려는 텍스트 데이터를로드합니다. 그리고 목록의 각 텍스트에 대해 결과를 얻습니다. 그것은 당신의 모범에서 어떻게 이루어 집니까? – SexyBeast

scikit-learn training 데이터 추가

답변

관련 문제