추가 데이터에 다음 디렉토리 구조가 있다고 가정하면 (참조) sklearn
API를 사용하여 데이터를 가져 오는 것이 훨씬 쉽기 때문에 첫 번째 단계가되어야합니다.
additional_data
|
|-> sports.cricket
|
|-> file1.txt
|-> file2.txt
|-> ...
|
|-> cooking.french
|
|-> file1.txt
|-> ...
...
python
으로 이동,
import os
from sklearn import cross_validation
from sklearn.datasets import fetch_20newsgroups
from sklearn.datasets import load_files
from sklearn.externals import joblib
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
import numpy as np
# Note if you have a pre-defined training/testing split in your additional data, you would merge them with the corresponding 'train' and 'test' subsets of 20news
news_data = fetch_20newsgroups(subset='all')
additional_data = load_files(container_path='/path/to/additional_data', encoding='utf-8')
# Both data objects are of type `Bunch` and therefore can be relatively straightforwardly merged
# Merge the two data files
'''
The Bunch object contains the following attributes: `dict_keys(['target_names', 'description', 'DESCR', 'target', 'data', 'filenames'])`
The interesting ones for our purposes are 'data' and 'filenames'
'''
all_filenames = np.concatenate((news_data.filenames, additional_data.filenames)) # filenames is a numpy array
all_data = news_data.data + additional_data.data # data is a standard python list
merged_data_path = '/path/to/merged_data'
'''
The 20newsgroups data has a filename a la '/path/to/scikit_learn_data/20news_home/20news-bydate-test/rec.sport.hockey/54367'
So depending on whether you want to keep the sub directory structure of the train/test splits or not,
you would either need the last 2 or 3 parts of the path
'''
for content, f in zip(all_data, all_filenames):
# extract sub path
sub_path, filename = f.split(os.sep)[-2:]
# Create output directory if not exists
p = os.path.join(merged_data_path, sub_path)
if (not os.path.exists(p)):
os.makedirs(p)
# Write data to file
with open(os.path.join(p, filename), 'w') as out_file:
out_file.write(content)
# Now that everything is stored at `merged_data_path`, we can use `load_files` to fetch the dataset again, which now includes everything from 20newsgroups and your additional data
all_data = load_files(container_path=merged_data_path)
'''
all_data is yet another `Bunch` object:
* `data` contains the data
* `target_names` contains the label names
* `target contains` the labels in numeric format
* `filenames` contains the paths of each individual document
thus, running a classifier over the data is straightforward
'''
vec = CountVectorizer()
X = vec.fit_transform(all_data.data)
# We want to create a train/test split for learning and evaluating a classifier (supposing we haven't created a pre-defined train/test split encoded in the directory structure)
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, all_data.target, test_size=0.2)
# Create & fit the MNB model
mnb = MultinomialNB()
mnb.fit(X_train, y_train)
# Evaluate Accuracy
y_predicted = mnb.predict(X_test)
print('Accuracy: {}'.format(accuracy_score(y_test, y_predicted)))
# Alternatively, the vectorisation and learning can be packaged into a pipeline and serialised for later use
pipeline = Pipeline([('vec', CountVectorizer()), ('mnb', MultinomialNB())])
# Run the vectorizer and train the classifier on all available data
pipeline.fit(all_data.data, all_data.target)
# Serialise the classifier to disk
joblib.dump(pipeline, '/path/to/model_zoo/mnb_pipeline.joblib')
# If you get some more data later on, you can deserialise the model and run them through the pipeline again
p = joblib.load('/path/to/model_zoo/mnb_pipeline.joblib')
docs_new = ['God is love', 'OpenGL on the GPU is fast']
y_predicted = p.predict(docs_new)
print('Predicted labels: {}'.format(np.array(all_data.target_names)[y_predicted]))
당신은 당신의 코드를 시간 업로드하시기 바랍니다 수 있습니다 (귀하의 추가 데이터는 위의 형식으로되어 및
/path/to/additional_data
에 뿌리를두고있다 랬) 두 데이터 세트를로드 그리고 당신이 붙어있는 곳? –나는 정말 아무데도 붙어 있지 않다. 보여줄 코드가 없다! Skilearn이 이미 제공하는 20 가지 문서 클래스에 더 많은 교육 데이터 (문서 및 동반 클래스)를 추가하는 방법을 알고 싶습니다. – SexyBeast