2016-11-29 5 views
0

나는 몇 시간 동안 여기에 있었고 지금 정말로 붙어 있다고 느낍니다.Python 로지스틱 회귀

나는 csv "ScoreBuckets.csv"에서 여러 개의 열을 사용하여 해당 "csv"의 다른 열을 "Score_Bucket"으로 예측하려고합니다. csv에서 여러 열을 사용하여 Score_Bucket 열을 예측하고 싶습니다. 내가 겪고있는 문제는 결과가 전혀 이해가 안된다는 것입니다. 여러 열을 사용하여 Score_Bucket 열을 예측하는 방법을 알지 못합니다. 데이터 마이닝을 처음 사용하기 때문에 코드/구문에 100 % 익숙하지 않습니다. 여기

내가 지금까지 가지고있는 코드 :

import pandas as pd 
import numpy as np 
from sklearn import metrics 
from sklearn.linear_model import LogisticRegression 
from sklearn.cross_validation import KFold, cross_val_score 

dataset = pd.read_csv('ScoreBuckets.csv') 

CV = (dataset.Score_Bucket.reshape((len(dataset.Score_Bucket), 1))).ravel() 
data = (dataset.ix[:,'CourseLoad_RelativeStudy':'Sleep_Sex'].values).reshape(
      (len(dataset.Score_Bucket), 2)) 


# Create a KNN object 
LogReg = LogisticRegression() 

# Train the model using the training sets 
LogReg.fit(data, CV) 

# the model 
print('Coefficients (m): \n', LogReg.coef_) 
print('Intercept (b): \n', LogReg.intercept_) 

#predict the class for each data point 
predicted = LogReg.predict(data) 
print("Predictions: \n", np.array([predicted]).T) 

# predict the probability/likelihood of the prediction 
print("Probability of prediction: \n",LogReg.predict_proba(data)) 
modelAccuracy = LogReg.score(data,CV) 
print("Accuracy score for the model: \n", LogReg.score(data,CV)) 

print(metrics.confusion_matrix(CV, predicted, labels=["Yes","No"])) 

# Calculating 5 fold cross validation results 
LogReg = LogisticRegression() 
kf = KFold(len(CV), n_folds=5) 
scores = cross_val_score(LogReg, data, CV, cv=kf) 
print("Accuracy of every fold in 5 fold cross validation: ", abs(scores)) 
print("Mean of the 5 fold cross-validation: %0.2f" % abs(scores.mean())) 

print("The accuracy difference between model and KFold is: ", 
     abs(abs(scores.mean())-modelAccuracy)) 

ScoreBuckets.csv :

Score_Bucket,Healthy,Course_Load,Miss_Class,Relative_Study,Faculty,Sleep,Relation_Status,Sex,Relative_Stress,Res_Gym?,Tuition_Awareness,Satisfaction,Healthy_TuitionAwareness,Healthy_TuitionAwareness_MissClass,Healthy_MissClass_Sex,Sleep_Faculty_RelativeStress,TuitionAwareness_ResGym,CourseLoad_RelativeStudy,Sleep_Sex 
5,0.5,1,0,1,0.4,0.33,1,0,0.5,1,0,0,0.75,0.5,0.17,0.41,0.5,1,0.17 
2,1,1,0.33,0.5,0.4,0.33,0,0,1,0,0,0,0.5,0.44,0.44,0.58,0,0.75,0.17 
5,0.5,1,0,0.5,0.4,0.33,1,0,0.5,0,1,0,0.75,0.5,0.17,0.41,0.5,0.75,0.17 
4,0.5,1,0,0,0.4,0.33,0,0,0.5,0,1,0,0.25,0.17,0.17,0.41,0.5,0.5,0.17 
5,0.5,1,0.33,0.5,0.4,0,1,1,1,0,1,0,0.75,0.61,0.61,0.47,0.5,0.75,0.5 
5,0.5,1,0,1,0.4,0.33,1,1,1,1,1,1,0.75,0.5,0.5,0.58,1,1,0.67 
5,0.5,1,0,0,0.4,0.33,0,0,0.5,0,1,0,0.25,0.17,0.17,0.41,0.5,0.5,0.17 
2,0.5,1,0.67,0.5,0.4,0,1,1,0.5,0,0,0,0.75,0.72,0.72,0.3,0,0.75,0.5 
5,0.5,1,0,1,0.4,0.33,0,1,1,0,1,1,0.25,0.17,0.5,0.58,0.5,1,0.67 
5,1,1,0,0.5,0.4,0.33,0,1,0.5,0,1,1,0.5,0.33,0.67,0.41,0.5,0.75,0.67 
0,0.5,1,0,1,0.4,0.33,0,0,0.5,0,0,0,0.25,0.17,0.17,0.41,0,1,0.17 
2,0.5,1,0,0.5,0.4,0.33,1,1,1,0,0,0,0.75,0.5,0.5,0.58,0,0.75,0.67 
5,0.5,1,0,1,0.4,0.33,0,0,1,1,1,0,0.25,0.17,0.17,0.58,1,1,0.17 
0,0.5,1,0.33,0.5,0.4,0.33,1,1,0.5,0,1,0,0.75,0.61,0.61,0.41,0.5,0.75,0.67 
5,0.5,1,0,0.5,0.4,0.33,0,0,0.5,0,1,1,0.25,0.17,0.17,0.41,0.5,0.75,0.17 
4,0,1,0.67,0.5,0.4,0.67,1,0,0.5,1,0,0,0.5,0.56,0.22,0.52,0.5,0.75,0.34 
2,0.5,1,0.33,1,0.4,0.33,0,0,0.5,0,1,0,0.25,0.28,0.28,0.41,0.5,1,0.17 
5,0.5,1,0.33,0.5,0.4,0.33,0,1,1,0,1,0,0.25,0.28,0.61,0.58,0.5,0.75,0.67 
5,0.5,1,0,1,0.4,0.33,0,0,0.5,1,1,0,0.25,0.17,0.17,0.41,1,1,0.17 
5,0.5,1,0.33,0.5,0.4,0.33,1,1,1,0,1,0,0.75,0.61,0.61,0.58,0.5,0.75,0.67 

출력 :

Coefficients (m): 
[[-0.4012899 -0.51699939] 
[-0.72785212 -0.55622303] 
[-0.62116232 0.30564259] 
[ 0.04222459 -0.01672418]] 
Intercept (b): 
[-1.80383738 -1.5156701 -1.29452772 0.67672118] 
Predictions: 
[[5] 
[5] 
[5] 
[5] 
... 
[5] 
[5] 
[5] 
[5]] 
Probability of prediction: 
[[ 0.09302973 0.08929139 0.13621146 0.68146742] 
[ 0.09777325 0.10103782 0.14934111 0.65184782] 
[ 0.09777325 0.10103782 0.14934111 0.65184782] 
[ 0.10232068 0.11359509 0.16267645 0.62140778] 
... 
[ 0.07920945 0.08045552 0.17396476 0.66637027] 
[ 0.07920945 0.08045552 0.17396476 0.66637027] 
[ 0.07920945 0.08045552 0.17396476 0.66637027] 
[ 0.07346886 0.07417316 0.18264008 0.66971789]] 
Accuracy score for the model: 
0.671171171171 
[[0 0] 
[0 0]] 
Accuracy of every fold in 5 fold cross validation: 
    [ 0.64444444 0.73333333 0.68181818 0.63636364 0.65909091] 
Mean of the 5 fold cross-validation: 0.67 
The accuracy difference between model and KFold is: 0.00016107016107 

I 출력하지 않는 것을 말하는 이유 2 가지 이유가 있습니다. 1. 열에 대해 어떤 데이터를 피드하든 관계없이 예측 정확도 cy는 동일하게 유지되며 일부 열이 Score_Buckets 열의 더 나은 예측 자이기 때문에 발생하지 않아야합니다. 2. Score_Buckets 열을 예측할 때 여러 열을 사용할 수는 없지만 같은 크기 여야한다고 나와 있기 때문에 여러 열을 사용할 수는 없지만 여러 열이 분명히 Score_Buckets 열보다 큰 배열 크기를 가질 수는 있습니다.

예상치 못한 부분이 무엇입니까?

답변

1

먼저 문제를 분류 문제로 구성 할 수 있는지 또는 회귀 문제로 공식화해야하는지 다시 확인하십시오.

실제로 데이터를 Score_Bucket 열에있는 네 개의 고유 한 클래스로 분류한다고 가정 할 때 왜 여러 열을 예측 자로 사용할 수 없다고 생각합니까? 사실, 예제에서 마지막 두 열을 사용하고 있습니다.

X = dataset[["CourseLoad_RelativeStudy", "Sleep_Sex"]] 
y = dataset[["Score_Bucket"]] 
logreg = LogisticRegression() 
logreg.fit(X, y) 

을 더 열을 선택하려는 경우, 당신은 loc을 사용할 수 있습니다 : 당신은 당신이 sklearn 방법은 직접 (NumPy와 배열로 변환 필요 없음) 팬더 DataFrames 작업 것을 고려한다면 좀 더 읽기 쉬운 코드를 만들 수 있습니다 방법 : 또한 인덱스 열을 선택할 수

X = dataset.loc[:, "Healthy":"Sleep_Sex"] 

: 두 번째 질문에 대해서는

X = dataset.iloc[:, 1:] 

, 나는 다른받을 수 있나요 기능으로 사용하는 열에 따라 교차 유효성 검사 절차의 결과입니다. 샘플 수 (20)가 매우 적으므로 추정 예측을 다소 다양하게 만듭니다.

+0

도움 주셔서 감사합니다! Score_Bucket 열을 여러 열로 나누고 각 열을 예측하려고합니다. 위의 코드를 사용하면 오류가 발생합니다. residual_error = CV - 예측. ValueError : 잘못된 항목 수가 222 건, 게재 위치가 1을 의미합니다 – user2997307

+0

"Score_Bucket"열을 여러 열로 나누는 것이 무슨 뜻인지 알지 못합니다. 대상 'y'는 한 열에 있어야하므로 여러 열로 나누어야합니다. 또한 잔여 오류를 계산할 때 회귀를 수행하고 싶다고 생각합니다. 이것이 분류 자이기 때문에 당신은 로지스틱 회귀와 함께 할 수 없습니다. – cbrnr