2017-10-08 1 views
1

와 StratifiedShuffleSplit KeyError를을 scikit는 배우기 :이 내 팬더 ​​dataframe <code>lots_not_preprocessed_usd</code>입니다 인덱스

<class 'pandas.core.frame.DataFrame'> 
Index: 78718 entries, 2017-09-12T18-38-38-076065 to 2017-10-02T07-29-40-245031 
Data columns (total 20 columns): 
created_year    78718 non-null float64 
price      78718 non-null float64 
........ 
decade     78718 non-null int64 
dtypes: float64(8), int64(1), object(11) 
memory usage: 12.6+ MB 

헤드 (1) :

artist_name_normalized house created_year description exhibited_in exhibited_in_museums height images max_estimated_price min_estimated_price price provenance provenance_estate_of sale_date sale_id sale_title style title width decade 
    key                    
    2017-09-12T18-38-38-076065 NaN c11 1862.0 An Album and a small Quantity of unframed Draw... NaN NaN NaN NaN 535.031166 267.515583 845.349242 NaN NaN 1998-06-21 8033 OILS, WATERCOLOURS & DRAWINGS FROM 18TH - 20TH... watercolor painting An Album and a small Quantity of unframed Draw... NaN 186 

내 스크립트 내가 갖는

from sklearn.model_selection import StratifiedShuffleSplit 

split = StratifiedShuffleSplit(n_splits=1, test_size =0.2, random_state=42) 
for train_index, test_index in split.split(lots_not_preprocessed_usd, lots_not_preprocessed_usd['decade']): 
    strat_train_set = lots_not_preprocessed_usd.loc[train_index] 
    strat_test_set = lots_not_preprocessed_usd.loc[test_index] 

오류 메시지

KeyError         Traceback (most recent call last) 
<ipython-input-224-cee2389254f2> in <module>() 
     3 split = StratifiedShuffleSplit(n_splits=1, test_size =0.2, random_state=42) 
     4 for train_index, test_index in split.split(lots_not_preprocessed_usd, lots_not_preprocessed_usd['decade']): 
----> 5  strat_train_set = lots_not_preprocessed_usd.loc[train_index] 
     6  strat_test_set = lots_not_preprocessed_usd.loc[test_index] 

...... 

KeyError: 'None of [[32199 67509 69003 ..., 44204 2809 56726]] are in the [index]' 

내 색인에 문제가있는 것 같습니다 (예 : 2017-09-12T18-38-38-076065) 나는 이해하지 못합니다. 문제는 어디에 있습니까?

나는 예상대로 작동 다른 스플릿을 사용하는 경우 :

from sklearn.model_selection import train_test_split 

train_set, test_set = train_test_split(lots_not_preprocessed_usd, test_size=0.2, random_state=42) 
+0

더 설명 – Dark

답변

2

당신은 당신이 대신 .loc의 orindary 숫자 인덱서를 사용하고자 할 때 너무 .iloc를 사용 row_indexer 위해 같은 인덱스를 전달해야 .loc 사용합니다. for 루프에서 train_index 및 text_index는 split.split(X,y) 임의의 인덱스 배열을 반환하기 때문에 datetime이 아닙니다.

... 
for train_index, test_index in split.split(lots_not_preprocessed_usd, lots_not_preprocessed_usd['decade']): 
    strat_train_set = lots_not_preprocessed_usd.iloc[train_index] 
    strat_test_set = lots_not_preprocessed_usd.iloc[test_index] 

샘플 예를

lots_not_preprocessed_usd = pd.DataFrame({'some':np.random.randint(5,10,100),'decade':np.random.randint(5,10,100)},index= pd.date_range('5-10-15',periods=100)) 

for train_index, test_index in split.split(lots_not_preprocessed_usd, lots_not_preprocessed_usd['decade']): 

    strat_train_set = lots_not_preprocessed_usd.iloc[train_index] 
    strat_test_set = lots_not_preprocessed_usd.iloc[test_index] 

샘플 출력 :

strat_train_set.head() 
 
      decade some 
2015-08-02  6  7 
2015-06-14  7  6 
2015-08-14  7  9 
2015-06-25  9  5 
2015-05-15  7  9 

+1

에 대한'lots_not_preprocessed_usd.head을()'추가 감사합니다, 작품 – zinyosrim