2016-10-06 6 views
1

CSV에서 정의한 데이터 프레임이 있으며 기본 요약 통계를 계산하고 싶습니다. 평균, 분산, ... 모든 모델의 기차 부분.팬더 퍼지 그룹 요약 통계

모델 번호를 삽입하고 그로 그룹화하면 문제가 없지만 좋은 해결책은 아닙니다. 카운터로 인해 group_by modelName이 작동하지 않으므로 모델 당 요약 통계를 얻으려면 어떻게해야합니까 (교육용).

df.groupby(['modelName', 'typeOfRun'])['kappa'].mean() 

또는

df[df.typeOfRun != 'validation'].describe() 

원하는 결과를 얻을 수 없습니다. pct

AUC_R,Accuracy,Error rate,False negative rate,False positive rate,Lift value,Precision J,Precision N,Rate of negative predictions,Rate of positive predictions,Sensitivity (true positives rate),Specificity (true negatives rate),f1_R,kappa,modelName,typeOfRun 
0.7747622323007851,0.7182416731216111,0.28175832687838887,0.16519823788546256,0.28527729751296715,2.769918376242967,0.08117369886485329,0.9930703132218424,0.029305447973147433,0.3013813581203202,0.8348017621145375,0.7147227024870328,0.8312130234716368,0.09987857210248623,00_testing_1-training,training 
0.7688154033277225,0.7295055512522592,0.27049444874774076,0.1894273127753304,0.27294188056922464,2.807689674786938,0.08228060368921185,0.9921956531603068,0.029305447973147433,0.28869739220242707,0.8105726872246696,0.7270581194307754,0.8391825769931881,0.10159217699431862,00_testing_2-training,training 
0.7653761718477654,0.7217918925897238,0.2782081074102763,0.1883259911894273,0.2809216651150419,2.737743031677203,0.08023078597866318,0.9921552436003304,0.029305447973147433,0.29647560030983733,0.8116740088105727,0.7190783348849581,0.8338281219878937,0.09791120175612114,00_testing_3-training,training 
0.7666987721022418,0.7202566535628756,0.2797433464371244,0.18396711202466598,0.2826353437708505,2.7358921138891255,0.08018987022168358,0.9923159476282464,0.02931031885891585,0.2982693958700465,0.816032887975334,0.7173646562291496,0.8327314318650539,0.097878484924986,00_testing-validation,validation 
0.7776426005660843,0.7300542215336948,0.2699457784663052,0.17180616740088106,0.2729086314669504,2.8639238514789174,0.08392857142857142,0.9929168180167091,0.029305447973147433,0.28918151303898787,0.8281938325991189,0.7270913685330496,0.8394625719769673,0.10476961017159536,01_otherSet_1-training,training 
0.7691501646636157,0.737412858249419,0.26258714175058095,0.197136563876652,0.2645631067961165,2.8639098209585327,0.08392816025788626,0.9919723742039644,0.029305447973147433,0.2803382390911438,0.802863436123348,0.7354368932038835,0.8446557452170924,0.1044486077353842,01_otherSet_2-training,training 
0.770174515310113,0.7342176607281178,0.2657823392718823,0.19162995594713655,0.26802101343263735,2.847815513920855,0.08345650938032974,0.9921582766235522,0.029305447973147433,0.283856183836819,0.8083700440528634,0.7319789865673627,0.8424375777288816,0.10367514449353035,01_otherSet_3-training,training 
0.7676347850606817,0.7317488289428102,0.26825117105718976,0.19424460431654678,0.2704858255620898,2.8156062097690264,0.08252631578947368,0.9920241385858671,0.02931031885891585,0.2861747473378218,0.8057553956834532,0.7295141744379102,0.8407546494992847,0.10196584743637081,01_otherSet-validation,validation 

답변

1

당신이 DataFrameGroupBy.describe을 사용할 수 있습니다 IIUC :

print (df.groupby(['modelName', 'typeOfRun']).describe()) 

              f1_R  kappa 
modelName    typeOfRun        
00_testing-validation validation count 1.000000 1.000000 
            mean 0.832731 0.097878 
            std   NaN  NaN 
            min 0.832731 0.097878 
            25% 0.832731 0.097878 
            50% 0.832731 0.097878 
            75% 0.832731 0.097878 
            max 0.832731 0.097878 
00_testing_1-training training count 1.000000 1.000000 
            mean 0.831213 0.099879 
            std   NaN  NaN 
            min 0.831213 0.099879 
            25% 0.831213 0.099879 
            50% 0.831213 0.099879 
            75% 0.831213 0.099879 
            max 0.831213 0.099879 
00_testing_2-training training count 1.000000 1.000000 
            mean 0.839183 0.101592 
            std   NaN  NaN 
... 
...         

할 수 있습니다 split에 의해 만들어지고 목록의 첫 번째 항목을 선택 Series하여 groupbystr[0]의 : 거의 다

print (df.modelName.str.split('_').str[0]) 
0 00 
1 00 
2 00 
3 00 
4 01 
5 01 
6 01 
7 01 
Name: modelName, dtype: object 

print (df.groupby([df.modelName.str.split('_').str[0]]).describe()) 
        AUC_R Accuracy Error;rate False;negative;rate \ 
modelName                
00  count 4.000000 4.000000 4.000000    4.000000 
      mean 0.768913 0.722449 0.277551    0.181730 
      std 0.004149 0.004924 0.004924    0.011270 
      min 0.765376 0.718242 0.270494    0.165198 
      25% 0.766368 0.719753 0.276280    0.179275 
      50% 0.767757 0.721024 0.278976    0.186147 
      75% 0.770302 0.723720 0.280247    0.188601 
      max 0.774762 0.729506 0.281758    0.189427 
01  count 4.000000 4.000000 4.000000    4.000000 
      mean 0.771151 0.733358 0.266642    0.188704 
      std 0.004452 0.003198 0.003198    0.011488 
      min 0.767635 0.730054 0.262587    0.171806 
      25% 0.768771 0.731325 0.264984    0.186674 
      50% 0.769662 0.732983 0.267017    0.192937 
      75% 0.772042 0.735016 0.268675    0.194968 
      max 0.777643 0.737413 0.269946    0.197137 
      ... 
      ... 
+0

. 그러나 modelName에 따라 엄격하게 그룹화하고 싶지는 않습니다. 각 fold마다 다릅니다. 오히려 나는 모델 이름의 처음 부분 (단지 상수 임)에 의해서만 그룹핑을 수행하기를 원할 것이다. –

+0

흠, 그래서'modelby' 만'modelby' 만 필요합니다.'print (df.groupby ([ 'modelName']). describe())'? 아니면 더 설명 할 수 있습니까? – jezrael

+0

"train"데이터 만 df에 포함되도록 데이터를 필터링 한 다음 그룹 별 모델 이름을 수행하려고합니다. 그러나 이름이 '00_testing_1-training','00_testing_2-training'과 같이 다른 카운터를 무시하고 싶습니다. –