1
CSV에서 정의한 데이터 프레임이 있으며 기본 요약 통계를 계산하고 싶습니다. 평균, 분산, ... 모든 모델의 기차 부분.팬더 퍼지 그룹 요약 통계
모델 번호를 삽입하고 그로 그룹화하면 문제가 없지만 좋은 해결책은 아닙니다. 카운터로 인해 group_by modelName이 작동하지 않으므로 모델 당 요약 통계를 얻으려면 어떻게해야합니까 (교육용).
df.groupby(['modelName', 'typeOfRun'])['kappa'].mean()
또는
df[df.typeOfRun != 'validation'].describe()
원하는 결과를 얻을 수 없습니다.
AUC_R,Accuracy,Error rate,False negative rate,False positive rate,Lift value,Precision J,Precision N,Rate of negative predictions,Rate of positive predictions,Sensitivity (true positives rate),Specificity (true negatives rate),f1_R,kappa,modelName,typeOfRun
0.7747622323007851,0.7182416731216111,0.28175832687838887,0.16519823788546256,0.28527729751296715,2.769918376242967,0.08117369886485329,0.9930703132218424,0.029305447973147433,0.3013813581203202,0.8348017621145375,0.7147227024870328,0.8312130234716368,0.09987857210248623,00_testing_1-training,training
0.7688154033277225,0.7295055512522592,0.27049444874774076,0.1894273127753304,0.27294188056922464,2.807689674786938,0.08228060368921185,0.9921956531603068,0.029305447973147433,0.28869739220242707,0.8105726872246696,0.7270581194307754,0.8391825769931881,0.10159217699431862,00_testing_2-training,training
0.7653761718477654,0.7217918925897238,0.2782081074102763,0.1883259911894273,0.2809216651150419,2.737743031677203,0.08023078597866318,0.9921552436003304,0.029305447973147433,0.29647560030983733,0.8116740088105727,0.7190783348849581,0.8338281219878937,0.09791120175612114,00_testing_3-training,training
0.7666987721022418,0.7202566535628756,0.2797433464371244,0.18396711202466598,0.2826353437708505,2.7358921138891255,0.08018987022168358,0.9923159476282464,0.02931031885891585,0.2982693958700465,0.816032887975334,0.7173646562291496,0.8327314318650539,0.097878484924986,00_testing-validation,validation
0.7776426005660843,0.7300542215336948,0.2699457784663052,0.17180616740088106,0.2729086314669504,2.8639238514789174,0.08392857142857142,0.9929168180167091,0.029305447973147433,0.28918151303898787,0.8281938325991189,0.7270913685330496,0.8394625719769673,0.10476961017159536,01_otherSet_1-training,training
0.7691501646636157,0.737412858249419,0.26258714175058095,0.197136563876652,0.2645631067961165,2.8639098209585327,0.08392816025788626,0.9919723742039644,0.029305447973147433,0.2803382390911438,0.802863436123348,0.7354368932038835,0.8446557452170924,0.1044486077353842,01_otherSet_2-training,training
0.770174515310113,0.7342176607281178,0.2657823392718823,0.19162995594713655,0.26802101343263735,2.847815513920855,0.08345650938032974,0.9921582766235522,0.029305447973147433,0.283856183836819,0.8083700440528634,0.7319789865673627,0.8424375777288816,0.10367514449353035,01_otherSet_3-training,training
0.7676347850606817,0.7317488289428102,0.26825117105718976,0.19424460431654678,0.2704858255620898,2.8156062097690264,0.08252631578947368,0.9920241385858671,0.02931031885891585,0.2861747473378218,0.8057553956834532,0.7295141744379102,0.8407546494992847,0.10196584743637081,01_otherSet-validation,validation
. 그러나 modelName에 따라 엄격하게 그룹화하고 싶지는 않습니다. 각 fold마다 다릅니다. 오히려 나는 모델 이름의 처음 부분 (단지 상수 임)에 의해서만 그룹핑을 수행하기를 원할 것이다. –
흠, 그래서'modelby' 만'modelby' 만 필요합니다.'print (df.groupby ([ 'modelName']). describe())'? 아니면 더 설명 할 수 있습니까? – jezrael
"train"데이터 만 df에 포함되도록 데이터를 필터링 한 다음 그룹 별 모델 이름을 수행하려고합니다. 그러나 이름이 '00_testing_1-training','00_testing_2-training'과 같이 다른 카운터를 무시하고 싶습니다. –