2016-07-08 3 views
0

나는 Cloudera VM 5.2.0 pandas 0.18.0과 함께 작업합니다.pandas group 선정 된 칼럼

나는 다음과 같은 데이터가

adclicksDF = pd.read_csv('/home/cloudera/Eglence/ad-clicks.csv', 
       parse_dates=['timestamp'], 
     skipinitialspace=True).assign(adCount=1) 

adclicksDF.head(n=5) 
Out[65]: 
      timestamp txId userSessionId teamId userId adId adCategory \ 
0 2016-05-26 15:13:22 5974   5809  27  611  2 electronics 
1 2016-05-26 15:17:24 5976   5705  18 1874 21  movies 
2 2016-05-26 15:22:52 5978   5791  53 2139 25 computers 
3 2016-05-26 15:22:57 5973   5756  63  212 10  fashion 
4 2016-05-26 15:22:58 5980   5920  9 1027 20  clothing 

    adCount 
0  1 
1  1 
2  1 
3  1 
4  1 

나는, 더 많은 열을 idUser을 adCategory을 agrupado에 추가 할 필드의 타임 스탬프

adCategoryclicks = adclicksDF[['timestamp','adId','adCategory','userId','adCount']] 

agrupadoDF = adCategoryclicks.groupby(pd.Grouper(key='timestamp', freq='1H'))['adCount'].agg(['count','sum']) 

agrupadoDF.head(n=5)  
Out[68]: 
        count sum 
timestamp      
2016-05-26 15:00:00  14 14 
2016-05-26 16:00:00  24 24 
2016-05-26 17:00:00  13 13 
2016-05-26 18:00:00  16 16 
2016-05-26 19:00:00  16 16 

에 의해 그룹을 수행 할 수 있습니다. 어떻게해야합니까?

답변

0

join는 이렇게하여 각 aggreagate group위한 userIdadCategory 여러 값보다 우수한 출력 변경이 샘플 지난 날짜에 ​​

print (adclicksDF) 
      timestamp txId userSessionId teamId userId adId adCategory \ 
0 2016-05-26 15:13:22 5974   5809  27 611  2 electronics 
1 2016-05-26 15:17:24 5976   5705  18 1874 21  movies 
2 2016-05-26 15:22:52 5978   5791  53 2139 25 computers 
3 2016-05-26 16:22:57 5973   5756  63 212 10  fashion 
4 2016-05-26 16:22:58 5980   5920  9 1027 20  clothing 

    adCount 
0  1 
1  1 
2  1 
3  1 
4  1 
#cast int to string 
adclicksDF['userId'] = adclicksDF['userId'].astype(str) 
adCategoryclicks = adclicksDF[['timestamp','adId','adCategory','userId','adCount']] 


agrupadoDF = adCategoryclicks.groupby(pd.Grouper(key='timestamp', freq='1H')) 
          .agg({'adCount': ['count','sum'], 
            'userId': ', '.join, 
            'adCategory': ', '.join}) 

agrupadoDF.columns = ['adCategory','count','sum','userId'] 

print (agrupadoDF) 
             adCategory count sum \ 
timestamp               
2016-05-26 15:00:00 electronics, movies, computers  3 3 
2016-05-26 16:00:00    fashion, clothing  2 2 

           userId 
timestamp        
2016-05-26 15:00:00 611, 1874, 2139 
2016-05-26 16:00:00  212, 1027