두 문자열 사이의 고유 한 ID 중복 횟수를 계산하십시오.

두 개의 열이있는 데이터 세트가 있습니다. 첫 번째 열은 고유 한 사용자 ID를 포함하고 두 번째 열은이 ID에 연결된 속성을 포함합니다. 예를 들어두 문자열 사이의 고유 한 ID 중복 횟수를 계산하십시오.

는 :

------------------------ 
User ID  Attribute 
------------------------ 
1234  blond 
1235  brunette 
1236  blond 
1234  tall 
1235  tall 
1236  short 
------------------------

내가 알고 싶은 것은 속성 간의 상관 관계이다. 위의 예에서, 나는 금발이 또한 몇 배나되는지를 알고 싶다.

------------------------------ 
Attr 1  Attr 2  Overlap 
------------------------------ 
blond  tall   1 
blond  short  1 
brunette tall   1 
brunette short  0 
------------------------------

내가 데이터를 피벗하고 출력을 얻기 위해 팬더를 사용하여 시도,하지만 내 데이터 세트가 속성의 수백을 가지고로, 나의 현재의 시도는 가능하지 않습니다 : 내 원하는 출력이다.

df = pandas.read_csv('myfile.csv')  

df.pivot_table(index='User ID', columns'Attribute', aggfunc=len, fill_value=0)

내 전류 출력 :

-------------------------------- 
Blond Brunette Short Tall 
-------------------------------- 
    0  1   0  1 
    1  0   0  1 
    1  0   1  0 
--------------------------------

내가 원하는 출력을 얻을 수있는 방법이 있나요? 미리 감사드립니다.

출처

2016-11-02 MARWEBIST

I 귀하의 첫 걸음은 이것을 더 좋은 관계 순서로 놓아야한다고 생각하십시오. 머리카락 색상/높이 속성에 이러한 속성을 논리적으로 구분하지 않습니다. – brianpck

실제로! 나는 대답을 시도했지만이 구별을 할 수 없었다. –

당신은 가능한 각 속성 커플을 찾기 위해 itertools product을 사용하고이에 행을 일치 coul :

import pandas as pd 
from itertools import product 

# 1) creating pandas dataframe 
df = [ ["1234" , "blond"], 
     ["1235" , "brunette"], 
     ["1236" , "blond" ], 
     ["1234" , "tall"], 
     ["1235" , "tall"], 
     ["1236" , "short"]] 

df = pd.DataFrame(df) 
df.columns = ["id", "attribute"] 

#2) creating all the possible attributes binomes 
attributs = set(df.attribute) 
for attribut1, attribut2 in product(attributs, attributs): 
    if attribut1!=attribut2: 
     #3) selecting the rows for each attribut 
     df1 = df[df.attribute == attribut1]["id"] 
     df2 = df[df.attribute == attribut2]["id"] 
     #4) finding the ids that are matching both attributs 
     intersection= len(set(df1).intersection(set(df2))) 
     if intersection: 
      #5) displaying the number of matches 
      print attribut1, attribut2, intersection

주는 :

tall brunette 1 
tall blond 1 
brunette tall 1 
blond tall 1 
blond short 1 
short blond 1

편집이로 수정하기 후 쉽게

원하는 출력을 얻으십시오 :

import pandas as pd 
from itertools import product 

# 1) creating pandas dataframe 
df = [ ["1234" , "blond"], 
     ["1235" , "brunette"], 
     ["1236" , "blond" ], 
     ["1234" , "tall"], 
     ["1235" , "tall"], 
     ["1236" , "short"]] 

df = pd.DataFrame(df) 
df.columns = ["id", "attribute"] 

wanted_attribute_1 = ["blond", "brunette"] 

#2) creating all the possible attributes binomes 
attributs = set(df.attribute) 
for attribut1, attribut2 in product(attributs, attributs): 
    if attribut1 in wanted_attribute_1 and attribut2 not in wanted_attribute_1: 
     if attribut1!=attribut2: 
      #3) selecting the rows for each attribut 
      df1 = df[df.attribute == attribut1]["id"] 
      df2 = df[df.attribute == attribut2]["id"] 
      #4) finding the ids that are matching both attributs 
      intersection= len(set(df1).intersection(set(df2))) 
      #5) displaying the number of matches 
      print attribut1, attribut2, intersection

기부는 :

당신의 피벗 테이블에서

brunette tall 1 
brunette short 0 
blond tall 1 
blond short 1

출처

2016-11-02 14:41:23

고마워. 이것은 나에게 내가 원하는 결과를 준다. 결과를 .csv 파일로 내보내려면 어떻게해야합니까? – MARWEBIST

당신은 [result] 데이터 프레임을 만들어야합니다.이 프레임은 처음에는 비어있을 것입니다. 그리고 루프에 [attribut1, attribut2, intersection]을 추가합니다 (append에 대해서는 http://pandas.pydata.org/를 참조하십시오). pandas-docs/stable/generated/pandas.DataFrame.append.html). Pandas 데이터 프레임은 [to_csv] 메소드를 제공하여 파일에 저장할 수있게합니다. –

, 당신은 자신의 전치 벡터 곱을 계산 한 다음 긴 형식으로 상위 삼각 결과를 변환 할 수 있습니다 :

import pandas as pd 
import numpy as np 
mat = df.pivot_table(index='User ID', columns='Attribute', aggfunc=len, fill_value=0) 

tprod = mat.T.dot(mat)   # calculate the tcrossprod here 
result = tprod.where((np.triu(np.ones(tprod.shape, bool), 1)), np.nan).stack().rename('value') 
           # extract the upper triangular part 
result.index.names = ['Attr1', 'Attr2'] 
result.reset_index().sort_values('value', ascending = False)

출처

2016-11-02 14:42:44 Psidom

두 문자열 사이의 고유 한 ID 중복 횟수를 계산하십시오.

답변

관련 문제