SciPy의 계층 적 클러스터링 덴도 그램의 출력물 해석하기? (어쩌면 버그를 발견했다 ...)

나는 scipy.cluster.hierarchy.dendrogram의 출력이 어떻게 작동하는지 알기 위해 노력하고있다 ... 나는 그것이 어떻게 작동하는지 알았고 출력물을 사용하여 멍멍을 재구성했지만 그것이 마치 나는이 모듈을 더 이상 이해하지 못하거나이 모듈의 Python 3 버전에 버그가 있습니다.SciPy의 계층 적 클러스터링 덴도 그램의 출력물 해석하기? (어쩌면 버그를 발견했다 ...)

이 답변, how do I get the subtrees of dendrogram made by scipy.cluster.hierarchy는 dendrogram 출력 사전이 dendrogram은를 재구성 w dict_keys(['icoord', 'ivl', 'color_list', 'leaves', 'dcoord'])/같은 크기의 모든 당신이 그들을 zip 수 있도록 plt.plot을주는 것을 의미한다.

Python 2.7.11을 사용했을 때 나는 충분히 단순 해 보였지만 일단 Python 3.5.1으로 업그레이드하면 내 이전 스크립트가 나에게 같은 결과를주지 못했습니다.

매우 간단한 반복적 인 예제로 클러스터를 다시 작성하기 시작했으며 Python 3.5.1 버전의 SciPy version 0.17.1-np110py35_1에서 버그를 발견했을 수도 있습니다. Scikit-learn 데이터 세트 b/c를 사용하려고하면 대부분의 사람들은 conda 배포판의 해당 모듈을 사용합니다.

왜 이런 줄이 보이지 않는데 왜 이러한 방법으로 멍멍을 재구성 할 수 없습니까?

attr_1 [ 15. 15. 25. 25.] [ 0. 0.10333704 0.10333704 0. ] g attr_4 [ 55. 55. 65. 65.] [ 0. 0.26150727 0.26150727 0. ] r attr_5 [ 45. 45. 60. 60.] [ 0. 0.4917828 0.4917828 0.26150727] r attr_2 [ 35. 35. 52.5 52.5] [ 0. 0.59107459 0.59107459 0.4917828 ] b attr_8 [ 20. 20. 43.75 43.75] [ 0.10333704 0.65064998 0.65064998 0.59107459] b attr_6 [ 85. 85. 95. 95.] [ 0. 0.60957062 0.60957062 0. ] b attr_7 [ 75. 75. 90. 90.] [ 0. 0.68142114 0.68142114 0.60957062] b attr_0 [ 31.875 31.875 82.5 82.5 ] [ 0.65064998 0.72066112 0.72066112 0.68142114] b attr_3 [ 5. 5. 57.1875 57.1875] [ 0. 0.80554653 0.80554653 0.72066112] b

# Init 
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns; sns.set() 

# Load data 
from sklearn.datasets import load_diabetes 

# Clustering 
from scipy.cluster.hierarchy import dendrogram, fcluster, leaves_list 
from scipy.spatial import distance 
from fastcluster import linkage # You can use SciPy one too 

%matplotlib inline 

# Dataset 
A_data = load_diabetes().data 
DF_diabetes = pd.DataFrame(A_data, columns = ["attr_%d" % j for j in range(A_data.shape[1])]) 

# Absolute value of correlation matrix, then subtract from 1 for disimilarity 
DF_dism = 1 - np.abs(DF_diabetes.corr()) 

# Compute average linkage 
A_dist = distance.squareform(DF_dism.as_matrix()) 
Z = linkage(A_dist,method="average") 

# I modded the SO code from the above answer for the plot function 
def plot_tree(D_dendro, ax): 
    # Set up plotting data 
    leaves = D_dendro["ivl"] 
    icoord = np.array(D_dendro['icoord']) 
    dcoord = np.array(D_dendro['dcoord']) 
    color_list = D_dendro["color_list"] 

    # Plot colors 
    for leaf, xs, ys, color in zip(leaves, icoord, dcoord, color_list): 
     print(leaf, xs, ys, color, sep="\t") 
     plt.plot(xs, ys, color) 

    # Set min/max of plots 
    xmin, xmax = icoord.min(), icoord.max() 
    ymin, ymax = dcoord.min(), dcoord.max() 

    plt.xlim(xmin-10, xmax + 0.1*abs(xmax)) 
    plt.ylim(ymin, ymax + 0.1*abs(ymax)) 

    # Set up ticks 
    ax.set_xticks(np.arange(5, len(leaves) * 10 + 5, 10)) 
    ax.set_xticklabels(leaves, fontsize=10, rotation=45) 

    plt.show() 

fig, ax = plt.subplots() 
D1 = dendrogram(Z=Z, labels=DF_dism.index, color_threshold=None, no_plot=True) 
plot_tree(D_dendro=D1, ax=ax)

여기에 하나 w, x 축 그래서

색상이 제대로 매핑되지 않은 체크 아웃의 레이블 그냥 icoord 값 O /. icoord에 대한 [ 15. 15. 25. 25.]은 attr_1과 같지만 값은 attr_4과 같습니다. 또한 마지막 잎 (attr_9) 끝까지 가지 않고 b/c 길이가 icoord이고 dcoord이 ivl 레이블보다 1 작습니다.

print([len(x) for x in [leaves, icoord, dcoord, color_list]]) 
#[10, 9, 9, 9]

출처

2016-07-03 O.rka

icoord, dcoord 및 color_list는 링크가 아닌 잎에 대해 설명합니다. icoord 및 dcoord은 플롯의 각 링크에 대한 "아치"(즉, 거꾸로 된 U 또는 J 모양)의 좌표를 제공하고 color_list은 그 아치의 색상입니다. 전체 음모에서, icoord 등의 길이는 관찰 한대로 ivl 길이보다 하나 작습니다.

icoord, dcoord 및 color_list 목록과 함께 ivl 목록을 나열하지 마십시오. 그들은 서로 다른 것들과 관련이 있습니다.

출처

2016-07-03 14:07:08

"dendrogram의 출력 사전"과 "사용자 정의 색상 사전"으로 "dendrogram plot"을 재구성하는 방법이나 해당 색상 사전으로'link_color_func'를 사용하는 방법을 설명하는 튜토리얼을 알고 있습니까? {key = leaf : 가치 = 색상}'? 위의 방법으로 진행되는 자습서가 몇 가지 있습니다. 정식 답변이 커뮤니티에 도움이 될 것이라고 생각합니다. –

SciPy의 계층 적 클러스터링 덴도 그램의 출력물 해석하기? (어쩌면 버그를 발견했다 ...)

답변

관련 문제