Scikit-RandomForestClassifier에서 기능의 중요성과 포리스트 구조가 어떻게 관련되어 있습니까?

다음은 아이리스 데이터 세트를 사용하여 간단한 문제입니다. export_graphviz을 사용하여 예측 기능 포리스트를 시각화 할 때 기능 중요도를 계산하는 방법과이를 표시하는 방법을 이해하려고 할 때 나는 당황 스럽습니다.Scikit-RandomForestClassifier에서 기능의 중요성과 포리스트 구조가 어떻게 관련되어 있습니까?

이

import pandas as pd 
import numpy as np 
from sklearn.datasets import load_iris 
import matplotlib.pyplot as plt 

data = load_iris() 
X = pd.DataFrame(data=data.data,columns=['sepallength', 'sepalwidth', 'petallength','petalwidth']) 
y = pd.DataFrame(data=data.target) 

from sklearn.cross_validation import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) 

from sklearn.ensemble import RandomForestClassifier 
rf = RandomForestClassifier(n_estimators=2,max_depth=1) 
rf.fit(X_train,y_train.iloc[:,0])

분류 자 저조한 수행하는 숲이 여기에 문제가되지 않습니다 어쨌든 1의 깊이있는 2 그루의 나무를 포함하고 있기 때문에 (점수가 0.68이다) : 여기 내 코드입니다.

이 기능의 중요성

는 다음과 같이 검색됩니다

importances = rf.feature_importances_ 
std = np.std([rf.feature_importances_ for tree in rf.estimators_],axis=0) 
indices = np.argsort(importances)[::-1] 

print("Feature ranking:") 
for f in range(X.shape[1]): 
    print("%d. feature %s (%f)" % (f + 1, X.columns.tolist()[f], importances[indices[f]]))

출력은 다음과 같습니다

Feature ranking: 
1. feature sepallength (1.000000) 
2. feature sepalwidth (0.000000) 
3. feature petallength (0.000000) 
4. feature petalwidth (0.000000)

이제 다음 코드를 사용하여 구축 나무의 구조를 보여줄 때 :

from sklearn.tree import export_graphviz 
export_graphviz(rf.estimators_[0], 
       feature_names=X.columns, 
       filled=True, 
       rounded=True) 
!dot -Tpng tree.dot -o tree0.png 
from IPython.display import Image 
Image('tree0.png')

을

이 두 그림을 얻습니다.

나무 #의

수출 1 : 나무 #
- 수출은
나는 sepallength이 중요성을 가질 수있는 방법을 이해할 수 없다 = 1은 아니지만 그림과 같이 두 트리의 노드 분할에 대해을 입력하십시오 (petallength 만 사용됨). 하나 개의 주문에 따라 라벨을 유지하지, 그리고 importances 다른 순서에 따라 - 당신이 indices = np.argsort(importances)[::-1]로 바꾸어 넣 경우

출처

2016-09-30 user6903745

당신은 당신은 모든을 뒤 바꿔 필요

for f in range(X.shape[1]): 
    print("%d. feature %s (%f)" % (f + 1, X.columns.tolist()[f], importances[indices[f]]))

에 버그가 있습니다.

당신이

for f in range(X.shape[1]): 
    print("%d. feature %s (%f)" % (f + 1, X.columns.tolist()[f], importances[f]))

에 의해 위를 교체 할 경우 다음 숲과 나무의 모든 인덱스 2의 기능은 어떤 중요성 유일 계약에 있습니다.

출처

2016-09-30 10:08:36

Scikit-RandomForestClassifier에서 기능의 중요성과 포리스트 구조가 어떻게 관련되어 있습니까?

답변

관련 문제