Features importance

The methode features_importance displays a bar chart representing the sum of absolute contribution values of each feature.

This method also makes it possible to represent this sum calculated on a subset and to compare it with the total population

This short tutorial presents the different parameters you can use.

Content : - Classification case: Specify the target modality to display. - selection parameter to display a subset - max_features parameter limits the number of features

We used Kaggle’s Titanic dataset

[1]:
import pandas as pd
from category_encoders import OrdinalEncoder
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split

Building Supervized Model

Load Titanic data

[2]:
from shapash.data.data_loader import data_loading
titanic_df, titanic_dict = data_loading('titanic')
del titanic_df['Name']
y_df=titanic_df['Survived'].to_frame()
X_df=titanic_df[titanic_df.columns.difference(['Survived'])]
[3]:
titanic_df.head()
[3]:
Survived Pclass Sex Age SibSp Parch Fare Embarked Title
PassengerId
1 0 Third class male 22.0 1 0 7.25 Southampton Mr
2 1 First class female 38.0 1 0 71.28 Cherbourg Mrs
3 1 Third class female 26.0 0 0 7.92 Southampton Miss
4 1 First class female 35.0 1 0 53.10 Southampton Mrs
5 0 Third class male 35.0 0 0 8.05 Southampton Mr

Load Titanic data

[4]:
from category_encoders import OrdinalEncoder

categorical_features = [col for col in X_df.columns if X_df[col].dtype == 'object']

encoder = OrdinalEncoder(
    cols=categorical_features,
    handle_unknown='ignore',
    return_df=True).fit(X_df)

X_df=encoder.transform(X_df)

Train / Test Split + model fitting

[5]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X_df, y_df, train_size=0.75, random_state=7)
[6]:
clf = ExtraTreesClassifier(n_estimators=200).fit(Xtrain,ytrain)

First step: You need to Declare and Compile SmartExplainer

[7]:
from shapash import SmartExplainer
[8]:
response_dict = {0: 'Death', 1:' Survival'}
[9]:
xpl = SmartExplainer(
    model=clf,
    preprocessing=encoder,      # Optional: compile step can use inverse_transform method
    features_dict=titanic_dict, # Optional parameters
    label_dict=response_dict    # Optional parameters, dicts specify labels
)
[10]:
xpl.compile(x=Xtest)
Backend: Shap TreeExplainer

Display Feature Importance

[11]:
xpl.plot.features_importance()
../../_images/tutorials_plots_and_charts_tuto-plot03-features-importance_17_0.png

Multiclass: Select the target modality

Features importances sum and display the absolute contribution for one target modality. you can change this modality, selecting with label parameter:

xpl.plot.features_importance(label=‘Death’)

with label parameter you can specify target value, label or number

Focus and compare a subset

selection parameter specify the subset:

[12]:
sel = [581, 610, 524, 636, 298, 420, 568, 817, 363, 557,
       486, 252, 390, 505, 16, 290, 611, 148, 438, 23, 810,
       875, 206, 836, 143, 843, 436, 701, 681, 67, 10]
[13]:
xpl.plot.features_importance(selection=sel)
../../_images/tutorials_plots_and_charts_tuto-plot03-features-importance_21_0.png

Tune the number of features to display

Use max_features parameter (default value: 20)

[15]:
xpl.plot.features_importance(max_features=3)
../../_images/tutorials_plots_and_charts_tuto-plot03-features-importance_23_0.png