Dictionnary Encoding tutorial¶

This tutorial shows how to use simple python dictionnaries to reverse data preprocessing and display explicit labels

Data from Kaggle Titanic

Content : - Encode data with dictionary - Build a Binary Classifier (Random Forest) - Using Shapash - Show inversed data

[1]:

import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split

Load titanic Data¶

[2]:

from shapash.data.data_loader import data_loading

titan_df, titan_dict = data_loading('titanic')
del titan_df['Name']

[3]:

titan_df.head()

[3]:

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	Title
PassengerId
1	0	Third class	male	22.0	1	0	7.25	Southampton	Mr
2	1	First class	female	38.0	1	0	71.28	Cherbourg	Mrs
3	1	Third class	female	26.0	0	0	7.92	Southampton	Miss
4	1	First class	female	35.0	1	0	53.10	Southampton	Mrs
5	0	Third class	male	35.0	0	0	8.05	Southampton	Mr

Prepare data for the model¶

Create Target

[4]:

y = titan_df['Survived']
X = titan_df.drop('Survived', axis=1)

Build dict tranformation and reversed dict

[5]:

#construct new variable
X['new_embarked'] = X.apply(lambda x : 1 if x.Embarked in ['Southampton','Cherbourg'] else 2 if x.Embarked in 'Queenstown' else 3, axis = 1)
#Construct the reversed dict
transfo_embarked = {'col': 'new_embarked',
                    'mapping': pd.Series(data=[1, 2, np.nan], index=['Southampton-Cherbourg', 'Queenstown','missing']),
                    'data_type': 'object'}

#construct new variable
X['new_ages'] = X.apply(lambda x : 1 if x.Age <= 25 else 2 if x.Age <= 40 else 3, axis = 1)
#Construct the reversed dict
transfo_age = dict()
transfo_age = {'col': 'new_ages',
                'mapping': pd.Series(data=[1, 2, 3, np.nan], index=['-25 years', '26-40 years', '+40 years','missing']),
                'data_type': 'object'}

[6]:

#put transformation into list
encoder = [transfo_age,transfo_embarked]

[7]:

X.head(4)

[7]:

	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	Title	new_embarked	new_ages
PassengerId
1	Third class	male	22.0	1	0	7.25	Southampton	Mr	1	1
2	First class	female	38.0	1	0	71.28	Cherbourg	Mrs	1	2
3	Third class	female	26.0	0	0	7.92	Southampton	Miss	1	2
4	First class	female	35.0	1	0	53.10	Southampton	Mrs	1	2

Fit a model¶

[8]:

X = X[['new_embarked','new_ages','Fare','Parch','Age']]

[9]:

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=0.75, random_state=1)

clf = XGBClassifier(n_estimators=200,min_child_weight=2).fit(Xtrain,ytrain)
clf.fit(Xtrain, ytrain)

[9]:

XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints=None,
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=2, missing=nan, monotone_constraints=None,
              n_estimators=200, n_jobs=0, num_parallel_tree=1,
              objective='binary:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method=None,
              validate_parameters=False, verbosity=None)

Using Shapash¶

[10]:

from shapash import SmartExplainer

[11]:

xpl = SmartExplainer(model=clf, preprocessing=encoder)

[12]:

xpl.compile(x=Xtest,
y_target=ytest, # Optional: allows to display True Values vs Predicted Values
)

Backend: Shap TreeExplainer

Visualize data in pandas¶

[13]:

xpl.x_init.head(4)

[13]:

	new_embarked	new_ages	Fare	Parch	Age
PassengerId
863	Southampton-Cherbourg	+40 years	25.93	0	48.0
224	Southampton-Cherbourg	26-40 years	7.90	0	29.5
85	Southampton-Cherbourg	-25 years	10.50	0	17.0
681	Queenstown	26-40 years	8.14	0	29.5

[14]:

xpl.x_encoded.head(4)

[14]:

	new_embarked	new_ages	Fare	Parch	Age
PassengerId
863	1	3	25.93	0	48.0
224	1	2	7.90	0	29.5
85	1	1	10.50	0	17.0
681	2	2	8.14	0	29.5