Dictionnary Encoding tutorial

This tutorial shows how to use simple python dictionnaries to reverse data preprocessing and display explicit labels

Data from Kaggle Titanic

Content : - Encode data with dictionary - Build a Binary Classifier (Random Forest) - Using Shapash - Show inversed data

[1]:
import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split

Load titanic Data

[2]:
from shapash.data.data_loader import data_loading

titan_df, titan_dict = data_loading('titanic')
del titan_df['Name']
[3]:
titan_df.head()
[3]:
Survived Pclass Sex Age SibSp Parch Fare Embarked Title
PassengerId
1 0 Third class male 22.0 1 0 7.25 Southampton Mr
2 1 First class female 38.0 1 0 71.28 Cherbourg Mrs
3 1 Third class female 26.0 0 0 7.92 Southampton Miss
4 1 First class female 35.0 1 0 53.10 Southampton Mrs
5 0 Third class male 35.0 0 0 8.05 Southampton Mr

Prepare data for the model

Create Target

[4]:
y = titan_df['Survived']
X = titan_df.drop('Survived', axis=1)

Build dict tranformation and reversed dict

[5]:
#construct new variable
X['new_embarked'] = X.apply(lambda x : 1 if x.Embarked in ['Southampton','Cherbourg'] else 2 if x.Embarked in 'Queenstown' else 3, axis = 1)
#Construct the reversed dict
transfo_embarked = {'col': 'new_embarked',
                    'mapping': pd.Series(data=[1, 2, np.nan], index=['Southampton-Cherbourg', 'Queenstown','missing']),
                    'data_type': 'object'}

#construct new variable
X['new_ages'] = X.apply(lambda x : 1 if x.Age <= 25 else 2 if x.Age <= 40 else 3, axis = 1)
#Construct the reversed dict
transfo_age = dict()
transfo_age = {'col': 'new_ages',
                'mapping': pd.Series(data=[1, 2, 3, np.nan], index=['-25 years', '26-40 years', '+40 years','missing']),
                'data_type': 'object'}
[6]:
#put transformation into list
encoder = [transfo_age,transfo_embarked]
[7]:
X.head(4)
[7]:
Pclass Sex Age SibSp Parch Fare Embarked Title new_embarked new_ages
PassengerId
1 Third class male 22.0 1 0 7.25 Southampton Mr 1 1
2 First class female 38.0 1 0 71.28 Cherbourg Mrs 1 2
3 Third class female 26.0 0 0 7.92 Southampton Miss 1 2
4 First class female 35.0 1 0 53.10 Southampton Mrs 1 2

Fit a model

[8]:
X = X[['new_embarked','new_ages','Fare','Parch','Age']]
[9]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=0.75, random_state=1)

clf = XGBClassifier(n_estimators=200,min_child_weight=2).fit(Xtrain,ytrain)
clf.fit(Xtrain, ytrain)
[9]:
XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints=None,
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=2, missing=nan, monotone_constraints=None,
              n_estimators=200, n_jobs=0, num_parallel_tree=1,
              objective='binary:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method=None,
              validate_parameters=False, verbosity=None)

Using Shapash

[10]:
from shapash import SmartExplainer
[11]:
xpl = SmartExplainer(model=clf, preprocessing=encoder)
[12]:
xpl.compile(x=Xtest,
y_target=ytest, # Optional: allows to display True Values vs Predicted Values
)
Backend: Shap TreeExplainer

Visualize data in pandas

[13]:
xpl.x_init.head(4)
[13]:
new_embarked new_ages Fare Parch Age
PassengerId
863 Southampton-Cherbourg +40 years 25.93 0 48.0
224 Southampton-Cherbourg 26-40 years 7.90 0 29.5
85 Southampton-Cherbourg -25 years 10.50 0 17.0
681 Queenstown 26-40 years 8.14 0 29.5
[14]:
xpl.x_encoded.head(4)
[14]:
new_embarked new_ages Fare Parch Age
PassengerId
863 1 3 25.93 0 48.0
224 1 2 7.90 0 29.5
85 1 1 10.50 0 17.0
681 2 2 8.14 0 29.5