Category_encoder tutorial

This tutorial shows how to use category_encoder encoders to reverse data preprocessing and display explicit labels.

We used Kaggle’s Titanic dataset.

This Tutorial: - Encode data with Category_encoder - Build a Binary Classifier (Random Forest) - Using Shapash - Show inversed data

[1]:
import numpy as np
import pandas as pd
from category_encoders import OrdinalEncoder
from category_encoders import OneHotEncoder
from category_encoders import TargetEncoder
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split

Load titanic Data

[2]:
from shapash.data.data_loader import data_loading
titan_df, titan_dict = data_loading('titanic')
del titan_df['Name']
[3]:
titan_df.head()
[3]:
Survived Pclass Sex Age SibSp Parch Fare Embarked Title
PassengerId
1 0 Third class male 22.0 1 0 7.25 Southampton Mr
2 1 First class female 38.0 1 0 71.28 Cherbourg Mrs
3 1 Third class female 26.0 0 0 7.92 Southampton Miss
4 1 First class female 35.0 1 0 53.10 Southampton Mrs
5 0 Third class male 35.0 0 0 8.05 Southampton Mr

Prepare data for the model with Category Encoder

Create Target

[4]:
y = titan_df['Survived']
X = titan_df.drop('Survived', axis=1)

Train category encoder

[5]:
#Train category encoder
onehot = OneHotEncoder(cols=['Pclass']).fit(X)
result_1 = onehot.transform(X)
ordinal = OrdinalEncoder(cols=['Embarked','Title']).fit(result_1)
result_2 = ordinal.transform(result_1)
target = TargetEncoder(cols=['Sex']).fit(result_2,y)
result_3 =target.transform(result_2)
[6]:
encoder = [onehot,ordinal,target]

Fit a model

[7]:
Xtrain, Xtest, ytrain, ytest = train_test_split(result_3, y, train_size=0.75, random_state=1)

clf = XGBClassifier(n_estimators=200,min_child_weight=2).fit(Xtrain,ytrain)
clf.fit(Xtrain, ytrain)
[7]:
XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints=None,
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=2, missing=nan, monotone_constraints=None,
              n_estimators=200, n_jobs=0, num_parallel_tree=1,
              objective='binary:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method=None,
              validate_parameters=False, verbosity=None)

Using Shapash

[8]:
from shapash import SmartExplainer
[9]:
xpl = SmartExplainer(
    model=clf,
    preprocessing=encoder,
)
[10]:
xpl.compile(x=Xtest,
y_target=ytest, # Optional: allows to display True Values vs Predicted Values
)
Backend: Shap TreeExplainer

Visualize data in pandas

[11]:
xpl.x_init
[11]:
Pclass Sex Age SibSp Parch Fare Embarked Title
PassengerId
863 First class female 48.0 0 0 25.93 Southampton Mrs
224 Third class male 29.5 0 0 7.90 Southampton Mr
85 Second class female 17.0 0 0 10.50 Southampton Miss
681 Third class female 29.5 0 0 8.14 Queenstown Miss
536 Second class female 7.0 0 2 26.25 Southampton Miss
624 Third class male 21.0 0 0 7.85 Southampton Mr
149 Second class male 36.5 0 2 26.00 Southampton Mr
4 First class female 35.0 1 0 53.10 Southampton Mrs
35 First class male 28.0 1 0 82.17 Cherbourg Mr
242 Third class female 29.5 1 0 15.50 Queenstown Miss
[12]:
xpl.x_encoded
[12]:
Pclass_1 Pclass_2 Pclass_3 Sex Age SibSp Parch Fare Embarked Title
PassengerId
863 0 1 0 0.742038 48.0 0 0 25.93 1 2
224 1 0 0 0.188908 29.5 0 0 7.90 1 1
85 0 0 1 0.742038 17.0 0 0 10.50 1 3
681 1 0 0 0.742038 29.5 0 0 8.14 3 3
536 0 0 1 0.742038 7.0 0 2 26.25 1 3
624 1 0 0 0.188908 21.0 0 0 7.85 1 1
149 0 0 1 0.188908 36.5 0 2 26.00 1 1
4 0 1 0 0.742038 35.0 1 0 53.10 1 2
35 0 1 0 0.188908 28.0 1 0 82.17 2 1
242 1 0 0 0.742038 29.5 1 0 15.50 3 3