Category_encoder tutorial¶

This tutorial shows how to use category_encoder encoders to reverse data preprocessing and display explicit labels.

We used Kaggle’s Titanic dataset.

This Tutorial: - Encode data with Category_encoder - Build a Binary Classifier (Random Forest) - Using Shapash - Show inversed data

[1]:

import numpy as np
import pandas as pd
from category_encoders import OrdinalEncoder
from category_encoders import OneHotEncoder
from category_encoders import TargetEncoder
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split

Load titanic Data¶

[2]:

from shapash.data.data_loader import data_loading
titan_df, titan_dict = data_loading('titanic')
del titan_df['Name']

[3]:

titan_df.head()

[3]:

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	Title
PassengerId
1	0	Third class	male	22.0	1	0	7.25	Southampton	Mr
2	1	First class	female	38.0	1	0	71.28	Cherbourg	Mrs
3	1	Third class	female	26.0	0	0	7.92	Southampton	Miss
4	1	First class	female	35.0	1	0	53.10	Southampton	Mrs
5	0	Third class	male	35.0	0	0	8.05	Southampton	Mr

Prepare data for the model with Category Encoder¶

Create Target

[4]:

y = titan_df['Survived']
X = titan_df.drop('Survived', axis=1)

Train category encoder

[5]:

#Train category encoder
onehot = OneHotEncoder(cols=['Pclass']).fit(X)
result_1 = onehot.transform(X)
ordinal = OrdinalEncoder(cols=['Embarked','Title']).fit(result_1)
result_2 = ordinal.transform(result_1)
target = TargetEncoder(cols=['Sex']).fit(result_2,y)
result_3 =target.transform(result_2)

[6]:

encoder = [onehot,ordinal,target]

Fit a model¶

[7]:

Xtrain, Xtest, ytrain, ytest = train_test_split(result_3, y, train_size=0.75, random_state=1)

clf = XGBClassifier(n_estimators=200,min_child_weight=2).fit(Xtrain,ytrain)
clf.fit(Xtrain, ytrain)

[7]:

XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints=None,
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=2, missing=nan, monotone_constraints=None,
              n_estimators=200, n_jobs=0, num_parallel_tree=1,
              objective='binary:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method=None,
              validate_parameters=False, verbosity=None)

Using Shapash¶

[8]:

from shapash import SmartExplainer

[9]:

xpl = SmartExplainer(
    model=clf,
    preprocessing=encoder,
)

[10]:

xpl.compile(x=Xtest,
y_target=ytest, # Optional: allows to display True Values vs Predicted Values
)

Backend: Shap TreeExplainer

Visualize data in pandas¶

[11]:

xpl.x_init

[11]:

	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	Title
PassengerId
863	First class	female	48.0	0	0	25.93	Southampton	Mrs
224	Third class	male	29.5	0	0	7.90	Southampton	Mr
85	Second class	female	17.0	0	0	10.50	Southampton	Miss
681	Third class	female	29.5	0	0	8.14	Queenstown	Miss
536	Second class	female	7.0	0	2	26.25	Southampton	Miss
624	Third class	male	21.0	0	0	7.85	Southampton	Mr
149	Second class	male	36.5	0	2	26.00	Southampton	Mr
4	First class	female	35.0	1	0	53.10	Southampton	Mrs
35	First class	male	28.0	1	0	82.17	Cherbourg	Mr
242	Third class	female	29.5	1	0	15.50	Queenstown	Miss

[12]:

xpl.x_encoded

[12]:

	Pclass_1	Pclass_2	Pclass_3	Sex	Age	SibSp	Parch	Fare	Embarked	Title
PassengerId
863	0	1	0	0.742038	48.0	0	0	25.93	1	2
224	1	0	0	0.188908	29.5	0	0	7.90	1	1
85	0	0	1	0.742038	17.0	0	0	10.50	1	3
681	1	0	0	0.742038	29.5	0	0	8.14	3	3
536	0	0	1	0.742038	7.0	0	2	26.25	1	3
624	1	0	0	0.188908	21.0	0	0	7.85	1	1
149	0	0	1	0.188908	36.5	0	2	26.00	1	1
4	0	1	0	0.742038	35.0	1	0	53.10	1	2
35	0	1	0	0.188908	28.0	1	0	82.17	2	1
242	1	0	0	0.742038	29.5	1	0	15.50	3	3