Category_encoder tutorial¶
This tutorial shows how to use category_encoder encoders to reverse data preprocessing and display explicit labels.
We used Kaggle’s Titanic dataset.
This Tutorial: - Encode data with Category_encoder - Build a Binary Classifier (Random Forest) - Using Shapash - Show inversed data
[1]:
import numpy as np
import pandas as pd
from category_encoders import OrdinalEncoder
from category_encoders import OneHotEncoder
from category_encoders import TargetEncoder
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
Load titanic Data¶
[2]:
from shapash.data.data_loader import data_loading
titan_df, titan_dict = data_loading('titanic')
del titan_df['Name']
[3]:
titan_df.head()
[3]:
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Title | |
---|---|---|---|---|---|---|---|---|---|
PassengerId | |||||||||
1 | 0 | Third class | male | 22.0 | 1 | 0 | 7.25 | Southampton | Mr |
2 | 1 | First class | female | 38.0 | 1 | 0 | 71.28 | Cherbourg | Mrs |
3 | 1 | Third class | female | 26.0 | 0 | 0 | 7.92 | Southampton | Miss |
4 | 1 | First class | female | 35.0 | 1 | 0 | 53.10 | Southampton | Mrs |
5 | 0 | Third class | male | 35.0 | 0 | 0 | 8.05 | Southampton | Mr |
Prepare data for the model with Category Encoder¶
Create Target
[4]:
y = titan_df['Survived']
X = titan_df.drop('Survived', axis=1)
Train category encoder
[5]:
#Train category encoder
onehot = OneHotEncoder(cols=['Pclass']).fit(X)
result_1 = onehot.transform(X)
ordinal = OrdinalEncoder(cols=['Embarked','Title']).fit(result_1)
result_2 = ordinal.transform(result_1)
target = TargetEncoder(cols=['Sex']).fit(result_2,y)
result_3 =target.transform(result_2)
[6]:
encoder = [onehot,ordinal,target]
Fit a model¶
[7]:
Xtrain, Xtest, ytrain, ytest = train_test_split(result_3, y, train_size=0.75, random_state=1)
clf = XGBClassifier(n_estimators=200,min_child_weight=2).fit(Xtrain,ytrain)
clf.fit(Xtrain, ytrain)
[7]:
XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints=None,
learning_rate=0.300000012, max_delta_step=0, max_depth=6,
min_child_weight=2, missing=nan, monotone_constraints=None,
n_estimators=200, n_jobs=0, num_parallel_tree=1,
objective='binary:logistic', random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method=None,
validate_parameters=False, verbosity=None)
Using Shapash¶
[8]:
from shapash import SmartExplainer
[9]:
xpl = SmartExplainer(
model=clf,
preprocessing=encoder,
)
[10]:
xpl.compile(x=Xtest,
y_target=ytest, # Optional: allows to display True Values vs Predicted Values
)
Backend: Shap TreeExplainer
Visualize data in pandas¶
[11]:
xpl.x_init
[11]:
Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | Title | |
---|---|---|---|---|---|---|---|---|
PassengerId | ||||||||
863 | First class | female | 48.0 | 0 | 0 | 25.93 | Southampton | Mrs |
224 | Third class | male | 29.5 | 0 | 0 | 7.90 | Southampton | Mr |
85 | Second class | female | 17.0 | 0 | 0 | 10.50 | Southampton | Miss |
681 | Third class | female | 29.5 | 0 | 0 | 8.14 | Queenstown | Miss |
536 | Second class | female | 7.0 | 0 | 2 | 26.25 | Southampton | Miss |
624 | Third class | male | 21.0 | 0 | 0 | 7.85 | Southampton | Mr |
149 | Second class | male | 36.5 | 0 | 2 | 26.00 | Southampton | Mr |
4 | First class | female | 35.0 | 1 | 0 | 53.10 | Southampton | Mrs |
35 | First class | male | 28.0 | 1 | 0 | 82.17 | Cherbourg | Mr |
242 | Third class | female | 29.5 | 1 | 0 | 15.50 | Queenstown | Miss |
[12]:
xpl.x_encoded
[12]:
Pclass_1 | Pclass_2 | Pclass_3 | Sex | Age | SibSp | Parch | Fare | Embarked | Title | |
---|---|---|---|---|---|---|---|---|---|---|
PassengerId | ||||||||||
863 | 0 | 1 | 0 | 0.742038 | 48.0 | 0 | 0 | 25.93 | 1 | 2 |
224 | 1 | 0 | 0 | 0.188908 | 29.5 | 0 | 0 | 7.90 | 1 | 1 |
85 | 0 | 0 | 1 | 0.742038 | 17.0 | 0 | 0 | 10.50 | 1 | 3 |
681 | 1 | 0 | 0 | 0.742038 | 29.5 | 0 | 0 | 8.14 | 3 | 3 |
536 | 0 | 0 | 1 | 0.742038 | 7.0 | 0 | 2 | 26.25 | 1 | 3 |
624 | 1 | 0 | 0 | 0.188908 | 21.0 | 0 | 0 | 7.85 | 1 | 1 |
149 | 0 | 0 | 1 | 0.188908 | 36.5 | 0 | 2 | 26.00 | 1 | 1 |
4 | 0 | 1 | 0 | 0.742038 | 35.0 | 1 | 0 | 53.10 | 1 | 2 |
35 | 0 | 1 | 0 | 0.188908 | 28.0 | 1 | 0 | 82.17 | 2 | 1 |
242 | 1 | 0 | 0 | 0.742038 | 29.5 | 1 | 0 | 15.50 | 3 | 3 |