Postprocessing parameter in compile method

Compile method is a method that creates the explainer you need for your model. This compile method has many parameters, and among those is postprocessing parameter, that will be explained in this tutorial. This parameter allows to modify the dataset with several techniques, for a better visualization. This tutorial presents the different way you can modify data, and the right syntax to do it.

Contents: - Loading dataset and fitting a model.

  • Creating our SmartExplainer and compiling it without postprocessing.

  • New SmartExplainer with postprocessing parameter.

Data from Kaggle: Titanic

[1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

Building Supervized Model

First step : Importing our dataset

[2]:
from shapash.data.data_loader import data_loading
titanic_df, titanic_dict = data_loading('titanic')
y_df=titanic_df['Survived']
X_df=titanic_df[titanic_df.columns.difference(['Survived'])]
[3]:
titanic_df.head()
[3]:
Survived Pclass Name Sex Age SibSp Parch Fare Embarked Title
PassengerId
1 0 Third class Braund Owen Harris male 22.0 1 0 7.25 Southampton Mr
2 1 First class Cumings John Bradley (Florence Briggs Thayer) female 38.0 1 0 71.28 Cherbourg Mrs
3 1 Third class Heikkinen Laina female 26.0 0 0 7.92 Southampton Miss
4 1 First class Futrelle Jacques Heath (Lily May Peel) female 35.0 1 0 53.10 Southampton Mrs
5 0 Third class Allen William Henry male 35.0 0 0 8.05 Southampton Mr

Second step : Encode our categorical variables

[4]:
from category_encoders import OrdinalEncoder

categorical_features = [col for col in X_df.columns if X_df[col].dtype == 'object']

encoder = OrdinalEncoder(
    cols=categorical_features,
    handle_unknown='ignore',
    return_df=True).fit(X_df)

X_df = encoder.transform(X_df)

Third step : Train/test split and fitting our model

[5]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X_df, y_df, train_size=0.75, random_state=1)
[6]:
classifier = RandomForestClassifier(n_estimators=200).fit(Xtrain, ytrain)
[7]:
y_pred = pd.DataFrame(classifier.predict(Xtest), columns=['pred'], index=Xtest.index) # Predictions

Fourth step : Declaring our Explainer

[8]:
from shapash import SmartExplainer
[9]:
xpl = SmartExplainer(
    model=classifier,
    preprocessing=encoder, # Optional: compile step can use inverse_transform method
    features_dict=titanic_dict   # Optional parameter, dict specifies label for features name
)

Compiling without postprocessing parameter

After declaring our explainer, we need to compile it on our model and data in order to have information.

[10]:
xpl.compile(x=Xtest, y_pred=y_pred)
Backend: Shap TreeExplainer

We can now use our explainer to understand model predictions, through plots or data. We also can find our original dataset, before preprocessing.

[11]:
xpl.x_init
[11]:
Age Embarked Fare Name Parch Pclass Sex SibSp Title
PassengerId
863 48.0 Southampton 25.93 Swift Frederick Joel (Margaret Welles Barron) 0 First class female 0 Mrs
224 29.5 Southampton 7.90 Nenkoff Christo 0 Third class male 0 Mr
85 17.0 Southampton 10.50 Ilett Bertha 0 Second class female 0 Miss
681 29.5 Queenstown 8.14 Peters Katie 0 Third class female 0 Miss
536 7.0 Southampton 26.25 Hart Eva Miriam 2 Second class female 0 Miss
... ... ... ... ... ... ... ... ... ...
507 33.0 Southampton 26.00 Quick Frederick Charles (Jane Richards) 2 Second class female 0 Mrs
468 56.0 Southampton 26.55 Smart John Montgomery 0 First class male 0 Mr
741 29.5 Southampton 30.00 Hawksford Walter James 0 First class male 0 Mr
355 29.5 Cherbourg 7.22 Yousif Wazli 0 Third class male 0 Mr
450 52.0 Southampton 30.50 Peuchen Arthur Godfrey 0 First class male 0 Major

223 rows × 9 columns

All the analysis you can do is in this tutorial : Tutorial

Compiling with postprocessing parameter

Nevertheless, here we want to add postprocessing to our data to understand them better, and to have a better explicability.

The syntax for the postprocessing parameter is as follow :

postprocess = {
    'name_of_the_feature': {'type': 'type_of_modification', 'rule': 'rule_to_apply'},
    'second_name_of_features': {'type': 'type_of_modification', 'rule': 'rule_to_apply'},
    ...
}

You have five different types of modifications :

    1. prefix : If you want to modify the beginning of the data. The syntax is

{'features_name': {'type': 'prefix',
                     'rule': 'Example : '}
}
    1. suffix : If you want to add something at the end of some features, the syntax is similar :

{'features_name': {'type': 'suffix',
                     'rule': ' is an example'}
}
    1. transcoding : This is a mapping function which modifies categorical variables. The syntax is :

{'features_name': {'type': 'transcoding',
                     'rule': {'old_name1': 'new_name1',
                              'old_name2': 'new_name2',
                              ...
                             }
                    }
}

If you don’t map all possible values, those values won’t be modified.

    1. regex : If you want to modify strings, you can do it by regular expressions like this:

{'features_name': {'type': 'regex',
                     'rule': {'in': '^M',
                              'out': 'm'
                             }
                    }
}
    1. case : If you want to change the case of a certain features, you can or change everything in lowercase with 'rule': 'lower', or change in uppercase with 'rule': 'upper'. The syntax is :

{'features_name': {'type': 'case',
                     'rule': 'upper'}

Of course, you don’t have to modify all features. Let’s give an example.

[12]:
postprocess = {
    'Age': {'type': 'suffix',
            'rule': ' years old' # Adding 'years old' at the end
           },
    'Sex': {'type': 'transcoding',
            'rule': {'male': 'Man',
                     'female': 'Woman'}
           },
    'Pclass': {'type': 'regex',
               'rule': {'in': ' class$',
                        'out': ''} # Deleting 'class' word at the end
              },
    'Fare': {'type': 'prefix',
             'rule': '$' # Adding $ at the beginning
            },
    'Embarked': {'type': 'case',
                 'rule': 'upper'
                }
}

You can now add this postprocess dict in parameter :

[13]:
xpl_postprocess = SmartExplainer(
    model=classifier,
    postprocessing=postprocess,
    preprocessing=encoder,       # Optional: compile step can use inverse_transform method
    features_dict=titanic_dict
)
[14]:
xpl_postprocess.compile(
    x=Xtest,
    y_pred=y_pred, # Optional
)
Backend: Shap TreeExplainer

You can now visualize your dataset, which is modified.

[15]:
xpl_postprocess.x_init
[15]:
Age Embarked Fare Name Parch Pclass Sex SibSp Title
PassengerId
863 48.0 years old SOUTHAMPTON $25.93 Swift Frederick Joel (Margaret Welles Barron) 0 First Woman 0 Mrs
224 29.5 years old SOUTHAMPTON $7.9 Nenkoff Christo 0 Third Man 0 Mr
85 17.0 years old SOUTHAMPTON $10.5 Ilett Bertha 0 Second Woman 0 Miss
681 29.5 years old QUEENSTOWN $8.14 Peters Katie 0 Third Woman 0 Miss
536 7.0 years old SOUTHAMPTON $26.25 Hart Eva Miriam 2 Second Woman 0 Miss
... ... ... ... ... ... ... ... ... ...
507 33.0 years old SOUTHAMPTON $26.0 Quick Frederick Charles (Jane Richards) 2 Second Woman 0 Mrs
468 56.0 years old SOUTHAMPTON $26.55 Smart John Montgomery 0 First Man 0 Mr
741 29.5 years old SOUTHAMPTON $30.0 Hawksford Walter James 0 First Man 0 Mr
355 29.5 years old CHERBOURG $7.22 Yousif Wazli 0 Third Man 0 Mr
450 52.0 years old SOUTHAMPTON $30.5 Peuchen Arthur Godfrey 0 First Man 0 Major

223 rows × 9 columns

All the plots are also modified with the postprocessing modifications.

The main purpose of postprocessing modifications is a better understanding of the data, especially when the features names are not specified, such as in to_pandas() method, which orders the features depending on their importance.

[17]:
xpl_postprocess.to_pandas()
to_pandas params: {'features_to_hide': None, 'threshold': None, 'positive': None, 'max_contrib': 20}
[17]:
pred feature_1 value_1 contribution_1 feature_2 value_2 contribution_2 feature_3 value_3 contribution_3 ... contribution_6 feature_7 value_7 contribution_7 feature_8 value_8 contribution_8 feature_9 value_9 contribution_9
863 1 Title of passenger Mrs 0.163479 Sex Woman 0.154309 Ticket class First 0.130221 ... 0.0406219 Name, First name Swift Frederick Joel (Margaret Welles Barron) -0.0381955 Port of embarkation SOUTHAMPTON -0.0147327 Relatives like children or parents 0 -0.00538103
224 0 Title of passenger Mr 0.094038 Sex Man 0.0696282 Age 29.5 years old 0.0658556 ... 0.0151605 Relatives such as brother or wife 0 -0.00855039 Relatives like children or parents 0 0.00124433 Name, First name Nenkoff Christo -0.000577095
85 1 Title of passenger Miss 0.190529 Sex Woman 0.135507 Ticket class Second 0.0809714 ... -0.025286 Relatives like children or parents 0 -0.0238222 Relatives such as brother or wife 0 0.0209045 Age 17.0 years old -0.00702283
681 1 Title of passenger Miss 0.237477 Port of embarkation QUEENSTOWN 0.143451 Sex Woman 0.127931 ... 0.0243567 Relatives like children or parents 0 0.0165205 Passenger fare $8.14 -0.0109633 Age 29.5 years old 0.00327866
536 1 Title of passenger Miss 0.210166 Ticket class Second 0.168247 Sex Woman 0.0876445 ... 0.0147503 Relatives like children or parents 2 0.0125069 Port of embarkation SOUTHAMPTON -0.0119119 Name, First name Hart Eva Miriam 0.00654165
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
507 1 Title of passenger Mrs 0.215332 Sex Woman 0.194419 Ticket class Second 0.166437 ... -0.0079185 Relatives like children or parents 2 0.00407485 Age 33.0 years old -0.00263589 Name, First name Quick Frederick Charles (Jane Richards) 0.00162901
468 0 Sex Man 0.100602 Passenger fare $26.55 -0.099794 Title of passenger Mr 0.0967768 ... 0.0243706 Port of embarkation SOUTHAMPTON 0.0124424 Relatives such as brother or wife 0 -0.0108301 Relatives like children or parents 0 -0.00332632
741 0 Title of passenger Mr 0.131861 Sex Man 0.110845 Age 29.5 years old 0.104878 ... 0.0339308 Relatives such as brother or wife 0 -0.00715564 Name, First name Hawksford Walter James 0.00165882 Relatives like children or parents 0 -0.00137946
355 0 Title of passenger Mr 0.12679 Sex Man 0.0933251 Age 29.5 years old 0.0717939 ... -0.0271103 Name, First name Yousif Wazli 0.0163174 Relatives such as brother or wife 0 -0.0108501 Relatives like children or parents 0 -0.000543508
450 0 Sex Man 0.13572 Title of passenger Major -0.0723023 Age 52.0 years old 0.0690373 ... 0.027384 Relatives such as brother or wife 0 -0.0134144 Relatives like children or parents 0 0.00256623 Name, First name Peuchen Arthur Godfrey 0.00229483

223 rows × 28 columns