scatter_plot_prediction

scatter_plot_prediction is a method that displays violin or scatter plot. The purpose of these representations is to visualise correct or wrong prediction Provide the y_target argument in the compile() method to display this plot

This tutorial presents the different parameters you can use in scatter_plot_prediction to tune output, and results on classification and regression models

Contents: - Building a classification model - Building a regression model - Display scatter plot prediction - Focus on a subset - Size of Random Sample

Data from Kaggle Titanic for classification Data from Kaggle House Prices for regression

[1]:
import pandas as pd
from category_encoders import OrdinalEncoder
from xgboost import XGBClassifier
from catboost import CatBoostRegressor
from sklearn.model_selection import train_test_split

Building Supervized Model for Classification and compute Shapash

[2]:
from shapash.data.data_loader import data_loading
titanic_df, titanic_dict = data_loading('titanic')
y_df=titanic_df['Survived'].to_frame()
X_df=titanic_df[titanic_df.columns.difference(['Survived'])]
[3]:
from category_encoders import OrdinalEncoder

categorical_features = [col for col in X_df.columns if X_df[col].dtype == 'object']

encoder = OrdinalEncoder(
    cols=categorical_features,
    handle_unknown='ignore',
    return_df=True).fit(X_df)

X_df=encoder.transform(X_df)

Xtrain, Xtest, ytrain, ytest = train_test_split(X_df, y_df, train_size=0.75, random_state=7)
[ ]:
clf = XGBClassifier(n_estimators=200,min_child_weight=2).fit(Xtrain,ytrain)

First step: You need to Declare and Compile SmartExplainer

[5]:
from shapash import SmartExplainer
[6]:
response_dict = {0: 'Death', 1:' Survival'}
[7]:
xpl_classifier = SmartExplainer(
    model=clf,
    preprocessing=encoder, # Optional: compile step can use inverse_transform method
    features_dict=titanic_dict, # Optional parameters
    label_dict=response_dict    # Optional parameters, dicts specify labels
)

To display scatter_plot_prediction, you have to fill “y_target”

[8]:
xpl_classifier.compile(x=Xtest,
            y_target=ytest
           )

Building Supervized Model for Regression and compute Shapash

[9]:
from shapash.data.data_loader import data_loading
house_df, house_dict = data_loading('house_prices')
y_df=house_df['SalePrice'].to_frame()
X_df=house_df[house_df.columns.difference(['SalePrice'])]
[10]:
from category_encoders import OrdinalEncoder

categorical_features = [col for col in X_df.columns if X_df[col].dtype == 'object']

encoder = OrdinalEncoder(
    cols=categorical_features,
    handle_unknown='ignore',
    return_df=True).fit(X_df)

X_df=encoder.transform(X_df)

Xtrain, Xtest, ytrain, ytest = train_test_split(X_df, y_df, train_size=0.75, random_state=1)
[11]:
regressor = CatBoostRegressor(n_estimators=50).fit(Xtrain,ytrain,verbose=False)

First step: You need to Declare and Compile SmartExplainer

[12]:
xpl_regressor = SmartExplainer(
    model=regressor,
    preprocessing=encoder, # Optional: compile step can use inverse_transform method
    features_dict=house_dict, # Optional parameter, dict specifies label for features name
)

To display scatter_plot_prediction, you have to fill “y_target”

[13]:
xpl_regressor.compile(x=Xtest,
            y_target=ytest
           )

Display scatter plot prediction

[14]:
xpl_classifier.plot.scatter_plot_prediction()
../../_images/tutorials_plots_and_charts_tuto-plot06-prediction_plot_21_0.png

On the x-axis we have the actual class of samples. On the y-axis we have the probability of the classifier belonging to the default class (Survival in this case). For a binary classification, we find a sort of confusion matrix with the true negatives at the bottom left. On the top right, the true positives. On the top left, the false negatives. Bottom right, false positives. The violin plot represents the distribution of all the samples.

[15]:
xpl_regressor.plot.scatter_plot_prediction()
../../_images/tutorials_plots_and_charts_tuto-plot06-prediction_plot_23_0.png

On the x-axis we have the actual values of samples. On the y-axis we have the predicted values. If a point is on the y = x line and is blue, then the prediction error is low. The further a point deviates from this line and is red, the higher the prediction error. If there are no true values (y_target) at 0 as in this use case, the prediction error is calculated as follows : \(Prediction Error = abs(True Values - Predicted Values) / True Values\) Else \(Prediction Error = abs(True Values - Predicted Values)\)

Focus on a subset

With selection params you can specify a list of index of people you wand to focus

[16]:
index = list(Xtest[xpl_regressor.x_init['YearBuilt'] > 1990].index.values)
[17]:
xpl_regressor.plot.scatter_plot_prediction(selection=index, width=900, height=600)
../../_images/tutorials_plots_and_charts_tuto-plot06-prediction_plot_27_0.png

with label parameter, you can specify explicit label or label number Very useful for multiclass.

[18]:
xpl_classifier.plot.scatter_plot_prediction(label='Death')
../../_images/tutorials_plots_and_charts_tuto-plot06-prediction_plot_29_0.png

Size of Random Sample

Method plot.scatter_plot_prediction use random sample to limit the number of points displayed. Default size of this sample is 2000, but you can change it with the parameter max_points:

[19]:
xpl_classifier.plot.scatter_plot_prediction(max_points=200)
../../_images/tutorials_plots_and_charts_tuto-plot06-prediction_plot_31_0.png

Add y_target after compile

If the compile() took a long time and you forgot to fill in y_target, you can also add it with the add method

[20]:
xpl_regressor = SmartExplainer(
    model=regressor,
    preprocessing=encoder, # Optional: compile step can use inverse_transform method
    features_dict=house_dict, # Optional parameter, dict specifies label for features name
)
[21]:
xpl_regressor.compile(x=Xtest,
            #y_target=ytest
           )
[22]:
xpl_regressor.add(y_target=ytest)
[23]:
xpl_regressor.plot.scatter_plot_prediction()
../../_images/tutorials_plots_and_charts_tuto-plot06-prediction_plot_36_0.png
[ ]: