{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Compute Contributions with Shap - Summarize Them With Shapash\n", "\n", "Shapash uses Shap backend to compute the Shapley contributions
\n", "in order to satisfy the most hurry users who wish to display
\n", "results with little lines of code.\n", "\n", "But we recommend you to refer to the excellent [Shap library](https://github.com/slundberg/shap).\n", "\n", "This tutorial shows how to use precalculated contributions with Shap in Shapash \n", "\n", "Contents:\n", "- Build a Binary Classifier\n", "- Use Shap KernelExplainer\n", "- Compile Shapash SmartExplainer\n", "- Display local_plot\n", "- to_pandas export\n", "\n", "We used Kaggle's [Titanic](https://www.kaggle.com/c/titanic) dataset" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from category_encoders import OrdinalEncoder\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.model_selection import train_test_split\n", "import shap" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from shapash.data.data_loader import data_loading" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "titan_df, titan_dict = data_loading('titanic')\n", "del titan_df['Name']" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassSexAgeSibSpParchFareEmbarkedTitle
PassengerId
10Third classmale22.0107.25SouthamptonMr
21First classfemale38.01071.28CherbourgMrs
31Third classfemale26.0007.92SouthamptonMiss
41First classfemale35.01053.10SouthamptonMrs
50Third classmale35.0008.05SouthamptonMr
\n", "
" ], "text/plain": [ " Survived Pclass Sex Age SibSp Parch Fare \\\n", "PassengerId \n", "1 0 Third class male 22.0 1 0 7.25 \n", "2 1 First class female 38.0 1 0 71.28 \n", "3 1 Third class female 26.0 0 0 7.92 \n", "4 1 First class female 35.0 1 0 53.10 \n", "5 0 Third class male 35.0 0 0 8.05 \n", "\n", " Embarked Title \n", "PassengerId \n", "1 Southampton Mr \n", "2 Cherbourg Mrs \n", "3 Southampton Miss \n", "4 Southampton Mrs \n", "5 Southampton Mr " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "titan_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create Classification Model" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "y = titan_df['Survived']\n", "X = titan_df.drop('Survived', axis=1)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "varcat=['Pclass','Sex','Embarked','Title']" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "categ_encoding = OrdinalEncoder(cols=varcat, \\\n", " handle_unknown='ignore', \\\n", " return_df=True).fit(X)\n", "X = categ_encoding.transform(X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Train Test split + Random Forest fit" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
RandomForestClassifier(min_samples_leaf=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "RandomForestClassifier(min_samples_leaf=3)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=0.75, random_state=1)\n", "\n", "rf = RandomForestClassifier(n_estimators=100,min_samples_leaf=3)\n", "rf.fit(Xtrain, ytrain)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "ypred = pd.DataFrame(rf.predict(Xtest),columns=['pred'],index=Xtest.index)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use Shapash With Shapley Contributions" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "from shapash import SmartExplainer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Differents ways to compute Shapeley values with Shap" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let Shapash choose the method for you" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO: Shap explainer type - \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "ExactExplainer explainer: 224it [00:42, 4.62it/s] \n" ] } ], "source": [ "xpl = SmartExplainer(\n", " model=rf,\n", " backend='shap',\n", " preprocessing=categ_encoding,\n", " features_dict=titan_dict\n", ")\n", "xpl.compile(\n", " y_pred=ypred,\n", " y_target=ytest, # Optional: allows to display True Values vs Predicted Values\n", " x=Xtest\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let Shap choose the method for you and give the masker you want" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO: Shap explainer type - \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "ExactExplainer explainer: 224it [00:36, 4.36it/s] \n" ] } ], "source": [ "xpl = SmartExplainer(\n", " model=rf,\n", " backend='shap',\n", " explainer_args={'model': rf.predict_proba, 'masker': Xtest},\n", " preprocessing=categ_encoding,\n", " features_dict=titan_dict\n", ")\n", "xpl.compile(\n", " y_pred=ypred,\n", " y_target=ytest, # Optional: allows to display True Values vs Predicted Values\n", " x=Xtest\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tell Shap what do " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "INFO: Shap explainer type - shap.explainers.PermutationExplainer()\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "PermutationExplainer explainer: 224it [03:04, 1.14it/s] \n" ] } ], "source": [ "xpl = SmartExplainer(\n", " model=rf,\n", " backend='shap',\n", " explainer_args={'explainer': shap.explainers.PermutationExplainer, 'model': rf.predict_proba, 'masker': Xtest},\n", " preprocessing=categ_encoding,\n", " features_dict=titan_dict\n", ")\n", "xpl.compile(\n", " y_pred=ypred,\n", " y_target=ytest, # Optional: allows to display True Values vs Predicted Values\n", " x=Xtest\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use contributions parameter of compile method to declare Shapley contributions" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "PermutationExplainer explainer: 224it [00:23, 5.51it/s] \n" ] } ], "source": [ "xpl = SmartExplainer(\n", " model=rf,\n", " preprocessing=categ_encoding,\n", " features_dict=titan_dict\n", ")\n", "\n", "masker = pd.DataFrame(shap.kmeans(Xtest, 50).data, columns=Xtest.columns)\n", "explainer = shap.explainers.PermutationExplainer(model=rf.predict_proba, masker=masker)\n", "shap_contrib = explainer.shap_values(Xtest)\n", "\n", "xpl.compile(\n", " contributions=shap_contrib, # Shap Contributions pd.DataFrame\n", " y_pred=ypred,\n", " y_target=ytest, # Optional: allows to display True Values vs Predicted Values\n", " x=Xtest\n", ")" ] } ], "metadata": { "celltoolbar": "Aucun(e)", "hide_input": false, "kernelspec": { "display_name": "myenv_39", "language": "python", "name": "myenv_39" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.9" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "vscode": { "interpreter": { "hash": "6dbaec60c0b0d722a3fa908c2fd7b738d946da6332c67fea5eea602801fdaf43" } } }, "nbformat": 4, "nbformat_minor": 4 }