{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Compile faster Lime and consistency of contributions \n", "\n", "You can compute your local contributions with the [Lime](https://github.com/marcotcr/lime) library and summarize them with Shapash.\n", "One of the limitations of using Lime is the speed of calculation.\n", "In this tutorial, we propose 2 ways to speed up the calculations.\n", "Then, we look impacts on the contributions of these accelerated calculations.\n", "\n", "Contents:\n", "- Build a Binary Classifier (Random Forest)\n", "- Create Explainer using Lime\n", "- Compile Shapash SmartExplainer\n", "- Use of multiprocessing\n", "- Changing setting of the num_samples option\n", "- Comparison of computing times\n", "- Consistency of contributions\n", "\n", "Data from Kaggle [Telco customer churn](https://www.kaggle.com/blastchar/telco-customer-churn)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from maif_datalab import utils\n", "utils.set_proxy()\n", "\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")\n", "\n", "import plotly\n", "plotly.io.renderers.default = 'png'" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from category_encoders import OrdinalEncoder\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.model_selection import train_test_split\n", "import lime.lime_tabular\n", "\n", "from shapash import SmartExplainer\n", "from category_encoders import OrdinalEncoder\n", "import multiprocessing\n", "from collections import namedtuple\n", "from shapash.explainer.consistency import Consistency" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building Supervized Model\n", "\n", "Let's start by loading a dataset and building a model that we will try to explain right after.\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from shapash.data.data_loader import data_loading" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 92.9 ms, sys: 13.9 ms, total: 107 ms\n", "Wall time: 3.86 s\n" ] } ], "source": [ "%%time\n", "df = data_loading('telco_customer_churn')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "df = df.reset_index().drop('customerID', axis=1)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "df['Churn'].replace('No', 0,inplace=True)\n", "df['Churn'].replace('Yes', 1,inplace=True)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "y_df = df['Churn']\n", "X_df = df.drop('Churn', axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Encoding Categorical Features " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "categorical_features = [col for col in X_df.columns if X_df[col].dtype == 'object']\n", "\n", "encoder = OrdinalEncoder(\n", " cols=categorical_features,\n", " handle_unknown='ignore',\n", " return_df=True).fit(X_df)\n", "\n", "X_df=encoder.transform(X_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Train / Test Split" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "Xtrain, Xtest, ytrain, ytest = train_test_split(X_df, y_df, train_size=0.75, random_state=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Model Fitting" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
RandomForestClassifier(min_samples_leaf=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(min_samples_leaf=3)