{ "cells": [ { "cell_type": "markdown", "id": "furnished-present", "metadata": {}, "source": [ "# Shapash Report\n", "\n", "> The Shapash Report feature allows data scientists to deliver to anyone who is interested in their project **a document that freezes different aspects of their work as a basis of an audit report**. This document can be easily shared across teams and does not require anything else than a working internet connexion.\n", "\n", "The shapash `generate_report` method allows to generate a report of your project. \n", "The result is a standalone HTML file that does not require any external dependency or server to work. \n", "The only requirement for the document to display properly is an active internet connexion. \n", "\n", "The report contains the following information :\n", "1. General information about the project\n", "2. Description of the dataset used\n", "3. Documentation about data preparation and feature engineering\n", "4. Details about your model used (library, parameters...)\n", "5. Exploration of the data with a focus on the difference between train and test sets\n", "6. Global explainability of the model\n", "7. Model performance\n", "\n", "> The first three points are generated using a YML file that the user should fill. An example is available [here](https://github.com/MAIF/shapash/blob/master/tutorial/report/utils/project_info.yml).\n", "\n", "This tutorial presents an example of how one can generate the Shapash Report. \n", "\n", "Content:\n", "- Set up an example project\n", "- Create and fill your project information that will be displayed in the report\n", "- Generate the base Shapash Report\n", "- *Go further*: Generate a custom report\n", "\n", "Data from Kaggle [House Prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)\n", "\n", "> Note : you may need to download the HTML report locally and open it in your browser otherwise it may not show properly." ] }, { "cell_type": "code", "execution_count": null, "id": "automatic-exemption", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from category_encoders import OrdinalEncoder\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "markdown", "id": "familiar-charm", "metadata": {}, "source": [ "## Building Supervized Model " ] }, { "cell_type": "code", "execution_count": null, "id": "subtle-sheet", "metadata": {}, "outputs": [], "source": [ "from shapash.data.data_loader import data_loading\n", "house_df, house_dict = data_loading('house_prices')\n", "y_df=house_df['SalePrice']\n", "X_df=house_df[house_df.columns.difference(['SalePrice'])]" ] }, { "cell_type": "code", "execution_count": null, "id": "colored-indie", "metadata": {}, "outputs": [], "source": [ "from category_encoders import OrdinalEncoder\n", "\n", "categorical_features = [col for col in X_df.columns if X_df[col].dtype == 'object']\n", "\n", "encoder = OrdinalEncoder(\n", " cols=categorical_features,\n", " handle_unknown='ignore',\n", " return_df=True).fit(X_df)\n", "\n", "X_df = encoder.transform(X_df)" ] }, { "cell_type": "code", "execution_count": null, "id": "hispanic-leave", "metadata": {}, "outputs": [], "source": [ "Xtrain, Xtest, ytrain, ytest = train_test_split(X_df, y_df, train_size=0.75, random_state=1)" ] }, { "cell_type": "code", "execution_count": null, "id": "increased-shuttle", "metadata": {}, "outputs": [], "source": [ "regressor = RandomForestRegressor(n_estimators=50).fit(Xtrain, ytrain)" ] }, { "cell_type": "code", "execution_count": null, "id": "floral-cleaners", "metadata": {}, "outputs": [], "source": [ "y_pred = pd.DataFrame(regressor.predict(Xtest),columns=['pred'], index=Xtest.index)" ] }, { "cell_type": "markdown", "id": "satisfactory-baptist", "metadata": {}, "source": [ "## Fill your project information\n", "\n", "**The next step is to create a YML file containing information about your project.** \n", "\n", "We will use the example file available [here](https://github.com/MAIF/shapash/blob/master/tutorial/report/utils/project_info.yml). \n", "**You are welcome to use this file as a template for your own report.** \n", "\n", "We display the information contained in the YML file below :" ] }, { "cell_type": "code", "execution_count": null, "id": "searching-northern", "metadata": {}, "outputs": [], "source": [ "import yaml\n", "\n", "with open(r'utils/project_info.yml') as file:\n", " project_info = yaml.full_load(file)\n", "\n", "print(yaml.dump(project_info, sort_keys=False))" ] }, { "cell_type": "markdown", "id": "northern-bahrain", "metadata": {}, "source": [ "---\n", "**If you want to create your own custom file :**\n", "\n", "The keys of the YML file are the titles of the different sections in the report. \n", "The YML file must then respect the following format:\n", "\n", "```yaml\n", "Title of section 1: \n", " property1 name: property1 value \n", " property2 name: property2 value \n", " ...\n", "Title of section 2: \n", " property1 name: property1 value \n", " ...\n", "```\n", "> Note that the **date** can be computed automatically using the *auto* property value (see example above)" ] }, { "cell_type": "markdown", "id": "optimum-description", "metadata": {}, "source": [ "## Generate your report" ] }, { "cell_type": "markdown", "id": "suited-honor", "metadata": {}, "source": [ "### Declare and compile SmartExplainer object" ] }, { "cell_type": "code", "execution_count": null, "id": "automated-hundred", "metadata": {}, "outputs": [], "source": [ "from shapash import SmartExplainer" ] }, { "cell_type": "code", "execution_count": null, "id": "consecutive-venice", "metadata": {}, "outputs": [], "source": [ "xpl = SmartExplainer(\n", " model=regressor,\n", " preprocessing=encoder, # Optional: compile step can use inverse_transform method\n", " features_dict=house_dict # optional parameter, specifies label for features name \n", ") " ] }, { "cell_type": "code", "execution_count": null, "id": "rising-person", "metadata": {}, "outputs": [], "source": [ "xpl.compile(x=Xtest, y_pred=y_pred, y_target=ytest)" ] }, { "cell_type": "markdown", "id": "focal-liberty", "metadata": {}, "source": [ "At this step the model can be checked and inspected using different methods of the SmartExplainer object we just created. \n", "\n", "Please refer to the other tutorials for more information." ] }, { "cell_type": "markdown", "id": "talented-shadow", "metadata": {}, "source": [ "### Generate the base Shapash Report\n", "\n", "Next we can generate the report using the `generate_report` method of our SmartExplainer object.\n", "\n", "We need to pass `x_train`, `y_train` and `y_test` parameters in order to explore the data used when training the model.\n", "\n", "Please refer to the documentation for a full description of the parameters.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "returning-effort", "metadata": {}, "outputs": [], "source": [ "xpl.generate_report(\n", " output_file='output/report.html', \n", " project_info_file='utils/project_info.yml',\n", " x_train=Xtrain,\n", " y_train=ytrain,\n", " y_test=ytest,\n", " title_story=\"House prices report\",\n", " title_description=\"\"\"This document is a data science report of the kaggle house prices tutorial project. \n", " It was generated using the Shapash library.\"\"\",\n", " metrics=[\n", " {\n", " 'path': 'sklearn.metrics.mean_absolute_error',\n", " 'name': 'Mean absolute error', \n", " },\n", " {\n", " 'path': 'sklearn.metrics.mean_squared_error',\n", " 'name': 'Mean squared error',\n", " }\n", " ]\n", ")" ] }, { "cell_type": "markdown", "id": "central-karma", "metadata": {}, "source": [ "> Note: You might want to specify the jupyter kernel used when generating the report.\n", "You should consider using the `kernel_name` parameter to indicate what kernel to use." ] }, { "cell_type": "markdown", "id": "elementary-sympathy", "metadata": {}, "source": [ "## Customize your own report" ] }, { "cell_type": "markdown", "id": "grateful-terror", "metadata": {}, "source": [ "Now let's customize our report by adding some new sections.\n", "\n", "To do so :\n", "- First, **copy the base report notebook** you can find [here](https://github.com/MAIF/shapash/blob/master/shapash/report/base_report.ipynb). This is the notebook that is used to generate the shapash report. It is executed and then converted to an HTML file. Only the output of each cell is kept and the code is deleted.\n", "- Then, delete or add cells depending on what you want to change.\n", "- Finally, add the parameter `notebook_path=\"path/to/your/custom/report.ipynb\"` in the `generate_report` method.\n", "\n", "> **Tip** : You can use the `working_dir` parameter to easily work inside your custom notebook before using the `generate_report` method. This way you can load the parameters used inside the notebook by papermill. Replace the `dir_path` inside your custom notebook with your own `working_dir` where are saved the different instances used.\n", "\n", "For our simple example, we created [this notebook](https://github.com/MAIF/shapash/blob/master/tutorial/report/utils/custom_report.ipynb). \n", "- We removed the multivariate analysis using the `report.display_dataset_analysis(multivariate_analysis=False)` (see notebook utils/custom_report.ipynb for more information)\n", "- It includes new sections **Relashionship with target variable** and **Relashionship between training variables** in which we included new simple graphs for this example. \n", "- We also added new cells at the end of the **metrics** section.\n", "\n", "Next, we use this notebook to generate our new custom report :" ] }, { "cell_type": "code", "execution_count": null, "id": "played-vegetation", "metadata": {}, "outputs": [], "source": [ "xpl.generate_report(\n", " output_file='output/custom_report.html', \n", " project_info_file='utils/project_info.yml',\n", " x_train=Xtrain,\n", " y_train=ytrain,\n", " y_test=ytest,\n", " title_story=\"House prices report\",\n", " title_description=\"\"\"This document is a data science report of the kaggle house prices tutorial project. \n", " It was generated using the Shapash library.\"\"\",\n", " metrics=[\n", " {\n", " 'path': 'sklearn.metrics.mean_absolute_error',\n", " 'name': 'Mean absolute error', \n", " },\n", " {\n", " 'path': 'sklearn.metrics.mean_squared_error',\n", " 'name': 'Mean squared error',\n", " }\n", " ],\n", " working_dir='working',\n", " notebook_path=\"utils/custom_report.ipynb\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "direct-cheese", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "hide_input": false, "kernelspec": { "display_name": "Python 3.9.13", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "vscode": { "interpreter": { "hash": "6dbaec60c0b0d722a3fa908c2fd7b738d946da6332c67fea5eea602801fdaf43" } } }, "nbformat": 4, "nbformat_minor": 5 }