Shapash Report¶
The Shapash Report feature allows data scientists to deliver to anyone who is interested in their project a document that freezes different aspects of their work as a basis of an audit report. This document can be easily shared across teams and does not require anything else than a working internet connexion.
generate_report
method allows to generate a report of your project.The report contains the following information : 1. General information about the project 2. Description of the dataset used 3. Documentation about data preparation and feature engineering 4. Details about your model used (library, parameters…) 5. Exploration of the data with a focus on the difference between train and test sets 6. Global explainability of the model 7. Model performance
The first three points are generated using a YML file that the user should fill. An example is available here.
This tutorial presents an example of how one can generate the Shapash Report.
Content: - Set up an example project - Create and fill your project information that will be displayed in the report - Generate the base Shapash Report - Go further: Generate a custom report
Data from Kaggle House Prices
Note : you may need to download the HTML report locally and open it in your browser otherwise it may not show properly.
[ ]:
import pandas as pd
from category_encoders import OrdinalEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
Building Supervized Model¶
[ ]:
from shapash.data.data_loader import data_loading
house_df, house_dict = data_loading('house_prices')
y_df=house_df['SalePrice']
X_df=house_df[house_df.columns.difference(['SalePrice'])]
[ ]:
from category_encoders import OrdinalEncoder
categorical_features = [col for col in X_df.columns if X_df[col].dtype == 'object']
encoder = OrdinalEncoder(
cols=categorical_features,
handle_unknown='ignore',
return_df=True).fit(X_df)
X_df = encoder.transform(X_df)
[ ]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X_df, y_df, train_size=0.75, random_state=1)
[ ]:
regressor = RandomForestRegressor(n_estimators=50).fit(Xtrain, ytrain)
[ ]:
y_pred = pd.DataFrame(regressor.predict(Xtest),columns=['pred'], index=Xtest.index)
Fill your project information¶
The next step is to create a YML file containing information about your project.
We display the information contained in the YML file below :
[ ]:
import yaml
with open(r'utils/project_info.yml') as file:
project_info = yaml.full_load(file)
print(yaml.dump(project_info, sort_keys=False))
If you want to create your own custom file :
Title of section 1:
property1 name: property1 value
property2 name: property2 value
...
Title of section 2:
property1 name: property1 value
...
Note that the date can be computed automatically using the auto property value (see example above)
Generate your report¶
Declare and compile SmartExplainer object¶
[ ]:
from shapash import SmartExplainer
[ ]:
xpl = SmartExplainer(
model=regressor,
preprocessing=encoder, # Optional: compile step can use inverse_transform method
features_dict=house_dict # optional parameter, specifies label for features name
)
[ ]:
xpl.compile(x=Xtest, y_pred=y_pred, y_target=ytest)
At this step the model can be checked and inspected using different methods of the SmartExplainer object we just created.
Please refer to the other tutorials for more information.
Generate the base Shapash Report¶
Next we can generate the report using the generate_report
method of our SmartExplainer object.
We need to pass x_train
, y_train
and y_test
parameters in order to explore the data used when training the model.
Please refer to the documentation for a full description of the parameters.
[ ]:
xpl.generate_report(
output_file='output/report.html',
project_info_file='utils/project_info.yml',
x_train=Xtrain,
y_train=ytrain,
y_test=ytest,
title_story="House prices report",
title_description="""This document is a data science report of the kaggle house prices tutorial project.
It was generated using the Shapash library.""",
metrics=[
{
'path': 'sklearn.metrics.mean_absolute_error',
'name': 'Mean absolute error',
},
{
'path': 'sklearn.metrics.mean_squared_error',
'name': 'Mean squared error',
}
]
)
Note: You might want to specify the jupyter kernel used when generating the report. You should consider using the
kernel_name
parameter to indicate what kernel to use.
Customize your own report¶
Now let’s customize our report by adding some new sections.
To do so : - First, copy the base report notebook you can find here. This is the notebook that is used to generate the shapash report. It is executed and then converted to an HTML file. Only the output of each cell is kept and the code is deleted. - Then, delete or add cells depending on what you want to change. - Finally, add the parameter notebook_path="path/to/your/custom/report.ipynb"
in the
generate_report
method.
Tip : You can use the
working_dir
parameter to easily work inside your custom notebook before using thegenerate_report
method. This way you can load the parameters used inside the notebook by papermill. Replace thedir_path
inside your custom notebook with your ownworking_dir
where are saved the different instances used.
For our simple example, we created this notebook. - We removed the multivariate analysis using the report.display_dataset_analysis(multivariate_analysis=False)
(see notebook utils/custom_report.ipynb for more information) - It includes new sections Relashionship with target variable and Relashionship between training variables in which we included new simple graphs for this example. - We also
added new cells at the end of the metrics section.
Next, we use this notebook to generate our new custom report :
[ ]:
xpl.generate_report(
output_file='output/custom_report.html',
project_info_file='utils/project_info.yml',
x_train=Xtrain,
y_train=ytrain,
y_test=ytest,
title_story="House prices report",
title_description="""This document is a data science report of the kaggle house prices tutorial project.
It was generated using the Shapash library.""",
metrics=[
{
'path': 'sklearn.metrics.mean_absolute_error',
'name': 'Mean absolute error',
},
{
'path': 'sklearn.metrics.mean_squared_error',
'name': 'Mean squared error',
}
],
working_dir='working',
notebook_path="utils/custom_report.ipynb"
)
[ ]: