Groups of features

Shapash allows the use of groups of features.
With groups of features you can regroup variables together and use the different functions of Shapash to analyze these groups.
For example if your model uses a lot of features you may want to regroup features that share a common theme.
This way you can visualize and compare the importance of these themes and how they are used by your model.

Contents of this tutorial: - Build a model - Contruct groups of features - Compile Shapash SmartExplainer with the groups - Start Shapash WebApp - Explore the functions of Shapash using groups - Use groups of features in production with SmartPredictor object

Data from Kaggle House Prices

Motivation

In this use case, we have a lot of features that describe the house very precisely.

However, when analyzing our model, you may want to get more general insights of the themes that are most important in setting the price of a property.
This way, rather than having to check the 6 features describing a garage, you can have a more general idea of how important the garage is by grouping these 6 features together. Shapash allows you to visualize the role of each group in the features importance plot.

Also, you may want to understand why your model predicted such an important price for a specific house. If many features describing the location of the house are contributing slightly more than usual to a higher price, it may not be visible directly that the price is due to the location because of the number of features. But grouping these variables together allows to easily understand a specific prediction. Shapash also allows you to group features together in local plots.

[1]:
import pandas as pd
from category_encoders import OrdinalEncoder
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split

Building a supervized model

Load House Prices data

[2]:
from shapash.data.data_loader import data_loading
house_df, house_dict = data_loading('house_prices')
[3]:
house_df.head()
[3]:
MSSubClass MSZoning LotArea Street LotShape LandContour Utilities LotConfig LandSlope Neighborhood ... EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SaleType SaleCondition SalePrice
Id
1 2-Story 1946 & Newer Residential Low Density 8450 Paved Regular Near Flat/Level All public Utilities (E,G,W,& S) Inside lot Gentle slope College Creek ... 0 0 0 0 0 2 2008 Warranty Deed - Conventional Normal Sale 208500
2 1-Story 1946 & Newer All Styles Residential Low Density 9600 Paved Regular Near Flat/Level All public Utilities (E,G,W,& S) Frontage on 2 sides of property Gentle slope Veenker ... 0 0 0 0 0 5 2007 Warranty Deed - Conventional Normal Sale 181500
3 2-Story 1946 & Newer Residential Low Density 11250 Paved Slightly irregular Near Flat/Level All public Utilities (E,G,W,& S) Inside lot Gentle slope College Creek ... 0 0 0 0 0 9 2008 Warranty Deed - Conventional Normal Sale 223500
4 2-Story 1945 & Older Residential Low Density 9550 Paved Slightly irregular Near Flat/Level All public Utilities (E,G,W,& S) Corner lot Gentle slope Crawford ... 272 0 0 0 0 2 2006 Warranty Deed - Conventional Abnormal Sale 140000
5 2-Story 1946 & Newer Residential Low Density 14260 Paved Slightly irregular Near Flat/Level All public Utilities (E,G,W,& S) Frontage on 2 sides of property Gentle slope Northridge ... 0 0 0 0 0 12 2008 Warranty Deed - Conventional Normal Sale 250000

5 rows × 73 columns

[4]:
y = house_df['SalePrice']
X = house_df.drop('SalePrice', axis=1)

Encoding Categorical Features

[5]:
categorical_features = [col for col in X.columns if X[col].dtype == 'object']

encoder = OrdinalEncoder(
    cols=categorical_features,
    handle_unknown='ignore',
    return_df=True
).fit(X)

X = encoder.transform(X)

Train / Test Split

[6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=1)

Model fitting

[7]:
regressor = LGBMRegressor(n_estimators=200).fit(X_train, y_train)
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001653 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2986
[LightGBM] [Info] Number of data points in the train set: 1095, number of used features: 66
[LightGBM] [Info] Start training from score 182319.757078

Construct groups of features

There are quite a lot of features used by the model and it can be hard to compare them.

We can regroup the features that share similarities in order to identify which topic is important.

In our example we constructed the following new groups : - location: features related to the location of the house - size: features that measure part of the house - aspect: features that evaluate the style of any part of the house - condition: features related to the quality of anything in the house - configuration: features about the general configuration / shape of the house - equipment: features that describe the equipment of the house (electricity, gas, heating…) - garage: features related to the garage (style, …) - sale: features related to the sale of the house

[8]:
# We construct the groups as a dictionary of string keys and list of string values
# All the features inside the list will belong to the same group
features_groups = {
    "location": ["MSZoning", "Neighborhood", "Condition1", "Condition2"],
    "size": [
        "LotArea",
        "MasVnrArea",
        "BsmtQual",
        "BsmtFinSF2",
        "BsmtUnfSF",
        "TotalBsmtSF",
        "1stFlrSF",
        "2ndFlrSF",
        "GrLivArea",
        "WoodDeckSF",
        "OpenPorchSF",
        "EnclosedPorch",
        "3SsnPorch",
        "ScreenPorch",
        "PoolArea",
        "BsmtFinSF1"
    ],
    "aspect": [
        "LotShape",
        "LandContour",
        "RoofStyle",
        "RoofMatl",
        "Exterior1st",
        "MasVnrType",
    ],
    "condition": [
        "OverallQual",
        "OverallCond",
        "ExterQual",
        "ExterCond",
        "BsmtCond",
        "BsmtFinType1",
        "BsmtFinType2",
        "HeatingQC",
        "KitchenQual"
    ],
    "configuration": ["LotConfig", "LandSlope", "BldgType", "HouseStyle"],
    "equipment": ["Heating", "CentralAir", "Electrical"],
    "garage": [
        "GarageType",
        "GarageYrBlt",
        "GarageFinish",
        "GarageArea",
        "GarageQual",
        "GarageCond",
    ],
    "sale": ["SaleType", "SaleCondition", "MoSold", "YrSold"]
}

Optional : we can also give labels to groups names

[9]:
groups_labels = {
    'location': 'Location of the property',
    'size' : 'Size of different elements in the house',
    'aspect': 'Aspect of the house',
    'condition': 'Quality of the materials and parts of the property',
    'configuration': 'Configuration of the house',
    'equipment': 'All equipments',
    'garage': 'Garage group of features',
    'sale': 'Sale information'
}
house_dict.update(groups_labels)

Compile Shapash SmartExplainer object using groups

[10]:
from shapash import SmartExplainer
# optional parameter, specifies label for features and groups name
xpl = SmartExplainer(
    model=regressor,
    preprocessing=encoder,
    features_groups=features_groups,
    features_dict=house_dict
)
[11]:
xpl.compile(x=X_test,
    y_target=y_test, # Optional: allows to display True Values vs Predicted Values
    )
INFO: Shap explainer type - <shap.explainers._tree.TreeExplainer object at 0x7fcf441b0370>

Start WebApp

We can now start the webapp using the following cell.

The groups of features are visible by default on the features importance plot.
You can disable the groups using the groups switch button.

Also you can click on a group’s bar to display the features importance of the features inside the group.

[ ]:
app = xpl.run_app(title_story='House Prices')

Stop the WebApp after using it

[ ]:
app.kill()

Explore the functions of Shapash using groups

Features importance plot

Display the features importance plot that includes the groups and excludes the features inside each group

[14]:
xpl.plot.features_importance(selection=[259, 268])
[14]:
../../_images/tutorials_common_tuto-common01-groups_of_features_30_0.png

Display the features importance plot of the features inside one group

[15]:
xpl.plot.features_importance(group_name='size')
[15]:
../../_images/tutorials_common_tuto-common01-groups_of_features_32_0.png

Contribution plot

Plot the shap values of each observation of a group of features
The features values were projected on the x axis using t-SNE.
The values of the features (top 4 features only) can be visualized using the hover text.
[16]:
xpl.plot.contribution_plot('size')
[16]:
../../_images/tutorials_common_tuto-common01-groups_of_features_35_0.png

Local plot

By default, Shapash will display the groups in the local plot.

You can directly see the impact of the different groups of features for the given observation.

[17]:
xpl.filter(max_contrib=8)
[18]:
xpl.plot.local_plot(index=629)
[18]:
../../_images/tutorials_common_tuto-common01-groups_of_features_39_0.png

You can also display the features without the groups using the following parameters :

[19]:
xpl.filter(max_contrib=6, display_groups=False)
[20]:
xpl.plot.local_plot(index=259, display_groups=False)
[20]:
../../_images/tutorials_common_tuto-common01-groups_of_features_42_0.png

Use groups of features in production with SmartPredictor object

[21]:
predictor = xpl.to_smartpredictor()

Create an imput and use add_input method of SmartPredictor object

[22]:
sample_input = house_df.sample(4).drop('SalePrice', axis=1)
sample_input
[22]:
MSSubClass MSZoning LotArea Street LotShape LandContour Utilities LotConfig LandSlope Neighborhood ... OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SaleType SaleCondition
Id
436 2-Story 1946 & Newer Residential Low Density 10667 Paved Moderately Irregular Near Flat/Level All public Utilities (E,G,W,& S) Cul-de-sac Gentle slope College Creek ... 61 0 0 0 0 0 4 2009 Contract Low Down payment and low interest Normal Sale
901 1-Story 1946 & Newer All Styles Residential Low Density 7340 Paved Slightly irregular Near Flat/Level All public Utilities (E,G,W,& S) Inside lot Gentle slope North Ames ... 0 0 0 0 0 0 6 2007 Warranty Deed - Conventional Normal Sale
1355 2-Story 1946 & Newer Residential Low Density 10316 Paved Slightly irregular Near Flat/Level All public Utilities (E,G,W,& S) Inside lot Gentle slope College Creek ... 184 0 0 0 0 0 6 2008 Warranty Deed - Conventional Normal Sale
244 2-Story PUD-1946 & Newer Residential Low Density 10762 Paved Regular Near Flat/Level All public Utilities (E,G,W,& S) Corner lot Gentle slope Sawyer West ... 28 0 0 0 0 0 4 2009 Warranty Deed - Conventional Normal Sale

4 rows × 72 columns

[23]:
predictor.add_input(sample_input)
INFO: Shap explainer type - <shap.explainers._tree.TreeExplainer object at 0x7fcf441b0370>
/home/79461r/.conda/envs/keltarif_39/lib/python3.9/site-packages/category_encoders/ordinal.py:198: FutureWarning:

Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`

Get detailed explanability associated to the predictions on this input
The contributions will contain the groups we created by default but you can replace the groups by their corresponding features using the use_groups parameter
[24]:
predictor.detail_contributions()
INFO: Shap explainer type - <shap.explainers._tree.TreeExplainer object at 0x7fcf441b0370>
[24]:
ypred MSSubClass Street Utilities YearBuilt YearRemodAdd Exterior2nd Foundation BsmtExposure LowQualFinSF ... PavedDrive MiscVal location size aspect condition configuration equipment garage sale
Id
436 204823.181059 2065.214959 0.0 0.0 2967.425942 1629.004923 579.479561 1239.153752 1213.533844 0.0 ... 43.823258 -2.583300 -657.282753 -7769.288367 2521.394322 12166.952238 2009.857231 13.213734 2170.635423 -1020.101898
901 118236.106783 656.087505 0.0 0.0 -132.488062 -1755.079708 -218.324403 -611.490088 -450.037449 0.0 ... 72.398246 -1.615371 -184.701377 -31556.344341 -215.744630 -31014.502704 -245.515921 410.619126 3793.634968 886.022752
1355 235004.669339 2998.973867 0.0 0.0 2729.870508 1982.290245 659.715052 736.225878 -772.037812 0.0 ... 35.369545 -2.293088 1037.882932 11574.270636 270.462779 12375.807920 19.729604 24.995704 16054.071349 83.051360
244 122412.294643 -1694.327005 0.0 0.0 179.740672 -2287.430112 -337.729828 -171.935819 -718.701115 0.0 ... 45.044229 -2.693414 -523.380963 -29962.163070 720.138010 -18221.657298 -307.088907 164.191467 -7443.766684 -862.593173

4 rows × 29 columns

[25]:
# Replace groups of features we created with their corresponding features contributions
predictor.detail_contributions(use_groups=False)
INFO: Shap explainer type - <shap.explainers._tree.TreeExplainer object at 0x7fcf441b0370>
[25]:
ypred MSSubClass MSZoning LotArea Street LotShape LandContour Utilities LotConfig LandSlope ... OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SaleType SaleCondition
Id
436 204823.181059 2065.214959 439.980549 1122.624757 0.0 2552.768486 28.023946 0.0 1843.333050 -8.438472 ... 1180.499098 45.200121 0.0 -275.513539 0.0 -2.583300 -234.466291 -946.579780 85.185242 75.758931
901 118236.106783 656.087505 304.793800 -1235.148056 0.0 371.868060 188.844755 0.0 -82.070475 -13.076696 ... -446.884475 -29.199097 0.0 -183.907781 0.0 -1.615371 637.739506 193.961044 -105.063845 159.386047
1355 235004.669339 2998.973867 230.518310 272.025317 0.0 111.483129 38.928399 0.0 -281.796337 -31.775538 ... 3254.710492 67.086326 0.0 -352.065611 0.0 -2.293088 233.357277 -70.085844 -164.508467 84.288395
244 122412.294643 -1694.327005 800.561360 977.557371 0.0 -1066.173916 56.128030 0.0 -201.989625 -36.830631 ... -383.480612 -74.102357 0.0 -200.089675 0.0 -2.693414 -906.830783 -378.732127 -169.873709 592.843446

4 rows × 73 columns

Compute a summary of these contributions
Configure the summary using the modify_mask method :
[26]:
predictor.modify_mask(max_contrib=4)

The summarize method will contain the groups of features contributions and the value_x columns will contain all the values of the features of the corresponding group as a dict.

[27]:
predictor.summarize()
[27]:
ypred feature_1 value_1 contribution_1 feature_2 value_2 contribution_2 feature_3 value_3 contribution_3 feature_4 value_4 contribution_4
436 204823.181059 Quality of the materials and parts of the prop... {'Overall material and finish of the house': 7... 12166.952238 Size of different elements in the house {'Lot size square feet': 10667.0, 'Masonry ven... -7769.288367 Original construction date 1996 2967.425942 Aspect of the house {'General shape of property': 3, 'Flatness of ... 2521.394322
901 118236.106783 Size of different elements in the house {'Lot size square feet': 7340.0, 'Masonry vene... -31556.344341 Quality of the materials and parts of the prop... {'Overall material and finish of the house': 4... -31014.502704 Garage group of features {'Garage location': 2.0, 'Year garage was buil... 3793.634968 Number of fireplaces 0 -1978.674665
1355 235004.669339 Garage group of features {'Garage location': 1.0, 'Year garage was buil... 16054.071349 Quality of the materials and parts of the prop... {'Overall material and finish of the house': 7... 12375.80792 Size of different elements in the house {'Lot size square feet': 10316.0, 'Masonry ven... 11574.270636 Building Class 2-Story 1946 & Newer 2998.973867
244 122412.294643 Size of different elements in the house {'Lot size square feet': 10762.0, 'Masonry ven... -29962.16307 Quality of the materials and parts of the prop... {'Overall material and finish of the house': 6... -18221.657298 Garage group of features {'Garage location': 1.0, 'Year garage was buil... -7443.766684 Remodel date 1980 -2287.430112
[28]:
# Removes the groups of features in the summary and replace them with their corresponding features
predictor.summarize(use_groups=False)
[28]:
ypred feature_1 value_1 contribution_1 feature_2 value_2 contribution_2 feature_3 value_3 contribution_3 feature_4 value_4 contribution_4
436 204823.181059 Overall material and finish of the house 7 8208.262746 Total square feet of basement area 799 -6415.71848 Original construction date 1996 2967.425942 General shape of property Moderately Irregular 2552.768486
901 118236.106783 Overall material and finish of the house 4 -32815.241448 Ground living area square feet 858 -20828.532247 Total square feet of basement area 858 -5998.344197 Size of garage in square feet 684 4455.152029
1355 235004.669339 Ground living area square feet 1865 21066.734064 Size of garage in square feet 839 16226.244604 Overall material and finish of the house 7 11732.738552 Total square feet of basement area 992 -6750.784976
244 122412.294643 Overall material and finish of the house 6 -16853.107942 Ground living area square feet 1217 -10930.60124 Total square feet of basement area 626 -8098.740953 Size of garage in square feet 288 -7579.75253