Groups of features

Shapash allows the use of groups of features.
With groups of features you can regroup variables together and use the different functions of Shapash to analyze these groups.
For example if your model uses a lot of features you may want to regroup features that share a common theme.
This way you can visualize and compare the importance of these themes and how they are used by your model.

Contents of this tutorial: - Build a model - Contruct groups of features - Compile Shapash SmartExplainer with the groups - Start Shapash WebApp - Explore the functions of Shapash using groups - Use groups of features in production with SmartPredictor object

Data from Kaggle House Prices

Motivation

In this use case, we have a lot of features that describe the house very precisely.

However, when analyzing our model, you may want to get more general insights of the themes that are most important in setting the price of a property.
This way, rather than having to check the 6 features describing a garage, you can have a more general idea of how important the garage is by grouping these 6 features together. Shapash allows you to visualize the role of each group in the features importance plot.

Also, you may want to understand why your model predicted such an important price for a specific house. If many features describing the location of the house are contributing slightly more than usual to a higher price, it may not be visible directly that the price is due to the location because of the number of features. But grouping these variables together allows to easily understand a specific prediction. Shapash also allows you to group features together in local plots.

[1]:
import pandas as pd
from category_encoders import OrdinalEncoder
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split

Building a supervized model

Load House Prices data

[2]:
from shapash.data.data_loader import data_loading
house_df, house_dict = data_loading('house_prices')
[3]:
house_df.head()
[3]:
MSSubClass MSZoning LotArea Street LotShape LandContour Utilities LotConfig LandSlope Neighborhood ... EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SaleType SaleCondition SalePrice
Id
1 2-Story 1946 & Newer Residential Low Density 8450 Paved Regular Near Flat/Level All public Utilities (E,G,W,& S) Inside lot Gentle slope College Creek ... 0 0 0 0 0 2 2008 Warranty Deed - Conventional Normal Sale 208500
2 1-Story 1946 & Newer All Styles Residential Low Density 9600 Paved Regular Near Flat/Level All public Utilities (E,G,W,& S) Frontage on 2 sides of property Gentle slope Veenker ... 0 0 0 0 0 5 2007 Warranty Deed - Conventional Normal Sale 181500
3 2-Story 1946 & Newer Residential Low Density 11250 Paved Slightly irregular Near Flat/Level All public Utilities (E,G,W,& S) Inside lot Gentle slope College Creek ... 0 0 0 0 0 9 2008 Warranty Deed - Conventional Normal Sale 223500
4 2-Story 1945 & Older Residential Low Density 9550 Paved Slightly irregular Near Flat/Level All public Utilities (E,G,W,& S) Corner lot Gentle slope Crawford ... 272 0 0 0 0 2 2006 Warranty Deed - Conventional Abnormal Sale 140000
5 2-Story 1946 & Newer Residential Low Density 14260 Paved Slightly irregular Near Flat/Level All public Utilities (E,G,W,& S) Frontage on 2 sides of property Gentle slope Northridge ... 0 0 0 0 0 12 2008 Warranty Deed - Conventional Normal Sale 250000

5 rows × 73 columns

[4]:
y = house_df['SalePrice']
X = house_df.drop('SalePrice', axis=1)

Encoding Categorical Features

[5]:
categorical_features = [col for col in X.columns if X[col].dtype == 'object']

encoder = OrdinalEncoder(
    cols=categorical_features,
    handle_unknown='ignore',
    return_df=True
).fit(X)

X = encoder.transform(X)

Train / Test Split

[6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=1)

Model fitting

[7]:
regressor = LGBMRegressor(n_estimators=200).fit(X_train, y_train)

Construct groups of features

There are quite a lot of features used by the model and it can be hard to compare them.

We can regroup the features that share similarities in order to identify which topic is important.

In our example we constructed the following new groups : - location: features related to the location of the house - size: features that measure part of the house - aspect: features that evaluate the style of any part of the house - condition: features related to the quality of anything in the house - configuration: features about the general configuration / shape of the house - equipment: features that describe the equipment of the house (electricity, gas, heating…) - garage: features related to the garage (style, …) - sale: features related to the sale of the house

[8]:
# We construct the groups as a dictionary of string keys and list of string values
# All the features inside the list will belong to the same group
features_groups = {
    "location": ["MSZoning", "Neighborhood", "Condition1", "Condition2"],
    "size": [
        "LotArea",
        "MasVnrArea",
        "BsmtQual",
        "BsmtFinSF2",
        "BsmtUnfSF",
        "TotalBsmtSF",
        "1stFlrSF",
        "2ndFlrSF",
        "GrLivArea",
        "WoodDeckSF",
        "OpenPorchSF",
        "EnclosedPorch",
        "3SsnPorch",
        "ScreenPorch",
        "PoolArea",
        "BsmtFinSF1"
    ],
    "aspect": [
        "LotShape",
        "LandContour",
        "RoofStyle",
        "RoofMatl",
        "Exterior1st",
        "MasVnrType",
    ],
    "condition": [
        "OverallQual",
        "OverallCond",
        "ExterQual",
        "ExterCond",
        "BsmtCond",
        "BsmtFinType1",
        "BsmtFinType2",
        "HeatingQC",
        "KitchenQual"
    ],
    "configuration": ["LotConfig", "LandSlope", "BldgType", "HouseStyle"],
    "equipment": ["Heating", "CentralAir", "Electrical"],
    "garage": [
        "GarageType",
        "GarageYrBlt",
        "GarageFinish",
        "GarageArea",
        "GarageQual",
        "GarageCond",
    ],
    "sale": ["SaleType", "SaleCondition", "MoSold", "YrSold"]
}

Optional : we can also give labels to groups names

[9]:
groups_labels = {
    'location': 'Location of the property',
    'size' : 'Size of different elements in the house',
    'aspect': 'Aspect of the house',
    'condition': 'Quality of the materials and parts of the property',
    'configuration': 'Configuration of the house',
    'equipment': 'All equipments',
    'garage': 'Garage group of features',
    'sale': 'Sale information'
}
house_dict.update(groups_labels)

Compile Shapash SmartExplainer object using groups

[10]:
from shapash import SmartExplainer
# optional parameter, specifies label for features and groups name
xpl = SmartExplainer(
    model=regressor,
    preprocessing=encoder,
    features_groups=features_groups,
    features_dict=house_dict
)
[ ]:
xpl.compile(x=X_test,
    y_target=y_test, # Optional: allows to display True Values vs Predicted Values
    )

Start WebApp

We can now start the webapp using the following cell.

The groups of features are visible by default on the features importance plot.
You can disable the groups using the groups switch button.

Also you can click on a group’s bar to display the features importance of the features inside the group.

[ ]:
app = xpl.run_app(title_story='House Prices')

Stop the WebApp after using it

[ ]:
app.kill()

Explore the functions of Shapash using groups

Features importance plot

Display the features importance plot that includes the groups and excludes the features inside each group

[15]:
xpl.plot.features_importance(selection=[259, 268])
../../_images/tutorials_common_tuto-common01-groups_of_features_30_0.png

Display the features importance plot of the features inside one group

[16]:
xpl.plot.features_importance(group_name='size')
../../_images/tutorials_common_tuto-common01-groups_of_features_32_0.png

Contribution plot

Plot the shap values of each observation of a group of features
The features values were projected on the x axis using t-SNE.
The values of the features (top 4 features only) can be visualized using the hover text.
[17]:
xpl.plot.contribution_plot('size')
../../_images/tutorials_common_tuto-common01-groups_of_features_35_0.png

Local plot

By default, Shapash will display the groups in the local plot.

You can directly see the impact of the different groups of features for the given observation.

[18]:
xpl.filter(max_contrib=8)
[19]:
xpl.plot.local_plot(index=629)
../../_images/tutorials_common_tuto-common01-groups_of_features_39_0.png

You can also display the features without the groups using the following parameters :

[20]:
xpl.filter(max_contrib=6, display_groups=False)
[21]:
xpl.plot.local_plot(index=259, display_groups=False)
../../_images/tutorials_common_tuto-common01-groups_of_features_42_0.png

Use groups of features in production with SmartPredictor object

[22]:
predictor = xpl.to_smartpredictor()

Create an imput and use add_input method of SmartPredictor object

[23]:
sample_input = house_df.sample(4).drop('SalePrice', axis=1)
sample_input
[23]:
MSSubClass MSZoning LotArea Street LotShape LandContour Utilities LotConfig LandSlope Neighborhood ... OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SaleType SaleCondition
Id
278 1-Story 1946 & Newer All Styles Residential Low Density 19138 Paved Regular Near Flat/Level All public Utilities (E,G,W,& S) Corner lot Gentle slope Gilbert ... 0 0 0 0 0 0 6 2010 Warranty Deed - Conventional Normal Sale
82 1-Story PUD (Planned Unit Development) - 1946 ... Residential Medium Density 4500 Paved Regular Near Flat/Level All public Utilities (E,G,W,& S) Frontage on 2 sides of property Gentle slope Mitchell ... 199 0 0 0 0 0 3 2006 Warranty Deed - Conventional Normal Sale
948 1-Story 1946 & Newer All Styles Residential Low Density 14536 Paved Regular Near Flat/Level All public Utilities (E,G,W,& S) Inside lot Gentle slope Timberland ... 252 0 0 0 0 0 11 2007 Warranty Deed - Conventional Normal Sale
1262 1-Story 1946 & Newer All Styles Residential Low Density 9600 Paved Regular Near Flat/Level All public Utilities (E,G,W,& S) Inside lot Gentle slope North Ames ... 0 0 0 0 0 0 6 2009 Warranty Deed - Conventional Normal Sale

4 rows × 72 columns

[24]:
predictor.add_input(sample_input)
Get detailed explanability associated to the predictions on this input
The contributions will contain the groups we created by default but you can replace the groups by their corresponding features using the use_groups parameter
[25]:
predictor.detail_contributions()
[25]:
ypred MSSubClass Street Utilities YearBuilt YearRemodAdd Exterior2nd Foundation BsmtExposure LowQualFinSF ... PavedDrive MiscVal location size aspect condition configuration equipment garage sale
Id
278 130844.267709 513.659708 0.0 0.0 -663.898088 -3069.707556 107.280103 -147.979932 -320.832096 0.0 ... 114.775449 -8.115543 381.872846 -14396.089946 389.034058 -34936.984759 136.495324 388.903141 2581.918959 852.852788
82 172074.308012 -1794.251940 0.0 0.0 8556.953143 341.624509 280.749670 157.831718 -741.368822 0.0 ... 36.213574 -13.952196 695.534998 2654.099001 132.778319 -20731.306929 -429.927436 197.002092 866.600207 -113.465212
948 270609.542311 809.401100 0.0 0.0 2062.573431 2274.633986 338.187922 145.925325 804.144542 0.0 ... 36.671406 -8.789213 219.232590 27418.095044 -1192.718922 58309.691558 -244.246746 59.576711 -373.060204 -1823.935655
1262 130366.146017 586.962923 0.0 0.0 -1281.558493 -3554.663787 29.700023 -178.384582 -409.165873 0.0 ... 118.746022 -20.226779 378.014782 -16687.534163 -42.260808 -24222.527707 -141.320268 136.621081 -3446.240527 441.515370

4 rows × 29 columns

[26]:
# Replace groups of features we created with their corresponding features contributions
predictor.detail_contributions(use_groups=False)
[26]:
ypred MSSubClass MSZoning LotArea Street LotShape LandContour Utilities LotConfig LandSlope ... OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SaleType SaleCondition
Id
278 130844.267709 513.659708 198.903599 12691.377983 0.0 -581.800537 122.547776 0.0 156.758117 0.0 ... -992.956104 -46.012302 0.0 -261.322952 0.0 -8.115543 809.554609 -277.564120 -155.385399 476.247697
82 172074.308012 -1794.251940 -477.108026 -6492.520867 0.0 -410.318176 55.697919 0.0 -487.737357 0.0 ... 7542.237778 6.606935 0.0 -292.337551 0.0 -13.952196 -174.894504 403.112039 -463.495150 121.812402
948 270609.542311 809.401100 124.049805 7126.125020 0.0 -702.467025 -154.403730 0.0 -255.183036 0.0 ... 11792.032451 15.080870 0.0 -372.007690 0.0 -8.789213 -1078.558864 433.762031 -1192.828411 13.689590
1262 130366.146017 586.962923 342.502857 -263.440875 0.0 -501.285776 125.195420 0.0 -116.788988 0.0 ... -815.781795 -79.706597 0.0 -225.931164 0.0 -20.226779 269.715464 -325.912085 -96.440432 594.152423

4 rows × 73 columns

Compute a summary of these contributions
Configure the summary using the modify_mask method :
[27]:
predictor.modify_mask(max_contrib=4)

The summarize method will contain the groups of features contributions and the value_x columns will contain all the values of the features of the corresponding group as a dict.

[28]:
predictor.summarize()
[28]:
ypred feature_1 value_1 contribution_1 feature_2 value_2 contribution_2 feature_3 value_3 contribution_3 feature_4 value_4 contribution_4
278 130844.267709 Quality of the materials and parts of the prop... {'Overall material and finish of the house': 4... -34936.984759 Size of different elements in the house {'Lot size square feet': 19138.0, 'Masonry ven... -14396.089946 Remodel date 1951 -3069.707556 Number of fireplaces 0 -2914.650743
82 172074.308012 Quality of the materials and parts of the prop... {'Overall material and finish of the house': 6... -20731.306929 Original construction date 1998 8556.953143 Size of different elements in the house {'Lot size square feet': 4500.0, 'Masonry vene... 2654.099001 Building Class 1-Story PUD (Planned Unit Development) - 1946 ... -1794.25194
948 270609.542311 Quality of the materials and parts of the prop... {'Overall material and finish of the house': 8... 58309.691558 Size of different elements in the house {'Lot size square feet': 14536.0, 'Masonry ven... 27418.095044 Total rooms above grade 9 -3238.097105 Remodel date 2003 2274.633986
1262 130366.146017 Quality of the materials and parts of the prop... {'Overall material and finish of the house': 5... -24222.527707 Size of different elements in the house {'Lot size square feet': 9600.0, 'Masonry vene... -16687.534163 Remodel date 1956 -3554.663787 Garage group of features {'Garage location': 1.0, 'Year garage was buil... -3446.240527
[29]:
# Removes the groups of features in the summary and replace them with their corresponding features
predictor.summarize(use_groups=False)
[29]:
ypred feature_1 value_1 contribution_1 feature_2 value_2 contribution_2 feature_3 value_3 contribution_3 feature_4 value_4 contribution_4
278 130844.267709 Overall material and finish of the house 4 -31398.569429 Ground living area square feet 864 -15548.964313 Lot size square feet 19138 12691.377983 Total square feet of basement area 864 -6352.627145
82 172074.308012 Overall material and finish of the house 6 -18709.409714 Original construction date 1998 8556.953143 Ground living area square feet 1337 -8081.588619 Open porch area in square feet 199 7542.237778
948 270609.542311 Overall material and finish of the house 8 66435.541722 Open porch area in square feet 252 11792.032451 Total square feet of basement area 1616 9210.438644 Type 1 finished square feet 1300 9098.524863
1262 130366.146017 Overall material and finish of the house 5 -23653.042515 Ground living area square feet 1050 -13069.750375 Remodel date 1956 -3554.663787 Size of garage in square feet 338 -3424.218904