Groups of features¶
Contents of this tutorial: - Build a model - Contruct groups of features - Compile Shapash SmartExplainer with the groups - Start Shapash WebApp - Explore the functions of Shapash using groups - Use groups of features in production with SmartPredictor object
Data from Kaggle House Prices
Motivation¶
In this use case, we have a lot of features that describe the house very precisely.
Also, you may want to understand why your model predicted such an important price for a specific house. If many features describing the location of the house are contributing slightly more than usual to a higher price, it may not be visible directly that the price is due to the location because of the number of features. But grouping these variables together allows to easily understand a specific prediction. Shapash also allows you to group features together in local plots.
[1]:
import pandas as pd
from category_encoders import OrdinalEncoder
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split
Building a supervized model¶
Load House Prices data¶
[2]:
from shapash.data.data_loader import data_loading
house_df, house_dict = data_loading('house_prices')
[3]:
house_df.head()
[3]:
MSSubClass | MSZoning | LotArea | Street | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | ... | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Id | |||||||||||||||||||||
1 | 2-Story 1946 & Newer | Residential Low Density | 8450 | Paved | Regular | Near Flat/Level | All public Utilities (E,G,W,& S) | Inside lot | Gentle slope | College Creek | ... | 0 | 0 | 0 | 0 | 0 | 2 | 2008 | Warranty Deed - Conventional | Normal Sale | 208500 |
2 | 1-Story 1946 & Newer All Styles | Residential Low Density | 9600 | Paved | Regular | Near Flat/Level | All public Utilities (E,G,W,& S) | Frontage on 2 sides of property | Gentle slope | Veenker | ... | 0 | 0 | 0 | 0 | 0 | 5 | 2007 | Warranty Deed - Conventional | Normal Sale | 181500 |
3 | 2-Story 1946 & Newer | Residential Low Density | 11250 | Paved | Slightly irregular | Near Flat/Level | All public Utilities (E,G,W,& S) | Inside lot | Gentle slope | College Creek | ... | 0 | 0 | 0 | 0 | 0 | 9 | 2008 | Warranty Deed - Conventional | Normal Sale | 223500 |
4 | 2-Story 1945 & Older | Residential Low Density | 9550 | Paved | Slightly irregular | Near Flat/Level | All public Utilities (E,G,W,& S) | Corner lot | Gentle slope | Crawford | ... | 272 | 0 | 0 | 0 | 0 | 2 | 2006 | Warranty Deed - Conventional | Abnormal Sale | 140000 |
5 | 2-Story 1946 & Newer | Residential Low Density | 14260 | Paved | Slightly irregular | Near Flat/Level | All public Utilities (E,G,W,& S) | Frontage on 2 sides of property | Gentle slope | Northridge | ... | 0 | 0 | 0 | 0 | 0 | 12 | 2008 | Warranty Deed - Conventional | Normal Sale | 250000 |
5 rows × 73 columns
[4]:
y = house_df['SalePrice']
X = house_df.drop('SalePrice', axis=1)
Encoding Categorical Features¶
[5]:
categorical_features = [col for col in X.columns if X[col].dtype == 'object']
encoder = OrdinalEncoder(
cols=categorical_features,
handle_unknown='ignore',
return_df=True
).fit(X)
X = encoder.transform(X)
Train / Test Split¶
[6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=1)
Model fitting¶
[7]:
regressor = LGBMRegressor(n_estimators=200).fit(X_train, y_train)
Construct groups of features¶
There are quite a lot of features used by the model and it can be hard to compare them.
We can regroup the features that share similarities in order to identify which topic is important.
In our example we constructed the following new groups : - location
: features related to the location of the house - size
: features that measure part of the house - aspect
: features that evaluate the style of any part of the house - condition
: features related to the quality of anything in the house - configuration
: features about the general configuration / shape of the house - equipment
: features that describe the equipment of the house (electricity, gas, heating…) -
garage
: features related to the garage (style, …) - sale
: features related to the sale of the house
[8]:
# We construct the groups as a dictionary of string keys and list of string values
# All the features inside the list will belong to the same group
features_groups = {
"location": ["MSZoning", "Neighborhood", "Condition1", "Condition2"],
"size": [
"LotArea",
"MasVnrArea",
"BsmtQual",
"BsmtFinSF2",
"BsmtUnfSF",
"TotalBsmtSF",
"1stFlrSF",
"2ndFlrSF",
"GrLivArea",
"WoodDeckSF",
"OpenPorchSF",
"EnclosedPorch",
"3SsnPorch",
"ScreenPorch",
"PoolArea",
"BsmtFinSF1"
],
"aspect": [
"LotShape",
"LandContour",
"RoofStyle",
"RoofMatl",
"Exterior1st",
"MasVnrType",
],
"condition": [
"OverallQual",
"OverallCond",
"ExterQual",
"ExterCond",
"BsmtCond",
"BsmtFinType1",
"BsmtFinType2",
"HeatingQC",
"KitchenQual"
],
"configuration": ["LotConfig", "LandSlope", "BldgType", "HouseStyle"],
"equipment": ["Heating", "CentralAir", "Electrical"],
"garage": [
"GarageType",
"GarageYrBlt",
"GarageFinish",
"GarageArea",
"GarageQual",
"GarageCond",
],
"sale": ["SaleType", "SaleCondition", "MoSold", "YrSold"]
}
Optional : we can also give labels to groups names
[9]:
groups_labels = {
'location': 'Location of the property',
'size' : 'Size of different elements in the house',
'aspect': 'Aspect of the house',
'condition': 'Quality of the materials and parts of the property',
'configuration': 'Configuration of the house',
'equipment': 'All equipments',
'garage': 'Garage group of features',
'sale': 'Sale information'
}
house_dict.update(groups_labels)
Compile Shapash SmartExplainer object using groups¶
[10]:
from shapash import SmartExplainer
# optional parameter, specifies label for features and groups name
xpl = SmartExplainer(
model=regressor,
preprocessing=encoder,
features_groups=features_groups,
features_dict=house_dict
)
[ ]:
xpl.compile(x=X_test,
y_target=y_test, # Optional: allows to display True Values vs Predicted Values
)
Start WebApp¶
We can now start the webapp using the following cell.
groups
switch button.Also you can click on a group’s bar to display the features importance of the features inside the group.
[ ]:
app = xpl.run_app(title_story='House Prices')
Stop the WebApp after using it
[ ]:
app.kill()
Explore the functions of Shapash using groups¶
Features importance plot¶
Display the features importance plot that includes the groups and excludes the features inside each group
[15]:
xpl.plot.features_importance(selection=[259, 268])
Display the features importance plot of the features inside one group
[16]:
xpl.plot.features_importance(group_name='size')
Contribution plot¶
[17]:
xpl.plot.contribution_plot('size')
Local plot¶
By default, Shapash will display the groups in the local plot.
You can directly see the impact of the different groups of features for the given observation.
[18]:
xpl.filter(max_contrib=8)
[19]:
xpl.plot.local_plot(index=629)
You can also display the features without the groups using the following parameters :
[20]:
xpl.filter(max_contrib=6, display_groups=False)
[21]:
xpl.plot.local_plot(index=259, display_groups=False)
Use groups of features in production with SmartPredictor object¶
[22]:
predictor = xpl.to_smartpredictor()
Create an imput and use add_input method of SmartPredictor object
[23]:
sample_input = house_df.sample(4).drop('SalePrice', axis=1)
sample_input
[23]:
MSSubClass | MSZoning | LotArea | Street | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | ... | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SaleType | SaleCondition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Id | |||||||||||||||||||||
278 | 1-Story 1946 & Newer All Styles | Residential Low Density | 19138 | Paved | Regular | Near Flat/Level | All public Utilities (E,G,W,& S) | Corner lot | Gentle slope | Gilbert | ... | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 2010 | Warranty Deed - Conventional | Normal Sale |
82 | 1-Story PUD (Planned Unit Development) - 1946 ... | Residential Medium Density | 4500 | Paved | Regular | Near Flat/Level | All public Utilities (E,G,W,& S) | Frontage on 2 sides of property | Gentle slope | Mitchell | ... | 199 | 0 | 0 | 0 | 0 | 0 | 3 | 2006 | Warranty Deed - Conventional | Normal Sale |
948 | 1-Story 1946 & Newer All Styles | Residential Low Density | 14536 | Paved | Regular | Near Flat/Level | All public Utilities (E,G,W,& S) | Inside lot | Gentle slope | Timberland | ... | 252 | 0 | 0 | 0 | 0 | 0 | 11 | 2007 | Warranty Deed - Conventional | Normal Sale |
1262 | 1-Story 1946 & Newer All Styles | Residential Low Density | 9600 | Paved | Regular | Near Flat/Level | All public Utilities (E,G,W,& S) | Inside lot | Gentle slope | North Ames | ... | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 2009 | Warranty Deed - Conventional | Normal Sale |
4 rows × 72 columns
[24]:
predictor.add_input(sample_input)
use_groups
parameter[25]:
predictor.detail_contributions()
[25]:
ypred | MSSubClass | Street | Utilities | YearBuilt | YearRemodAdd | Exterior2nd | Foundation | BsmtExposure | LowQualFinSF | ... | PavedDrive | MiscVal | location | size | aspect | condition | configuration | equipment | garage | sale | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Id | |||||||||||||||||||||
278 | 130844.267709 | 513.659708 | 0.0 | 0.0 | -663.898088 | -3069.707556 | 107.280103 | -147.979932 | -320.832096 | 0.0 | ... | 114.775449 | -8.115543 | 381.872846 | -14396.089946 | 389.034058 | -34936.984759 | 136.495324 | 388.903141 | 2581.918959 | 852.852788 |
82 | 172074.308012 | -1794.251940 | 0.0 | 0.0 | 8556.953143 | 341.624509 | 280.749670 | 157.831718 | -741.368822 | 0.0 | ... | 36.213574 | -13.952196 | 695.534998 | 2654.099001 | 132.778319 | -20731.306929 | -429.927436 | 197.002092 | 866.600207 | -113.465212 |
948 | 270609.542311 | 809.401100 | 0.0 | 0.0 | 2062.573431 | 2274.633986 | 338.187922 | 145.925325 | 804.144542 | 0.0 | ... | 36.671406 | -8.789213 | 219.232590 | 27418.095044 | -1192.718922 | 58309.691558 | -244.246746 | 59.576711 | -373.060204 | -1823.935655 |
1262 | 130366.146017 | 586.962923 | 0.0 | 0.0 | -1281.558493 | -3554.663787 | 29.700023 | -178.384582 | -409.165873 | 0.0 | ... | 118.746022 | -20.226779 | 378.014782 | -16687.534163 | -42.260808 | -24222.527707 | -141.320268 | 136.621081 | -3446.240527 | 441.515370 |
4 rows × 29 columns
[26]:
# Replace groups of features we created with their corresponding features contributions
predictor.detail_contributions(use_groups=False)
[26]:
ypred | MSSubClass | MSZoning | LotArea | Street | LotShape | LandContour | Utilities | LotConfig | LandSlope | ... | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SaleType | SaleCondition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Id | |||||||||||||||||||||
278 | 130844.267709 | 513.659708 | 198.903599 | 12691.377983 | 0.0 | -581.800537 | 122.547776 | 0.0 | 156.758117 | 0.0 | ... | -992.956104 | -46.012302 | 0.0 | -261.322952 | 0.0 | -8.115543 | 809.554609 | -277.564120 | -155.385399 | 476.247697 |
82 | 172074.308012 | -1794.251940 | -477.108026 | -6492.520867 | 0.0 | -410.318176 | 55.697919 | 0.0 | -487.737357 | 0.0 | ... | 7542.237778 | 6.606935 | 0.0 | -292.337551 | 0.0 | -13.952196 | -174.894504 | 403.112039 | -463.495150 | 121.812402 |
948 | 270609.542311 | 809.401100 | 124.049805 | 7126.125020 | 0.0 | -702.467025 | -154.403730 | 0.0 | -255.183036 | 0.0 | ... | 11792.032451 | 15.080870 | 0.0 | -372.007690 | 0.0 | -8.789213 | -1078.558864 | 433.762031 | -1192.828411 | 13.689590 |
1262 | 130366.146017 | 586.962923 | 342.502857 | -263.440875 | 0.0 | -501.285776 | 125.195420 | 0.0 | -116.788988 | 0.0 | ... | -815.781795 | -79.706597 | 0.0 | -225.931164 | 0.0 | -20.226779 | 269.715464 | -325.912085 | -96.440432 | 594.152423 |
4 rows × 73 columns
modify_mask
method :[27]:
predictor.modify_mask(max_contrib=4)
The summarize
method will contain the groups of features contributions and the value_x
columns will contain all the values of the features of the corresponding group as a dict.
[28]:
predictor.summarize()
[28]:
ypred | feature_1 | value_1 | contribution_1 | feature_2 | value_2 | contribution_2 | feature_3 | value_3 | contribution_3 | feature_4 | value_4 | contribution_4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
278 | 130844.267709 | Quality of the materials and parts of the prop... | {'Overall material and finish of the house': 4... | -34936.984759 | Size of different elements in the house | {'Lot size square feet': 19138.0, 'Masonry ven... | -14396.089946 | Remodel date | 1951 | -3069.707556 | Number of fireplaces | 0 | -2914.650743 |
82 | 172074.308012 | Quality of the materials and parts of the prop... | {'Overall material and finish of the house': 6... | -20731.306929 | Original construction date | 1998 | 8556.953143 | Size of different elements in the house | {'Lot size square feet': 4500.0, 'Masonry vene... | 2654.099001 | Building Class | 1-Story PUD (Planned Unit Development) - 1946 ... | -1794.25194 |
948 | 270609.542311 | Quality of the materials and parts of the prop... | {'Overall material and finish of the house': 8... | 58309.691558 | Size of different elements in the house | {'Lot size square feet': 14536.0, 'Masonry ven... | 27418.095044 | Total rooms above grade | 9 | -3238.097105 | Remodel date | 2003 | 2274.633986 |
1262 | 130366.146017 | Quality of the materials and parts of the prop... | {'Overall material and finish of the house': 5... | -24222.527707 | Size of different elements in the house | {'Lot size square feet': 9600.0, 'Masonry vene... | -16687.534163 | Remodel date | 1956 | -3554.663787 | Garage group of features | {'Garage location': 1.0, 'Year garage was buil... | -3446.240527 |
[29]:
# Removes the groups of features in the summary and replace them with their corresponding features
predictor.summarize(use_groups=False)
[29]:
ypred | feature_1 | value_1 | contribution_1 | feature_2 | value_2 | contribution_2 | feature_3 | value_3 | contribution_3 | feature_4 | value_4 | contribution_4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
278 | 130844.267709 | Overall material and finish of the house | 4 | -31398.569429 | Ground living area square feet | 864 | -15548.964313 | Lot size square feet | 19138 | 12691.377983 | Total square feet of basement area | 864 | -6352.627145 |
82 | 172074.308012 | Overall material and finish of the house | 6 | -18709.409714 | Original construction date | 1998 | 8556.953143 | Ground living area square feet | 1337 | -8081.588619 | Open porch area in square feet | 199 | 7542.237778 |
948 | 270609.542311 | Overall material and finish of the house | 8 | 66435.541722 | Open porch area in square feet | 252 | 11792.032451 | Total square feet of basement area | 1616 | 9210.438644 | Type 1 finished square feet | 1300 | 9098.524863 |
1262 | 130366.146017 | Overall material and finish of the house | 5 | -23653.042515 | Ground living area square feet | 1050 | -13069.750375 | Remodel date | 1956 | -3554.663787 | Size of garage in square feet | 338 | -3424.218904 |