Shapash + TabICL for Regression¶

This tutorial shows how to: - train a TabICLRegressor on a regression problem - compute SHAP values with tabicl.shap - convert these contributions into a format compatible with Shapash - explore global and local explainability in a notebook - optionally launch the Shapash WebApp

The dataset used here is house_prices, already provided in Shapash.

1. Prerequisites¶

TabICL does not return Shapley values directly from predict(). However, the project does provide a tabicl.shap module that computes SHAP values by leveraging the native column-wise NaN masking.

Typical installation: pip install tabicl[shap]

On some macOS Intel environments, you may need to install PyTorch before tabicl.

[ ]:

# Uncomment if needed
# !pip install tabicl[shap]

import numpy as np
import pandas as pd
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder

from shapash import SmartExplainer
from shapash.data.data_loader import data_loading
from tabicl import TabICLRegressor
from tabicl.shap import get_shap_values, plot_shap

2. Load the Data¶

[2]:

house_df, house_dict = data_loading('house_prices')

y = house_df['SalePrice']
X = house_df.drop(columns=['SalePrice'])

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    train_size=0.75,
    random_state=42,
)

X_train.head()

[2]:

	MSSubClass	MSZoning	LotArea	Street	LotShape	LandContour	Utilities	LotConfig	LandSlope	Neighborhood	...	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	MiscVal	MoSold	YrSold	SaleType	SaleCondition
Id
1024	1-Story PUD (Planned Unit Development) - 1946 ...	Residential Low Density	3182	Paved	Regular	Near Flat/Level	All public Utilities (E,G,W,& S)	Inside lot	Gentle slope	Bloomington Heights	...	20	0	0	0	0	0	5	2008	Warranty Deed - Conventional	Normal Sale
811	1-Story 1946 & Newer All Styles	Residential Low Density	10140	Paved	Regular	Near Flat/Level	All public Utilities (E,G,W,& S)	Inside lot	Gentle slope	Northwest Ames	...	0	0	0	0	648	0	1	2006	Warranty Deed - Conventional	Normal Sale
1385	1-1/2 Story Finished All Ages	Residential Low Density	9060	Paved	Regular	Near Flat/Level	All public Utilities (E,G,W,& S)	Inside lot	Gentle slope	Edwards	...	0	0	0	0	0	0	10	2009	Warranty Deed - Conventional	Normal Sale
627	1-Story 1946 & Newer All Styles	Residential Low Density	12342	Paved	Slightly irregular	Near Flat/Level	All public Utilities (E,G,W,& S)	Inside lot	Gentle slope	North Ames	...	0	36	0	0	0	600	8	2007	Warranty Deed - Conventional	Normal Sale
814	1-Story 1946 & Newer All Styles	Residential Low Density	9750	Paved	Regular	Near Flat/Level	All public Utilities (E,G,W,& S)	Inside lot	Gentle slope	North Ames	...	0	275	0	0	0	500	4	2007	Court Officer Deed/Estate	Normal Sale

5 rows × 72 columns

3. Train the TabICL Model¶

In this version, the data is encoded numerically before fit (ordinal encoding of categorical variables), and missing values are imputed before training.

The model is then trained on X_train_num and evaluated on X_test_num.

[3]:

# Replace missing values then encode categoricals to numeric before fit
num_cols = X_train.select_dtypes(include=['number', 'bool']).columns
cat_cols = X_train.columns.difference(num_cols)

X_train_filled = X_train.copy()
X_test_filled = X_test.copy()

num_fill_values = X_train_filled[num_cols].median()
X_train_filled[num_cols] = X_train_filled[num_cols].fillna(num_fill_values)
X_test_filled[num_cols] = X_test_filled[num_cols].fillna(num_fill_values)

if len(cat_cols) > 0:
    cat_fill_values = X_train_filled[cat_cols].mode(dropna=True).iloc[0]
    X_train_filled[cat_cols] = X_train_filled[cat_cols].fillna(cat_fill_values)
    X_test_filled[cat_cols] = X_test_filled[cat_cols].fillna(cat_fill_values)

    ordinal_encoder = OrdinalEncoder(
        handle_unknown='use_encoded_value',
        unknown_value=-1,
        encoded_missing_value=-1,
    )
    X_train_cat = pd.DataFrame(
        ordinal_encoder.fit_transform(X_train_filled[cat_cols]),
        index=X_train_filled.index,
        columns=cat_cols,
    )
    X_test_cat = pd.DataFrame(
        ordinal_encoder.transform(X_test_filled[cat_cols]),
        index=X_test_filled.index,
        columns=cat_cols,
    )
else:
    X_train_cat = pd.DataFrame(index=X_train_filled.index)
    X_test_cat = pd.DataFrame(index=X_test_filled.index)

X_train_num = pd.concat([X_train_filled[num_cols].astype(float), X_train_cat.astype(float)], axis=1)
X_test_num = pd.concat([X_test_filled[num_cols].astype(float), X_test_cat.astype(float)], axis=1)

# Keep the original feature order
X_train_num = X_train_num.reindex(columns=X_train.columns).fillna(0.0)
X_test_num = X_test_num.reindex(columns=X_train.columns).fillna(0.0)

regressor = TabICLRegressor(
    n_estimators=4,
    kv_cache=True,
    device='cpu',
    random_state=42,
)

regressor.fit(X_train_num, y_train)

[3]:

TabICLRegressor(device='cpu', kv_cache=True, n_estimators=4)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

[4]:

y_pred = pd.DataFrame(regressor.predict(X_test_num), columns=['pred'], index=X_test_num.index)

metrics = pd.Series({
    'MAE': mean_absolute_error(y_test, y_pred['pred']),
    'RMSE': np.sqrt(mean_squared_error(y_test, y_pred['pred'])),
    'R2': r2_score(y_test, y_pred['pred']),
})

metrics.round(4)

[4]:

MAE     13479.4658
RMSE    21565.9094
R2          0.9336
dtype: float64

4. Compute SHAP Values with TabICL¶

We only explain a subset of the test set to keep the notebook responsive.

[5]:

X_explain_columns = X_test_num.columns.tolist()

# X_explain = X_explain.fillna(0.0).astype(float)

shap_values = get_shap_values(
    estimator=regressor,
    X_test=X_test_num,
    attribute_names=X_explain_columns,
)

type(shap_values)

PermutationExplainer explainer: 366it [07:53,  1.32s/it]

[5]:

shap._explanation.Explanation

5. Convert to a Shapash-Compatible Format¶

Shapash accepts user-provided local contributions with SmartExplainer.compile(..., contributions=...). For regression, you simply provide an (n_samples, n_features) matrix aligned with x.

[6]:

contributions = pd.DataFrame(
    shap_values.values,
    columns=X_explain_columns,
    index=X_test_num.index,
)

contributions.head()

[6]:

	MSSubClass	MSZoning	LotArea	Street	LotShape	LandContour	Utilities	LotConfig	LandSlope	Neighborhood	...	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	MiscVal	MoSold	YrSold	SaleType	SaleCondition
Id
893	1134.341146	1165.591146	-4267.718750	-243.335938	-846.552083	-1180.197917	-129.291667	-722.570312	-959.122396	-718.528646	...	-2731.927083	-1290.799479	-2229.411458	384.627604	-3934.169271	324.828125	-1374.604167	-55.851562	-1159.580729	-3525.583333
1106	2210.226562	1919.864583	4093.882812	-616.252604	54.406250	-2897.070312	-54.440104	-442.393229	-2222.575521	9676.208333	...	-975.208333	-317.333333	-2853.255208	580.546875	-7612.604167	736.776042	-1093.338542	-518.557292	-1599.414062	-5408.106771
414	-1800.076823	-1522.001302	-2821.118490	-107.744792	-535.144531	-1283.110677	-81.026042	-792.263021	-646.457031	-2638.473958	...	-2467.863281	333.020833	-2093.165365	191.466146	-3519.006510	585.395833	-988.803385	-290.632812	-970.368490	-3135.895833
523	1554.544271	-2411.419271	-11190.130208	-137.018229	-1000.174479	-2037.104167	172.817708	95.820312	-1223.960938	4175.825521	...	-1240.661458	-105.760417	-2577.388021	366.585938	-5503.609375	846.067708	301.609375	-58.408854	-1523.947917	-4321.684896
1037	801.145833	1113.877604	4766.515625	-670.895833	-154.882812	3451.523438	-229.979167	-1795.559896	-2748.388021	5653.872396	...	-2727.916667	-969.276042	-3819.666667	271.013021	-9041.513021	94.385417	709.822917	-470.518229	-2645.000000	-6211.145833

5 rows × 72 columns

6. Compile SmartExplainer with TabICL Outputs¶

[7]:

xpl = SmartExplainer(
    model=regressor,
    features_dict=house_dict,
    title_story='House Prices - TabICL Regressor',
)

xpl.compile(
    x=X_test_num,
    contributions=contributions,
    y_pred=y_pred,
    y_target=y_test,
)

7. Explainability in the Notebook¶

[8]:

xpl.plot.features_importance()

../../_images/tutorials_domain_examples_tuto-domain03-tabicl-regression-explainability_15_0.png

[9]:

example_index = X_test_num.index[0]
xpl.plot.local_plot(index=example_index)

../../_images/tutorials_domain_examples_tuto-domain03-tabicl-regression-explainability_16_0.png

[10]:

summary_df = xpl.to_pandas(max_contrib=4)
summary_df.head()

[10]:

	pred	feature_1	value_1	contribution_1	feature_2	value_2	contribution_2	feature_3	value_3	contribution_3	feature_4	value_4	contribution_4
893	145425.593750	Ground living area square feet	1068.0	-17761.609375	Exterior materials' quality	0.0	-6394.335938	Number of fireplaces	0.0	-6384.5625	Lot size square feet	8414.0	-4267.71875
1106	338848.625000	Ground living area square feet	2622.0	63718.979167	Overall material and finish of the house	8.0	30047.625	Second floor square feet	1122.0	10486.434896	Number of fireplaces	2.0	9939.911458
414	106041.742188	Ground living area square feet	1028.0	-16405.161458	Overall material and finish of the house	5.0	-11118.700521	Proximity to various conditions	2.0	-10017.960938	Exterior materials' quality	0.0	-5917.255208
523	157701.843750	Lot size square feet	5000.0	-11190.130208	Proximity to various conditions	3.0	-10930.195312	Exterior materials' quality	0.0	-9621.018229	Number of fireplaces	2.0	9555.598958
1037	344184.750000	Overall material and finish of the house	9.0	62309.341146	Original construction date	2007.0	15426.658854	Kitchen quality	0.0	12653.299479	Total square feet of basement area	1620.0	12174.908854

8. Optionally Launch the WebApp¶

As in the other Shapash tutorials, you can launch the WebApp directly from the notebook.

[ ]:

app = xpl.run_app(port=8050)
# Then use: app.kill()

INFO:root:Your Shapash application run on http://PMP01087:8050/
INFO:root:Use the method .kill() to down your app.

	n_estimators	4
	norm_methods	None
	feat_shuffle_method	'latin'
	outlier_threshold	4.0
	batch_size	8
	kv_cache	True
	model_path	None
	allow_auto_download	True
	checkpoint_version	'tabicl-regressor-v2-20260212.ckpt'
	device	'cpu'
	use_amp	'auto'
	use_fa3	'auto'
	offload_mode	'auto'
	disk_offload_dir	None
	random_state	42
	n_jobs	None
	verbose	False
	inference_config	None