Shapash - Time Series Tabular Forecasting¶

This notebook illustrates tabular forecasting: lag creation, calendar features, then local and global interpretation with Shapash.

[ ]:

import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score

from shapash import SmartExplainer

1. Build a synthetic daily demand signal¶

[2]:

rng = np.random.default_rng(42)
date_index = pd.date_range(start="2021-01-01", periods=900, freq="D")

trend = np.linspace(0, 25, len(date_index))
weekly = 10 * np.sin(2 * np.pi * np.arange(len(date_index)) / 7)
yearly = 20 * np.sin(2 * np.pi * np.arange(len(date_index)) / 365)
noise = rng.normal(0, 4, len(date_index))

target = 120 + trend + weekly + yearly + noise
ts_df = pd.DataFrame({"date": date_index, "target": target})
ts_df.head()

[2]:

	date	target
0	2021-01-01	121.218868
1	2021-01-02	124.030454
2	2021-01-03	133.495133
3	2021-01-04	129.216916
4	2021-01-05	109.344305

2. Engineer tabular features (lags + calendar)¶

[3]:

df = ts_df.copy()
df["lag_1"] = df["target"].shift(1)
df["lag_7"] = df["target"].shift(7)
df["rolling_mean_7"] = df["target"].shift(1).rolling(7).mean()
df["rolling_std_7"] = df["target"].shift(1).rolling(7).std()
df["day_of_week"] = df["date"].dt.dayofweek
df["month"] = df["date"].dt.month
df["is_weekend"] = (df["day_of_week"] >= 5).astype(int)

model_df = df.dropna().reset_index(drop=True)
feature_cols = [
    "lag_1",
    "lag_7",
    "rolling_mean_7",
    "rolling_std_7",
    "day_of_week",
    "month",
    "is_weekend",
]

X = model_df[feature_cols]
y = model_df[["target"]]

split_idx = int(len(model_df) * 0.8)
X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]
date_train = model_df["date"].iloc[:split_idx]
date_test = model_df["date"].iloc[split_idx:]

X_train.head()

[3]:

	lag_1	lag_7	rolling_mean_7	rolling_std_7	day_of_week	month	is_weekend
0	114.921933	121.218868	119.875422	9.963461	4	1	0
1	121.333851	124.030454	119.891848	9.966140	5	1	1
2	130.719155	133.495133	120.847377	10.721123	6	1	1
3	129.673558	129.216916	120.301437	10.045764	0	1	0
4	131.560379	109.344305	120.636218	10.424313	1	1	0

3. Train and evaluate¶

[4]:

model = RandomForestRegressor(n_estimators=400, random_state=42, n_jobs=-1)
model.fit(X_train, y_train.iloc[:, 0])

pred_train = model.predict(X_train)
pred_test = model.predict(X_test)

metrics = pd.DataFrame(
    {
        "MAE": [
            mean_absolute_error(y_train.iloc[:, 0], pred_train),
            mean_absolute_error(y_test.iloc[:, 0], pred_test),
        ],
        "R2": [
            r2_score(y_train.iloc[:, 0], pred_train),
            r2_score(y_test.iloc[:, 0], pred_test),
        ],
    },
    index=["train", "test"],
)
metrics

[4]:

	MAE	R2
train	1.511569	0.985118
test	5.771074	0.535060

[5]:

compare_df = pd.DataFrame(
    {
        "date": date_test.values,
        "actual": y_test.iloc[:, 0].values,
        "predicted": pred_test,
    }
)

plot_df = compare_df.set_index("date").tail(60)

ax = plot_df["actual"].plot(
    figsize=(12, 4),
    color="#FECB2F",   # jaune style Shapash
    linewidth=2.4,
    label="Actual",
)
plot_df["predicted"].plot(
    ax=ax,
    color="#7A7A7A",   # gris
    linewidth=2.2,
    linestyle="--",
    label="Predicted",
)

ax.set_title("Actual vs Predicted demand")
ax.grid(alpha=0.25)
ax.legend(frameon=False)

[5]:

<matplotlib.legend.Legend at 0x1218a0110>

../../_images/tutorials_domain_examples_tuto-domain05-time-series-tabular-forecasting.ipynb_8_1.png

4. Explain forecasting drivers with Shapash¶

[6]:

feature_dict = {
    "lag_1": "Demand D-1",
    "lag_7": "Demand D-7",
    "rolling_mean_7": "7-day rolling mean",
    "rolling_std_7": "7-day rolling volatility",
    "day_of_week": "Day of week",
    "month": "Month",
    "is_weekend": "Weekend",
}

xpl = SmartExplainer(
    model=model,
    features_dict=feature_dict,
    title_story="Demand forecasting with tabular features",
)

y_pred_test_df = pd.DataFrame(pred_test, columns=["target"], index=X_test.index)

xpl.compile(
    x=X_test,
    y_pred=y_pred_test_df,
    y_target=y_test,
    additional_data=model_df.loc[X_test.index, ["date"]],
)

xpl.plot.features_importance()

INFO: Shap explainer type - <shap.explainers._tree.TreeExplainer object at 0x121857410>

../../_images/tutorials_domain_examples_tuto-domain05-time-series-tabular-forecasting.ipynb_10_1.png

[7]:

worst_error_idx = (y_test.iloc[:, 0] - pred_test).abs().idxmax()
xpl.plot.local_plot(index=worst_error_idx)

../../_images/tutorials_domain_examples_tuto-domain05-time-series-tabular-forecasting.ipynb_11_0.png

5. Leakage checklist for temporal models¶

Use a strict time-based split (never random shuffling).
Compute rolling statistics using only past values.
Verify that calendar features are available at inference time.
Monitor drift in lag feature distributions in production.