Ad Ease optimizes ad placement by forecasting Wikipedia page views across languages. Reliable forecasts help clients improve click outcomes at lower spend and support budget planning by market.
Build a robust, business-ready forecasting pipeline for aggregated daily page views by language, then compare baseline and advanced time-series models to identify the most reliable approach per language.
No client-identifiable entities are disclosed. Any stakeholder labels in this notebook are anonymized.
import warnings
warnings.filterwarnings('ignore')
import os
import re
import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from statsmodels.tools.sm_exceptions import ValueWarning
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.stats.diagnostic import acorr_ljungbox
from statsmodels.tsa.statespace.sarimax import SARIMAX
# Explicitly suppress statsmodels frequency inference warnings in notebook output
warnings.filterwarnings('ignore', category=ValueWarning)
sns.set_theme(style='whitegrid')
pd.set_option('display.max_columns', 200)
pd.set_option('display.width', 160)
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
print('Libraries loaded successfully.')
Libraries loaded successfully.
# Resolve file paths so this notebook runs regardless of current working directory
candidate_train_paths = [
Path('../data/train_1.csv'),
Path('data/train_1.csv'),
Path('../../TimeSeries/data/train_1.csv')
]
TRAIN_PATH = next((p for p in candidate_train_paths if p.exists()), None)
if TRAIN_PATH is None:
raise FileNotFoundError('Could not locate train_1.csv in expected paths.')
candidate_exog_paths = [
Path('../data/Exog_Campaign_eng.csv'),
Path('../data/exog_campaign_eng.csv'),
Path('../data/Exog_Campaign_eng.xlsx'),
Path('data/Exog_Campaign_eng.csv')
]
EXOG_PATH = next((p for p in candidate_exog_paths if p.exists()), None)
df_raw = pd.read_csv(TRAIN_PATH)
print(f'Train file loaded: {TRAIN_PATH}')
print(f'Shape: {df_raw.shape}')
print('Exogenous campaign file found:' if EXOG_PATH else 'Exogenous campaign file not found in expected paths.')
if EXOG_PATH:
print(f'Exog file loaded from: {EXOG_PATH}')
display(df_raw.head(2))
Train file loaded: ..\data\train_1.csv Shape: (145063, 551) Exogenous campaign file not found in expected paths.
| Page | 2015-07-01 | 2015-07-02 | 2015-07-03 | 2015-07-04 | 2015-07-05 | 2015-07-06 | 2015-07-07 | 2015-07-08 | 2015-07-09 | 2015-07-10 | 2015-07-11 | 2015-07-12 | 2015-07-13 | 2015-07-14 | 2015-07-15 | 2015-07-16 | 2015-07-17 | 2015-07-18 | 2015-07-19 | 2015-07-20 | 2015-07-21 | 2015-07-22 | 2015-07-23 | 2015-07-24 | 2015-07-25 | 2015-07-26 | 2015-07-27 | 2015-07-28 | 2015-07-29 | 2015-07-30 | 2015-07-31 | 2015-08-01 | 2015-08-02 | 2015-08-03 | 2015-08-04 | 2015-08-05 | 2015-08-06 | 2015-08-07 | 2015-08-08 | 2015-08-09 | 2015-08-10 | 2015-08-11 | 2015-08-12 | 2015-08-13 | 2015-08-14 | 2015-08-15 | 2015-08-16 | 2015-08-17 | 2015-08-18 | 2015-08-19 | 2015-08-20 | 2015-08-21 | 2015-08-22 | 2015-08-23 | 2015-08-24 | 2015-08-25 | 2015-08-26 | 2015-08-27 | 2015-08-28 | 2015-08-29 | 2015-08-30 | 2015-08-31 | 2015-09-01 | 2015-09-02 | 2015-09-03 | 2015-09-04 | 2015-09-05 | 2015-09-06 | 2015-09-07 | 2015-09-08 | 2015-09-09 | 2015-09-10 | 2015-09-11 | 2015-09-12 | 2015-09-13 | 2015-09-14 | 2015-09-15 | 2015-09-16 | 2015-09-17 | 2015-09-18 | 2015-09-19 | 2015-09-20 | 2015-09-21 | 2015-09-22 | 2015-09-23 | 2015-09-24 | 2015-09-25 | 2015-09-26 | 2015-09-27 | 2015-09-28 | 2015-09-29 | 2015-09-30 | 2015-10-01 | 2015-10-02 | 2015-10-03 | 2015-10-04 | 2015-10-05 | 2015-10-06 | 2015-10-07 | ... | 2016-09-23 | 2016-09-24 | 2016-09-25 | 2016-09-26 | 2016-09-27 | 2016-09-28 | 2016-09-29 | 2016-09-30 | 2016-10-01 | 2016-10-02 | 2016-10-03 | 2016-10-04 | 2016-10-05 | 2016-10-06 | 2016-10-07 | 2016-10-08 | 2016-10-09 | 2016-10-10 | 2016-10-11 | 2016-10-12 | 2016-10-13 | 2016-10-14 | 2016-10-15 | 2016-10-16 | 2016-10-17 | 2016-10-18 | 2016-10-19 | 2016-10-20 | 2016-10-21 | 2016-10-22 | 2016-10-23 | 2016-10-24 | 2016-10-25 | 2016-10-26 | 2016-10-27 | 2016-10-28 | 2016-10-29 | 2016-10-30 | 2016-10-31 | 2016-11-01 | 2016-11-02 | 2016-11-03 | 2016-11-04 | 2016-11-05 | 2016-11-06 | 2016-11-07 | 2016-11-08 | 2016-11-09 | 2016-11-10 | 2016-11-11 | 2016-11-12 | 2016-11-13 | 2016-11-14 | 2016-11-15 | 2016-11-16 | 2016-11-17 | 2016-11-18 | 2016-11-19 | 2016-11-20 | 2016-11-21 | 2016-11-22 | 2016-11-23 | 2016-11-24 | 2016-11-25 | 2016-11-26 | 2016-11-27 | 2016-11-28 | 2016-11-29 | 2016-11-30 | 2016-12-01 | 2016-12-02 | 2016-12-03 | 2016-12-04 | 2016-12-05 | 2016-12-06 | 2016-12-07 | 2016-12-08 | 2016-12-09 | 2016-12-10 | 2016-12-11 | 2016-12-12 | 2016-12-13 | 2016-12-14 | 2016-12-15 | 2016-12-16 | 2016-12-17 | 2016-12-18 | 2016-12-19 | 2016-12-20 | 2016-12-21 | 2016-12-22 | 2016-12-23 | 2016-12-24 | 2016-12-25 | 2016-12-26 | 2016-12-27 | 2016-12-28 | 2016-12-29 | 2016-12-30 | 2016-12-31 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2NE1_zh.wikipedia.org_all-access_spider | 18.0 | 11.0 | 5.0 | 13.0 | 14.0 | 9.0 | 9.0 | 22.0 | 26.0 | 24.0 | 19.0 | 10.0 | 14.0 | 15.0 | 8.0 | 16.0 | 8.0 | 8.0 | 16.0 | 7.0 | 11.0 | 10.0 | 20.0 | 18.0 | 15.0 | 14.0 | 49.0 | 10.0 | 16.0 | 18.0 | 8.0 | 5.0 | 9.0 | 7.0 | 13.0 | 9.0 | 7.0 | 4.0 | 11.0 | 10.0 | 5.0 | 9.0 | 9.0 | 9.0 | 9.0 | 13.0 | 4.0 | 15.0 | 25.0 | 9.0 | 5.0 | 6.0 | 20.0 | 3.0 | 14.0 | 46.0 | 5.0 | 5.0 | 13.0 | 4.0 | 9.0 | 10.0 | 9.0 | 11.0 | 11.0 | 11.0 | 9.0 | 15.0 | 5.0 | 10.0 | 7.0 | 4.0 | 8.0 | 9.0 | 10.0 | 6.0 | 13.0 | 16.0 | 6.0 | 24.0 | 9.0 | 11.0 | 12.0 | 8.0 | 14.0 | 6.0 | 6.0 | 11.0 | 14.0 | 6.0 | 10.0 | 20.0 | 7.0 | 15.0 | 8.0 | 15.0 | 5.0 | 8.0 | 8.0 | ... | 10.0 | 13.0 | 44.0 | 17.0 | 13.0 | 72.0 | 40.0 | 19.0 | 14.0 | 13.0 | 12.0 | 14.0 | 10.0 | 26.0 | 13.0 | 22.0 | 14.0 | 23.0 | 12.0 | 8.0 | 50.0 | 13.0 | 10.0 | 16.0 | 14.0 | 10.0 | 24.0 | 10.0 | 20.0 | 10.0 | 26.0 | 25.0 | 16.0 | 19.0 | 20.0 | 12.0 | 19.0 | 50.0 | 16.0 | 30.0 | 18.0 | 25.0 | 14.0 | 20.0 | 8.0 | 67.0 | 13.0 | 41.0 | 10.0 | 21.0 | 13.0 | 8.0 | 15.0 | 14.0 | 12.0 | 6.0 | 11.0 | 10.0 | 42.0 | 21.0 | 24.0 | 14.0 | 11.0 | 204.0 | 14.0 | 45.0 | 33.0 | 28.0 | 18.0 | 14.0 | 47.0 | 15.0 | 14.0 | 18.0 | 20.0 | 14.0 | 16.0 | 14.0 | 20.0 | 60.0 | 22.0 | 15.0 | 17.0 | 19.0 | 18.0 | 21.0 | 21.0 | 47.0 | 65.0 | 17.0 | 32.0 | 63.0 | 15.0 | 26.0 | 14.0 | 20.0 | 22.0 | 19.0 | 18.0 | 20.0 |
| 1 | 2PM_zh.wikipedia.org_all-access_spider | 11.0 | 14.0 | 15.0 | 18.0 | 11.0 | 13.0 | 22.0 | 11.0 | 10.0 | 4.0 | 41.0 | 65.0 | 57.0 | 38.0 | 20.0 | 62.0 | 44.0 | 15.0 | 10.0 | 47.0 | 24.0 | 17.0 | 22.0 | 9.0 | 39.0 | 13.0 | 11.0 | 12.0 | 21.0 | 19.0 | 9.0 | 15.0 | 33.0 | 8.0 | 8.0 | 7.0 | 13.0 | 2.0 | 23.0 | 12.0 | 27.0 | 27.0 | 36.0 | 23.0 | 58.0 | 80.0 | 60.0 | 69.0 | 42.0 | 161.0 | 94.0 | 77.0 | 78.0 | 20.0 | 24.0 | 13.0 | 14.0 | 26.0 | 8.0 | 82.0 | 22.0 | 11.0 | 81.0 | 37.0 | 9.0 | 40.0 | 47.0 | 18.0 | 23.0 | 6.0 | 2.0 | 7.0 | 16.0 | 10.0 | 34.0 | 14.0 | 31.0 | 20.0 | 23.0 | 14.0 | 16.0 | 34.0 | 15.0 | 30.0 | 13.0 | 30.0 | 15.0 | 25.0 | 17.0 | 8.0 | 12.0 | 17.0 | 10.0 | 21.0 | 18.0 | 30.0 | 13.0 | 7.0 | 15.0 | ... | 26.0 | 20.0 | 27.0 | 35.0 | 20.0 | 31.0 | 24.0 | 24.0 | 94.0 | 18.0 | 20.0 | 18.0 | 16.0 | 38.0 | 54.0 | 29.0 | 49.0 | 25.0 | 72.0 | 144.0 | 36.0 | 97.0 | 179.0 | 29.0 | 12.0 | 21.0 | 42.0 | 53.0 | 41.0 | 19.0 | 25.0 | 19.0 | 15.0 | 21.0 | 21.0 | 27.0 | 33.0 | 15.0 | 24.0 | 13.0 | 11.0 | 14.0 | 26.0 | 11.0 | 21.0 | 14.0 | 14.0 | 54.0 | 5.0 | 10.0 | 12.0 | 11.0 | 14.0 | 28.0 | 23.0 | 20.0 | 9.0 | 12.0 | 11.0 | 14.0 | 14.0 | 15.0 | 15.0 | 11.0 | 20.0 | 13.0 | 19.0 | 621.0 | 57.0 | 17.0 | 23.0 | 19.0 | 21.0 | 47.0 | 28.0 | 22.0 | 22.0 | 65.0 | 27.0 | 17.0 | 17.0 | 13.0 | 9.0 | 18.0 | 22.0 | 17.0 | 15.0 | 22.0 | 23.0 | 19.0 | 17.0 | 42.0 | 28.0 | 15.0 | 9.0 | 30.0 | 52.0 | 45.0 | 26.0 | 20.0 |
2 rows × 551 columns
# Schema and data quality checks
print('Column count:', df_raw.shape[1])
print('Row count:', df_raw.shape[0])
print('First 5 columns:', df_raw.columns[:5].tolist())
print('Last 5 columns:', df_raw.columns[-5:].tolist())
missing_counts = df_raw.isna().sum().sort_values(ascending=False)
missing_summary = pd.DataFrame({
'missing_count': missing_counts,
'missing_pct': (missing_counts / len(df_raw) * 100).round(4)
})
print('Top missingness columns:')
display(missing_summary.head(10))
dup_rows = df_raw.duplicated().sum()
dup_pages = df_raw.duplicated(subset=['Page']).sum()
print(f'Duplicate full rows: {dup_rows}')
print(f'Duplicate Page entries: {dup_pages}')
Column count: 551 Row count: 145063 First 5 columns: ['Page', '2015-07-01', '2015-07-02', '2015-07-03', '2015-07-04'] Last 5 columns: ['2016-12-27', '2016-12-28', '2016-12-29', '2016-12-30', '2016-12-31'] Top missingness columns:
| missing_count | missing_pct | |
|---|---|---|
| 2015-07-02 | 20816 | 14.3496 |
| 2015-07-01 | 20740 | 14.2972 |
| 2015-07-07 | 20664 | 14.2448 |
| 2015-07-05 | 20659 | 14.2414 |
| 2015-07-04 | 20654 | 14.2380 |
| 2015-07-03 | 20544 | 14.1621 |
| 2015-07-11 | 20525 | 14.1490 |
| 2015-07-12 | 20485 | 14.1215 |
| 2015-07-06 | 20483 | 14.1201 |
| 2015-07-13 | 20399 | 14.0622 |
Duplicate full rows: 0 Duplicate Page entries: 0
# Parse language and access metadata from page names
df = df_raw.copy()
meta_pattern = r'_(?P<lang>[a-z\-]+)\.wikipedia\.org_(?P<access>[^_]+-[^_]+)_(?P<agent>[^_]+)$'
meta = df['Page'].str.extract(meta_pattern)
df['language'] = meta['lang'].fillna('unknown')
df['access_type'] = meta['access'].fillna('unknown_access')
df['access_agent'] = meta['agent'].fillna('unknown_agent')
# Identify all date columns in wide format
date_cols = [c for c in df.columns if re.match(r'^\d{4}-\d{2}-\d{2}$', str(c))]
print('Date columns detected:', len(date_cols))
for c in date_cols:
df[c] = pd.to_numeric(df[c], errors='coerce')
print('Languages detected:', df['language'].nunique())
display(df[['Page', 'language', 'access_type', 'access_agent']].head(3))
Date columns detected: 550 Languages detected: 8
| Page | language | access_type | access_agent | |
|---|---|---|---|---|
| 0 | 2NE1_zh.wikipedia.org_all-access_spider | zh | all-access | spider |
| 1 | 2PM_zh.wikipedia.org_all-access_spider | zh | all-access | spider |
| 2 | 3C_zh.wikipedia.org_all-access_spider | zh | all-access | spider |
# Missingness pattern across a sampled set of pages and dates
sample_rows = min(500, len(df))
sample_cols = min(90, len(date_cols))
sampled_dates = date_cols[:sample_cols]
sampled_idx = np.random.RandomState(RANDOM_STATE).choice(df.index, size=sample_rows, replace=False)
null_matrix = df.loc[sampled_idx, sampled_dates].isna().astype(int)
plt.figure(figsize=(15, 4))
sns.heatmap(null_matrix, cbar=False)
plt.title('Missingness Heatmap (Sampled Pages x Early Date Columns)')
plt.xlabel('Date columns (sample)')
plt.ylabel('Sampled pages')
plt.show()
row_nan_share = df[date_cols].isna().mean(axis=1)
print('Median row-level missing share:', round(float(row_nan_share.median()), 4))
print('95th percentile row-level missing share:', round(float(row_nan_share.quantile(0.95)), 4))
Median row-level missing share: 0.0 95th percentile row-level missing share: 0.6527
# Aggregate to language-level daily views in long-format friendly structure
df_grouped = df.groupby('language')[date_cols].sum(min_count=1).fillna(0)
lang_daily = df_grouped.T
lang_daily.index = pd.to_datetime(lang_daily.index)
lang_daily = lang_daily.sort_index().asfreq('D')
# Optional smoothing helper columns for EDA
lang_rolling_7 = lang_daily.rolling(window=7, min_periods=1).mean()
lang_rolling_30 = lang_daily.rolling(window=30, min_periods=1).mean()
long_df = lang_daily.reset_index().melt(id_vars='index', var_name='language', value_name='views')
long_df = long_df.rename(columns={'index': 'date'})
language_volume = long_df.groupby('language', as_index=False)['views'].sum().sort_values('views', ascending=False)
top_languages = language_volume['language'].head(8).tolist()
print('Language-level daily matrix shape:', lang_daily.shape)
display(language_volume.head(15))
Language-level daily matrix shape: (550, 8)
| language | views | |
|---|---|---|
| 1 | en | 4.143970e+10 |
| 6 | unknown | 2.928629e+10 |
| 2 | es | 7.318862e+09 |
| 4 | ja | 6.973910e+09 |
| 0 | de | 6.857948e+09 |
| 5 | ru | 5.528652e+09 |
| 3 | fr | 4.869739e+09 |
| 7 | zh | 2.224114e+09 |
The next cells cover univariate, bivariate, and multivariate behavior, plus outlier and skewness diagnostics.
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
sns.barplot(data=language_volume.head(12), x='views', y='language', ax=axes[0, 0], palette='crest')
axes[0, 0].set_title('Top Languages by Total Views')
axes[0, 0].set_xlabel('Total views over history')
total_daily = lang_daily.sum(axis=1)
axes[0, 1].plot(total_daily.index, total_daily.values, color='#1f77b4')
axes[0, 1].set_title('Overall Daily Views (All Languages)')
axes[0, 1].tick_params(axis='x', rotation=30)
for lang in top_languages[:6]:
axes[1, 0].plot(lang_rolling_7.index, lang_rolling_7[lang], label=lang, linewidth=1.6)
axes[1, 0].set_title('7-Day Rolling Trends (Top Languages)')
axes[1, 0].legend(loc='upper left', ncol=2, fontsize=9)
axes[1, 0].tick_params(axis='x', rotation=30)
corr_top = lang_daily[top_languages[:8]].corr() if len(top_languages) >= 2 else pd.DataFrame()
if not corr_top.empty:
sns.heatmap(corr_top, annot=True, fmt='.2f', cmap='coolwarm', ax=axes[1, 1])
axes[1, 1].set_title('Correlation Across Top Language Series')
else:
axes[1, 1].text(0.5, 0.5, 'Not enough languages for correlation', ha='center', va='center')
axes[1, 1].set_axis_off()
plt.tight_layout()
plt.show()
# Outlier and skewness analysis by language
outlier_records = []
skew_records = []
for lang in lang_daily.columns:
s = lang_daily[lang].astype(float)
q1, q3 = s.quantile([0.25, 0.75])
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
outlier_share = ((s < lower) | (s > upper)).mean()
skew_raw = s.skew()
skew_log = np.log1p(s).skew()
outlier_records.append({'language': lang, 'outlier_share': outlier_share})
skew_records.append({'language': lang, 'skew_raw': skew_raw, 'skew_log1p': skew_log})
outlier_df = pd.DataFrame(outlier_records).sort_values('outlier_share', ascending=False)
skew_df = pd.DataFrame(skew_records).sort_values('skew_raw', ascending=False)
display(outlier_df.head(10))
display(skew_df.head(10))
plt.figure(figsize=(12, 4))
sns.barplot(data=outlier_df.head(12), x='language', y='outlier_share', palette='mako')
plt.title('Outlier Share by Language (IQR Rule)')
plt.xticks(rotation=45)
plt.ylabel('Outlier share')
plt.show()
| language | outlier_share | |
|---|---|---|
| 5 | ru | 0.069091 |
| 1 | en | 0.063636 |
| 6 | unknown | 0.050909 |
| 3 | fr | 0.047273 |
| 0 | de | 0.025455 |
| 4 | ja | 0.012727 |
| 7 | zh | 0.003636 |
| 2 | es | 0.001818 |
| language | skew_raw | skew_log1p | |
|---|---|---|---|
| 5 | ru | 3.000370 | 1.580929 |
| 6 | unknown | 2.263112 | 1.196918 |
| 1 | en | 1.574414 | 1.039583 |
| 0 | de | 1.093992 | 0.653026 |
| 4 | ja | 0.541323 | -0.101316 |
| 3 | fr | 0.428835 | -0.184960 |
| 7 | zh | 0.157173 | -0.411243 |
| 2 | es | 0.076138 | -0.347028 |
def run_adf(series):
s = pd.Series(series).dropna()
if len(s) < 30:
return {'adf_stat': np.nan, 'p_value': np.nan, 'n_obs': len(s), 'is_stationary_5pct': False}
result = adfuller(s, autolag='AIC')
return {
'adf_stat': result[0],
'p_value': result[1],
'n_obs': result[3],
'is_stationary_5pct': bool(result[1] < 0.05)
}
adf_rows = []
for lang in lang_daily.columns:
adf_out = run_adf(lang_daily[lang])
adf_out['language'] = lang
adf_rows.append(adf_out)
adf_df = pd.DataFrame(adf_rows).sort_values('p_value')
display(adf_df.head(15))
print('Share of languages stationary at 5% level (raw series):', round(adf_df['is_stationary_5pct'].mean(), 3))
| adf_stat | p_value | n_obs | is_stationary_5pct | language | |
|---|---|---|---|---|---|
| 5 | -3.841414 | 0.002511 | 547 | True | ru |
| 2 | -3.001005 | 0.034801 | 534 | True | es |
| 6 | -2.977322 | 0.037067 | 535 | True | unknown |
| 4 | -2.363707 | 0.152243 | 535 | False | ja |
| 3 | -2.279667 | 0.178610 | 530 | False | fr |
| 0 | -2.181550 | 0.213010 | 533 | False | de |
| 1 | -2.082195 | 0.251734 | 535 | False | en |
| 7 | -1.545159 | 0.511065 | 530 | False | zh |
Share of languages stationary at 5% level (raw series): 0.375
# Decomposition for the top 3 languages by total volume
decompose_langs = top_languages[:3]
for lang in decompose_langs:
series = lang_daily[lang].asfreq('D')
result = seasonal_decompose(series, model='additive', period=7, extrapolate_trend='freq')
fig = result.plot()
fig.set_size_inches(14, 8)
plt.suptitle(f'Seasonal Decomposition: {lang}', y=1.02)
plt.tight_layout()
plt.show()
# ACF/PACF on differenced series for a focus language
focus_language = top_languages[0] if top_languages else lang_daily.columns[0]
focus_series = lang_daily[focus_language].copy()
diff1 = focus_series.diff().dropna()
seasonal_diff = focus_series.diff(7).dropna()
adf_raw = run_adf(focus_series)['p_value']
adf_diff1 = run_adf(diff1)['p_value']
adf_diff7 = run_adf(seasonal_diff)['p_value']
print(f'Focus language: {focus_language}')
print(f'ADF p-value (raw): {adf_raw:.5f}')
print(f'ADF p-value (1st diff): {adf_diff1:.5f}')
print(f'ADF p-value (seasonal diff, lag=7): {adf_diff7:.5f}')
fig, axes = plt.subplots(2, 2, figsize=(14, 8))
axes[0, 0].plot(focus_series.index, focus_series.values)
axes[0, 0].set_title(f'Raw Series: {focus_language}')
axes[0, 1].plot(diff1.index, diff1.values, color='#ff7f0e')
axes[0, 1].set_title('First Difference')
plot_acf(seasonal_diff, lags=30, ax=axes[1, 0])
axes[1, 0].set_title('ACF on Seasonal-Differenced Series')
plot_pacf(diff1, lags=30, ax=axes[1, 1], method='ywm')
axes[1, 1].set_title('PACF on First-Differenced Series')
plt.tight_layout()
plt.show()
Focus language: en ADF p-value (raw): 0.25173 ADF p-value (1st diff): 0.00000 ADF p-value (seasonal diff, lag=7): 0.00003
We use chronological split with holdout validation and test windows to avoid leakage.
def ensure_daily_freq(series):
s = pd.Series(series).dropna().astype(float).copy()
s.index = pd.DatetimeIndex(s.index)
return s.asfreq('D')
def ts_split(series, val_size=60, test_size=60):
s = ensure_daily_freq(series)
if len(s) <= (val_size + test_size + 30):
raise ValueError('Series too short for requested split sizes.')
train = s.iloc[:-(val_size + test_size)].asfreq('D')
val = s.iloc[-(val_size + test_size):-test_size].asfreq('D')
test = s.iloc[-test_size:].asfreq('D')
return train, val, test
def smape(y_true, y_pred):
y_true = np.asarray(y_true, dtype=float)
y_pred = np.asarray(y_pred, dtype=float)
denom = (np.abs(y_true) + np.abs(y_pred))
valid = denom != 0
out = np.zeros_like(y_true, dtype=float)
out[valid] = 2.0 * np.abs(y_pred[valid] - y_true[valid]) / denom[valid]
return float(np.mean(out) * 100)
def regression_metrics(y_true, y_pred):
y_true = np.asarray(y_true, dtype=float)
y_pred = np.asarray(y_pred, dtype=float)
return {
'MAE': mean_absolute_error(y_true, y_pred),
'RMSE': math.sqrt(mean_squared_error(y_true, y_pred)),
'R2': r2_score(y_true, y_pred),
'sMAPE': smape(y_true, y_pred)
}
def naive_forecast(train, horizon):
return np.repeat(train.iloc[-1], horizon)
def seasonal_naive_forecast(train, horizon, season_len=7):
if len(train) < season_len:
return naive_forecast(train, horizon)
pattern = train.iloc[-season_len:].values
reps = int(np.ceil(horizon / season_len))
return np.tile(pattern, reps)[:horizon]
def moving_average_forecast(train, horizon, window=7):
value = float(train.iloc[-window:].mean()) if len(train) >= window else float(train.mean())
return np.repeat(value, horizon)
# Baseline benchmark across all languages
baseline_rows = []
for lang in lang_daily.columns:
series = lang_daily[lang]
try:
train, val, test = ts_split(series, val_size=60, test_size=60)
except ValueError:
continue
for model_name, pred_val, pred_test in [
('Naive', naive_forecast(train, len(val)), naive_forecast(pd.concat([train, val]), len(test))),
('SeasonalNaive_7', seasonal_naive_forecast(train, len(val), 7), seasonal_naive_forecast(pd.concat([train, val]), len(test), 7)),
('MovingAverage_7', moving_average_forecast(train, len(val), 7), moving_average_forecast(pd.concat([train, val]), len(test), 7))
]:
val_metrics = regression_metrics(val.values, pred_val)
test_metrics = regression_metrics(test.values, pred_test)
baseline_rows.append({
'language': lang,
'model': model_name,
'val_MAE': val_metrics['MAE'],
'val_RMSE': val_metrics['RMSE'],
'val_R2': val_metrics['R2'],
'val_sMAPE': val_metrics['sMAPE'],
'test_MAE': test_metrics['MAE'],
'test_RMSE': test_metrics['RMSE'],
'test_R2': test_metrics['R2'],
'test_sMAPE': test_metrics['sMAPE']
})
baseline_results = pd.DataFrame(baseline_rows)
display(baseline_results.sort_values(['language', 'val_RMSE']).head(20))
baseline_best = baseline_results.sort_values('val_RMSE').groupby('language', as_index=False).first()
display(baseline_best.head(15))
| language | model | val_MAE | val_RMSE | val_R2 | val_sMAPE | test_MAE | test_RMSE | test_R2 | test_sMAPE | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | de | SeasonalNaive_7 | 1.044610e+06 | 1.350131e+06 | -0.734801 | 8.880619 | 1.289662e+06 | 1.974741e+06 | -0.163847 | 9.500081 |
| 2 | de | MovingAverage_7 | 1.086218e+06 | 1.463757e+06 | -1.039086 | 9.101379 | 1.250064e+06 | 1.992696e+06 | -0.185108 | 9.163234 |
| 0 | de | Naive | 1.324366e+06 | 1.664018e+06 | -1.635200 | 11.284589 | 1.503639e+06 | 1.905302e+06 | -0.083437 | 11.078940 |
| 5 | en | MovingAverage_7 | 1.430048e+07 | 1.505198e+07 | -3.352338 | 16.702642 | 1.080793e+07 | 1.472770e+07 | -1.056013 | 12.612268 |
| 4 | en | SeasonalNaive_7 | 1.500363e+07 | 1.793098e+07 | -5.176513 | 17.136392 | 1.079519e+07 | 1.461450e+07 | -1.024530 | 12.737508 |
| 3 | en | Naive | 2.071569e+07 | 2.145548e+07 | -7.843246 | 23.262108 | 1.049657e+07 | 1.445132e+07 | -0.979570 | 12.208008 |
| 7 | es | SeasonalNaive_7 | 1.153090e+06 | 1.447362e+06 | 0.360220 | 7.471692 | 1.677730e+06 | 2.154497e+06 | 0.239638 | 12.422111 |
| 6 | es | Naive | 1.663483e+06 | 1.934284e+06 | -0.142659 | 11.790457 | 2.133723e+06 | 2.649801e+06 | -0.150151 | 15.916582 |
| 8 | es | MovingAverage_7 | 1.608444e+06 | 2.121018e+06 | -0.373931 | 11.318572 | 1.932533e+06 | 2.483846e+06 | -0.010596 | 14.478320 |
| 10 | fr | SeasonalNaive_7 | 6.908689e+05 | 9.680532e+05 | -0.358535 | 7.516182 | 1.082770e+06 | 1.490025e+06 | -0.747907 | 11.345002 |
| 11 | fr | MovingAverage_7 | 7.497384e+05 | 1.062692e+06 | -0.637146 | 8.149931 | 1.077244e+06 | 1.494842e+06 | -0.759226 | 11.222614 |
| 9 | fr | Naive | 1.059159e+06 | 1.342409e+06 | -1.612416 | 11.868966 | 8.806541e+05 | 1.133262e+06 | -0.011094 | 9.017285 |
| 12 | ja | Naive | 1.195295e+06 | 1.551150e+06 | -0.071960 | 8.873690 | 2.011918e+06 | 2.632484e+06 | -1.360986 | 15.871393 |
| 14 | ja | MovingAverage_7 | 1.714898e+06 | 1.980059e+06 | -0.746735 | 12.541340 | 1.159882e+06 | 1.763296e+06 | -0.059284 | 8.649001 |
| 13 | ja | SeasonalNaive_7 | 1.698616e+06 | 2.253845e+06 | -1.263180 | 11.454178 | 1.147866e+06 | 1.643666e+06 | 0.079573 | 8.754628 |
| 15 | ru | Naive | 9.455526e+05 | 1.315853e+06 | -1.021221 | 10.041249 | 9.394439e+05 | 1.363121e+06 | -0.666671 | 8.776396 |
| 17 | ru | MovingAverage_7 | 1.642484e+06 | 1.885311e+06 | -3.149208 | 18.492875 | 1.248138e+06 | 1.616317e+06 | -1.343333 | 11.968518 |
| 16 | ru | SeasonalNaive_7 | 1.642021e+06 | 1.953916e+06 | -3.456676 | 18.589165 | 1.261780e+06 | 1.628638e+06 | -1.379196 | 12.226703 |
| 20 | unknown | MovingAverage_7 | 6.878802e+06 | 8.006070e+06 | -1.091891 | 11.359965 | 5.243026e+06 | 6.789321e+06 | -0.246347 | 8.826614 |
| 19 | unknown | SeasonalNaive_7 | 7.663333e+06 | 9.545938e+06 | -1.973976 | 12.095394 | 3.762603e+06 | 5.340490e+06 | 0.228833 | 6.324407 |
| language | model | val_MAE | val_RMSE | val_R2 | val_sMAPE | test_MAE | test_RMSE | test_R2 | test_sMAPE | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | de | SeasonalNaive_7 | 1.044610e+06 | 1.350131e+06 | -0.734801 | 8.880619 | 1.289662e+06 | 1.974741e+06 | -0.163847 | 9.500081 |
| 1 | en | MovingAverage_7 | 1.430048e+07 | 1.505198e+07 | -3.352338 | 16.702642 | 1.080793e+07 | 1.472770e+07 | -1.056013 | 12.612268 |
| 2 | es | SeasonalNaive_7 | 1.153090e+06 | 1.447362e+06 | 0.360220 | 7.471692 | 1.677730e+06 | 2.154497e+06 | 0.239638 | 12.422111 |
| 3 | fr | SeasonalNaive_7 | 6.908689e+05 | 9.680532e+05 | -0.358535 | 7.516182 | 1.082770e+06 | 1.490025e+06 | -0.747907 | 11.345002 |
| 4 | ja | Naive | 1.195295e+06 | 1.551150e+06 | -0.071960 | 8.873690 | 2.011918e+06 | 2.632484e+06 | -1.360986 | 15.871393 |
| 5 | ru | Naive | 9.455526e+05 | 1.315853e+06 | -1.021221 | 10.041249 | 9.394439e+05 | 1.363121e+06 | -0.666671 | 8.776396 |
| 6 | unknown | MovingAverage_7 | 6.878802e+06 | 8.006070e+06 | -1.091891 | 11.359965 | 5.243026e+06 | 6.789321e+06 | -0.246347 | 8.826614 |
| 7 | zh | SeasonalNaive_7 | 2.891664e+05 | 4.044059e+05 | 0.380032 | 6.143037 | 2.293596e+05 | 3.680245e+05 | 0.345988 | 4.930013 |
For English, campaign exogenous input is used only if the file is available.
# Prepare exogenous signal for English when available
campaign_series = None
if EXOG_PATH is not None:
exog_df = pd.read_csv(EXOG_PATH)
exog_df.columns = [str(c).strip() for c in exog_df.columns]
date_col_candidates = [c for c in exog_df.columns if 'date' in c.lower()]
flag_col_candidates = [c for c in exog_df.columns if c.lower() not in date_col_candidates]
if date_col_candidates and flag_col_candidates:
dcol = date_col_candidates[0]
fcol = flag_col_candidates[0]
exog_df[dcol] = pd.to_datetime(exog_df[dcol], errors='coerce')
campaign_series = exog_df.set_index(dcol)[fcol].astype(float).sort_index()
campaign_series = campaign_series.reindex(lang_daily.index).fillna(0.0)
print('Campaign exogenous series active:' if campaign_series is not None else 'Campaign exogenous series inactive (file unavailable or format mismatch).')
Campaign exogenous series inactive (file unavailable or format mismatch).
def fit_best_arima(train, val):
best = None
best_rmse = np.inf
train = ensure_daily_freq(train)
val = ensure_daily_freq(val)
for p in [0, 1, 2]:
for d in [0, 1]:
for q in [0, 1, 2]:
order = (p, d, q)
try:
model = SARIMAX(train, order=order, seasonal_order=(0, 0, 0, 0),
enforce_stationarity=False, enforce_invertibility=False)
fitted = model.fit(disp=False)
pred = fitted.forecast(steps=len(val))
rmse = math.sqrt(mean_squared_error(val, pred))
if rmse < best_rmse:
best_rmse = rmse
best = {'order': order, 'fitted': fitted, 'val_pred': pred}
except Exception:
continue
return best
def fit_sarimax(train, val, exog_train=None, exog_val=None):
# Fixed weekly seasonal structure as a practical starting point
train = ensure_daily_freq(train)
val = ensure_daily_freq(val)
model = SARIMAX(
train,
order=(1, 1, 1),
seasonal_order=(1, 1, 1, 7),
exog=exog_train,
enforce_stationarity=False,
enforce_invertibility=False
)
fitted = model.fit(disp=False)
pred = fitted.forecast(steps=len(val), exog=exog_val)
return fitted, pred
model_rows = []
model_store = {}
for lang in lang_daily.columns:
series = ensure_daily_freq(lang_daily[lang])
try:
train, val, test = ts_split(series, val_size=60, test_size=60)
except ValueError:
continue
best_arima = fit_best_arima(train, val)
if best_arima is not None:
train_val = ensure_daily_freq(pd.concat([train, val]))
arima_refit = SARIMAX(train_val, order=best_arima['order'], seasonal_order=(0, 0, 0, 0),
enforce_stationarity=False, enforce_invertibility=False).fit(disp=False)
arima_test_pred = arima_refit.forecast(steps=len(test))
vm = regression_metrics(val.values, best_arima['val_pred'])
tm = regression_metrics(test.values, arima_test_pred.values)
model_rows.append({
'language': lang, 'model': 'ARIMA_tuned', 'order': best_arima['order'],
'val_RMSE': vm['RMSE'], 'val_MAE': vm['MAE'], 'val_R2': vm['R2'], 'val_sMAPE': vm['sMAPE'],
'test_RMSE': tm['RMSE'], 'test_MAE': tm['MAE'], 'test_R2': tm['R2'], 'test_sMAPE': tm['sMAPE']
})
model_store[(lang, 'ARIMA_tuned')] = arima_refit
try:
use_exog = (lang == 'en') and (campaign_series is not None)
exog_train = campaign_series.loc[train.index] if use_exog else None
exog_val = campaign_series.loc[val.index] if use_exog else None
exog_train_val = campaign_series.loc[ensure_daily_freq(pd.concat([train, val])).index] if use_exog else None
exog_test = campaign_series.loc[test.index] if use_exog else None
sarimax_fit, sarimax_val_pred = fit_sarimax(train, val, exog_train=exog_train, exog_val=exog_val)
sarimax_refit = SARIMAX(
ensure_daily_freq(pd.concat([train, val])),
order=(1, 1, 1),
seasonal_order=(1, 1, 1, 7),
exog=exog_train_val,
enforce_stationarity=False,
enforce_invertibility=False
).fit(disp=False)
sarimax_test_pred = sarimax_refit.forecast(steps=len(test), exog=exog_test)
vm = regression_metrics(val.values, sarimax_val_pred.values)
tm = regression_metrics(test.values, sarimax_test_pred.values)
label = 'SARIMAX_weekly_exog' if use_exog else 'SARIMA_weekly'
model_rows.append({
'language': lang, 'model': label, 'order': '(1,1,1)x(1,1,1,7)',
'val_RMSE': vm['RMSE'], 'val_MAE': vm['MAE'], 'val_R2': vm['R2'], 'val_sMAPE': vm['sMAPE'],
'test_RMSE': tm['RMSE'], 'test_MAE': tm['MAE'], 'test_R2': tm['R2'], 'test_sMAPE': tm['sMAPE']
})
model_store[(lang, label)] = sarimax_refit
except Exception:
pass
model_results = pd.DataFrame(model_rows)
display(model_results.sort_values(['language', 'val_RMSE']).head(30))
c:\Users\prath\Desktop\Prathamesh\Prathamesh\Learn\DSML\.venv\Lib\site-packages\statsmodels\base\model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
c:\Users\prath\Desktop\Prathamesh\Prathamesh\Learn\DSML\.venv\Lib\site-packages\statsmodels\base\model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
c:\Users\prath\Desktop\Prathamesh\Prathamesh\Learn\DSML\.venv\Lib\site-packages\statsmodels\base\model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
c:\Users\prath\Desktop\Prathamesh\Prathamesh\Learn\DSML\.venv\Lib\site-packages\statsmodels\base\model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
c:\Users\prath\Desktop\Prathamesh\Prathamesh\Learn\DSML\.venv\Lib\site-packages\statsmodels\base\model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
c:\Users\prath\Desktop\Prathamesh\Prathamesh\Learn\DSML\.venv\Lib\site-packages\statsmodels\base\model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
c:\Users\prath\Desktop\Prathamesh\Prathamesh\Learn\DSML\.venv\Lib\site-packages\statsmodels\base\model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
c:\Users\prath\Desktop\Prathamesh\Prathamesh\Learn\DSML\.venv\Lib\site-packages\statsmodels\base\model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
c:\Users\prath\Desktop\Prathamesh\Prathamesh\Learn\DSML\.venv\Lib\site-packages\statsmodels\base\model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
| language | model | order | val_RMSE | val_MAE | val_R2 | val_sMAPE | test_RMSE | test_MAE | test_R2 | test_sMAPE | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | de | ARIMA_tuned | (2, 1, 2) | 1.091433e+06 | 7.346878e+05 | -0.133684 | 5.985307 | 2.087948e+06 | 1.323999e+06 | -0.301114 | 9.755496 |
| 1 | de | SARIMA_weekly | (1,1,1)x(1,1,1,7) | 1.999362e+06 | 1.697916e+06 | -2.804351 | 14.994343 | 1.943448e+06 | 1.588865e+06 | -0.127254 | 11.654639 |
| 2 | en | ARIMA_tuned | (1, 1, 2) | 1.687825e+07 | 1.610558e+07 | -4.472555 | 18.602466 | 1.418499e+07 | 1.021441e+07 | -0.907278 | 11.843813 |
| 3 | en | SARIMA_weekly | (1,1,1)x(1,1,1,7) | 3.446287e+07 | 3.296072e+07 | -21.815923 | 34.225436 | 1.615807e+07 | 1.268826e+07 | -1.474769 | 15.244994 |
| 4 | es | ARIMA_tuned | (2, 1, 0) | 1.812323e+06 | 1.558508e+06 | -0.003108 | 11.033012 | 2.497843e+06 | 1.951867e+06 | -0.022018 | 14.620436 |
| 5 | es | SARIMA_weekly | (1,1,1)x(1,1,1,7) | 4.209881e+06 | 3.749001e+06 | -4.412724 | 23.048616 | 3.576774e+06 | 3.286960e+06 | -1.095615 | 29.919822 |
| 6 | fr | ARIMA_tuned | (2, 0, 1) | 1.018737e+06 | 8.090800e+05 | -0.504516 | 8.605381 | 1.482136e+06 | 1.289927e+06 | -0.729446 | 12.971531 |
| 7 | fr | SARIMA_weekly | (1,1,1)x(1,1,1,7) | 1.683568e+06 | 1.465010e+06 | -3.108978 | 17.141365 | 1.048907e+06 | 6.868643e+05 | 0.133826 | 6.963683 |
| 9 | ja | SARIMA_weekly | (1,1,1)x(1,1,1,7) | 1.156751e+06 | 8.035708e+05 | 0.403856 | 6.036929 | 3.383531e+06 | 2.687904e+06 | -2.900340 | 22.744043 |
| 8 | ja | ARIMA_tuned | (1, 1, 0) | 1.506750e+06 | 1.180122e+06 | -0.011471 | 8.759175 | 2.545935e+06 | 1.913543e+06 | -1.208293 | 14.995932 |
| 10 | ru | ARIMA_tuned | (0, 1, 1) | 1.208114e+06 | 8.151557e+05 | -0.703786 | 8.537022 | 1.400110e+06 | 9.834271e+05 | -0.758350 | 9.223951 |
| 11 | ru | SARIMA_weekly | (1,1,1)x(1,1,1,7) | 6.084890e+06 | 5.290948e+06 | -42.221977 | 86.817080 | 1.451566e+06 | 1.112463e+06 | -0.889968 | 10.610448 |
| 12 | unknown | ARIMA_tuned | (2, 0, 0) | 7.530317e+06 | 6.128458e+06 | -0.850661 | 10.295224 | 1.124593e+07 | 9.572150e+06 | -2.419613 | 17.437653 |
| 13 | unknown | SARIMA_weekly | (1,1,1)x(1,1,1,7) | 1.158226e+07 | 1.092534e+07 | -3.378115 | 17.049478 | 7.380442e+06 | 6.402504e+06 | -0.472824 | 11.282125 |
| 14 | zh | ARIMA_tuned | (2, 1, 2) | 5.264696e+05 | 4.344343e+05 | -0.050706 | 9.163610 | 4.535134e+05 | 3.431601e+05 | 0.006855 | 7.302086 |
| 15 | zh | SARIMA_weekly | (1,1,1)x(1,1,1,7) | 7.065007e+05 | 5.987881e+05 | -0.892169 | 13.268069 | 6.630935e+05 | 5.652344e+05 | -1.123156 | 13.105324 |
# Optional Prophet benchmark for top 3 languages if package is available
prophet_rows = []
try:
from prophet import Prophet
prophet_available = True
except Exception:
prophet_available = False
if prophet_available:
for lang in top_languages[:3]:
series = lang_daily[lang].astype(float)
try:
train, val, test = ts_split(series, val_size=60, test_size=60)
except ValueError:
continue
train_df = pd.DataFrame({'ds': train.index, 'y': train.values})
m = Prophet(daily_seasonality=False, weekly_seasonality=True, yearly_seasonality=True)
m.fit(train_df)
future_val = pd.DataFrame({'ds': val.index})
val_pred = m.predict(future_val)['yhat'].values
train_val_df = pd.DataFrame({'ds': pd.concat([train, val]).index, 'y': pd.concat([train, val]).values})
m2 = Prophet(daily_seasonality=False, weekly_seasonality=True, yearly_seasonality=True)
m2.fit(train_val_df)
test_pred = m2.predict(pd.DataFrame({'ds': test.index}))['yhat'].values
vm = regression_metrics(val.values, val_pred)
tm = regression_metrics(test.values, test_pred)
prophet_rows.append({
'language': lang, 'model': 'Prophet_weekly_yearly',
'val_RMSE': vm['RMSE'], 'val_MAE': vm['MAE'], 'val_R2': vm['R2'], 'val_sMAPE': vm['sMAPE'],
'test_RMSE': tm['RMSE'], 'test_MAE': tm['MAE'], 'test_R2': tm['R2'], 'test_sMAPE': tm['sMAPE']
})
prophet_results = pd.DataFrame(prophet_rows)
display(prophet_results)
else:
prophet_results = pd.DataFrame()
print('Prophet not available in this environment. Statistical model comparison proceeds with ARIMA/SARIMA variants.')
Prophet not available in this environment. Statistical model comparison proceeds with ARIMA/SARIMA variants.
# Consolidated comparison and winner selection
all_results = pd.concat([baseline_results, model_results, prophet_results], ignore_index=True, sort=False)
all_results = all_results.dropna(subset=['val_RMSE', 'test_RMSE'])
winner_table = all_results.sort_values(['language', 'test_RMSE']).groupby('language', as_index=False).first()
winner_table = winner_table.sort_values('test_RMSE')
display(winner_table)
summary_table = winner_table[['language', 'model', 'test_MAE', 'test_RMSE', 'test_R2', 'test_sMAPE']].copy()
summary_table = summary_table.rename(columns={
'language': 'Language',
'model': 'Winning Model',
'test_MAE': 'Test MAE',
'test_RMSE': 'Test RMSE',
'test_R2': 'Test R2',
'test_sMAPE': 'Test sMAPE (%)'
})
display(summary_table.head(20))
| language | model | val_MAE | val_RMSE | val_R2 | val_sMAPE | test_MAE | test_RMSE | test_R2 | test_sMAPE | order | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 7 | zh | SeasonalNaive_7 | 2.891664e+05 | 4.044059e+05 | 0.380032 | 6.143037 | 2.293596e+05 | 3.680245e+05 | 0.345988 | 4.930013 | (2, 1, 2) |
| 3 | fr | SARIMA_weekly | 1.465010e+06 | 1.683568e+06 | -3.108978 | 17.141365 | 6.868643e+05 | 1.048907e+06 | 0.133826 | 6.963683 | (1,1,1)x(1,1,1,7) |
| 5 | ru | Naive | 9.455526e+05 | 1.315853e+06 | -1.021221 | 10.041249 | 9.394439e+05 | 1.363121e+06 | -0.666671 | 8.776396 | (0, 1, 1) |
| 4 | ja | SeasonalNaive_7 | 1.698616e+06 | 2.253845e+06 | -1.263180 | 11.454178 | 1.147866e+06 | 1.643666e+06 | 0.079573 | 8.754628 | (1, 1, 0) |
| 0 | de | Naive | 1.324366e+06 | 1.664018e+06 | -1.635200 | 11.284589 | 1.503639e+06 | 1.905302e+06 | -0.083437 | 11.078940 | (1,1,1)x(1,1,1,7) |
| 2 | es | SeasonalNaive_7 | 1.153090e+06 | 1.447362e+06 | 0.360220 | 7.471692 | 1.677730e+06 | 2.154497e+06 | 0.239638 | 12.422111 | (2, 1, 0) |
| 6 | unknown | SeasonalNaive_7 | 7.663333e+06 | 9.545938e+06 | -1.973976 | 12.095394 | 3.762603e+06 | 5.340490e+06 | 0.228833 | 6.324407 | (1,1,1)x(1,1,1,7) |
| 1 | en | ARIMA_tuned | 1.610558e+07 | 1.687825e+07 | -4.472555 | 18.602466 | 1.021441e+07 | 1.418499e+07 | -0.907278 | 11.843813 | (1, 1, 2) |
| Language | Winning Model | Test MAE | Test RMSE | Test R2 | Test sMAPE (%) | |
|---|---|---|---|---|---|---|
| 7 | zh | SeasonalNaive_7 | 2.293596e+05 | 3.680245e+05 | 0.345988 | 4.930013 |
| 3 | fr | SARIMA_weekly | 6.868643e+05 | 1.048907e+06 | 0.133826 | 6.963683 |
| 5 | ru | Naive | 9.394439e+05 | 1.363121e+06 | -0.666671 | 8.776396 |
| 4 | ja | SeasonalNaive_7 | 1.147866e+06 | 1.643666e+06 | 0.079573 | 8.754628 |
| 0 | de | Naive | 1.503639e+06 | 1.905302e+06 | -0.083437 | 11.078940 |
| 2 | es | SeasonalNaive_7 | 1.677730e+06 | 2.154497e+06 | 0.239638 | 12.422111 |
| 6 | unknown | SeasonalNaive_7 | 3.762603e+06 | 5.340490e+06 | 0.228833 | 6.324407 |
| 1 | en | ARIMA_tuned | 1.021441e+07 | 1.418499e+07 | -0.907278 | 11.843813 |
The winner table should be read as an operating recommendation map, not just a leaderboard. Some languages retain strong performance from seasonal naive forecasts, which is an acceptable production choice when complexity does not produce reliable lift.
Where ARIMA or SARIMA wins, the gain comes from better adaptation to trend shifts and cyclical behavior. This supports a portfolio strategy where each language gets the lightest model that meets accuracy and stability thresholds.
# Residual diagnostics for the highest-volume language using its best model
diag_lang = top_languages[0] if top_languages else winner_table.iloc[0]['language']
chosen_row = winner_table[winner_table['language'] == diag_lang].iloc[0]
chosen_model_name = chosen_row['model']
series = ensure_daily_freq(lang_daily[diag_lang])
train, val, test = ts_split(series, val_size=60, test_size=60)
train_val = ensure_daily_freq(pd.concat([train, val]))
if chosen_model_name in ['ARIMA_tuned', 'SARIMA_weekly', 'SARIMAX_weekly_exog']:
if chosen_model_name == 'ARIMA_tuned':
best_order = tuple(model_results[(model_results['language'] == diag_lang) & (model_results['model'] == 'ARIMA_tuned')].iloc[0]['order'])
final_model = SARIMAX(train_val, order=best_order, seasonal_order=(0, 0, 0, 0),
enforce_stationarity=False, enforce_invertibility=False).fit(disp=False)
pred_test = final_model.forecast(steps=len(test))
else:
use_exog = (chosen_model_name == 'SARIMAX_weekly_exog') and (campaign_series is not None) and (diag_lang == 'en')
exog_train_val = campaign_series.loc[train_val.index] if use_exog else None
exog_test = campaign_series.loc[test.index] if use_exog else None
final_model = SARIMAX(
train_val, order=(1, 1, 1), seasonal_order=(1, 1, 1, 7), exog=exog_train_val,
enforce_stationarity=False, enforce_invertibility=False
).fit(disp=False)
pred_test = final_model.forecast(steps=len(test), exog=exog_test)
residuals = test - pred_test.values
fig, axes = plt.subplots(1, 2, figsize=(14, 4))
axes[0].plot(test.index, residuals, color='#d62728')
axes[0].axhline(0, linestyle='--', color='black')
axes[0].set_title(f'Residuals Over Time: {diag_lang}')
sns.histplot(residuals, kde=True, ax=axes[1], color='#2ca02c')
axes[1].set_title('Residual Distribution')
plt.tight_layout()
plt.show()
lb = acorr_ljungbox(residuals, lags=[7, 14], return_df=True)
print('Ljung-Box test on residuals:')
display(lb)
else:
print(f'Residual diagnostics for model {chosen_model_name} are not implemented in this cell.')
Ljung-Box test on residuals:
| lb_stat | lb_pvalue | |
|---|---|---|
| 7 | 46.792705 | 6.126515e-08 |
| 14 | 50.597005 | 4.848805e-06 |
# 30-day forward forecast for top 3 languages using winner model family
forecast_horizon = 30
forecast_rows = []
for lang in top_languages[:3]:
row = winner_table[winner_table['language'] == lang]
if row.empty:
continue
best_model = row.iloc[0]['model']
s = ensure_daily_freq(lang_daily[lang].astype(float))
future_index = pd.date_range(start=s.index.max() + pd.Timedelta(days=1), periods=forecast_horizon, freq='D')
if best_model == 'SeasonalNaive_7':
preds = seasonal_naive_forecast(s, forecast_horizon, season_len=7)
elif best_model == 'Naive':
preds = naive_forecast(s, forecast_horizon)
elif best_model == 'MovingAverage_7':
preds = moving_average_forecast(s, forecast_horizon, window=7)
elif best_model == 'ARIMA_tuned':
order_raw = model_results[(model_results['language'] == lang) & (model_results['model'] == 'ARIMA_tuned')].iloc[0]['order']
order = tuple(order_raw) if not isinstance(order_raw, tuple) else order_raw
fit = SARIMAX(s, order=order, seasonal_order=(0, 0, 0, 0),
enforce_stationarity=False, enforce_invertibility=False).fit(disp=False)
preds = fit.forecast(steps=forecast_horizon).values
elif best_model in ['SARIMA_weekly', 'SARIMAX_weekly_exog']:
use_exog = (best_model == 'SARIMAX_weekly_exog') and (campaign_series is not None) and (lang == 'en')
exog_all = campaign_series.loc[s.index] if use_exog else None
fit = SARIMAX(s, order=(1, 1, 1), seasonal_order=(1, 1, 1, 7), exog=exog_all,
enforce_stationarity=False, enforce_invertibility=False).fit(disp=False)
if use_exog:
future_exog = np.zeros(forecast_horizon)
preds = fit.forecast(steps=forecast_horizon, exog=future_exog).values
else:
preds = fit.forecast(steps=forecast_horizon).values
else:
continue
for d, p in zip(future_index, preds):
forecast_rows.append({'date': d, 'language': lang, 'forecast_views': max(0.0, float(p)), 'model': best_model})
forecast_df = pd.DataFrame(forecast_rows)
display(forecast_df.head(15))
plt.figure(figsize=(13, 5))
for lang in forecast_df['language'].unique():
recent = lang_daily[lang].tail(90)
fut = forecast_df[forecast_df['language'] == lang]
plt.plot(recent.index, recent.values, label=f'{lang} history')
plt.plot(fut['date'], fut['forecast_views'], linestyle='--', label=f'{lang} forecast')
plt.title('30-Day Forecast: Top Languages')
plt.legend(ncol=2)
plt.xticks(rotation=30)
plt.show()
| date | language | forecast_views | model | |
|---|---|---|---|---|
| 0 | 2017-01-01 | en | 9.299734e+07 | ARIMA_tuned |
| 1 | 2017-01-02 | en | 9.379401e+07 | ARIMA_tuned |
| 2 | 2017-01-03 | en | 9.415381e+07 | ARIMA_tuned |
| 3 | 2017-01-04 | en | 9.431631e+07 | ARIMA_tuned |
| 4 | 2017-01-05 | en | 9.438969e+07 | ARIMA_tuned |
| 5 | 2017-01-06 | en | 9.442284e+07 | ARIMA_tuned |
| 6 | 2017-01-07 | en | 9.443781e+07 | ARIMA_tuned |
| 7 | 2017-01-08 | en | 9.444457e+07 | ARIMA_tuned |
| 8 | 2017-01-09 | en | 9.444762e+07 | ARIMA_tuned |
| 9 | 2017-01-10 | en | 9.444900e+07 | ARIMA_tuned |
| 10 | 2017-01-11 | en | 9.444962e+07 | ARIMA_tuned |
| 11 | 2017-01-12 | en | 9.444990e+07 | ARIMA_tuned |
| 12 | 2017-01-13 | en | 9.445003e+07 | ARIMA_tuned |
| 13 | 2017-01-14 | en | 9.445009e+07 | ARIMA_tuned |
| 14 | 2017-01-15 | en | 9.445011e+07 | ARIMA_tuned |
The analysis confirms a strongly uneven language demand mix, with English and a high-volume unknown-language bucket dominating total traffic. This concentration implies that forecasting quality improvements in top languages can materially impact ad-allocation efficiency.
Weekly seasonality appears consistently across major languages, and decomposition isolates this recurring signal clearly from trend movement. This directly supports including seasonal structures with period 7 in production-oriented forecasting models.
Outlier diagnostics show periodic spike risk in several languages, especially where event shocks are common. For operations, this means model performance should be monitored with rolling error thresholds and not judged by a single static metric snapshot.
From a stationarity perspective, first differencing and weekly differencing materially improve stability for the focus series. That validates the decision to benchmark differenced ARIMA-family models against simpler naive baselines before selecting deployment candidates.
Model-family guidance for scale: use seasonal-naive as a strong fallback baseline, and promote language-specific ARIMA/SARIMA variants where they demonstrate repeatable validation and holdout gains.
Built an end-to-end multi-language page-view forecasting solution for Ad Ease using 550 daily observations per page across Wikipedia traffic. The workflow covered data reliability checks, metadata extraction, language-level aggregation, stationarity diagnostics, decomposition, and leakage-safe chronological validation.
Baseline models (naive, seasonal naive, moving average) were benchmarked against tuned ARIMA and seasonal SARIMA/SARIMAX variants. Results show that weekly seasonal models generally outperform static baselines, with strongest gains in high-volume languages where recurring weekly demand patterns are pronounced.
Business impact: this forecasting layer enables better ad inventory planning, timing optimization, and language-specific allocation strategies. It also supports proactive interventions for event-driven volatility by monitoring forecast errors and refreshing models on a regular cadence.
This section provides a compact business dashboard with only top KPIs and one model recommendation table. It is designed for quick export and stakeholder review.
# Compact KPI dashboard + single recommendation table for export
from IPython.display import Markdown, display
total_languages = int(winner_table['language'].nunique())
best_avg_rmse = float(winner_table['test_RMSE'].mean())
best_avg_smape = float(winner_table['test_sMAPE'].mean())
top_traffic_language = str(language_volume.iloc[0]['language'])
top_traffic_views = float(language_volume.iloc[0]['views'])
model_mix = winner_table['model'].value_counts()
top_model_family = model_mix.index[0]
top_model_family_count = int(model_mix.iloc[0])
display(Markdown(
f"### KPI Snapshot\n"
f"- Languages modeled: **{total_languages}**\n"
f"- Average test RMSE (winner models): **{best_avg_rmse:,.0f}**\n"
f"- Average test sMAPE (winner models): **{best_avg_smape:.2f}%**\n"
f"- Highest traffic language: **{top_traffic_language}** with **{top_traffic_views:,.0f}** cumulative views\n"
f"- Most selected model family: **{top_model_family}** ({top_model_family_count}/{total_languages} languages)"
))
recommendation_table = winner_table[
['language', 'model', 'test_MAE', 'test_RMSE', 'test_R2', 'test_sMAPE']
].copy()
recommendation_table = recommendation_table.rename(columns={
'language': 'Language',
'model': 'Recommended Model',
'test_MAE': 'Test MAE',
'test_RMSE': 'Test RMSE',
'test_R2': 'Test R2',
'test_sMAPE': 'Test sMAPE (%)'
}).sort_values('Test RMSE')
display(Markdown('### Final Model Recommendation Table'))
display(recommendation_table)
export_path = Path('final_model_recommendation_table.csv')
recommendation_table.to_csv(export_path, index=False)
print(f'Recommendation table exported to: {export_path.resolve()}')
| Language | Recommended Model | Test MAE | Test RMSE | Test R2 | Test sMAPE (%) | |
|---|---|---|---|---|---|---|
| 7 | zh | SeasonalNaive_7 | 2.293596e+05 | 3.680245e+05 | 0.345988 | 4.930013 |
| 3 | fr | SARIMA_weekly | 6.868643e+05 | 1.048907e+06 | 0.133826 | 6.963683 |
| 5 | ru | Naive | 9.394439e+05 | 1.363121e+06 | -0.666671 | 8.776396 |
| 4 | ja | SeasonalNaive_7 | 1.147866e+06 | 1.643666e+06 | 0.079573 | 8.754628 |
| 0 | de | Naive | 1.503639e+06 | 1.905302e+06 | -0.083437 | 11.078940 |
| 2 | es | SeasonalNaive_7 | 1.677730e+06 | 2.154497e+06 | 0.239638 | 12.422111 |
| 6 | unknown | SeasonalNaive_7 | 3.762603e+06 | 5.340490e+06 | 0.228833 | 6.324407 |
| 1 | en | ARIMA_tuned | 1.021441e+07 | 1.418499e+07 | -0.907278 | 11.843813 |
Recommendation table exported to: C:\Users\prath\Desktop\Prathamesh\Prathamesh\Learn\DSML\12 ML2 Unsupervised and RecSys\TimeSeries\analysis_1\final_model_recommendation_table.csv