The company aims to improve user engagement, session depth, and retention by serving relevant movie recommendations for each user. This notebook delivers a production-oriented recommendation analysis using explicit ratings, demographics, and movie metadata.
Build and evaluate a personalized movie recommender system that can:
No client/company names are exposed in this analysis.
This case study covers:
Primary modeling approaches:
import warnings
warnings.filterwarnings("ignore")
from pathlib import Path
from datetime import datetime
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import MultiLabelBinarizer
from scipy.sparse import csr_matrix
sns.set_theme(style="whitegrid")
pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 160)
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
print("Libraries loaded.")
Libraries loaded.
data_dir = Path("../data")
ratings_cols = ["UserID", "MovieID", "Rating", "Timestamp"]
users_cols = ["UserID", "Gender", "Age", "Occupation", "ZipCode"]
movies_cols = ["MovieID", "Title", "Genres"]
ratings = pd.read_csv(data_dir / "zee-ratings.dat", sep="::", engine="python", names=ratings_cols, encoding="latin-1")
users = pd.read_csv(data_dir / "zee-users.dat", sep="::", engine="python", names=users_cols, encoding="latin-1")
movies = pd.read_csv(data_dir / "zee-movies.dat", sep="::", engine="python", names=movies_cols, encoding="latin-1")
# Remove accidental header rows embedded as first records and enforce schema types
ratings = ratings[ratings["UserID"] != "UserID"].copy()
users = users[users["UserID"] != "UserID"].copy()
movies = movies[movies["MovieID"] != "Movie ID"].copy()
ratings[["UserID", "MovieID", "Rating", "Timestamp"]] = ratings[["UserID", "MovieID", "Rating", "Timestamp"]].apply(pd.to_numeric, errors="coerce")
users[["UserID", "Age", "Occupation"]] = users[["UserID", "Age", "Occupation"]].apply(pd.to_numeric, errors="coerce")
movies["MovieID"] = pd.to_numeric(movies["MovieID"], errors="coerce")
ratings = ratings.dropna(subset=["UserID", "MovieID", "Rating", "Timestamp"]).astype({"UserID": "int64", "MovieID": "int64", "Rating": "float64", "Timestamp": "int64"})
users = users.dropna(subset=["UserID", "Age", "Occupation"]).astype({"UserID": "int64", "Age": "int64", "Occupation": "int64"})
movies = movies.dropna(subset=["MovieID", "Title", "Genres"]).astype({"MovieID": "int64"})
print("Shapes:")
print("ratings:", ratings.shape)
print("users :", users.shape)
print("movies :", movies.shape)
display(ratings.head(3))
display(users.head(3))
display(movies.head(3))
Shapes: ratings: (1000209, 4) users : (6040, 5) movies : (3883, 3)
| UserID | MovieID | Rating | Timestamp | |
|---|---|---|---|---|
| 1 | 1 | 1193 | 5.0 | 978300760 |
| 2 | 1 | 661 | 3.0 | 978302109 |
| 3 | 1 | 914 | 3.0 | 978301968 |
| UserID | Gender | Age | Occupation | ZipCode | |
|---|---|---|---|---|---|
| 1 | 1 | F | 1 | 10 | 48067 |
| 2 | 2 | M | 56 | 16 | 70072 |
| 3 | 3 | M | 25 | 15 | 55117 |
| MovieID | Title | Genres | |
|---|---|---|---|
| 1 | 1 | Toy Story (1995) | Animation|Children's|Comedy |
| 2 | 2 | Jumanji (1995) | Adventure|Children's|Fantasy |
| 3 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
def quick_data_quality(df, name):
print(f"\n{name} schema")
display(df.dtypes.to_frame("dtype"))
miss = df.isna().sum().to_frame("missing_count")
miss["missing_pct"] = (miss["missing_count"] / len(df) * 100).round(3)
print(f"{name} missing values")
display(miss)
print(f"{name} duplicates: {df.duplicated().sum()}")
quick_data_quality(ratings, "ratings")
quick_data_quality(users, "users")
quick_data_quality(movies, "movies")
print("\nRatings summary:")
display(ratings["Rating"].describe())
print("Unique users in ratings:", ratings["UserID"].nunique())
print("Unique movies in ratings:", ratings["MovieID"].nunique())
ratings schema
| dtype | |
|---|---|
| UserID | int64 |
| MovieID | int64 |
| Rating | float64 |
| Timestamp | int64 |
ratings missing values
| missing_count | missing_pct | |
|---|---|---|
| UserID | 0 | 0.0 |
| MovieID | 0 | 0.0 |
| Rating | 0 | 0.0 |
| Timestamp | 0 | 0.0 |
ratings duplicates: 0 users schema
| dtype | |
|---|---|
| UserID | int64 |
| Gender | str |
| Age | int64 |
| Occupation | int64 |
| ZipCode | str |
users missing values
| missing_count | missing_pct | |
|---|---|---|
| UserID | 0 | 0.0 |
| Gender | 0 | 0.0 |
| Age | 0 | 0.0 |
| Occupation | 0 | 0.0 |
| ZipCode | 0 | 0.0 |
users duplicates: 0 movies schema
| dtype | |
|---|---|
| MovieID | int64 |
| Title | str |
| Genres | str |
movies missing values
| missing_count | missing_pct | |
|---|---|---|
| MovieID | 0 | 0.0 |
| Title | 0 | 0.0 |
| Genres | 0 | 0.0 |
movies duplicates: 0 Ratings summary:
count 1.000209e+06 mean 3.581564e+00 std 1.117102e+00 min 1.000000e+00 25% 3.000000e+00 50% 4.000000e+00 75% 4.000000e+00 max 5.000000e+00 Name: Rating, dtype: float64
Unique users in ratings: 6040 Unique movies in ratings: 3706
# Merge tables for richer EDA
movies["ReleaseYear"] = movies["Title"].str.extract(r"\((\d{4})\)").astype(float)
movies["Decade"] = (movies["ReleaseYear"] // 10 * 10).astype("Int64")
ratings_enriched = ratings.merge(users, on="UserID", how="left").merge(movies, on="MovieID", how="left")
ratings_enriched["RatingDate"] = pd.to_datetime(ratings_enriched["Timestamp"], unit="s", errors="coerce")
ratings_enriched["RatingYear"] = ratings_enriched["RatingDate"].dt.year
print("Merged shape:", ratings_enriched.shape)
display(ratings_enriched.head(3))
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
sns.countplot(data=ratings, x="Rating", ax=axes[0, 0], palette="Blues")
axes[0, 0].set_title("Rating Distribution")
user_activity = ratings.groupby("UserID").size()
sns.histplot(user_activity, bins=40, kde=True, ax=axes[0, 1], color="#2ca02c")
axes[0, 1].set_title("Ratings per User")
movie_activity = ratings.groupby("MovieID").size()
sns.histplot(movie_activity, bins=40, kde=True, ax=axes[1, 0], color="#ff7f0e")
axes[1, 0].set_title("Ratings per Movie")
sns.countplot(data=users, x="Gender", ax=axes[1, 1], palette="Set2")
axes[1, 1].set_title("User Gender Distribution")
plt.tight_layout()
plt.show()
print("Skewness diagnostics")
print("Ratings per user skewness:", round(user_activity.skew(), 3))
print("Ratings per movie skewness:", round(movie_activity.skew(), 3))
print("Insight: interaction counts are highly right-skewed, confirming sparsity and popularity concentration.")
Merged shape: (1000209, 14)
| UserID | MovieID | Rating | Timestamp | Gender | Age | Occupation | ZipCode | Title | Genres | ReleaseYear | Decade | RatingDate | RatingYear | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1193 | 5.0 | 978300760 | F | 1 | 10 | 48067 | One Flew Over the Cuckoo's Nest (1975) | Drama | 1975.0 | 1970 | 2000-12-31 22:12:40 | 2000 |
| 1 | 1 | 661 | 3.0 | 978302109 | F | 1 | 10 | 48067 | James and the Giant Peach (1996) | Animation|Children's|Musical | 1996.0 | 1990 | 2000-12-31 22:35:09 | 2000 |
| 2 | 1 | 914 | 3.0 | 978301968 | F | 1 | 10 | 48067 | My Fair Lady (1964) | Musical|Romance | 1964.0 | 1960 | 2000-12-31 22:32:48 | 2000 |
Skewness diagnostics Ratings per user skewness: 2.745 Ratings per movie skewness: 2.815 Insight: interaction counts are highly right-skewed, confirming sparsity and popularity concentration.
# Bivariate and multivariate EDA
age_activity = ratings_enriched.groupby("Age")["MovieID"].count().sort_values(ascending=False)
occupation_activity = ratings_enriched.groupby("Occupation")["MovieID"].count().sort_values(ascending=False)
genre_exploded = ratings_enriched.assign(Genre=ratings_enriched["Genres"].str.split("|", regex=False)).explode("Genre")
genre_rating = genre_exploded.groupby("Genre")["Rating"].agg(["count", "mean"]).sort_values("count", ascending=False)
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
sns.barplot(x=age_activity.index.astype(str), y=age_activity.values, ax=axes[0], palette="viridis")
axes[0].set_title("Ratings Volume by Age Group")
axes[0].tick_params(axis="x", rotation=45)
sns.barplot(x=occupation_activity.head(12).index.astype(str), y=occupation_activity.head(12).values, ax=axes[1], palette="magma")
axes[1].set_title("Top Occupations by Ratings Volume")
axes[1].tick_params(axis="x", rotation=60)
sns.barplot(y=genre_rating.head(12).index, x=genre_rating.head(12)["count"], ax=axes[2], palette="cubehelix")
axes[2].set_title("Top Genres by Rating Count")
plt.tight_layout()
plt.show()
genre_pivot = genre_exploded.pivot_table(index="Age", columns="Genre", values="Rating", aggfunc="mean")
plt.figure(figsize=(14, 6))
sns.heatmap(genre_pivot, cmap="YlGnBu", annot=False)
plt.title("Average Rating by Age Group and Genre")
plt.show()
print("Initial EDA insights captured.")
Initial EDA insights captured.
## Preprocessing and feature engineering
ratings_model = ratings.copy()
ratings_model = ratings_model.sort_values(["UserID", "Timestamp"]).reset_index(drop=True)
# Core aggregate features
user_stats = ratings_model.groupby("UserID")["Rating"].agg(user_mean="mean", user_count="count")
movie_stats = ratings_model.groupby("MovieID")["Rating"].agg(movie_mean="mean", movie_count="count")
ratings_model = ratings_model.join(user_stats, on="UserID")
ratings_model = ratings_model.join(movie_stats, on="MovieID")
print("Feature-engineered ratings sample:")
display(ratings_model.head(3))
def temporal_user_split(df, min_interactions=5):
train_parts, val_parts, test_parts = [], [], []
for _, grp in df.groupby("UserID", sort=False):
grp = grp.sort_values("Timestamp")
n = len(grp)
if n >= min_interactions:
test_parts.append(grp.iloc[[-1]])
val_parts.append(grp.iloc[[-2]])
train_parts.append(grp.iloc[:-2])
elif n >= 3:
test_parts.append(grp.iloc[[-1]])
val_parts.append(grp.iloc[[-2]])
train_parts.append(grp.iloc[:-2])
else:
train_parts.append(grp)
train_df = pd.concat(train_parts, ignore_index=True) if train_parts else pd.DataFrame(columns=df.columns)
val_df = pd.concat(val_parts, ignore_index=True) if val_parts else pd.DataFrame(columns=df.columns)
test_df = pd.concat(test_parts, ignore_index=True) if test_parts else pd.DataFrame(columns=df.columns)
return train_df, val_df, test_df
train_df, val_df, test_df = temporal_user_split(ratings_model)
print("Split sizes:", len(train_df), len(val_df), len(test_df))
print("Users in train/val/test:", train_df['UserID'].nunique(), val_df['UserID'].nunique(), test_df['UserID'].nunique())
global_mean = train_df["Rating"].mean()
print("Global mean rating in train:", round(global_mean, 4))
Feature-engineered ratings sample:
| UserID | MovieID | Rating | Timestamp | user_mean | user_count | movie_mean | movie_count | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 3186 | 4.0 | 978300019 | 4.188679 | 53 | 3.477958 | 431 |
| 1 | 1 | 1270 | 5.0 | 978300055 | 4.188679 | 53 | 3.990321 | 2583 |
| 2 | 1 | 1721 | 4.0 | 978300055 | 4.188679 | 53 | 3.583441 | 1546 |
Split sizes: 988129 6040 6040 Users in train/val/test: 6040 6040 6040 Global mean rating in train: 3.5815
def rmse(y_true, y_pred):
return float(np.sqrt(mean_squared_error(y_true, y_pred)))
def mape(y_true, y_pred):
y_true = np.array(y_true, dtype=float)
y_pred = np.array(y_pred, dtype=float)
non_zero = y_true != 0
if non_zero.sum() == 0:
return np.nan
return float(mean_absolute_percentage_error(y_true[non_zero], y_pred[non_zero]))
def clip_rating(x, low=1.0, high=5.0):
return np.clip(x, low, high)
def evaluate_predictions(df_true, pred_col="pred"):
return {
"RMSE": rmse(df_true["Rating"], df_true[pred_col]),
"MAPE": mape(df_true["Rating"], df_true[pred_col])
}
# Baseline 1: global mean
val_pred_global = val_df.copy()
val_pred_global["pred"] = global_mean
test_pred_global = test_df.copy()
test_pred_global["pred"] = global_mean
global_val_metrics = evaluate_predictions(val_pred_global)
global_test_metrics = evaluate_predictions(test_pred_global)
print("Global baseline (val):", global_val_metrics)
print("Global baseline (test):", global_test_metrics)
# Baseline 2: regularized user-item bias model
def fit_bias_model(train, reg=10.0, n_iter=8):
mu = train["Rating"].mean()
bu = pd.Series(0.0, index=train["UserID"].unique())
bi = pd.Series(0.0, index=train["MovieID"].unique())
for _ in range(n_iter):
# Update user biases
tmp = train.join(bi.rename("bi"), on="MovieID")
num_u = tmp.groupby("UserID").apply(lambda g: (g["Rating"] - mu - g["bi"]).sum())
cnt_u = tmp.groupby("UserID").size()
bu = (num_u / (cnt_u + reg)).reindex(bu.index).fillna(0.0)
# Update item biases
tmp = train.join(bu.rename("bu"), on="UserID")
num_i = tmp.groupby("MovieID").apply(lambda g: (g["Rating"] - mu - g["bu"]).sum())
cnt_i = tmp.groupby("MovieID").size()
bi = (num_i / (cnt_i + reg)).reindex(bi.index).fillna(0.0)
return mu, bu, bi
def predict_bias(df_eval, mu, bu, bi):
pred = (
mu
+ df_eval["UserID"].map(bu).fillna(0.0)
+ df_eval["MovieID"].map(bi).fillna(0.0)
)
return clip_rating(pred.values)
reg_grid = [5.0, 10.0, 20.0, 40.0]
bias_scores = []
best_bias = None
best_val_rmse = np.inf
for reg in reg_grid:
mu, bu, bi = fit_bias_model(train_df, reg=reg, n_iter=8)
v = val_df.copy()
v["pred"] = predict_bias(v, mu, bu, bi)
score = evaluate_predictions(v)
bias_scores.append({"reg": reg, **score})
if score["RMSE"] < best_val_rmse:
best_val_rmse = score["RMSE"]
best_bias = (mu, bu, bi, reg)
bias_score_df = pd.DataFrame(bias_scores).sort_values("RMSE")
display(bias_score_df)
mu_b, bu_b, bi_b, best_reg = best_bias
test_bias = test_df.copy()
test_bias["pred"] = predict_bias(test_bias, mu_b, bu_b, bi_b)
bias_test_metrics = evaluate_predictions(test_bias)
print(f"Best bias-model reg: {best_reg}")
print("Bias model (test):", bias_test_metrics)
Global baseline (val): {'RMSE': 1.1584898475322987, 'MAPE': 0.40937426811225724}
Global baseline (test): {'RMSE': 1.1755651985279745, 'MAPE': 0.4136629200068449}
| reg | RMSE | MAPE | |
|---|---|---|---|
| 0 | 5.0 | 0.939383 | 0.296606 |
| 1 | 10.0 | 0.940191 | 0.299171 |
| 2 | 20.0 | 0.943328 | 0.303724 |
| 3 | 40.0 | 0.951043 | 0.310951 |
Best bias-model reg: 5.0
Bias model (test): {'RMSE': 0.9652078657563132, 'MAPE': 0.2999148801084146}
# Build index mappings for sparse matrices
user_ids = np.sort(train_df["UserID"].unique())
movie_ids = np.sort(train_df["MovieID"].unique())
u2i = {u: i for i, u in enumerate(user_ids)}
m2i = {m: i for i, m in enumerate(movie_ids)}
train_idx = train_df[train_df["UserID"].isin(u2i) & train_df["MovieID"].isin(m2i)].copy()
rows = train_idx["UserID"].map(u2i).values
cols = train_idx["MovieID"].map(m2i).values
vals = train_idx["Rating"].astype(float).values
R = csr_matrix((vals, (rows, cols)), shape=(len(user_ids), len(movie_ids)))
R_dense = R.toarray()
# Item-based Pearson with overlap-aware shrinkage
rated_mask = (R_dense > 0).astype(np.uint8)
co_counts = rated_mask.T @ rated_mask
user_mean_vec = np.where(
rated_mask.sum(axis=1) > 0,
R_dense.sum(axis=1) / np.maximum(rated_mask.sum(axis=1), 1),
global_mean
)
R_centered = R_dense.copy()
for i in range(R_centered.shape[0]):
mask = R_centered[i] != 0
R_centered[i, mask] = R_centered[i, mask] - user_mean_vec[i]
item_corr_raw = np.corrcoef(R_centered.T)
item_corr_raw = np.nan_to_num(item_corr_raw, nan=0.0, posinf=0.0, neginf=0.0)
shrink_lambda = 25.0
item_corr = item_corr_raw * (co_counts / (co_counts + shrink_lambda))
np.fill_diagonal(item_corr, 1.0)
# User-based Cosine similarity
user_cos = cosine_similarity(R)
np.fill_diagonal(user_cos, 0.0)
print("Matrices ready:")
print("R shape:", R.shape, "Item corr:", item_corr.shape, "User cosine:", user_cos.shape)
def predict_item_pearson(user_id, movie_id, k=40):
if user_id not in u2i or movie_id not in m2i:
return global_mean
u = u2i[user_id]
m = m2i[movie_id]
user_ratings = R_dense[u]
rated_idx = np.where(user_ratings > 0)[0]
if len(rated_idx) == 0:
return global_mean
sims = item_corr[m, rated_idx]
top_idx = np.argsort(np.abs(sims))[::-1][:k]
sims_top = sims[top_idx]
ratings_top = user_ratings[rated_idx[top_idx]]
denom = np.sum(np.abs(sims_top))
if denom == 0:
return user_mean_vec[u]
pred = np.dot(sims_top, ratings_top) / denom
return float(clip_rating(pred))
def predict_user_cosine(user_id, movie_id, k=40):
if user_id not in u2i or movie_id not in m2i:
return global_mean
u = u2i[user_id]
m = m2i[movie_id]
movie_col = R_dense[:, m]
raters = np.where(movie_col > 0)[0]
if len(raters) == 0:
return global_mean
sims = user_cos[u, raters]
top_idx = np.argsort(np.abs(sims))[::-1][:k]
sims_top = sims[top_idx]
ratings_top = movie_col[raters[top_idx]]
denom = np.sum(np.abs(sims_top))
if denom == 0:
return user_mean_vec[u]
pred = np.dot(sims_top, ratings_top) / denom
return float(clip_rating(pred))
def batch_predict(df_eval, predictor, k=40):
return np.array([predictor(u, m, k=k) for u, m in zip(df_eval["UserID"], df_eval["MovieID"])])
k_grid = [20, 40, 80, 120]
item_scores = []
user_scores = []
for k in k_grid:
v_item = val_df.copy()
v_item["pred"] = batch_predict(v_item, predict_item_pearson, k=k)
item_scores.append({"k": k, **evaluate_predictions(v_item)})
v_user = val_df.copy()
v_user["pred"] = batch_predict(v_user, predict_user_cosine, k=k)
user_scores.append({"k": k, **evaluate_predictions(v_user)})
item_scores_df = pd.DataFrame(item_scores).sort_values("RMSE")
user_scores_df = pd.DataFrame(user_scores).sort_values("RMSE")
display(item_scores_df)
display(user_scores_df)
best_k_item = int(item_scores_df.iloc[0]["k"])
best_k_user = int(user_scores_df.iloc[0]["k"])
test_item = test_df.copy()
test_item["pred"] = batch_predict(test_item, predict_item_pearson, k=best_k_item)
item_test_metrics = evaluate_predictions(test_item)
test_user = test_df.copy()
test_user["pred"] = batch_predict(test_user, predict_user_cosine, k=best_k_user)
user_test_metrics = evaluate_predictions(test_user)
print(f"Best item-based Pearson k: {best_k_item}, test metrics: {item_test_metrics}")
print(f"Best user-based Cosine k: {best_k_user}, test metrics: {user_test_metrics}")
# Pearson diagnostics: overlap and shrinkage effects
tri_i, tri_j = np.triu_indices(item_corr.shape[0], k=1)
overlap_vals = co_counts[tri_i, tri_j]
raw_vals = item_corr_raw[tri_i, tri_j]
shrunk_vals = item_corr[tri_i, tri_j]
print("Pearson diagnostics summary:")
print("Median overlap across item pairs:", float(np.median(overlap_vals)))
print("Share of pairs with overlap >= 20:", round(float((overlap_vals >= 20).mean()), 4))
print("Mean |raw corr|:", round(float(np.mean(np.abs(raw_vals))), 4))
print("Mean |shrunk corr|:", round(float(np.mean(np.abs(shrunk_vals))), 4))
valid = overlap_vals >= 20
id_to_title_local = movies.drop_duplicates("MovieID").set_index("MovieID")["Title"].to_dict()
if valid.any():
valid_idx = np.where(valid)[0]
top_idx = valid_idx[np.argsort(shrunk_vals[valid_idx])[::-1][:10]]
top_pairs = []
for idx in top_idx:
mid_1 = movie_ids[tri_i[idx]]
mid_2 = movie_ids[tri_j[idx]]
top_pairs.append({
"Movie1": id_to_title_local.get(mid_1, str(mid_1)),
"Movie2": id_to_title_local.get(mid_2, str(mid_2)),
"OverlapUsers": int(overlap_vals[idx]),
"RawPearson": float(raw_vals[idx]),
"ShrunkPearson": float(shrunk_vals[idx]),
})
print("Top high-confidence item correlations (overlap >= 20):")
display(pd.DataFrame(top_pairs))
else:
print("No item pairs met overlap >= 20 threshold in current train matrix.")
Matrices ready: R shape: (6040, 3704) Item corr: (3704, 3704) User cosine: (6040, 6040)
| k | RMSE | MAPE | |
|---|---|---|---|
| 0 | 20 | 1.719703 | 0.390199 |
| 1 | 40 | 1.842065 | 0.417400 |
| 2 | 80 | 1.935562 | 0.439979 |
| 3 | 120 | 1.974472 | 0.449712 |
| k | RMSE | MAPE | |
|---|---|---|---|
| 2 | 80 | 0.989241 | 0.312508 |
| 3 | 120 | 0.990205 | 0.313433 |
| 1 | 40 | 0.990304 | 0.311779 |
| 0 | 20 | 0.997068 | 0.312476 |
Best item-based Pearson k: 20, test metrics: {'RMSE': 1.7798298892085307, 'MAPE': 0.4000738881971182}
Best user-based Cosine k: 80, test metrics: {'RMSE': 1.0036411945109756, 'MAPE': 0.31216639197813173}
Pearson diagnostics summary:
Median overlap across item pairs: 6.0
Share of pairs with overlap >= 20: 0.2965
Mean |raw corr|: 0.0237
Mean |shrunk corr|: 0.009
Top high-confidence item correlations (overlap >= 20):
| Movie1 | Movie2 | OverlapUsers | RawPearson | ShrunkPearson | |
|---|---|---|---|---|---|
| 0 | Police Academy 5: Assignment: Miami Beach (1988) | Police Academy 6: City Under Siege (1989) | 111 | 0.706891 | 0.576948 |
| 1 | Puppet Master 4 (1993) | Puppet Master 5: The Final Chapter (1994) | 48 | 0.849808 | 0.558778 |
| 2 | Nightmare on Elm Street 4: The Dream Master, A... | Nightmare on Elm Street 5: The Dream Child, A ... | 160 | 0.631547 | 0.546203 |
| 3 | Nightmare on Elm Street 3: Dream Warriors, A (... | Nightmare on Elm Street 4: The Dream Master, A... | 191 | 0.597634 | 0.528463 |
| 4 | Godfather, The (1972) | Godfather: Part II, The (1974) | 155 | 0.613166 | 0.528004 |
| 5 | Jean de Florette (1986) | Manon of the Spring (Manon des sources) (1986) | 141 | 0.616756 | 0.523871 |
| 6 | Fistful of Dollars, A (1964) | For a Few Dollars More (1965) | 250 | 0.572157 | 0.520143 |
| 7 | Police Academy 4: Citizens on Patrol (1987) | Police Academy 5: Assignment: Miami Beach (1988) | 108 | 0.615823 | 0.500066 |
| 8 | Police Academy 4: Citizens on Patrol (1987) | Police Academy 6: City Under Siege (1989) | 108 | 0.614175 | 0.498728 |
| 9 | Good, The Bad and The Ugly, The (1966) | Fistful of Dollars, A (1964) | 178 | 0.549540 | 0.481862 |
# Item-based similar movies for 'Liar Liar' (Pearson approach)
title_to_id = movies.drop_duplicates("Title").set_index("Title")["MovieID"].to_dict()
id_to_title = movies.drop_duplicates("MovieID").set_index("MovieID")["Title"].to_dict()
target_title = "Liar Liar (1997)"
if target_title in title_to_id and title_to_id[target_title] in m2i:
target_mid = title_to_id[target_title]
t_idx = m2i[target_mid]
sims = item_corr[t_idx]
order = np.argsort(sims)[::-1]
top_similar = []
for idx in order:
if idx == t_idx:
continue
mid = movie_ids[idx]
title = id_to_title.get(mid, str(mid))
top_similar.append((title, float(sims[idx])))
if len(top_similar) == 3:
break
print("Top 3 movies similar to Liar Liar (item-based Pearson):")
display(pd.DataFrame(top_similar, columns=["SimilarMovie", "PearsonSimilarity"]))
else:
top_similar = []
print("Liar Liar (1997) not found in train item index.")
def recommend_for_user(user_id, method="item", top_n=10):
watched = set(ratings.loc[ratings["UserID"] == user_id, "MovieID"].tolist())
candidates = [m for m in movie_ids if m not in watched]
if len(candidates) == 0:
return pd.DataFrame(columns=["MovieID", "Title", "PredRating"])
preds = []
for mid in candidates:
if method == "item":
p = predict_item_pearson(user_id, mid, k=best_k_item)
else:
p = predict_user_cosine(user_id, mid, k=best_k_user)
preds.append((mid, p))
recs = pd.DataFrame(preds, columns=["MovieID", "PredRating"]).sort_values("PredRating", ascending=False).head(top_n)
recs["Title"] = recs["MovieID"].map(id_to_title)
return recs[["MovieID", "Title", "PredRating"]]
example_user = int(train_df["UserID"].value_counts().index[0])
print(f"Sample recommendations for UserID={example_user} (item-based):")
display(recommend_for_user(example_user, method="item", top_n=10))
Top 3 movies similar to Liar Liar (item-based Pearson):
| SimilarMovie | PearsonSimilarity | |
|---|---|---|
| 0 | Ace Ventura: Pet Detective (1994) | 0.156271 |
| 1 | Tommy Boy (1995) | 0.156112 |
| 2 | Dumb & Dumber (1994) | 0.149243 |
Sample recommendations for UserID=4169 (item-based):
| MovieID | Title | PredRating | |
|---|---|---|---|
| 1042 | 3022 | General, The (1927) | 4.861353 |
| 1148 | 3338 | For All Mankind (1989) | 4.818162 |
| 1074 | 3133 | Go West (1925) | 4.810607 |
| 260 | 669 | Aparajito (1956) | 4.809627 |
| 259 | 668 | Pather Panchali (1955) | 4.734362 |
| 1013 | 2946 | Help! (1965) | 4.658292 |
| 261 | 670 | World of Apu, The (Apur Sansar) (1959) | 4.572760 |
| 286 | 722 | Haunted World of Edward D. Wood Jr., The (1995) | 4.565847 |
| 718 | 2081 | Little Mermaid, The (1989) | 4.539420 |
| 413 | 1147 | When We Were Kings (1996) | 4.527898 |
This section now includes overlap-aware Pearson diagnostics to avoid over-trusting high correlations from very small co-rating samples.
What to review in outputs:
Interpretation guideline:
# Matrix Factorization via Truncated SVD (validation-tuned)
mask_obs = (R_dense > 0).astype(float)
base_user_mean = np.where(
mask_obs.sum(axis=1) > 0,
R_dense.sum(axis=1) / np.maximum(mask_obs.sum(axis=1), 1),
global_mean
)
def build_centered_matrix(alpha):
blended_user_mean = alpha * base_user_mean + (1 - alpha) * global_mean
R_filled = R_dense.copy()
for i in range(R_filled.shape[0]):
miss = R_filled[i] == 0
if miss.any():
R_filled[i, miss] = blended_user_mean[i]
R_center = R_filled - blended_user_mean.reshape(-1, 1)
return R_center, blended_user_mean
def fit_svd_predict(centered_matrix, n_components):
svd = TruncatedSVD(n_components=n_components, random_state=RANDOM_STATE)
user_factors = svd.fit_transform(centered_matrix)
item_factors = svd.components_.T
return svd, user_factors, item_factors
def predict_svd(df_eval, user_factors, item_factors, user_means):
preds = []
for u, m in zip(df_eval["UserID"], df_eval["MovieID"]):
if u in u2i and m in m2i:
ui = u2i[u]
mi = m2i[m]
residual = float(np.dot(user_factors[ui], item_factors[mi]))
p = user_means[ui] + residual
else:
p = global_mean
preds.append(float(clip_rating(p)))
return np.array(preds)
svd_grid = [20, 40, 60, 80, 120, 160]
alpha_grid = [0.7, 0.85, 1.0]
svd_scores = []
best_svd = None
best_svd_rmse = np.inf
for alpha in alpha_grid:
R_center_tmp, user_mean_tmp = build_centered_matrix(alpha)
for n_comp in svd_grid:
svd_model, uf, itf = fit_svd_predict(R_center_tmp, n_comp)
v = val_df.copy()
v["pred"] = predict_svd(v, uf, itf, user_mean_tmp)
score = evaluate_predictions(v)
svd_scores.append({
"alpha": alpha,
"n_factors": n_comp,
**score,
"explained_var": float(svd_model.explained_variance_ratio_.sum())
})
if score["RMSE"] < best_svd_rmse:
best_svd_rmse = score["RMSE"]
best_svd = (svd_model, uf, itf, n_comp, alpha, R_center_tmp, user_mean_tmp)
svd_score_df = pd.DataFrame(svd_scores).sort_values("RMSE")
display(svd_score_df.head(12))
svd_model_b, uf_b, itf_b, best_n, best_alpha, R_center, user_mean_train = best_svd
test_svd = test_df.copy()
test_svd["pred"] = predict_svd(test_svd, uf_b, itf_b, user_mean_train)
svd_test_metrics = evaluate_predictions(test_svd)
print(f"Best SVD factors: {best_n}")
print(f"Best SVD alpha (user/global blend): {best_alpha}")
print("SVD Matrix Factorization (test):", svd_test_metrics)
# Latent-factor interpretation: highest loading movies per factor
id_to_title_local = movies.drop_duplicates("MovieID").set_index("MovieID")["Title"].to_dict()
latent_rows = []
n_latent_to_show = min(4, best_n)
for f in range(n_latent_to_show):
w = itf_b[:, f]
pos_idx = np.argsort(w)[-5:][::-1]
neg_idx = np.argsort(w)[:5]
latent_rows.append({
"Factor": f + 1,
"TopPositiveMovies": " | ".join([id_to_title_local.get(movie_ids[i], str(movie_ids[i])) for i in pos_idx]),
"TopNegativeMovies": " | ".join([id_to_title_local.get(movie_ids[i], str(movie_ids[i])) for i in neg_idx]),
})
latent_factor_df = pd.DataFrame(latent_rows)
print("Latent-factor interpretation (top loadings):")
display(latent_factor_df)
| alpha | n_factors | RMSE | MAPE | explained_var | |
|---|---|---|---|---|---|
| 8 | 0.85 | 60 | 1.037776 | 0.356936 | 0.241130 |
| 7 | 0.85 | 40 | 1.037854 | 0.357526 | 0.198777 |
| 2 | 0.70 | 60 | 1.038994 | 0.358959 | 0.241844 |
| 6 | 0.85 | 20 | 1.039093 | 0.358581 | 0.147580 |
| 1 | 0.70 | 40 | 1.039273 | 0.359739 | 0.199405 |
| 14 | 1.00 | 60 | 1.040301 | 0.355804 | 0.242010 |
| 13 | 1.00 | 40 | 1.040418 | 0.356373 | 0.199680 |
| 9 | 0.85 | 80 | 1.040570 | 0.357367 | 0.278546 |
| 0 | 0.70 | 20 | 1.040618 | 0.360705 | 0.147749 |
| 10 | 0.85 | 120 | 1.040903 | 0.357328 | 0.344243 |
| 12 | 1.00 | 20 | 1.041390 | 0.357288 | 0.148623 |
| 4 | 0.70 | 120 | 1.041458 | 0.359205 | 0.344930 |
Best SVD factors: 60
Best SVD alpha (user/global blend): 0.85
SVD Matrix Factorization (test): {'RMSE': 1.0629461993365337, 'MAPE': 0.36472529252653096}
Latent-factor interpretation (top loadings):
| Factor | TopPositiveMovies | TopNegativeMovies | |
|---|---|---|---|
| 0 | 1 | Star Wars: Episode IV - A New Hope (1977) | Ra... | Wild Wild West (1999) | Batman & Robin (1997) ... |
| 1 | 2 | Jurassic Park (1993) | Independence Day (ID4) ... | Fargo (1996) | Dr. Strangelove or: How I Learn... |
| 2 | 3 | American Beauty (1999) | Fight Club (1999) | P... | Casablanca (1942) | Wizard of Oz, The (1939) |... |
| 3 | 4 | Shawshank Redemption, The (1994) | Schindler's... | Aliens (1986) | Alien (1979) | Robocop (1987) ... |
This section now performs broader hyperparameter tuning across latent factor count and user-mean/global-mean blending. It also surfaces factor-level interpretation using the highest positive and negative loading movies per latent dimension.
What to review in outputs:
# Stability check and alternative strategies
def evaluate_svd_with_seed(seed, n_factors):
svd = TruncatedSVD(n_components=n_factors, random_state=seed)
uf = svd.fit_transform(R_center)
itf = svd.components_.T
tmp = val_df.copy()
tmp["pred"] = predict_svd(tmp, uf, itf, user_mean_train)
return evaluate_predictions(tmp)["RMSE"]
stability_seeds = [11, 21, 42, 84]
svd_stability_rmse = [evaluate_svd_with_seed(s, best_n) for s in stability_seeds]
stability_std = float(np.std(svd_stability_rmse))
print("SVD validation RMSE by seed:", dict(zip(stability_seeds, np.round(svd_stability_rmse, 4))))
print("RMSE std (stability indicator):", round(stability_std, 5))
# Alternative strategy A: stronger regularization for bias model
mu_alt, bu_alt, bi_alt = fit_bias_model(train_df, reg=80.0, n_iter=10)
test_alt_bias = test_df.copy()
test_alt_bias["pred"] = predict_bias(test_alt_bias, mu_alt, bu_alt, bi_alt)
alt_bias_metrics = evaluate_predictions(test_alt_bias)
# Alternative strategy B: high-neighbor user cosine model
test_alt_user = test_df.copy()
test_alt_user["pred"] = batch_predict(test_alt_user, predict_user_cosine, k=120)
alt_user_metrics = evaluate_predictions(test_alt_user)
alt_compare = pd.DataFrame([
{"Strategy": "SVD (best)", **svd_test_metrics},
{"Strategy": "Alt A: Bias model (reg=80)", **alt_bias_metrics},
{"Strategy": "Alt B: User cosine (k=120)", **alt_user_metrics},
]).sort_values("RMSE")
display(alt_compare)
if stability_std > 0.03:
print("Model shows instability across seeds. Alternative strategies above are recommended as contingency options.")
else:
print("Model stability is acceptable. Alternatives retained as fallback options.")
SVD validation RMSE by seed: {11: np.float64(1.0375), 21: np.float64(1.0382), 42: np.float64(1.0378), 84: np.float64(1.0371)}
RMSE std (stability indicator): 0.0004
| Strategy | RMSE | MAPE | |
|---|---|---|---|
| 1 | Alt A: Bias model (reg=80) | 0.982715 | 0.323291 |
| 2 | Alt B: User cosine (k=120) | 1.004644 | 0.313295 |
| 0 | SVD (best) | 1.062946 | 0.364725 |
Model stability is acceptable. Alternatives retained as fallback options.
# Consolidated model performance table
model_comparison = pd.DataFrame([
{"Model": "Global Mean Baseline", **global_test_metrics},
{"Model": "Regularized Bias Baseline", **bias_test_metrics},
{"Model": f"Item-based Pearson (k={best_k_item})", **item_test_metrics},
{"Model": f"User-based Cosine (k={best_k_user})", **user_test_metrics},
{"Model": f"Matrix Factorization SVD (factors={best_n})", **svd_test_metrics},
]).sort_values("RMSE")
display(model_comparison)
best_model_name = model_comparison.iloc[0]["Model"]
print("Best model on test set:", best_model_name)
| Model | RMSE | MAPE | |
|---|---|---|---|
| 1 | Regularized Bias Baseline | 0.965208 | 0.299915 |
| 3 | User-based Cosine (k=80) | 1.003641 | 0.312166 |
| 4 | Matrix Factorization SVD (factors=60) | 1.062946 | 0.364725 |
| 0 | Global Mean Baseline | 1.175565 | 0.413663 |
| 2 | Item-based Pearson (k=20) | 1.779830 | 0.400074 |
Best model on test set: Regularized Bias Baseline
# Questionnaire answers with computed evidence
age_top = age_activity.index[0]
occupation_top = occupation_activity.index[0]
male_majority = users["Gender"].eq("M").mean() > 0.5
decade_top = movies["Decade"].dropna().mode().iloc[0]
movie_max_ratings_id = ratings.groupby("MovieID").size().sort_values(ascending=False).index[0]
movie_max_ratings_title = id_to_title.get(movie_max_ratings_id, str(movie_max_ratings_id))
q6_titles = [x[0] for x in top_similar] if len(top_similar) > 0 else []
answers = pd.DataFrame({
"QuestionNo": [1,2,3,4,5,6,7,8,9,10],
"Answer": [
f"Age group {age_top}",
f"Occupation code {occupation_top}",
"True" if male_majority else "False",
f"{int(decade_top)}s",
movie_max_ratings_title,
"; ".join(q6_titles) if q6_titles else "Not available in current train index",
"User-based and Item-based",
"Pearson: -1 to 1; Cosine: 0 to 1",
f"RMSE={svd_test_metrics['RMSE']:.4f}, MAPE={svd_test_metrics['MAPE']:.4f}",
"CSR rows=[0,1,1], cols=[0,0,1], data=[1,3,7]"
]
})
display(answers)
print("Note: Occupation is represented as coded values in this dataset.")
| QuestionNo | Answer | |
|---|---|---|
| 0 | 1 | Age group 25 |
| 1 | 2 | Occupation code 4 |
| 2 | 3 | True |
| 3 | 4 | 1990s |
| 4 | 5 | American Beauty (1999) |
| 5 | 6 | Ace Ventura: Pet Detective (1994); Tommy Boy (... |
| 6 | 7 | User-based and Item-based |
| 7 | 8 | Pearson: -1 to 1; Cosine: 0 to 1 |
| 8 | 9 | RMSE=1.0629, MAPE=0.3647 |
| 9 | 10 | CSR rows=[0,1,1], cols=[0,0,1], data=[1,3,7] |
Note: Occupation is represented as coded values in this dataset.
Built an end-to-end personalized movie recommender system for the company using explicit user ratings, user demographics, and movie metadata. The workflow covered EDA, sparsity diagnostics, leakage-aware temporal splitting, and model benchmarking across baseline, neighborhood, and matrix-factorization methods.
Business impact: