I was sprawled on my couch, deep into a Netflix binge, when their "Because you watched..." feature caught my eye. It was uncanny—almost like Netflix had a direct line to my movie tastes. That sparked a question: Could I build something like that?
Fast forward to enrolling in the Machine Learning Specialization on Coursera by DeepLearning.AI, and my project, YMovies, was born—a web app designed to recommend movies tailored to your preferences. Here's how I turned that spark into a working recommendation system, complete with code, challenges, and lessons learned.
The Netflix Spark and a Pivot in Plans
The Coursera course introduced me to collaborative filtering—think of it as matchmaking based on user behavior. If you and I both love Inception, and I rave about Interstellar, the system might nudge you toward it. It's powerful, but there's a catch: it needs a ton of user data. Starting fresh with YMovies, I didn't have that luxury.
So, I pivoted to content-based filtering. This method looks at a movie's features—genres, plot, cast—and finds similar ones based on what you've liked. It's perfect for a new app, especially with the TMDB API handing me rich movie metadata. For anyone kicking off a project with limited user data, this is a great place to start.
Crafting the Content-Based Recommender
The core of YMovies is a content-based recommender built with TF-IDF vectorization and cosine similarity. Here's the gist: I mashed up movie features—genres, overview, cast, keywords—into a single text string per movie. TF-IDF (Term Frequency-Inverse Document Frequency) turns that text into vectors, highlighting key terms across the movie catalog. Then, cosine similarity measures how close two movies are in this vector space.
For example, if you liked The Dark Knight, it might suggest Batman Begins based on shared genres (Action, Drama) and cast (Christian Bale). Check out this snippet from content_based_recommender.py
where I weighted features like genres and directors more heavily:
def _combine_features(self, row):
features = []
if 'genres' in row and row['genres']:
genres = row['genres'].split()
features.extend([g for g in genres for _ in range(3)]) # Triple weight for genres
if 'overview' in row and row['overview']:
features.append(row['overview'])
if 'cast' in row and row['cast']:
cast_list = row['cast'].split()[:5]
features.extend([c for c in cast_list for _ in range(2)]) # Double weight for cast
if 'director' in row and row['director']:
features.extend([row['director']] * 3) # Triple weight for director
if 'keywords' in row and row['keywords']:
features.append(row['keywords'])
return ' '.join(features)
This powers the get_similar_movies
function, which finds movies close to your favorites—forming the backbone of the "Because you liked..." feature.
Mixing It Up with a Hybrid Recommender
Content-based filtering was a solid start, but I wanted YMovies to feel personal. Enter the hybrid recommender in hybrid_recommender.py
. It blends content similarity with user preferences from a quiz (genres, eras, durations) and interaction history (liked movies, watch history). For new users, the quiz tackles the cold start problem—those first recommendations when there's no data to lean on.
Here's how it works in action:
def get_recommendations(self, user_data, n=20):
recommendations = []
if liked_movie_ids:
for movie_id in recent_liked_ids:
similar_movies = self.get_because_you_liked_recommendations(movie_id, n=10)
# Filter out watched movies and add to recommendations
if quiz_genres:
quiz_recs = self._get_quiz_based_recommendations(
quiz_genres, quiz_year_range, quiz_duration, exclude_ids, n=20
)
# Add quiz-based picks
return recommendations
If you've liked Inception, it pulls similar films. If you said "Sci-Fi" and "recent" in the quiz, it narrows down to movies like Interstellar. It's a combo of math and user insight.
The "Because You Liked..." Magic
That Netflix-inspired "Because you liked..." section? It's alive in YMovies. The get_because_you_liked_recommendations
method uses content similarity to suggest movies, tagging each with a reason like "Because you liked Inception." I made sure to exclude movies you've already watched or added to your watchlist—keeping it fresh and relevant.
Solving the Cold Start with a Quiz
For newbies, the quiz is a lifesaver. It asks about genres (e.g., Action, Romance), year ranges (recent or classic), and runtime preferences (short, medium, long). The _get_quiz_based_recommendations
function filters the movie pool accordingly:
- Recent: Last 5 years
- Classic: Pre-2000
- Short: Under 100 minutes
It then ranks by popularity and ratings, ensuring solid picks from day one. If you're building a system, a quiz like this is a quick win for onboarding users.
Technical Hurdles and Fixes
Building YMovies wasn't all smooth sailing. Here's what I wrestled with:
- Data Wrangling: TMDB data was a goldmine, but messy—genre IDs needed mapping, overviews had gaps. I cleaned it up in
app.py
'sload_movie_data
function. - Speed: Calculating similarities for thousands of movies bogged things down. Sparse matrices for TF-IDF vectors saved the day—pro tip for handling big datasets.
- Balancing Act: How much should quiz answers weigh versus liked movies? I tweaked it with a
diversity_factor
to mix things up without being repetitive.
Serving It Up with Flask
The Flask app in app.py
ties it all together, offering endpoints like:
/recommendations/similar/<movie_id>
: Content-based picks/recommendations/personalized
: Full hybrid recommendations/recommendations/quiz-based
: Quiz-driven suggestions
It pulls movie data from TMDB, caches it, and initializes the recommenders on startup. Deployed on Vercel with Neon's serverless PostgreSQL, it scales like a dream.
What I Learned
After three iterations, YMovies taught me a ton:
- Content-Based is Your Friend: Great for starting with item metadata.
- Hybrid Wins: Combining approaches beats any single method.
- Data Quality Matters: Clean, rich data fuels better recommendations.
- User Experience is Key: It's not just about accuracy—surprise and delight matter too.
YMovies isn't perfect, but when I see "Because you liked The Matrix" pop up with Blade Runner 2049, I know it's working. Want to build your own? Start with what data you have, add a personal touch, and iterate. Who knows—you might just recommend the next big hit!
Deep Dive: The Technical Implementation
Let me walk you through the core files that power YMovies and how they work together to create that Netflix-like experience.
content_based_recommender.py: The Movie Matchmaker
Imagine you're at a party trying to introduce movies to each other. "Hey, Inception, have you met Interstellar? You've got a lot in common!" That's what this file does—it figures out which movies are similar based on their DNA: genres, overviews, cast, directors, and keywords.
Step 1: Building a Movie's "Profile"
First, you've got this _combine_features
method that takes a movie's raw data and smushes it into one big text string. But you're not just throwing everything in the blender—you're picky about what gets emphasis. Genres and directors get triple the weight, cast gets double, and overviews and keywords get a single pass. Why? Because I figured genres (like "sci-fi" or "comedy") and directors (like "Nolan" or "Spielberg") are huge clues about a movie's vibe.
Here's what that looks like in code:
def _combine_features(self, row):
features = []
if 'genres' in row and row['genres']:
genres = row['genres'].split()
features.extend([g for g in genres for _ in range(3)]) # Triple weight for genres
if 'overview' in row and row['overview']:
features.append(row['overview'])
if 'cast' in row and row['cast']:
cast_list = row['cast'].split()[:5] # Top 5 cast members
features.extend([c for c in cast_list for _ in range(2)]) # Double weight for cast
if 'director' in row and row['director']:
features.extend([row['director']] * 3) # Triple weight for director
if 'keywords' in row and row['keywords']:
features.append(row['keywords'])
return ' '.join(features)
Let's say we've got The Dark Knight. Its genres might be "action crime thriller," its overview is something like "Batman fights the Joker in Gotham," its cast includes "Christian Bale" and "Heath Ledger," and its director is "Christopher Nolan." After this method, the string might look like:
action action action crime crime crime thriller thriller thriller Batman fights the Joker in Gotham Christian Christian Bale Bale Heath Heath Ledger Ledger Christopher Christopher Christopher Nolan Nolan Nolan
See how "action" and "Nolan" show up three times? That's the weighting at work—it's like shouting, "This movie is REALLY about action and Nolan's style!" Meanwhile, the overview keeps its natural flow, giving context without overcomplicating things.
Step 2: Turning Words into Numbers with TF-IDF
Now, you've got these big text strings for every movie, but computers don't speak English—they speak math. So, I use TF-IDF (Term Frequency-Inverse Document Frequency) to turn those strings into vectors. TF-IDF is like a bouncer at a club: it lets in words that matter (like "superhero" in action movies) and tones down ones that are too common (like "the" or "and").
Here's how I set it up:
def fit(self, movies_df):
self.movies_df = movies_df
self.tfidf = TfidfVectorizer(
stop_words='english', # Skip boring words like "the"
ngram_range=(1, 2), # Grab single words and pairs like "science fiction"
min_df=2, # Ignore words that show up in fewer than 2 movies
max_features=5000 # Keep it to 5,000 features max
)
combined_features = self.movies_df.apply(self._combine_features, axis=1)
self.tfidf_matrix = self.tfidf.fit_transform(combined_features)
- Stop words: Ditching "the" and "is" keeps the focus on meaningful terms.
- N-grams: Grabbing "science fiction" as a pair is way smarter than just "science" or "fiction" alone—it captures the genre's essence.
- Min_df and max_features: These are your performance guards.
min_df=2
means a word has to appear in at least two movies to count (no one-off weirdos), andmax_features=5000
stops your matrix from ballooning out of control.
For The Dark Knight, TF-IDF might give "action" and "Nolan" high scores because they're repeated and distinctive, while "Batman" gets a solid score but not as high since it pops up in other Batman movies too. The result? A sparse matrix where each movie is a row, and each column is a weighted word or phrase.
Step 3: Measuring Similarity with Cosine
Okay, now every movie's a vector in this high-dimensional space. How do you find the ones that are "close" to each other? Cosine similarity! It's like measuring the angle between two arrows—smaller angle, more similar. I compute this for every pair of movies and store it in a big similarity matrix:
self.similarity_matrix = cosine_similarity(self.tfidf_matrix)
For The Dark Knight, this might show Batman Begins with a similarity of 0.85 (same director, similar genres) and Iron Man at 0.60 (action and superhero vibes, but different flavor). The get_similar_movies
method grabs the top matches:
def get_similar_movies(self, movie_id, top_n=10):
movie_idx = self.movies_df.index[self.movies_df['id'] == movie_id].tolist()[0]
sim_scores = list(enumerate(self.similarity_matrix[movie_idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:top_n+1]
movie_indices = [i[0] for i in sim_scores]
return self.movies_df.iloc[movie_indices][['id', 'title']].to_dict('records')
This is your "If you liked X, try Y" engine. Simple, but powerful!
hybrid_recommender.py: Getting Personal
Content-based is cool, but I wanted YMovies to feel like it knows the user, not just the movies. Enter the hybrid recommender—it's like a chef mixing content similarity with user preferences from a quiz, liked movies, and watch history. Let's dig into how I made it personal.
Cold Start Savior: The Quiz
New users are tricky—no data, no recommendations. My fix? A quiz! In _get_quiz_based_recommendations
, I ask about genres, recency, and runtime, then filter movies accordingly:
def _get_quiz_based_recommendations(self, quiz_answers):
filtered_df = self.movies_df.copy()
if 'genres' in quiz_answers:
genre_list = quiz_answers['genres']
filtered_df = filtered_df[filtered_df['genres'].apply(
lambda x: any(g in x.split() for g in genre_list))]
if 'recency' in quiz_answers:
current_year = datetime.now().year
if quiz_answers['recency'] == 'recent':
filtered_df = filtered_df[filtered_df['release_year'] >= current_year - 5]
elif quiz_answers['recency'] == 'classic':
filtered_df = filtered_df[filtered_df['release_year'] < 2000]
if 'duration' in quiz_answers:
if quiz_answers['duration'] == 'short':
filtered_df = filtered_df[filtered_df['runtime'] < 100]
elif quiz_answers['duration'] == 'medium':
filtered_df = filtered_df[(filtered_df['runtime'] >= 100) & (filtered_df['runtime'] <= 150)]
else: # long
filtered_df = filtered_df[filtered_df['runtime'] > 150]
return filtered_df.sort_values(by=['popularity', 'vote_average'], ascending=False).head(10)
Say someone picks "action" and "sci-fi," "recent," and "medium." I'd filter for movies from 2019-2024, 100-150 minutes long, with "action" or "sci-fi" in the genres, then sort by popularity and rating. Maybe Tenet pops up—boom, instant recs without any prior user data.
Mixing in Liked Movies
Once users like a few movies, I level up. The get_recommendations
method combines content-based similarity with their tastes, adding a diversity twist:
def get_recommendations(self, user_id, top_n=10):
user_data = self.user_data.get(user_id, {})
liked_movies = user_data.get('liked_movies', [])
watched_movies = user_data.get('watched_movies', [])
all_similar_movies = {}
for movie_id in liked_movies[-3:]: # Last 3 liked movies
similar_movies = self.content_recommender.get_similar_movies(movie_id, top_n=20)
for movie in similar_movies:
if movie['id'] in watched_movies or movie['id'] in liked_movies:
continue
movie['similarity'] = self.content_recommender.similarity_matrix[
self.movies_df.index[self.movies_df['id'] == movie_id].tolist()[0],
self.movies_df.index[self.movies_df['id'] == movie['id']].tolist()[0]
]
if movie['id'] in all_similar_movies:
current_score = all_similar_movies[movie['id']]['relevance_score']
new_score = current_score + (movie['similarity'] * (1 - current_score * self.diversity_factor))
all_similar_movies[movie['id']]['relevance_score'] = new_score
else:
movie['relevance_score'] = movie['similarity']
all_similar_movies[movie['id']] = movie
sorted_movies = sorted(all_similar_movies.values(), key=lambda x: x['relevance_score'], reverse=True)
return sorted_movies[:top_n]
Here's the cool part: that diversity_factor
. If someone likes The Matrix and Inception, I don't just suggest ten sci-fi mind-benders—I mix it up so they don't get bored. The formula tweaks scores to favor movies that haven't been over-represented yet. It's like curating a playlist with variety, not just repeats of the same beat.
app.py: Serving It Up
Finally, app.py
is where it all comes together—a Flask app that delivers recommendations to users. I've got endpoints like:
/recommendations/similar/<movie_id>
: Pure content-based matches./recommendations/personalized
: Hybrid recs for logged-in users./recommendations/quiz-based
: Quiz-driven suggestions.
I kick things off by loading data and initializing recommenders:
app = Flask(__name__)
movies_df, content_recommender, hybrid_recommender = None, None, None
def load_movie_data():
global movies_df, content_recommender, hybrid_recommender
movies_df = pd.read_csv('tmdb_movies.csv') # Or fetch from TMDB API
content_recommender = ContentBasedRecommender()
content_recommender.fit(movies_df)
hybrid_recommender = HybridRecommender(content_recommender)
load_movie_data()
@app.route('/recommendations/similar/<int:movie_id>')
def get_similar(movie_id):
recs = content_recommender.get_similar_movies(movie_id)
return jsonify(recs)
@app.route('/recommendations/personalized/<int:user_id>')
def get_personalized(user_id):
recs = hybrid_recommender.get_recommendations(user_id)
return jsonify(recs)
if __name__ == '__main__':
app.run(debug=True)
I'm caching TMDB data (smart, since APIs hate being spammed) and pre-computing the similarity matrix so requests are fast. It's like pre-baking a cake—when someone orders, you just slice and serve.
Wrapping Up: The Journey
So, there you have it! From a Netflix "aha!" moment to a full-blown recommendation system, I've built something awesome. content_based_recommender.py
finds movie twins, hybrid_recommender.py
adds the user's personality, and app.py
delivers it with a bow. I've tackled sparse data, performance hiccups, and user experience like a pro.
Next time I'm tweaking this, maybe I'll play with weighting the hybrid inputs more dynamically—could be a fun experiment! Building YMovies taught me that the best recommendations aren't just about accuracy—they're about surprise and delight too.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from flask import Flask, jsonify
from datetime import datetime
# content_based_recommender.py
class ContentBasedRecommender:
def __init__(self):
self.movies_df = None
self.tfidf = None
self.tfidf_matrix = None
self.similarity_matrix = None
def _combine_features(self, row):
features = []
if 'genres' in row and row['genres']:
genres = row['genres'].split()
features.extend([g for g in genres for _ in range(3)])
if 'overview' in row and row['overview']:
features.append(row['overview'])
if 'cast' in row and row['cast']:
cast_list = row['cast'].split()[:5]
features.extend([c for c in cast_list for _ in range(2)])
if 'director' in row and row['director']:
features.extend([row['director']] * 3)
if 'keywords' in row and row['keywords']:
features.append(row['keywords'])
return ' '.join(features)
def fit(self, movies_df):
self.movies_df = movies_df
self.tfidf = TfidfVectorizer(
stop_words='english',
ngram_range=(1, 2),
min_df=2,
max_features=5000
)
combined_features = self.movies_df.apply(self._combine_features, axis=1)
self.tfidf_matrix = self.tfidf.fit_transform(combined_features)
self.similarity_matrix = cosine_similarity(self.tfidf_matrix)
def get_similar_movies(self, movie_id, top_n=10):
movie_idx = self.movies_df.index[self.movies_df['id'] == movie_id].tolist()[0]
sim_scores = list(enumerate(self.similarity_matrix[movie_idx]))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:top_n+1]
movie_indices = [i[0] for i in sim_scores]
return self.movies_df.iloc[movie_indices][['id', 'title']].to_dict('records')
# hybrid_recommender.py
class HybridRecommender:
def __init__(self, content_recommender, diversity_factor=0.1):
self.content_recommender = content_recommender
self.diversity_factor = diversity_factor
self.movies_df = content_recommender.movies_df
self.user_data = {}
def _get_quiz_based_recommendations(self, quiz_answers):
filtered_df = self.movies_df.copy()
if 'genres' in quiz_answers:
genre_list = quiz_answers['genres']
filtered_df = filtered_df[filtered_df['genres'].apply(
lambda x: any(g in x.split() for g in genre_list))]
if 'recency' in quiz_answers:
current_year = datetime.now().year
if quiz_answers['recency'] == 'recent':
filtered_df = filtered_df[filtered_df['release_year'] >= current_year - 5]
elif quiz_answers['recency'] == 'classic':
filtered_df = filtered_df[filtered_df['release_year'] < 2000]
if 'duration' in quiz_answers:
if quiz_answers['duration'] == 'short':
filtered_df = filtered_df[filtered_df['runtime'] < 100]
elif quiz_answers['duration'] == 'medium':
filtered_df = filtered_df[(filtered_df['runtime'] >= 100) & (filtered_df['runtime'] <= 150)]
else:
filtered_df = filtered_df[filtered_df['runtime'] > 150]
return filtered_df.sort_values(by=['popularity', 'vote_average'], ascending=False).head(10)
def get_recommendations(self, user_id, top_n=10):
user_data = self.user_data.get(user_id, {})
liked_movies = user_data.get('liked_movies', [])
watched_movies = user_data.get('watched_movies', [])
all_similar_movies = {}
for movie_id in liked_movies[-3:]:
similar_movies = self.content_recommender.get_similar_movies(movie_id, top_n=20)
for movie in similar_movies:
if movie['id'] in watched_movies or movie['id'] in liked_movies:
continue
movie['similarity'] = self.content_recommender.similarity_matrix[
self.movies_df.index[self.movies_df['id'] == movie_id].tolist()[0],
self.movies_df.index[self.movies_df['id'] == movie['id']].tolist()[0]
]
if movie['id'] in all_similar_movies:
current_score = all_similar_movies[movie['id']]['relevance_score']
new_score = current_score + (movie['similarity'] * (1 - current_score * self.diversity_factor))
all_similar_movies[movie['id']]['relevance_score'] = new_score
else:
movie['relevance_score'] = movie['similarity']
all_similar_movies[movie['id']] = movie
sorted_movies = sorted(all_similar_movies.values(), key=lambda x: x['relevance_score'], reverse=True)
return sorted_movies[:top_n]
# app.py
app = Flask(__name__)
movies_df, content_recommender, hybrid_recommender = None, None, None
def load_movie_data():
global movies_df, content_recommender, hybrid_recommender
movies_df = pd.read_csv('tmdb_movies.csv') # Placeholder for TMDB API data
content_recommender = ContentBasedRecommender()
content_recommender.fit(movies_df)
hybrid_recommender = HybridRecommender(content_recommender)
load_movie_data()
@app.route('/recommendations/similar/<int:movie_id>')
def get_similar(movie_id):
recs = content_recommender.get_similar_movies(movie_id)
return jsonify(recs)
@app.route('/recommendations/personalized/<int:user_id>')
def get_personalized(user_id):
recs = hybrid_recommender.get_recommendations(user_id)
return jsonify(recs)
if __name__ == '__main__':
app.run(debug=True)