Strap in as we take this long-awaited journey into how I, a human being, created one of the greatest sites to ever launch on the internet. (Had to say that, lol!)
Well, it started just like any other stroy, I was on the couch watching Netflix.
And while I was on this great task, their "Because you watched..." feature in the home page caught my eyes.
I mean, it’s not the first time I’ve seen a recommendation system; I’m a data science student, after all.
Nevertheless, it was quite uncanny to see how accurate it was. (I know people on Reddit complain about how inaccurate it is, lol)
This whole thing sparked a question: Could I build something like that?
Fast forward to taking the Machine Learning course on Coursera, which helped me better understand how I could actually build a recommender for this app.
In this not-so-small blog, I will try to explain how I created this app.
Hopefully, I will inspire you to create something magical like it.
Check out the source code, or hit the live demo button to check it out.
Building The Recommendation Engine
1. From Collaborative to Content-Based:
The Machine Learning course introduced me to collaborative filtering, in basic terms it is matchmaking based on user behavior.
Let's say, you and I both love Inception, and I praise Interstellar, the system might suggest it to you.
It's quite powerful, however, it needs a ton of user data.
Having beginning afresh with YMovies, I didn't have that luxury.
And so, I turned to content-based filtering. Which looks at a movie's features, genres, plot, cast... and then it finds similar ones based on what you've liked.
Which makes it perfect for a new app, especially with the TMDB API giving rich movie metadata.
For anyone starting off a project with limited user data, it's a great place to start.
2. Creating the Content-Based Recommender
At the core of YMovies is a content-based recommender built with TF-IDF vectorization and cosine similarity.
In short, I combined each movie’s key details (genres, synopsis, cast, and keywords) into a single text string.
TF-IDF turns that text into vectors, which highlights key terms across the movie catalog. Then, cosine similarity measures how close two movies are in this vector space.
For example, if you liked The Dark Knight, it might suggest Batman Begins based on shared genres (Action, Drama) and cast (Christian Bale).
Check out this section from content_based_recommender.py
that emphasizes certain features, like genres and
directors by giving them higher weights:
def _combine_features(self, row):
features = []
if 'genres' in row and row['genres']:
genres = row['genres'].split()
features.extend([g for g in genres for _ in range(3)]) # Triple weight for genres
if 'overview' in row and row['overview']:
features.append(row['overview'])
if 'cast' in row and row['cast']:
cast_list = row['cast'].split()[:5]
features.extend([c for c in cast_list for _ in range(2)]) # Double weight for cast
if 'director' in row and row['director']:
features.extend([row['director']] * 3) # Triple weight for director
if 'keywords' in row and row['keywords']:
features.append(row['keywords'])
return ' '.join(features)
This enables the get_similar_movies
function to highlight movies similar to your favorites, forming essentially
the backbone of the "Because you liked..." feature.
3. Mixing It Up with a Hybrid Recommender
Content-based filtering was a good start, but I wanted YMovies to feel personal.
This is where the hybrid recommender in hybrid_recommender.py
came in handy. It mixes content similarity with
user preferences from a quiz and interaction history.
For new users, the quiz solves the cold start problem, those first recommendations when there is no data to lean on.
Here is how it works:
def get_recommendations(self, user_data, n=20):
recommendations = []
if liked_movie_ids:
for movie_id in recent_liked_ids:
similar_movies = self.get_because_you_liked_recommendations(movie_id, n=10)
# Filter out watched movies and add to recommendations
if quiz_genres:
quiz_recs = self._get_quiz_based_recommendations(
quiz_genres, quiz_year_range, quiz_duration, exclude_ids, n=20
)
# Add quiz-based picks
return recommendations
If you have liked Inception, it suggests similar films.
If you said "Sci-Fi" and "recent" in the quiz, it narrows down to movies like Interstellar. It's a combination of math and user insight.
The "Because You Liked..." section
The Netflix-inspired section is one of the main focuses of YMovies.
The get_because_you_liked_recommendations
method uses content similarity to suggest movies, tagging each with a reason like "Because you liked Inception."
I made sure to exclude movies you've already added to your watchlist, just to keep it fresh and relevant.
Solving the Cold Start with a Quiz
For newbies, the quiz is an actual lifesaver.
It asks about genres (e.g., Action, Romance), year ranges (recent or classic), and runtime preferences (short, medium, long).
The _get_quiz_based_recommendations
function filters the movie pool accordingly:
- Recent: Last 5 years:
- Classic: Pre-2000
- Short: Under 100 minutes
It then ranks by popularity and ratings, making sure solid picks from the start.
Quick Note: If you're building a system, a quiz like this is a quick win for onboarding users.
Technical Problems and Fixes
Building YMovies wasn't all smooth sailing. Here's what I struggled with:
- Data Wrangling: TMDB data was a goldmine, but the messy—genre IDs needed mapping, overviews had gaps... I cleaned it the in
app.py
'sload_movie_data
function. - Speed: Calculating similarities for thousands of movies slowed things down by a lot.
- Sparse matrices: for TF-IDF vectors saved the day, pro tip for handling big datasets.
- Balancing Act: How much should quiz answers weigh versus liked movies? I changed it with a
diversity_factor
to mix things without being repetitive.
Production Results
I'd be lying if I said that I was a hundred precent sure I'd come through with finishing this project. But now that I'm here,
I'm glad I did it. It was a slow start but I've a learned a lot. If you're just joining us here is what happens in YMovies,
content_based_recommender.py
finds movie twins, hybrid_recommender.py
adds the user's personality, and app.py
delivers
it with a bow.
Next time I'm chaning this, maybe I'll build something with the ByteDance open source recoomendation algorithm.
PS: I didn't use the quiz feature after all, still, I do recommend it.