YMovies: Building a Movie Recommendation Web App

00:12:46:40

I was sprawled on my couch, deep into a Netflix binge, when their "Because you watched..." feature caught my eye. It was uncanny—almost like Netflix had a direct line to my movie tastes. That sparked a question: Could I build something like that?

Fast forward to enrolling in the Machine Learning Specialization on Coursera by DeepLearning.AI, and my project, YMovies, was born—a web app designed to recommend movies tailored to your preferences. Here's how I turned that spark into a working recommendation system, complete with code, challenges, and lessons learned.

View on GitHub

Live Demo

The Netflix Spark and a Pivot in Plans

The Coursera course introduced me to collaborative filtering—think of it as matchmaking based on user behavior. If you and I both love Inception, and I rave about Interstellar, the system might nudge you toward it. It's powerful, but there's a catch: it needs a ton of user data. Starting fresh with YMovies, I didn't have that luxury.

So, I pivoted to content-based filtering. This method looks at a movie's features—genres, plot, cast—and finds similar ones based on what you've liked. It's perfect for a new app, especially with the TMDB API handing me rich movie metadata. For anyone kicking off a project with limited user data, this is a great place to start.

Crafting the Content-Based Recommender

The core of YMovies is a content-based recommender built with TF-IDF vectorization and cosine similarity. Here's the gist: I mashed up movie features—genres, overview, cast, keywords—into a single text string per movie. TF-IDF (Term Frequency-Inverse Document Frequency) turns that text into vectors, highlighting key terms across the movie catalog. Then, cosine similarity measures how close two movies are in this vector space.

For example, if you liked The Dark Knight, it might suggest Batman Begins based on shared genres (Action, Drama) and cast (Christian Bale). Check out this snippet from content_based_recommender.py where I weighted features like genres and directors more heavily:

python

def _combine_features(self, row):
    features = []
    if 'genres' in row and row['genres']:
        genres = row['genres'].split()
        features.extend([g for g in genres for _ in range(3)])  # Triple weight for genres
    if 'overview' in row and row['overview']:
        features.append(row['overview'])
    if 'cast' in row and row['cast']:
        cast_list = row['cast'].split()[:5]
        features.extend([c for c in cast_list for _ in range(2)])  # Double weight for cast
    if 'director' in row and row['director']:
        features.extend([row['director']] * 3)  # Triple weight for director
    if 'keywords' in row and row['keywords']:
        features.append(row['keywords'])
    return ' '.join(features)

This powers the get_similar_movies function, which finds movies close to your favorites—forming the backbone of the "Because you liked..." feature.

Mixing It Up with a Hybrid Recommender

Content-based filtering was a solid start, but I wanted YMovies to feel personal. Enter the hybrid recommender in hybrid_recommender.py. It blends content similarity with user preferences from a quiz (genres, eras, durations) and interaction history (liked movies, watch history). For new users, the quiz tackles the cold start problem—those first recommendations when there's no data to lean on.

Here's how it works in action:

python

def get_recommendations(self, user_data, n=20):
    recommendations = []
    if liked_movie_ids:
        for movie_id in recent_liked_ids:
            similar_movies = self.get_because_you_liked_recommendations(movie_id, n=10)
            # Filter out watched movies and add to recommendations
    if quiz_genres:
        quiz_recs = self._get_quiz_based_recommendations(
            quiz_genres, quiz_year_range, quiz_duration, exclude_ids, n=20
        )
        # Add quiz-based picks
    return recommendations

If you've liked Inception, it pulls similar films. If you said "Sci-Fi" and "recent" in the quiz, it narrows down to movies like Interstellar. It's a combo of math and user insight.

The "Because You Liked..." Magic

That Netflix-inspired "Because you liked..." section? It's alive in YMovies. The get_because_you_liked_recommendations method uses content similarity to suggest movies, tagging each with a reason like "Because you liked Inception." I made sure to exclude movies you've already watched or added to your watchlist—keeping it fresh and relevant.

Solving the Cold Start with a Quiz

For newbies, the quiz is a lifesaver. It asks about genres (e.g., Action, Romance), year ranges (recent or classic), and runtime preferences (short, medium, long). The _get_quiz_based_recommendations function filters the movie pool accordingly:

Recent: Last 5 years
Classic: Pre-2000
Short: Under 100 minutes

It then ranks by popularity and ratings, ensuring solid picks from day one. If you're building a system, a quiz like this is a quick win for onboarding users.

Technical Hurdles and Fixes

Building YMovies wasn't all smooth sailing. Here's what I wrestled with:

Data Wrangling: TMDB data was a goldmine, but messy—genre IDs needed mapping, overviews had gaps. I cleaned it up in app.py's load_movie_data function.
Speed: Calculating similarities for thousands of movies bogged things down. Sparse matrices for TF-IDF vectors saved the day—pro tip for handling big datasets.
Balancing Act: How much should quiz answers weigh versus liked movies? I tweaked it with a diversity_factor to mix things up without being repetitive.

Serving It Up with Flask

The Flask app in app.py ties it all together, offering endpoints like:

/recommendations/similar/<movie_id>: Content-based picks
/recommendations/personalized: Full hybrid recommendations
/recommendations/quiz-based: Quiz-driven suggestions

It pulls movie data from TMDB, caches it, and initializes the recommenders on startup. Deployed on Vercel with Neon's serverless PostgreSQL, it scales like a dream.

What I Learned

After three iterations, YMovies taught me a ton:

Content-Based is Your Friend: Great for starting with item metadata.
Hybrid Wins: Combining approaches beats any single method.
Data Quality Matters: Clean, rich data fuels better recommendations.
User Experience is Key: It's not just about accuracy—surprise and delight matter too.

YMovies isn't perfect, but when I see "Because you liked The Matrix" pop up with Blade Runner 2049, I know it's working. Want to build your own? Start with what data you have, add a personal touch, and iterate. Who knows—you might just recommend the next big hit!

Deep Dive: The Technical Implementation

Let me walk you through the core files that power YMovies and how they work together to create that Netflix-like experience.

content_based_recommender.py: The Movie Matchmaker

Imagine you're at a party trying to introduce movies to each other. "Hey, Inception, have you met Interstellar? You've got a lot in common!" That's what this file does—it figures out which movies are similar based on their DNA: genres, overviews, cast, directors, and keywords.

Step 1: Building a Movie's "Profile"

First, you've got this _combine_features method that takes a movie's raw data and smushes it into one big text string. But you're not just throwing everything in the blender—you're picky about what gets emphasis. Genres and directors get triple the weight, cast gets double, and overviews and keywords get a single pass. Why? Because I figured genres (like "sci-fi" or "comedy") and directors (like "Nolan" or "Spielberg") are huge clues about a movie's vibe.

Here's what that looks like in code:

python

def _combine_features(self, row):
    features = []
    if 'genres' in row and row['genres']:
        genres = row['genres'].split()
        features.extend([g for g in genres for _ in range(3)])  # Triple weight for genres
    if 'overview' in row and row['overview']:
        features.append(row['overview'])
    if 'cast' in row and row['cast']:
        cast_list = row['cast'].split()[:5]  # Top 5 cast members
        features.extend([c for c in cast_list for _ in range(2)])  # Double weight for cast
    if 'director' in row and row['director']:
        features.extend([row['director']] * 3)  # Triple weight for director
    if 'keywords' in row and row['keywords']:
        features.append(row['keywords'])
    return ' '.join(features)

Let's say we've got The Dark Knight. Its genres might be "action crime thriller," its overview is something like "Batman fights the Joker in Gotham," its cast includes "Christian Bale" and "Heath Ledger," and its director is "Christopher Nolan." After this method, the string might look like:

text

action action action crime crime crime thriller thriller thriller Batman fights the Joker in Gotham Christian Christian Bale Bale Heath Heath Ledger Ledger Christopher Christopher Christopher Nolan Nolan Nolan

See how "action" and "Nolan" show up three times? That's the weighting at work—it's like shouting, "This movie is REALLY about action and Nolan's style!" Meanwhile, the overview keeps its natural flow, giving context without overcomplicating things.

Step 2: Turning Words into Numbers with TF-IDF

Now, you've got these big text strings for every movie, but computers don't speak English—they speak math. So, I use TF-IDF (Term Frequency-Inverse Document Frequency) to turn those strings into vectors. TF-IDF is like a bouncer at a club: it lets in words that matter (like "superhero" in action movies) and tones down ones that are too common (like "the" or "and").

Here's how I set it up:

python

def fit(self, movies_df):
    self.movies_df = movies_df
    self.tfidf = TfidfVectorizer(
        stop_words='english',       # Skip boring words like "the"
        ngram_range=(1, 2),         # Grab single words and pairs like "science fiction"
        min_df=2,                   # Ignore words that show up in fewer than 2 movies
        max_features=5000           # Keep it to 5,000 features max
    )
    combined_features = self.movies_df.apply(self._combine_features, axis=1)
    self.tfidf_matrix = self.tfidf.fit_transform(combined_features)

Stop words: Ditching "the" and "is" keeps the focus on meaningful terms.
N-grams: Grabbing "science fiction" as a pair is way smarter than just "science" or "fiction" alone—it captures the genre's essence.
Min_df and max_features: These are your performance guards. min_df=2 means a word has to appear in at least two movies to count (no one-off weirdos), and max_features=5000 stops your matrix from ballooning out of control.

For The Dark Knight, TF-IDF might give "action" and "Nolan" high scores because they're repeated and distinctive, while "Batman" gets a solid score but not as high since it pops up in other Batman movies too. The result? A sparse matrix where each movie is a row, and each column is a weighted word or phrase.

Step 3: Measuring Similarity with Cosine

Okay, now every movie's a vector in this high-dimensional space. How do you find the ones that are "close" to each other? Cosine similarity! It's like measuring the angle between two arrows—smaller angle, more similar. I compute this for every pair of movies and store it in a big similarity matrix:

python

self.similarity_matrix = cosine_similarity(self.tfidf_matrix)

For The Dark Knight, this might show Batman Begins with a similarity of 0.85 (same director, similar genres) and Iron Man at 0.60 (action and superhero vibes, but different flavor). The get_similar_movies method grabs the top matches:

python

def get_similar_movies(self, movie_id, top_n=10):
    movie_idx = self.movies_df.index[self.movies_df['id'] == movie_id].tolist()[0]
    sim_scores = list(enumerate(self.similarity_matrix[movie_idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:top_n+1]
    movie_indices = [i[0] for i in sim_scores]
    return self.movies_df.iloc[movie_indices][['id', 'title']].to_dict('records')

This is your "If you liked X, try Y" engine. Simple, but powerful!

hybrid_recommender.py: Getting Personal

Content-based is cool, but I wanted YMovies to feel like it knows the user, not just the movies. Enter the hybrid recommender—it's like a chef mixing content similarity with user preferences from a quiz, liked movies, and watch history. Let's dig into how I made it personal.

Cold Start Savior: The Quiz

New users are tricky—no data, no recommendations. My fix? A quiz! In _get_quiz_based_recommendations, I ask about genres, recency, and runtime, then filter movies accordingly:

python

def _get_quiz_based_recommendations(self, quiz_answers):
    filtered_df = self.movies_df.copy()
    if 'genres' in quiz_answers:
        genre_list = quiz_answers['genres']
        filtered_df = filtered_df[filtered_df['genres'].apply(
            lambda x: any(g in x.split() for g in genre_list))]
    if 'recency' in quiz_answers:
        current_year = datetime.now().year
        if quiz_answers['recency'] == 'recent':
            filtered_df = filtered_df[filtered_df['release_year'] >= current_year - 5]
        elif quiz_answers['recency'] == 'classic':
            filtered_df = filtered_df[filtered_df['release_year'] < 2000]
    if 'duration' in quiz_answers:
        if quiz_answers['duration'] == 'short':
            filtered_df = filtered_df[filtered_df['runtime'] < 100]
        elif quiz_answers['duration'] == 'medium':
            filtered_df = filtered_df[(filtered_df['runtime'] >= 100) & (filtered_df['runtime'] <= 150)]
        else:  # long
            filtered_df = filtered_df[filtered_df['runtime'] > 150]
    return filtered_df.sort_values(by=['popularity', 'vote_average'], ascending=False).head(10)

Say someone picks "action" and "sci-fi," "recent," and "medium." I'd filter for movies from 2019-2024, 100-150 minutes long, with "action" or "sci-fi" in the genres, then sort by popularity and rating. Maybe Tenet pops up—boom, instant recs without any prior user data.

Mixing in Liked Movies

Once users like a few movies, I level up. The get_recommendations method combines content-based similarity with their tastes, adding a diversity twist:

python

def get_recommendations(self, user_id, top_n=10):
    user_data = self.user_data.get(user_id, {})
    liked_movies = user_data.get('liked_movies', [])
    watched_movies = user_data.get('watched_movies', [])
    all_similar_movies = {}
    
    for movie_id in liked_movies[-3:]:  # Last 3 liked movies
        similar_movies = self.content_recommender.get_similar_movies(movie_id, top_n=20)
        for movie in similar_movies:
            if movie['id'] in watched_movies or movie['id'] in liked_movies:
                continue
            movie['similarity'] = self.content_recommender.similarity_matrix[
                self.movies_df.index[self.movies_df['id'] == movie_id].tolist()[0],
                self.movies_df.index[self.movies_df['id'] == movie['id']].tolist()[0]
            ]
            if movie['id'] in all_similar_movies:
                current_score = all_similar_movies[movie['id']]['relevance_score']
                new_score = current_score + (movie['similarity'] * (1 - current_score * self.diversity_factor))
                all_similar_movies[movie['id']]['relevance_score'] = new_score
            else:
                movie['relevance_score'] = movie['similarity']
                all_similar_movies[movie['id']] = movie
    
    sorted_movies = sorted(all_similar_movies.values(), key=lambda x: x['relevance_score'], reverse=True)
    return sorted_movies[:top_n]

Here's the cool part: that diversity_factor. If someone likes The Matrix and Inception, I don't just suggest ten sci-fi mind-benders—I mix it up so they don't get bored. The formula tweaks scores to favor movies that haven't been over-represented yet. It's like curating a playlist with variety, not just repeats of the same beat.

app.py: Serving It Up

Finally, app.py is where it all comes together—a Flask app that delivers recommendations to users. I've got endpoints like:

/recommendations/similar/<movie_id>: Pure content-based matches.
/recommendations/personalized: Hybrid recs for logged-in users.
/recommendations/quiz-based: Quiz-driven suggestions.

I kick things off by loading data and initializing recommenders:

python

app = Flask(__name__)
movies_df, content_recommender, hybrid_recommender = None, None, None

def load_movie_data():
    global movies_df, content_recommender, hybrid_recommender
    movies_df = pd.read_csv('tmdb_movies.csv')  # Or fetch from TMDB API
    content_recommender = ContentBasedRecommender()
    content_recommender.fit(movies_df)
    hybrid_recommender = HybridRecommender(content_recommender)

load_movie_data()

@app.route('/recommendations/similar/<int:movie_id>')
def get_similar(movie_id):
    recs = content_recommender.get_similar_movies(movie_id)
    return jsonify(recs)

@app.route('/recommendations/personalized/<int:user_id>')
def get_personalized(user_id):
    recs = hybrid_recommender.get_recommendations(user_id)
    return jsonify(recs)

if __name__ == '__main__':
    app.run(debug=True)

I'm caching TMDB data (smart, since APIs hate being spammed) and pre-computing the similarity matrix so requests are fast. It's like pre-baking a cake—when someone orders, you just slice and serve.

Wrapping Up: The Journey

So, there you have it! From a Netflix "aha!" moment to a full-blown recommendation system, I've built something awesome. content_based_recommender.py finds movie twins, hybrid_recommender.py adds the user's personality, and app.py delivers it with a bow. I've tackled sparse data, performance hiccups, and user experience like a pro.

Next time I'm tweaking this, maybe I'll play with weighting the hybrid inputs more dynamically—could be a fun experiment! Building YMovies taught me that the best recommendations aren't just about accuracy—they're about surprise and delight too.

python

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from flask import Flask, jsonify
from datetime import datetime

# content_based_recommender.py
class ContentBasedRecommender:
    def __init__(self):
        self.movies_df = None
        self.tfidf = None
        self.tfidf_matrix = None
        self.similarity_matrix = None

    def _combine_features(self, row):
        features = []
        if 'genres' in row and row['genres']:
            genres = row['genres'].split()
            features.extend([g for g in genres for _ in range(3)])
        if 'overview' in row and row['overview']:
            features.append(row['overview'])
        if 'cast' in row and row['cast']:
            cast_list = row['cast'].split()[:5]
            features.extend([c for c in cast_list for _ in range(2)])
        if 'director' in row and row['director']:
            features.extend([row['director']] * 3)
        if 'keywords' in row and row['keywords']:
            features.append(row['keywords'])
        return ' '.join(features)

    def fit(self, movies_df):
        self.movies_df = movies_df
        self.tfidf = TfidfVectorizer(
            stop_words='english',
            ngram_range=(1, 2),
            min_df=2,
            max_features=5000
        )
        combined_features = self.movies_df.apply(self._combine_features, axis=1)
        self.tfidf_matrix = self.tfidf.fit_transform(combined_features)
        self.similarity_matrix = cosine_similarity(self.tfidf_matrix)

    def get_similar_movies(self, movie_id, top_n=10):
        movie_idx = self.movies_df.index[self.movies_df['id'] == movie_id].tolist()[0]
        sim_scores = list(enumerate(self.similarity_matrix[movie_idx]))
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:top_n+1]
        movie_indices = [i[0] for i in sim_scores]
        return self.movies_df.iloc[movie_indices][['id', 'title']].to_dict('records')

# hybrid_recommender.py
class HybridRecommender:
    def __init__(self, content_recommender, diversity_factor=0.1):
        self.content_recommender = content_recommender
        self.diversity_factor = diversity_factor
        self.movies_df = content_recommender.movies_df
        self.user_data = {}

    def _get_quiz_based_recommendations(self, quiz_answers):
        filtered_df = self.movies_df.copy()
        if 'genres' in quiz_answers:
            genre_list = quiz_answers['genres']
            filtered_df = filtered_df[filtered_df['genres'].apply(
                lambda x: any(g in x.split() for g in genre_list))]
        if 'recency' in quiz_answers:
            current_year = datetime.now().year
            if quiz_answers['recency'] == 'recent':
                filtered_df = filtered_df[filtered_df['release_year'] >= current_year - 5]
            elif quiz_answers['recency'] == 'classic':
                filtered_df = filtered_df[filtered_df['release_year'] < 2000]
        if 'duration' in quiz_answers:
            if quiz_answers['duration'] == 'short':
                filtered_df = filtered_df[filtered_df['runtime'] < 100]
            elif quiz_answers['duration'] == 'medium':
                filtered_df = filtered_df[(filtered_df['runtime'] >= 100) & (filtered_df['runtime'] <= 150)]
            else:
                filtered_df = filtered_df[filtered_df['runtime'] > 150]
        return filtered_df.sort_values(by=['popularity', 'vote_average'], ascending=False).head(10)

    def get_recommendations(self, user_id, top_n=10):
        user_data = self.user_data.get(user_id, {})
        liked_movies = user_data.get('liked_movies', [])
        watched_movies = user_data.get('watched_movies', [])
        all_similar_movies = {}
        for movie_id in liked_movies[-3:]:
            similar_movies = self.content_recommender.get_similar_movies(movie_id, top_n=20)
            for movie in similar_movies:
                if movie['id'] in watched_movies or movie['id'] in liked_movies:
                    continue
                movie['similarity'] = self.content_recommender.similarity_matrix[
                    self.movies_df.index[self.movies_df['id'] == movie_id].tolist()[0],
                    self.movies_df.index[self.movies_df['id'] == movie['id']].tolist()[0]
                ]
                if movie['id'] in all_similar_movies:
                    current_score = all_similar_movies[movie['id']]['relevance_score']
                    new_score = current_score + (movie['similarity'] * (1 - current_score * self.diversity_factor))
                    all_similar_movies[movie['id']]['relevance_score'] = new_score
                else:
                    movie['relevance_score'] = movie['similarity']
                    all_similar_movies[movie['id']] = movie
        sorted_movies = sorted(all_similar_movies.values(), key=lambda x: x['relevance_score'], reverse=True)
        return sorted_movies[:top_n]

# app.py
app = Flask(__name__)
movies_df, content_recommender, hybrid_recommender = None, None, None

def load_movie_data():
    global movies_df, content_recommender, hybrid_recommender
    movies_df = pd.read_csv('tmdb_movies.csv')  # Placeholder for TMDB API data
    content_recommender = ContentBasedRecommender()
    content_recommender.fit(movies_df)
    hybrid_recommender = HybridRecommender(content_recommender)

load_movie_data()

@app.route('/recommendations/similar/<int:movie_id>')
def get_similar(movie_id):
    recs = content_recommender.get_similar_movies(movie_id)
    return jsonify(recs)

@app.route('/recommendations/personalized/<int:user_id>')
def get_personalized(user_id):
    recs = hybrid_recommender.get_recommendations(user_id)
    return jsonify(recs)

if __name__ == '__main__':
    app.run(debug=True)

Photo by Felix Mooneeram on Unsplash