Live Demo β’ Report Bug β’ Request Feature
- Overview
- Features
- Demo
- Architecture
- Tech Stack
- Quick Start
- Installation
- Dataset
- Model Training
- Project Structure
- Performance
- How to Use
- API Reference
- FAQ
- Contributing
- License
- Contact
A content-based movie recommendation system that suggests films based on similarity in genres, keywords, cast, crew, and plot. Built with Streamlit and powered by machine learning, it provides personalized recommendations with rich metadata from TMDB API.
- π― Content-Based Filtering using NLP and cosine similarity
- π΄ Real-Time Data from TMDB API (posters, trailers, cast, ratings)
- β‘ Fast Recommendations with pre-computed similarity matrix
- π 4,800+ Movies in the catalog
- π¨ Interactive UI with trending movies, random suggestions, and viewing history
| Feature | Description |
|---|---|
| Movie Search | Search from 4,800+ movies and get instant recommendations |
| Surprise Me | Random movie discovery with full details |
| Trending Movies | Weekly trending films from TMDB |
| Rich Metadata | Cast, crew, budget, revenue, ratings, runtime, trailers |
| Viewing History | Track and revisit recently viewed movies |
| Responsive Design | Mobile-friendly interface |
π Launch Live Demo
What you can do:
- Search through 4,800+ movies
- Get 5 similar movie recommendations instantly
- View detailed information (cast, crew, budget, ratings, trailers)
- Discover trending movies weekly
- Get random movie suggestions
βββββββββββββββββββ
β User Input β
β (Movie Title) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Streamlit App β
β (Frontend) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Recommender β
β Engine β
β (Cosine Sim.) β
ββββββββββ¬βββββββββ
β
βββββββββββββββββ¬ββββββββββββββββ
βΌ βΌ βΌ
ββββββββββββββββ ββββββββββββ ββββββββββββββββ
β Similarity β β TMDB API β β Local Cache β
β Matrix (pkl) β β (Live) β β (Session) β
ββββββββββββββββ ββββββββββββ ββββββββββββββββ
- Text Vectorization: Convert movie features (genres, keywords, cast, crew, overview) into vectors using CountVectorizer (5000 features)
- Similarity Computation: Calculate cosine similarity between all movie pairs (4806 Γ 4806 matrix)
- Recommendation: For a given movie, retrieve top 5 most similar movies based on cosine similarity scores
Cosine Similarity Formula:
similarity(A, B) = (A Β· B) / (||A|| Γ ||B||)
| Category | Technologies |
|---|---|
| Framework | Streamlit |
| ML/NLP | scikit-learn, NLTK (PorterStemmer) |
| Data Processing | Pandas, NumPy, Pickle |
| API | TMDB API, Requests |
| Deployment | Streamlit Cloud |
# Clone the repository
git clone https://github.com/hk-kumawat/Movie-Recommender-System.git
cd Movie-Recommender-System
# Install dependencies
pip install -r requirements.txt
# Set up TMDB API key (see below)
mkdir .streamlit
echo '[tmdb]\napi_key = "YOUR_API_KEY"' > .streamlit/secrets.toml
# Run the application
streamlit run app.pyAccess the app at: http://localhost:8501
- Python 3.8 or higher
- pip package manager
- TMDB API key (Get one here)
1. Clone the Repository
git clone https://github.com/hk-kumawat/Movie-Recommender-System.git
cd Movie-Recommender-System2. Create Virtual Environment (Recommended)
python -m venv venv
# Windows
venv\Scripts\activate
# macOS/Linux
source venv/bin/activate3. Install Dependencies
pip install -r requirements.txt4. Configure TMDB API Key
Create .streamlit/secrets.toml:
[tmdb]
api_key = "your_tmdb_api_key_here"How to get TMDB API Key:
- Sign up at themoviedb.org
- Go to Settings β API
- Request API Key (select "Developer")
- Copy your API key
5. Run the Application
streamlit run app.py| Issue | Solution |
|---|---|
ModuleNotFoundError |
Run pip install -r requirements.txt |
| API Key Error | Check .streamlit/secrets.toml format |
| Port Already in Use | Use streamlit run app.py --server.port 8502 |
| NLTK Data Missing | Run python -m nltk.downloader punkt stopwords |
Source: TMDb 5000 Movie Dataset (Kaggle)
| File | Records | Description |
|---|---|---|
tmdb_5000_movies.csv |
4,803 | Movie metadata (title, overview, genres, keywords, budget, revenue) |
tmdb_5000_credits.csv |
4,803 | Cast and crew information |
Key Statistics:
- Movies: 4,806 (after preprocessing)
- Features: 5,000 (CountVectorizer)
- Genres: 20 unique genres
- Time Period: 1916-2017
Raw Data
β
Merge movies + credits
β
Extract features (genres, keywords, cast, crew, overview)
β
Text preprocessing (lowercase, remove spaces)
β
Stemming (PorterStemmer)
β
Combine into "tags" column
β
Vectorize (CountVectorizer, max_features=5000)
β
Compute cosine similarity matrix (4806 Γ 4806)
β
Save model (movie_list.pkl, similarity.pkl)
The recommendation model is trained using a content-based filtering approach. Here's how it works:
1. Data Collection & Preprocessing
# Load datasets
movies = pd.read_csv('Dataset/tmdb_5000_movies.csv')
credits = pd.read_csv('Dataset/tmdb_5000_credits.csv')
# Merge on title
movies = movies.merge(credits, on='title')
# Extract relevant features
movies = movies[['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew']]2. Feature Engineering
# Extract top 3 cast members
def convert_cast(text):
return [actor['name'] for actor in ast.literal_eval(text)[:3]]
# Extract director from crew
def fetch_director(text):
for person in ast.literal_eval(text):
if person['job'] == 'Director':
return [person['name']]
return []
# Apply transformations
movies['cast'] = movies['cast'].apply(convert_cast)
movies['crew'] = movies['crew'].apply(fetch_director)
movies['genres'] = movies['genres'].apply(lambda x: [genre['name'] for genre in ast.literal_eval(x)])
movies['keywords'] = movies['keywords'].apply(lambda x: [kw['name'] for kw in ast.literal_eval(x)])3. Text Processing
# Combine all features into tags
movies['tags'] = movies['overview'] + movies['genres'] + movies['keywords'] + movies['cast'] + movies['crew']
# Convert to string and lowercase
movies['tags'] = movies['tags'].apply(lambda x: ' '.join(x).lower())
# Apply stemming
from nltk.stem import PorterStemmer
ps = PorterStemmer()
movies['tags'] = movies['tags'].apply(lambda x: ' '.join([ps.stem(word) for word in x.split()]))4. Vectorization & Similarity Computation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Create count vectorizer
cv = CountVectorizer(max_features=5000, stop_words='english')
vectors = cv.fit_transform(movies['tags']).toarray()
# Compute cosine similarity matrix
similarity = cosine_similarity(vectors)
# Save models
pickle.dump(movies, open('model_files/movie_list.pkl', 'wb'))
pickle.dump(similarity, open('model_files/similarity.pkl', 'wb'))| Parameter | Value | Description |
|---|---|---|
max_features |
5000 | Maximum vocabulary size for CountVectorizer |
stop_words |
'english' | Remove common English words |
similarity_metric |
Cosine Similarity | Measure of similarity between vectors |
top_n_recommendations |
5 | Number of recommendations to return |
- Notebook:
Movie Recommender System.ipynb - Training Time: ~2 minutes (on standard CPU)
- Model Size: 184 MB (similarity matrix)
- Libraries: scikit-learn, NLTK, Pandas, NumPy
Movie-Recommender-System/
β
βββ app.py # Main Streamlit application
βββ Movie Recommender System.ipynb # Data preprocessing & model training
βββ requirements.txt # Python dependencies
βββ .gitignore # Git ignore rules
βββ LICENSE # MIT License
β
βββ Dataset/ # Raw movie data
β βββ tmdb_5000_movies.csv
β βββ tmdb_5000_credits.csv
β
βββ model_files/ # Trained models
β βββ movie_list.pkl # Movie data (4806 movies)
β βββ similarity.pkl # Cosine similarity matrix (4806Γ4806)
β
βββ .streamlit/ # Configuration (not in repo)
βββ secrets.toml # TMDB API key
| Metric | Value |
|---|---|
| Movies in Catalog | 4,806 |
| Feature Dimensions | 5,000 |
| Similarity Matrix Size | 4,806 Γ 4,806 |
| Average Recommendation Time | <2 seconds |
| Model Size | 184 MB (similarity.pkl) |
- API Response Time: ~1.2s (TMDB)
- Recommendation Generation: ~0.8s
- Memory Usage: ~500MB
- Concurrent Users: 100+
- Search Mode: Select a movie from the dropdown and click "Show Details & Recommendations"
- Surprise Mode: Click "Surprise Me!" for a random movie suggestion
- Trending: View weekly trending movies at the top
- History: Access recently viewed movies from the sidebar
import pickle
import pandas as pd
# Load models
movies = pickle.load(open('model_files/movie_list.pkl', 'rb'))
similarity = pickle.load(open('model_files/similarity.pkl', 'rb'))
# Get recommendations
def recommend(movie):
index = movies[movies['title'] == movie].index[0]
distances = sorted(list(enumerate(similarity[index])), reverse=True, key=lambda x: x[1])
recommendations = []
for i in distances[1:6]:
recommendations.append(movies.iloc[i[0]].title)
return recommendations
# Example
print(recommend('Avatar'))
# Output: ['Guardians of the Galaxy', 'Star Wars', 'Star Trek', ...]Returns top 5 similar movies based on content similarity.
Parameters:
movie_title(str): Title of the movie (must exist in dataset)
Returns:
- List of dictionaries containing recommended movies with poster URLs and trailers
Example:
recommendations = recommend('The Dark Knight')
# Returns: [
# {'title': 'The Dark Knight Rises', 'poster': '...', 'trailer': '...'},
# {'title': 'Batman Begins', 'poster': '...', 'trailer': '...'},
# ...
# ]Fetches comprehensive movie information from TMDB API.
Parameters:
movie_id(int): TMDB movie ID
Returns:
- Dictionary with rating, cast, crew, budget, revenue, genres, etc.
Example:
details = get_movie_details(19995) # Avatar
# Returns: {
# 'rating': 7.2,
# 'cast': [...],
# 'director': 'James Cameron',
# 'budget': '$237,000,000',
# ...
# }Gets current trending movies from TMDB API.
Returns:
- List of top 5 trending movies with posters and IDs
Fetches movie poster URL from TMDB API.
Parameters:
movie_id(int): TMDB movie ID
Returns:
- Full URL to movie poster (500px width)
Fetches YouTube trailer URL from TMDB API.
Parameters:
movie_id(int): TMDB movie ID
Returns:
- YouTube URL to official trailer (if available)
Environment Variables:
TMDB_API_KEY = st.secrets["tmdb"]["api_key"] # From .streamlit/secrets.tomlSession State:
st.session_state.history # Recently viewed movies (list of IDs)
st.session_state.mode # Current mode: 'search' or 'surprise'
st.session_state.selected_movie # Currently selected movie titleHow does the recommendation system work?
It uses content-based filtering with cosine similarity. Movies are represented as vectors based on genres, cast, crew, keywords, and plot. Similar movies have vectors close together in this multi-dimensional space.
How many movies are in the database?
4,806 movies from the TMDb 5000 dataset, spanning 1916-2017.
Can I add my own movies?
Not directly. You would need to retrain the model with new data. See the Jupyter notebook for the training process.
Why do I need a TMDB API key?
The API key is required to fetch real-time data like posters, trailers, cast information, and ratings from The Movie Database.
What algorithm is used for recommendations?
Content-based filtering using CountVectorizer for text features and cosine similarity for computing movie similarity scores.
How accurate are the recommendations?
Accuracy depends on user preference, but the system achieves good results by considering multiple features (genres, cast, crew, plot, keywords).
Can this handle collaborative filtering?
No, this is purely content-based. It doesn't use user ratings or behavior data.
What if a movie title has special characters?
Use the exact title as it appears in the dropdown menu. The system is case-sensitive.
How often is the trending section updated?
Trending movies are fetched in real-time from TMDB API every time you load the page.
Can I deploy this on my own server?
Yes! It works on any platform that supports Streamlit (Streamlit Cloud, Heroku, AWS, etc.).
What are the system requirements?
Python 3.8+, ~500MB RAM, and the libraries in `requirements.txt`.
How do I update the movie database?
Download a new dataset, retrain the model using the Jupyter notebook, and replace the `.pkl` files.
Is there a rate limit on TMDB API?
Yes, TMDB has rate limits. The app uses retry logic with exponential backoff to handle this gracefully.
Contributions are welcome! Here's how:
- Fork the repository
- Create a feature branch (
git checkout -b feature/YourFeature) - Commit changes (
git commit -m 'Add YourFeature') - Push to branch (
git push origin feature/YourFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Made with β€οΈ by Harshal Kumawat

