IMDB makes this dataset available with a number of licensing terms. The long and short of it is, you can only use the data for your own, non-commercial use.
Source and licensing: https://developer.imdb.com/non-commercial-datasets/
Data files: https://datasets.imdbws.com/
You'll need SQL Server 2022 and up to 20 GB of free disk space to do this.
If you want to run it on a lower version, you'll have to rebuild/replace the STRING_SPLIT
function calls in the last script.
- Create a new SQL Server database.
- I've called it IMDB, but you can name it whatever you want.
- I strongly recommend setting the database to simple recovery model.
- Create the database schema by running
Create IMDB-schema.sql
in the database. - Download and import the IMDB dataset using the
Load-ImdbStaging.ps1
script. The script will download the files to your current working directory and perform the ingest from there. The download is just a few GB, but inserting the data may take up to a few hours depending on your environment. - Populate the relational tables using the
Load IMDB relational tables.sql
script. - If you want, you can now drop the staging tables in the
Raw
schema.
- dbo.Attributes: 171 rows
- dbo.Episodes: 8M rows, 95 MB
- dbo.Genres: 28 rows
- dbo.Professions: 48 rows
- dbo.PrimaryProfessions: 14M rows, 160 MB
- dbo.Principals: 13M rows, 350 MB
- dbo.Titles: 10M rows, 255 MB
- dbo.TitleCharacters: 5.3M rows, 96 MB
- dbo.TitleGenres: 16M rows, 173 MB
- dbo.TitleNames: 37M rows, 1.1 GB
- dbo.TitleNameAttributes: 500k rows, 9 MB
- dbo.TitlePrincipals: 28M rows, 460 MB
- dbo.TitleTypes: 12 rows