Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide script that downloads Netflix prize data and puts it in the right place #43

Open
audiodude opened this issue Nov 8, 2024 · 1 comment · May be fixed by #56
Open

Provide script that downloads Netflix prize data and puts it in the right place #43

audiodude opened this issue Nov 8, 2024 · 1 comment · May be fixed by #56
Assignees

Comments

@audiodude
Copy link
Collaborator

For our scripts to work, we need to download the Netflix prize data. As this file is ~500 MB and the licensing terms are dubious, we decided early on not to include it as part of the repo. It is however available on the Internet Archive here: https://archive.org/details/nf_prize_dataset.tar

We should write a script that downloads this tarfile, decompresses it, and puts the contents in /data/.

Specifically, everything that is in the download directory that the tarfile produces should be in the /data/ directory. So if there is download/movie_titles.txt it should be /data/movie_titles.txt.

In general, this would be a good candidate for a bash script that does wget -> tar -xzf -> rename directory. But it would be better to do it in Python for portability reasons (some team members use Windows).

We can definitely download the file with requests, but it might not be the perfect fit. We should research if there is a wget type library for Python that we could use.

We can use the tarfile library to extract the files: https://docs.python.org/3/library/tarfile.html

Finally, renaming can be done with os or shutil: https://docs.python.org/3/library/os.html#os.rename

@jhanley634
Copy link
Collaborator

@JamesKohlsRepo , I see some nice code in that PR 45.

nit: Modern code usually prefers lowercase list[str] and str | None over the old-fashioned List[str] and Optional[str]. Linters like ruff can flag that if desired.

Caching the giant Netflix tar-ball somewhere in the filesystem is certainly a possibility, and then you have to settle on a destination directory that works for everyone. Another possible design would be to rely on the requests-cache library, which I've had very good results with. It manages a local file, in sqlite format, which the app doesn't know or care about. Specify an expiration period of > 30 days, and the effect is your app does a GET exactly once, the bits are cached on disk, and subsequent GET attempts will be satisfied from disk rather than from network, for nice snappy performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment