Provide script that downloads Netflix prize data and puts it in the right place #43

audiodude · 2024-11-08T03:46:24Z

For our scripts to work, we need to download the Netflix prize data. As this file is ~500 MB and the licensing terms are dubious, we decided early on not to include it as part of the repo. It is however available on the Internet Archive here: https://archive.org/details/nf_prize_dataset.tar

We should write a script that downloads this tarfile, decompresses it, and puts the contents in /data/.

Specifically, everything that is in the download directory that the tarfile produces should be in the /data/ directory. So if there is download/movie_titles.txt it should be /data/movie_titles.txt.

In general, this would be a good candidate for a bash script that does wget -> tar -xzf -> rename directory. But it would be better to do it in Python for portability reasons (some team members use Windows).

We can definitely download the file with requests, but it might not be the perfect fit. We should research if there is a wget type library for Python that we could use.

We can use the tarfile library to extract the files: https://docs.python.org/3/library/tarfile.html

Finally, renaming can be done with os or shutil: https://docs.python.org/3/library/os.html#os.rename

The text was updated successfully, but these errors were encountered:

jhanley634 · 2024-11-08T19:44:38Z

@JamesKohlsRepo , I see some nice code in that PR 45.

nit: Modern code usually prefers lowercase list[str] and str | None over the old-fashioned List[str] and Optional[str]. Linters like ruff can flag that if desired.

Caching the giant Netflix tar-ball somewhere in the filesystem is certainly a possibility, and then you have to settle on a destination directory that works for everyone. Another possible design would be to rely on the requests-cache library, which I've had very good results with. It manages a local file, in sqlite format, which the app doesn't know or care about. Specify an expiration period of > 30 days, and the effect is your app does a GET exactly once, the bits are cached on disk, and subsequent GET attempts will be satisfied from disk rather than from network, for nice snappy performance.

audiodude assigned JamesKohlsRepo Nov 8, 2024

audiodude linked a pull request Dec 13, 2024 that will close this issue

Provide script that downloads netflix prize data and puts it in the right place #56

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide script that downloads Netflix prize data and puts it in the right place #43

Provide script that downloads Netflix prize data and puts it in the right place #43

audiodude commented Nov 8, 2024

jhanley634 commented Nov 8, 2024

Provide script that downloads Netflix prize data and puts it in the right place #43

Provide script that downloads Netflix prize data and puts it in the right place #43

Comments

audiodude commented Nov 8, 2024

jhanley634 commented Nov 8, 2024