You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For our scripts to work, we need to download the Netflix prize data. As this file is ~500 MB and the licensing terms are dubious, we decided early on not to include it as part of the repo. It is however available on the Internet Archive here: https://archive.org/details/nf_prize_dataset.tar
We should write a script that downloads this tarfile, decompresses it, and puts the contents in /data/.
Specifically, everything that is in the download directory that the tarfile produces should be in the /data/ directory. So if there is download/movie_titles.txt it should be /data/movie_titles.txt.
In general, this would be a good candidate for a bash script that does wget -> tar -xzf -> rename directory. But it would be better to do it in Python for portability reasons (some team members use Windows).
We can definitely download the file with requests, but it might not be the perfect fit. We should research if there is a wget type library for Python that we could use.
nit: Modern code usually prefers lowercase list[str] and str | None over the old-fashioned List[str] and Optional[str]. Linters like ruff can flag that if desired.
Caching the giant Netflix tar-ball somewhere in the filesystem is certainly a possibility, and then you have to settle on a destination directory that works for everyone. Another possible design would be to rely on the requests-cache library, which I've had very good results with. It manages a local file, in sqlite format, which the app doesn't know or care about. Specify an expiration period of > 30 days, and the effect is your app does a GET exactly once, the bits are cached on disk, and subsequent GET attempts will be satisfied from disk rather than from network, for nice snappy performance.
For our scripts to work, we need to download the Netflix prize data. As this file is ~500 MB and the licensing terms are dubious, we decided early on not to include it as part of the repo. It is however available on the Internet Archive here: https://archive.org/details/nf_prize_dataset.tar
We should write a script that downloads this tarfile, decompresses it, and puts the contents in
/data/
.Specifically, everything that is in the
download
directory that the tarfile produces should be in the/data/
directory. So if there isdownload/movie_titles.txt
it should be/data/movie_titles.txt
.In general, this would be a good candidate for a bash script that does wget -> tar -xzf -> rename directory. But it would be better to do it in Python for portability reasons (some team members use Windows).
We can definitely download the file with
requests
, but it might not be the perfect fit. We should research if there is awget
type library for Python that we could use.We can use the
tarfile
library to extract the files: https://docs.python.org/3/library/tarfile.htmlFinally, renaming can be done with
os
orshutil
: https://docs.python.org/3/library/os.html#os.renameThe text was updated successfully, but these errors were encountered: