Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About how to download raw PDF files using 2012_manifest.tsv #111

Open
kevindryz opened this issue Dec 2, 2024 · 0 comments
Open

About how to download raw PDF files using 2012_manifest.tsv #111

kevindryz opened this issue Dec 2, 2024 · 0 comments

Comments

@kevindryz
Copy link

I didn't understand the specific download process, and my company does not allow direct downloads from cloud storage. However, based on the dc_slug in the TSV file, I have a general idea of how to find the original PDF URL.

For example, for a dc_slug like 456300-sept-17-23-2012-11953-13474707086771-_-pdf, I can use the split function to split at the first hyphen and then construct the URL as follows:

url = f'https://s3.amazonaws.com/s3.documentcloud.org/documents/{456300}/{sept-17-23-2012-11953-13474707086771-_-pdf}.pdf'

This way, I can directly access the PDF!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant