About how to download raw PDF files using 2012_manifest.tsv #111

kevindryz · 2024-12-02T09:02:30Z

I didn't understand the specific download process, and my company does not allow direct downloads from cloud storage. However, based on the dc_slug in the TSV file, I have a general idea of how to find the original PDF URL.

For example, for a dc_slug like 456300-sept-17-23-2012-11953-13474707086771-_-pdf, I can use the split function to split at the first hyphen and then construct the URL as follows:

url = f'https://s3.amazonaws.com/s3.documentcloud.org/documents/{456300}/{sept-17-23-2012-11953-13474707086771-_-pdf}.pdf'

This way, I can directly access the PDF!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About how to download raw PDF files using 2012_manifest.tsv #111

About how to download raw PDF files using 2012_manifest.tsv #111

kevindryz commented Dec 2, 2024

About how to download raw PDF files using 2012_manifest.tsv #111

About how to download raw PDF files using 2012_manifest.tsv #111

Comments

kevindryz commented Dec 2, 2024