A script for scraping PDFs from government website, appending, and OCRing.
A Python script using Jupyter Lab to bulk download, merge, and OCR the recently released Martin Luther King Jr. assassination records from the National Archives. This prepares the documents for powerful text analysis tools like Google's NotebookLM.
Manually downloading hundreds or thousands of individual PDF documents and then converting them into a searchable format can be incredibly time-consuming. This script automates that entire process, transforming scattered, potentially non-searchable PDFs into a single, cohesive, and fully OCR'd document ready for advanced research.
- Automated Download: Scrapes the National Archives website to automatically download the first 100 available PDF files related to the MLK assassination.
- PDF Merging: Combines all downloaded individual PDFs into a single, comprehensive document.
- Optical Character Recognition (OCR): Applies OCR to the merged PDF, creating a hidden text layer that makes the entire document searchable and its content extractable for text analysis.
- NotebookLM Ready: The final OCR'd PDF is perfectly formatted for upload into Google's NotebookLM, allowing for instant summarization, Q&A, and exploration of complex historical data.
Follow these steps to set up and run the script on your local machine.
You need Python 3.8+ and a few system-level tools:
-
Python Libraries: Install the necessary Python packages using pip. It's recommended to do this in a virtual environment.
pip install -r requirements.txt
(Note: If running in Jupyter Lab and you prefer seeing a live progress bar, you might also want to install
ipywidgets:pip install ipywidgetsand thenjupyter nbextension enable --py widgetsnbextensionfollowed byjupyter labextension install @jupyter-widgets/jupyterlab-managerif in Jupyter Lab, then restart Jupyter Lab.) -
Tesseract OCR Engine (System-Level):
ocrmypdfrelies on Tesseract.- Windows: Download the installer from Tesseract-OCR GitHub page (choose the 64-bit version).
- macOS (Homebrew):
brew install tesseract - Linux (Debian/Ubuntu):
sudo apt install tesseract-ocr
-
Ghostscript (System-Level): Crucial for PDF processing (especially on Windows).
- Windows: Download the latest 64-bit AGPL release from Ghostscript's Official Website. During installation, ensure it's added to your system's PATH, or manually add its
bindirectory (e.g.,C:\Program Files\gs\gs10.05.1\bin) to your system's PATH environment variables. Verify by opening a new command prompt and typinggswin64c -v. - macOS (Homebrew):
brew install ghostscript - Linux (Debian/Ubuntu):
sudo apt install ghostscript
- Windows: Download the latest 64-bit AGPL release from Ghostscript's Official Website. During installation, ensure it's added to your system's PATH, or manually add its
- Clone the Repository:
git clone [https://github.com/darredo1/mlkdocrelease2025.git](https://github.com/darredo1/mlkdocrelease2025.git) cd mlkdocrelease2025 - Open in Jupyter Lab:
jupyter lab
- Run the Notebook: Open the
merge_mlk_pdfs.ipynbnotebook.- Option 1 (Run All): Go to
Run->Run All Cells. - Option 2 (Step-by-Step): Run each cell sequentially. The notebook is structured to guide you through folder setup, web scraping, downloading, merging, and OCR.
- Option 1 (Run All): Go to
Upon successful execution, the script will:
- Create a folder named
mlk_pdfsin the same directory as your notebook. - Download the first 100 unique PDF files from the National Archives into
mlk_pdfs. - Create
merged_mlk_records.pdf(a single PDF containing all downloaded documents) insidemlk_pdfs. - Create
searchable_mlk_records.pdf(the OCR'd, searchable version of the merged file) insidemlk_pdfs.
- Go to NotebookLM and create a new Notebook.
- Click "Add Source" or "Upload".
- Upload the
mlk_pdfs/searchable_mlk_records.pdffile.
You can now ask questions, summarize, and explore the content of thousands of pages of MLK assassination records effortlessly!
If you find this script useful for your own work or if it inspires further projects, please consider citing this repository:
-
APA Style (adapted for software): Darredo1. (2025). MLK Assassination Records Processing & Analysis Script (Version 1.0) [Computer software]. GitHub. https://github.com/darredo1/mlkdocrelease2025
-
MLA Style (adapted for software): Darredo1. MLK Assassination Records Processing & Analysis Script. Version 1.0, 2025, GitHub, https://github.com/darredo1/mlkdocrelease2025.
Note on NotebookLM: As NotebookLM environments are currently private, direct public linking is not possible. You can, however, describe its use in your methodology: "This research utilized Google's NotebookLM (alpha/beta version) for interactive analysis and summarization of the processed documents."
The original documents processed by this script are from the National Archives and Records Administration (NARA). Please cite the original source as:
-
APA Style: National Archives and Records Administration. (n.d.). Records Related to the Assassination of the Reverend Dr. Martin Luther King, Jr. Retrieved from https://www.archives.gov/research/mlk
-
MLA Style: National Archives and Records Administration. "Records Related to the Assassination of the Reverend Dr. Martin Luther King, Jr." National Archives, https://www.archives.gov/research/mlk. Accessed July 21, 2025.
This project is licensed under the MIT License - see the LICENSE file for details.