Extracting text from pdfs using pdfminer.six and pyPDF2
pip install -r requirements.txt
python pdf_extract.py
the above will default to parsing all pdfs in 'samples' and save output txt files to 'output'. Pass a path to a folder containing pdfs with --path_to_folder & change output folder with --out_path args
E.G
python pdf_extract.py --path_to_folder /Users/user/my_pdfs --out_path /Users/documents/parsed_pdfs
usage: pdf_extract.py [-h] [--path_to_folder PATH_TO_FOLDER]
[--out_path OUT_PATH] [-nf] [--size SIZE]
CLI for PDFextract - extracts plaintext from PDF files
optional arguments:
-h, --help show this help message and exit
--path_to_folder PATH_TO_FOLDER
Path to folder containing pdfs
--out_path OUT_PATH Output location for final .txt file
-nf, --no_filter turn off cleaning & filtering resulting txt files
--size SIZE Do not process files larger than this size per page in
bytes (mostly images) - default 300000