PDFextract

Extracting text from pdfs using pdfminer.six and pyPDF2

Setup

pip install -r requirements.txt

Usage

python pdf_extract.py

the above will default to parsing all pdfs in 'samples' and save output txt files to 'output'. Pass a path to a folder containing pdfs with --path_to_folder & change output folder with --out_path args

E.G python pdf_extract.py --path_to_folder /Users/user/my_pdfs --out_path /Users/documents/parsed_pdfs

Full usage details:

usage: pdf_extract.py [-h] [--path_to_folder PATH_TO_FOLDER]
                      [--out_path OUT_PATH] [-nf] [--size SIZE]

CLI for PDFextract - extracts plaintext from PDF files

optional arguments:
  -h, --help            show this help message and exit
  --path_to_folder PATH_TO_FOLDER
                        Path to folder containing pdfs
  --out_path OUT_PATH   Output location for final .txt file
  -nf, --no_filter      turn off cleaning & filtering resulting txt files
  --size SIZE           Do not process files larger than this size per page in
                        bytes (mostly images) - default 300000

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
README.md		README.md
fix_unicode.py		fix_unicode.py
istarmap.py		istarmap.py
pdf_extract.py		pdf_extract.py
pdf_filter.py		pdf_filter.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDFextract

Setup

Usage

Full usage details:

About

Releases

Packages

Languages

ClarosAI/PDFextract

Folders and files

Latest commit

History

Repository files navigation

PDFextract

Setup

Usage

Full usage details:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages