Real World Data Prep for LLMs

This is a companion repo for the Real-World Data Prep for LLMs talk available at: https://www.youtube.com/live/YfW5vVwgbyo

Agenda

We look at how to deal with real-world data extraction from PDFs and will cover dealing with the following:

We will compare the performance of different OCR tools and techniques for each of these scenarios.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
clean		clean
forms		forms
handwritten		handwritten
smartphone_cam		smartphone_cam
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
camelot_extract.py		camelot_extract.py
llamaparse_extract.py		llamaparse_extract.py
llmwhisperer_extract.py		llmwhisperer_extract.py
pdf_plumber_extract.py		pdf_plumber_extract.py
requirements.txt		requirements.txt
run_llamaparse.sh		run_llamaparse.sh
run_llmwhisperer.sh		run_llmwhisperer.sh
run_oss_extractors.sh		run_oss_extractors.sh
run_unstructured.sh		run_unstructured.sh
sample.env		sample.env
tabula_extract.py		tabula_extract.py
unstructured_extract.py		unstructured_extract.py