This is a companion repo for the Real-World Data Prep for LLMs talk available at: https://www.youtube.com/live/YfW5vVwgbyo
We look at how to deal with real-world data extraction from PDFs and will cover dealing with the following:
- Native text / clean PDFs
- Scanned PDFs
- Handwritten text and hand-filled forms
- Tables in PDFs
- Smartphone-captured images
We will compare the performance of different OCR tools and techniques for each of these scenarios.