Skip to content

Latest commit

 

History

History
13 lines (10 loc) · 504 Bytes

README.md

File metadata and controls

13 lines (10 loc) · 504 Bytes

Real World Data Prep for LLMs

This is a companion repo for the Real-World Data Prep for LLMs talk available at: https://www.youtube.com/live/YfW5vVwgbyo

Agenda

We look at how to deal with real-world data extraction from PDFs and will cover dealing with the following:

  • Native text / clean PDFs
  • Scanned PDFs
  • Handwritten text and hand-filled forms
  • Tables in PDFs
  • Smartphone-captured images

We will compare the performance of different OCR tools and techniques for each of these scenarios.