Skip to content

Conversation

@floschne
Copy link
Member

@floschne floschne commented Jun 7, 2025

This fixes the import of large PDFs by fixed page-based chunking and improves PDF import in general by using Docling.

STRATEGY:

  1. check if PDF needs to be chunked, i.e., if it has more than N (per default 5) pages.

YES:

  1. Chunk the PDF
  2. stop prepro for cargo (not the whole PPJ!)
  3. create a new PPJ from the chunks

NO:

  1. continue with extracting content as HTML including images from PDF via Docling through RayModelService

Open Questions:

  • how to properly link the chunks concerning the page order and SDoc links to navigate in the UI?
  • can we maybe have some sort of Parent SDoc (no Adoc!) that links the chunk sdocs?

TODOs:

  • write an end2end test! I already uploaded a good example file at ltdata. Since multiple PPJs are spawned from the chunks and their images, one naive strategy would be to upload the file, set a max time limit (about 10 mins on GPU), poll the PPJs, assert that all PPJs are done within the time limit.
  • improved chunking strategies (semantic, structured, etc)
  • improve throughput by batched docling conversion (https://docling-project.github.io/docling/examples/batch_convert/)

@bigabig bigabig merged commit 5bf4a31 into main Jun 9, 2025
4 checks passed
@bigabig bigabig deleted the pdf-import branch July 21, 2025 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants