Import of large PDFs #547

floschne · 2025-06-07T00:12:35Z

This fixes the import of large PDFs by fixed page-based chunking and improves PDF import in general by using Docling.

STRATEGY:

check if PDF needs to be chunked, i.e., if it has more than N (per default 5) pages.

YES:

Chunk the PDF
stop prepro for cargo (not the whole PPJ!)
create a new PPJ from the chunks

NO:

continue with extracting content as HTML including images from PDF via Docling through RayModelService

Open Questions:

how to properly link the chunks concerning the page order and SDoc links to navigate in the UI?
can we maybe have some sort of Parent SDoc (no Adoc!) that links the chunk sdocs?

TODOs:

write an end2end test! I already uploaded a good example file at ltdata. Since multiple PPJs are spawned from the chunks and their images, one naive strategy would be to upload the file, set a max time limit (about 10 mins on GPU), poll the PPJs, assert that all PPJs are done within the time limit.
improved chunking strategies (semantic, structured, etc)
improve throughput by batched docling conversion (https://docling-project.github.io/docling/examples/batch_convert/)

backend/src/app/preprocessing/ray_model_worker/requirements.txt

backend/pyproject.toml

backend/src/app/preprocessing/ray_model_worker/utils.py

floschne added 12 commits June 6, 2025 14:28

added docling dep for ray

2134522

added missing pymupdf dep for ray

904e6f5

retry on unique violation for span texts

f57d244

better error message for zip issues

164e631

b64 to img

67b6b05

method to flush cargo steps

9005d26

docling config for ray

282d530

docling util methods

9a94aba

docling pdf2html in ray

6e7f762

docling pdf2html in RMS

f4d13df

adapted and extended pipeline for large pdf docs

63fdd73

uv lock

d9a385a

bigabig requested changes Jun 8, 2025

View reviewed changes

backend/src/app/preprocessing/ray_model_worker/requirements.txt Outdated Show resolved Hide resolved

backend/pyproject.toml Outdated Show resolved Hide resolved

backend/src/app/preprocessing/ray_model_worker/utils.py Outdated Show resolved Hide resolved

floschne and others added 5 commits June 8, 2025 09:24

fixed tenacity version

3302220

moved docling specific util methods to docling model

2217253

removed unnecessary pymupdf from docling

d122423

updated lock file

6f91133

added missing dependency

800605b

bigabig approved these changes Jun 9, 2025

View reviewed changes

bigabig merged commit 5bf4a31 into main Jun 9, 2025
4 checks passed

bigabig deleted the pdf-import branch July 21, 2025 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Import of large PDFs #547

Import of large PDFs #547

Uh oh!

floschne commented Jun 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Import of large PDFs #547

Import of large PDFs #547

Uh oh!

Conversation

floschne commented Jun 7, 2025

STRATEGY:

YES:

NO:

Open Questions:

TODOs:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants