Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefer tesserocr over easyocr, if available (backport #369) #391

Merged
merged 2 commits into from
Nov 15, 2024

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Nov 15, 2024

When setting up our ingestion pipeline, explicitly check if tesserocr is available and Docling can load it. If so, prefer that. Otherwise, attempt the same for EasyOCR. If neither can load, log an error and disable optical character recognition.

Fixes #352


This is an automatic backport of pull request #369 done by Mergify.

When setting up our ingestion pipeline, explicitly check if tesserocr
is available and Docling can load it. If so, prefer that. Otherwise,
attempt the same for EasyOCR. If neither can load, log an error and
disable optical character recognition.

Fixes #352

Signed-off-by: Ben Browning <[email protected]>
(cherry picked from commit ba00454)
This borrows and adapts the `leanimports.py` script and test from the
InstructLab CLI repository to ensure within SDG we're not prematurely
loading the entirety of Torch into memory.

The CLI repo noticed we were doing this, and since this PR would
actually have exacerbated this by attempting to load the tesseract and
easyocr modules even earlier, this felt like the right time to address
this. The overall imports are all the same, but now we only import
specific docling pieces as needed when we're actually going to run
chunking vs triggering the whole PyTorch import chain as soon as
someone imports SDG.

Signed-off-by: Ben Browning <[email protected]>
(cherry picked from commit 791fc7f)
@mergify mergify bot added testing Relates to testing release-branch dependencies Pull requests that update a dependency file one-approval labels Nov 15, 2024
@mergify mergify bot removed the one-approval label Nov 15, 2024
@nathan-weinberg nathan-weinberg merged commit b58f904 into release-v0.5 Nov 15, 2024
18 checks passed
@nathan-weinberg nathan-weinberg deleted the mergify/bp/release-v0.5/pr-369 branch November 15, 2024 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file release-branch testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants