Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefer tesserocr over easyocr, if available #369

Merged
merged 2 commits into from
Nov 14, 2024

Conversation

bbrowning
Copy link
Contributor

When setting up our ingestion pipeline, explicitly check if tesserocr is available and Docling can load it. If so, prefer that. Otherwise, attempt the same for EasyOCR. If neither can load, log an error and disable optical character recognition.

Fixes #352

@mergify mergify bot added testing Relates to testing dependencies Pull requests that update a dependency file ci-failure labels Nov 12, 2024
When setting up our ingestion pipeline, explicitly check if tesserocr
is available and Docling can load it. If so, prefer that. Otherwise,
attempt the same for EasyOCR. If neither can load, log an error and
disable optical character recognition.

Fixes instructlab#352

Signed-off-by: Ben Browning <[email protected]>
@mergify mergify bot removed the ci-failure label Nov 12, 2024
@bbrowning bbrowning requested a review from a team November 12, 2024 19:23
@mergify mergify bot added the one-approval label Nov 13, 2024
@bbrowning bbrowning added the hold In-progress PR. Tag should be removed before merge. label Nov 13, 2024
@bbrowning
Copy link
Contributor Author

Thanks for the approval! After following some discussion elsewhere about being careful when we import anything that imports all of torch, I'm going to add an additional test to this and defer some of the docling/easyocr imports to not import transformers or torch until they're actually needed. Just a small change, but realized that should go in as part of this because otherwise we're loading all of torch fairly early in our import chain.

This borrows and adapts the `leanimports.py` script and test from the
InstructLab CLI repository to ensure within SDG we're not prematurely
loading the entirety of Torch into memory.

The CLI repo noticed we were doing this, and since this PR would
actually have exacerbated this by attempting to load the tesseract and
easyocr modules even earlier, this felt like the right time to address
this. The overall imports are all the same, but now we only import
specific docling pieces as needed when we're actually going to run
chunking vs triggering the whole PyTorch import chain as soon as
someone imports SDG.

Signed-off-by: Ben Browning <[email protected]>
@bbrowning bbrowning removed the hold In-progress PR. Tag should be removed before merge. label Nov 13, 2024
@bbrowning
Copy link
Contributor Author

Ok, removing the hold now that we're not importing all of Pytorch as soon as someone imports SDG. Instead, we defer that until Docling actually needs torch loaded by moving some of our imports of docling bits further down into the code. And, the added test ensures we don't accidentally regress on that as we do future docling work here.

@khaledsulayman
Copy link
Member

Thanks for taking care of this, Ben! 😁

@mergify mergify bot merged commit 399d5aa into instructlab:main Nov 14, 2024
22 checks passed
@mergify mergify bot removed the one-approval label Nov 14, 2024
@nathan-weinberg
Copy link
Member

@Mergifyio backport release-v0.5

Copy link
Contributor

mergify bot commented Nov 15, 2024

backport release-v0.5

✅ Backports have been created

nathan-weinberg added a commit that referenced this pull request Nov 15, 2024
Prefer tesserocr over easyocr, if available (backport #369)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prefer tesserocr vs easyocr for Docling integration, when available
4 participants