-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade docling, expand chunking testing #349
Conversation
be4ef14
to
f76d1a1
Compare
This functional test has already discovered two bugs - one which is already fixed in this PR, and one around easyocr on macos in our CI environment that is the cause of the current CI failure. |
2729665
to
df57c59
Compare
This adds pytest-based functional tests to our repo with a basic PDF chunking test. While writing that test I discovered a minor bug in DocumentChunker, resulting in an additional unit test and a minor change to handle the case where we're given no documents and were previously not instantiating any chunker. Signed-off-by: Ben Browning <[email protected]>
The pdf chunking functional tests exposed a bug on Macs with MPS-enabled Torch distributions where easyocr was crashing. The newer docling version uses CPU-only (instead of MPS) when running on Macs to avoid this. Signed-off-by: Ben Browning <[email protected]>
We no longer need qna.yaml files on disk to chunk documents. Signed-off-by: Ben Browning <[email protected]>
bebe9a7
to
e558258
Compare
Tagging you in for a review @nathan-weinberg since I'm editing the CI files, adding functional tests in addition to our unit tests. Those are wired into CI much like the CLI repo, where we add a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing aspects LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bbrowning. This is great! Love the new functional tests 🚢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this!
Semi-related, but it seems like 2 outstanding testing todos around this are:
- More thorough unit testing running the chunkers
- (up for discussion) updates to the e2e to test PDFs
I have #346 as a placeholder issue for Chunking testing, I can flesh it out more and/or make it an epic to track those efforts if we want to do those in follow-up PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you move this to testdata/sample_documents
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as with the .md
@khaledsulayman we would def welcome some updates to the E2E CI to add PDF testing! |
I have some other functional tests I want to iterate on as well, not just specific to chunking, so would love to get this basic set of functional tests in and then open separate PRs for new tests. Some of the others aren't related to chunking at all, but are for things like testing that we don't cause llama_cpp to throw an assert error when doing batching and similar stuff we need to fix. |
This adds pytest-based functional tests to our repo with a basic PDF chunking test.
While writing that test I discovered a minor bug in DocumentChunker, resulting in an additional unit test and a minor change to handle the case where we're given no documents and were previously not instantiating any chunker.