Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade docling, expand chunking testing #349

Merged
merged 3 commits into from
Nov 8, 2024

Conversation

bbrowning
Copy link
Contributor

This adds pytest-based functional tests to our repo with a basic PDF chunking test.

While writing that test I discovered a minor bug in DocumentChunker, resulting in an additional unit test and a minor change to handle the case where we're given no documents and were previously not instantiating any chunker.

@mergify mergify bot added CI/CD Affects CI/CD configuration testing Relates to testing ci-failure labels Nov 8, 2024
@mergify mergify bot added ci-failure and removed ci-failure labels Nov 8, 2024
@bbrowning
Copy link
Contributor Author

This functional test has already discovered two bugs - one which is already fixed in this PR, and one around easyocr on macos in our CI environment that is the cause of the current CI failure.

@mergify mergify bot added documentation Improvements or additions to documentation ci-failure and removed ci-failure labels Nov 8, 2024
@mergify mergify bot added ci-failure dependencies Pull requests that update a dependency file and removed ci-failure labels Nov 8, 2024
@bbrowning bbrowning marked this pull request as ready for review November 8, 2024 17:10
@bbrowning bbrowning changed the title Expand chunking testing, including new functional tests Upgrade docling, expand chunking testing Nov 8, 2024
This adds pytest-based functional tests to our repo with a basic PDF
chunking test.

While writing that test I discovered a minor bug in DocumentChunker,
resulting in an additional unit test and a minor change to handle the
case where we're given no documents and were previously not
instantiating any chunker.

Signed-off-by: Ben Browning <[email protected]>
The pdf chunking functional tests exposed a bug on Macs with
MPS-enabled Torch distributions where easyocr was crashing. The newer
docling version uses CPU-only (instead of MPS) when running on Macs to
avoid this.

Signed-off-by: Ben Browning <[email protected]>
We no longer need qna.yaml files on disk to chunk documents.

Signed-off-by: Ben Browning <[email protected]>
@bbrowning
Copy link
Contributor Author

Tagging you in for a review @nathan-weinberg since I'm editing the CI files, adding functional tests in addition to our unit tests. Those are wired into CI much like the CLI repo, where we add a tox -e py3-functional and have the functional tests run immediately after the unit tests in the same CI job.

Copy link
Member

@nathan-weinberg nathan-weinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing aspects LGTM!

@mergify mergify bot added the one-approval label Nov 8, 2024
Copy link
Member

@aakankshaduggal aakankshaduggal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bbrowning. This is great! Love the new functional tests 🚢

@mergify mergify bot removed the one-approval label Nov 8, 2024
Copy link
Member

@khaledsulayman khaledsulayman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!

Semi-related, but it seems like 2 outstanding testing todos around this are:

  1. More thorough unit testing running the chunkers
  2. (up for discussion) updates to the e2e to test PDFs

I have #346 as a placeholder issue for Chunking testing, I can flesh it out more and/or make it an epic to track those efforts if we want to do those in follow-up PRs.

tests/functional/test_chunkers.py Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you move this to testdata/sample_documents?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as with the .md

@nathan-weinberg
Copy link
Member

@khaledsulayman we would def welcome some updates to the E2E CI to add PDF testing!

@bbrowning
Copy link
Contributor Author

I have some other functional tests I want to iterate on as well, not just specific to chunking, so would love to get this basic set of functional tests in and then open separate PRs for new tests. Some of the others aren't related to chunking at all, but are for things like testing that we don't cause llama_cpp to throw an assert error when doing batching and similar stuff we need to fix.

@mergify mergify bot merged commit e0698d6 into instructlab:main Nov 8, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD Affects CI/CD configuration dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants