Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Context-Aware Chunking and PDF Support #284

Merged
merged 11 commits into from
Nov 7, 2024

Conversation

khaledsulayman
Copy link
Member

@khaledsulayman khaledsulayman commented Sep 23, 2024

Add context-aware document chunking for PDFs.

Checking by filetype to decide whether to use the context-aware chunker (pdf) or text-splitter (md).
Currently uses docking v1 json format.

Resolves: #271

@mergify mergify bot added ci-failure and removed ci-failure labels Sep 23, 2024
@khaledsulayman khaledsulayman force-pushed the ks-integrate-docprocessor branch 2 times, most recently from 5ca1f89 to 43d9414 Compare September 24, 2024 15:11
@mergify mergify bot added ci-failure and removed ci-failure labels Sep 24, 2024
@bbrowning
Copy link
Contributor

I'm keeping track of the work here, but withholding any feedback until you get to a point where you think it's ready. If you need any help tracking down CI or other test failures while working on this, let me know and I'll make time.

@mergify mergify bot added the dependencies Pull requests that update a dependency file label Sep 25, 2024
@khaledsulayman khaledsulayman force-pushed the ks-integrate-docprocessor branch from 10cda56 to 1146df3 Compare September 25, 2024 14:37
@aakankshaduggal aakankshaduggal force-pushed the ks-integrate-docprocessor branch from 2e224b6 to 521d5e0 Compare September 25, 2024 17:30
@khaledsulayman khaledsulayman force-pushed the ks-integrate-docprocessor branch 3 times, most recently from a4209d6 to bec5daf Compare September 26, 2024 20:25
@mergify mergify bot added ci-failure and removed ci-failure labels Sep 26, 2024
@khaledsulayman khaledsulayman force-pushed the ks-integrate-docprocessor branch 4 times, most recently from 2471ba9 to 82fa5c2 Compare September 27, 2024 15:49
Copy link
Contributor

@bbrowning bbrowning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did an initial once-over - may not have caught everything, and some of the things I caught I think will be obvious once we can run tests (including end-to-end CI tests) on this.

We'll want to add a pdf file and data generation from it to our existing end-to-end tests to exercise that code-path, as there's a lot of new logic here that we'll want to ensure works in the CI environment.

I tried to focus mostly on the issues that will either cause packaging headaches (keeping our dependencies as slim as possible), user-impacting issues running this across multiple systems (file path handling), or breakage to things like existing markdown functionality. We can clean up other bits later once we get clean end-to-end tests working for markdown and PDFs.

src/instructlab/sdg/utils/chunking.py Outdated Show resolved Hide resolved
requirements.txt Outdated Show resolved Hide resolved
requirements.txt Outdated Show resolved Hide resolved
requirements.txt Show resolved Hide resolved
src/instructlab/sdg/utils/docprocessor.py Outdated Show resolved Hide resolved
src/instructlab/sdg/utils/docprocessor.py Outdated Show resolved Hide resolved
src/instructlab/sdg/utils/docprocessor.py Outdated Show resolved Hide resolved
src/instructlab/sdg/utils/taxonomy.py Outdated Show resolved Hide resolved
src/instructlab/sdg/utils/docprocessor.py Outdated Show resolved Hide resolved
src/instructlab/sdg/utils/docprocessor.py Outdated Show resolved Hide resolved
@khaledsulayman khaledsulayman force-pushed the ks-integrate-docprocessor branch 9 times, most recently from f86f99a to d1e076e Compare September 27, 2024 21:28
@mergify mergify bot added ci-failure and removed ci-failure labels Nov 6, 2024
@khaledsulayman khaledsulayman force-pushed the ks-integrate-docprocessor branch from e5d35f8 to 462d4bc Compare November 6, 2024 20:56
@mergify mergify bot added ci-failure and removed ci-failure labels Nov 6, 2024
@khaledsulayman khaledsulayman force-pushed the ks-integrate-docprocessor branch from 462d4bc to 00d4d1f Compare November 7, 2024 04:11
@mergify mergify bot added ci-failure and removed ci-failure labels Nov 7, 2024
@khaledsulayman khaledsulayman force-pushed the ks-integrate-docprocessor branch from d31f130 to 757632d Compare November 7, 2024 04:18
@mergify mergify bot removed the ci-failure label Nov 7, 2024
@khaledsulayman khaledsulayman force-pushed the ks-integrate-docprocessor branch from 757632d to 182a31b Compare November 7, 2024 04:20
@mergify mergify bot added the ci-failure label Nov 7, 2024
Signed-off-by: Khaled Sulayman <[email protected]>
Co-authored-by: Aakanksha Duggal <[email protected]>
@khaledsulayman khaledsulayman force-pushed the ks-integrate-docprocessor branch from 182a31b to 46d7366 Compare November 7, 2024 15:06
@mergify mergify bot added ci-failure and removed ci-failure labels Nov 7, 2024
@khaledsulayman khaledsulayman force-pushed the ks-integrate-docprocessor branch from 46d7366 to f06e7f4 Compare November 7, 2024 15:13
@mergify mergify bot removed the ci-failure label Nov 7, 2024
Copy link
Contributor

@bbrowning bbrowning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this and getting things to a better place! I think this is good enough to get in, knowing that we already have a number of follow-up issues (#333, #334, #335) that we want to tackle after this merges. Some of those follow-up issues will reduce the extensive chunking code here, so while I agree with the comments from @jwm4 about the chunking logic itself being hard to follow, I think we can quickly reduce the complexity of that with the expected move to docling v2.

I'd also like to see more unit, functional, and/or e2e tests here and will contribute to those myself once this merges so that we can divide and conquer the remaining work to be done versus everything needed to serialize behind this PR.

@mergify mergify bot added the one-approval label Nov 7, 2024
@aakankshaduggal aakankshaduggal requested a review from jwm4 November 7, 2024 15:47
@mergify mergify bot removed the one-approval label Nov 7, 2024
Copy link

@jwm4 jwm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't like this chunker. It is confusing, idiosyncratic, and there are hardly any comments. However, my easily-fixable comments have been addressed and I understand it is time to move on for now. So I am approving this as-is for now with the understanding that it will undergo a major rework very soon. There is a discussion at DS4SD/docling#191 about better approaches for chunking using Docling (v2) outputs, and I am hoping that discussion will influence the direction this goes in the future.

@mergify mergify bot merged commit 4c82c05 into instructlab:main Nov 7, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

context-aware chunking
5 participants