Integrate Context-Aware Chunking and PDF Support #284

khaledsulayman · 2024-09-23T17:18:43Z

Add context-aware document chunking for PDFs.

Checking by filetype to decide whether to use the context-aware chunker (pdf) or text-splitter (md).
Currently uses docking v1 json format.

Resolves: #271

bbrowning · 2024-09-25T00:18:05Z

I'm keeping track of the work here, but withholding any feedback until you get to a point where you think it's ready. If you need any help tracking down CI or other test failures while working on this, let me know and I'll make time.

bbrowning

I did an initial once-over - may not have caught everything, and some of the things I caught I think will be obvious once we can run tests (including end-to-end CI tests) on this.

We'll want to add a pdf file and data generation from it to our existing end-to-end tests to exercise that code-path, as there's a lot of new logic here that we'll want to ensure works in the CI environment.

I tried to focus mostly on the issues that will either cause packaging headaches (keeping our dependencies as slim as possible), user-impacting issues running this across multiple systems (file path handling), or breakage to things like existing markdown functionality. We can clean up other bits later once we get clean end-to-end tests working for markdown and PDFs.

src/instructlab/sdg/utils/chunking.py

requirements.txt

src/instructlab/sdg/utils/docprocessor.py

src/instructlab/sdg/utils/taxonomy.py

src/instructlab/sdg/utils/docprocessor.py

Signed-off-by: Khaled Sulayman <[email protected]> Co-authored-by: Aakanksha Duggal <[email protected]>

Signed-off-by: Khaled Sulayman <[email protected]>

bbrowning

Thanks for iterating on this and getting things to a better place! I think this is good enough to get in, knowing that we already have a number of follow-up issues (#333, #334, #335) that we want to tackle after this merges. Some of those follow-up issues will reduce the extensive chunking code here, so while I agree with the comments from @jwm4 about the chunking logic itself being hard to follow, I think we can quickly reduce the complexity of that with the expected move to docling v2.

I'd also like to see more unit, functional, and/or e2e tests here and will contribute to those myself once this merges so that we can divide and conquer the remaining work to be done versus everything needed to serialize behind this PR.

jwm4

I still don't like this chunker. It is confusing, idiosyncratic, and there are hardly any comments. However, my easily-fixable comments have been addressed and I understand it is time to move on for now. So I am approving this as-is for now with the understanding that it will undergo a major rework very soon. There is a discussion at DS4SD/docling#191 about better approaches for chunking using Docling (v2) outputs, and I am hoping that discussion will influence the direction this goes in the future.

mergify bot added ci-failure and removed ci-failure labels Sep 23, 2024

khaledsulayman force-pushed the ks-integrate-docprocessor branch 2 times, most recently from 5ca1f89 to 43d9414 Compare September 24, 2024 15:11

mergify bot added ci-failure and removed ci-failure labels Sep 24, 2024

mergify bot added the dependencies Pull requests that update a dependency file label Sep 25, 2024

khaledsulayman force-pushed the ks-integrate-docprocessor branch from 10cda56 to 1146df3 Compare September 25, 2024 14:37

aakankshaduggal force-pushed the ks-integrate-docprocessor branch from 2e224b6 to 521d5e0 Compare September 25, 2024 17:30

khaledsulayman force-pushed the ks-integrate-docprocessor branch 3 times, most recently from a4209d6 to bec5daf Compare September 26, 2024 20:25

mergify bot added ci-failure and removed ci-failure labels Sep 26, 2024

khaledsulayman force-pushed the ks-integrate-docprocessor branch 4 times, most recently from 2471ba9 to 82fa5c2 Compare September 27, 2024 15:49

bbrowning previously requested changes Sep 27, 2024

View reviewed changes

khaledsulayman force-pushed the ks-integrate-docprocessor branch 9 times, most recently from f86f99a to d1e076e Compare September 27, 2024 21:28

mergify bot added ci-failure and removed ci-failure labels Nov 6, 2024

khaledsulayman force-pushed the ks-integrate-docprocessor branch from e5d35f8 to 462d4bc Compare November 6, 2024 20:56

mergify bot added ci-failure and removed ci-failure labels Nov 6, 2024

khaledsulayman force-pushed the ks-integrate-docprocessor branch from 462d4bc to 00d4d1f Compare November 7, 2024 04:11

mergify bot added ci-failure and removed ci-failure labels Nov 7, 2024

khaledsulayman force-pushed the ks-integrate-docprocessor branch from d31f130 to 757632d Compare November 7, 2024 04:18

mergify bot removed the ci-failure label Nov 7, 2024

khaledsulayman force-pushed the ks-integrate-docprocessor branch from 757632d to 182a31b Compare November 7, 2024 04:20

mergify bot added the ci-failure label Nov 7, 2024

Update testing for Document Chunker classes

e7b1666

Signed-off-by: Khaled Sulayman <[email protected]> Co-authored-by: Aakanksha Duggal <[email protected]>

khaledsulayman force-pushed the ks-integrate-docprocessor branch from 182a31b to 46d7366 Compare November 7, 2024 15:06

mergify bot added ci-failure and removed ci-failure labels Nov 7, 2024

Change prints in build_chunks_from_docling_json to debug messages

f06e7f4

Signed-off-by: Khaled Sulayman <[email protected]>

khaledsulayman force-pushed the ks-integrate-docprocessor branch from 46d7366 to f06e7f4 Compare November 7, 2024 15:13

mergify bot removed the ci-failure label Nov 7, 2024

bbrowning approved these changes Nov 7, 2024

View reviewed changes

mergify bot added the one-approval label Nov 7, 2024

aakankshaduggal requested a review from jwm4 November 7, 2024 15:47

aakankshaduggal approved these changes Nov 7, 2024

View reviewed changes

mergify bot removed the one-approval label Nov 7, 2024

jwm4 approved these changes Nov 7, 2024

View reviewed changes

mergify bot merged commit 4c82c05 into instructlab:main Nov 7, 2024
22 checks passed

KodieGlosserIBM mentioned this pull request Nov 20, 2024

Sdg v0.6.0+ multiple knowledge sources fails to clone #404

Closed

mairin mentioned this pull request Dec 17, 2024

InstructLab Maintainer nomination instructlab/community#417

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Context-Aware Chunking and PDF Support #284

Integrate Context-Aware Chunking and PDF Support #284

khaledsulayman commented Sep 23, 2024 •

edited

Loading

bbrowning commented Sep 25, 2024

bbrowning left a comment

bbrowning left a comment

jwm4 left a comment

Integrate Context-Aware Chunking and PDF Support #284

Integrate Context-Aware Chunking and PDF Support #284

Conversation

khaledsulayman commented Sep 23, 2024 • edited Loading

bbrowning commented Sep 25, 2024

bbrowning left a comment

Choose a reason for hiding this comment

bbrowning left a comment

Choose a reason for hiding this comment

jwm4 left a comment

Choose a reason for hiding this comment

khaledsulayman commented Sep 23, 2024 •

edited

Loading