-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate Context-Aware Chunking and PDF Support #284
Integrate Context-Aware Chunking and PDF Support #284
Conversation
5ca1f89
to
43d9414
Compare
I'm keeping track of the work here, but withholding any feedback until you get to a point where you think it's ready. If you need any help tracking down CI or other test failures while working on this, let me know and I'll make time. |
10cda56
to
1146df3
Compare
2e224b6
to
521d5e0
Compare
a4209d6
to
bec5daf
Compare
2471ba9
to
82fa5c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did an initial once-over - may not have caught everything, and some of the things I caught I think will be obvious once we can run tests (including end-to-end CI tests) on this.
We'll want to add a pdf file and data generation from it to our existing end-to-end tests to exercise that code-path, as there's a lot of new logic here that we'll want to ensure works in the CI environment.
I tried to focus mostly on the issues that will either cause packaging headaches (keeping our dependencies as slim as possible), user-impacting issues running this across multiple systems (file path handling), or breakage to things like existing markdown functionality. We can clean up other bits later once we get clean end-to-end tests working for markdown and PDFs.
f86f99a
to
d1e076e
Compare
e5d35f8
to
462d4bc
Compare
462d4bc
to
00d4d1f
Compare
d31f130
to
757632d
Compare
757632d
to
182a31b
Compare
Signed-off-by: Khaled Sulayman <[email protected]> Co-authored-by: Aakanksha Duggal <[email protected]>
182a31b
to
46d7366
Compare
Signed-off-by: Khaled Sulayman <[email protected]>
46d7366
to
f06e7f4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on this and getting things to a better place! I think this is good enough to get in, knowing that we already have a number of follow-up issues (#333, #334, #335) that we want to tackle after this merges. Some of those follow-up issues will reduce the extensive chunking code here, so while I agree with the comments from @jwm4 about the chunking logic itself being hard to follow, I think we can quickly reduce the complexity of that with the expected move to docling v2.
I'd also like to see more unit, functional, and/or e2e tests here and will contribute to those myself once this merges so that we can divide and conquer the remaining work to be done versus everything needed to serialize behind this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't like this chunker. It is confusing, idiosyncratic, and there are hardly any comments. However, my easily-fixable comments have been addressed and I understand it is time to move on for now. So I am approving this as-is for now with the understanding that it will undergo a major rework very soon. There is a discussion at DS4SD/docling#191 about better approaches for chunking using Docling (v2) outputs, and I am hoping that discussion will influence the direction this goes in the future.
Add context-aware document chunking for PDFs.
Checking by filetype to decide whether to use the context-aware chunker (pdf) or text-splitter (md).
Currently uses docking v1 json format.
Resolves: #271