-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Document Chunker to always use docling #430
base: main
Are you sure you want to change the base?
Refactor Document Chunker to always use docling #430
Conversation
Signed-off-by: Khaled Sulayman <[email protected]>
Signed-off-by: Khaled Sulayman <[email protected]>
|
||
|
||
|
||
# class DocumentChunker: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep this commented out or would it make more sense to delete it? It'll always be in git history if we need to find it again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry, forgot to remove these
Might set this PR to draft until I'm able to get tests to pass and then do one final linting cleanup
self.server_ctx_size = server_ctx_size | ||
self.chunk_word_count = chunk_word_count | ||
self.output_dir = output_dir | ||
if len(document_paths) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an FYI, in Python you can do if not document_path
and it will perform both if document_path is None or len(document_paths) == 0
Signed-off-by: Khaled Sulayman <[email protected]>
Signed-off-by: Khaled Sulayman <[email protected]>
The old DocumentChunker was a factory class that called the text-splitter on markdowns and docling on PDFs. In reality, we want to call docling and then use the text-splitter on all document types. This change refactors the DocumentChunker class to always call docling (as long as the provided documents are supported filetypes).
Resolves: #334