-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Support Hierarchial Chunking with Semantic Chunking as a secondary #91
Comments
Hey @theoden8! Thanks for submitting an issue! 😊 Please let me explain as to why this is expected behaviour with Chonkie Chonkie's So the first chunk comes out to be 932 < 1024 ( In most cases, the maximum matters more since there's a hard limit to embedding model context. And you can easily augment the chunk with So, this is expected behaviour at the moment—but if this doesn't work for you, could you tell me how this could be made better? 🙏 Thanks |
Hey, thank you for a detailed explanation! The hard limit for the maximum length of the chunk makes sense indeed, as anything beyond would be truncated. however, there are some reasons to want bigger chunks:
So, basically in hybrid splitting structure and determinism take precedence for me, and semantic splitting is a nice extra that helps split the chunks not completely arbitrarily. |
Thanks for the valuable feedback, @theoden8! 😄🫂 Totally valid, makes sense! In fact, I am working towards making a hierarchial/sequential chunking solution with benefits on Markdown, with Semantic second and then making sure that the chunks are not too small as well. It's definitely on the roadmap. I'd have to ask you to be patient with Chonkie, since it's planned for release with version Thanks 😊 (P.S. Since I can't "resolve" the issue at the moment, I will keep this open but change to a feature request instead, if you don't mind... And I will ping you here when I am done with the feature.) |
Describe the bug
When I run semantic/SDPM chunker, min_chunk_size is not respected.
To Reproduce
Expected behavior
The minimum chunk size must be >= 256 tokens.
Additional context
Related to recently implemented #40
The text was updated successfully, but these errors were encountered: