You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using the chunk_batch() method, the resulting Chunks have a wrong end.index. The indexes seem to be counted in token units instead of string characters unit. This does not happen when using the single .chunk() method.
To Reproduce
Suppose that text_ds is a list of str that contains the texts you want to chunk.
chunks[0][0].end_index should return a greater int value.
Additional context
I could not check for other chunkers, I have the same issue as #73 . I tried to look in the code, maybe it originates from the _process_batch method of the TokenChunker class ? I'll try to go deeper if I have time.
Anyway, thanks for your time and for the great package !
Cheers !
The text was updated successfully, but these errors were encountered:
I understand the issue you are seeing, and I'll look into reproducing it so as to add a patch for it to Chonkie, at the earliest. Thanks to the detailed reproduction steps, I should be able to do that quickly.
Will update regarding the progress on this bug here,
Describe the bug
When using the
chunk_batch()
method, the resultingChunks
have a wrongend.index
. The indexes seem to be counted in token units instead of string characters unit. This does not happen when using the single.chunk()
method.To Reproduce
Suppose that
text_ds
is alist
ofstr
that contains the texts you want to chunk.this prints
300
or whateverchunk_size
is.Expected behavior
chunks[0][0].end_index
should return a greaterint
value.Additional context
I could not check for other chunkers, I have the same issue as #73 . I tried to look in the code, maybe it originates from the
_process_batch
method of theTokenChunker
class ? I'll try to go deeper if I have time.Anyway, thanks for your time and for the great package !
Cheers !
The text was updated successfully, but these errors were encountered: