Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] TokenChunker Batch_chunking gives wrong end_index #84

Open
CharlesMoslonka opened this issue Dec 9, 2024 · 1 comment
Open

[BUG] TokenChunker Batch_chunking gives wrong end_index #84

CharlesMoslonka opened this issue Dec 9, 2024 · 1 comment
Assignees
Labels
bug Something isn't working in progress Actively looking into the issue

Comments

@CharlesMoslonka
Copy link

Describe the bug

When using the chunk_batch() method, the resulting Chunks have a wrong end.index. The indexes seem to be counted in token units instead of string characters unit. This does not happen when using the single .chunk() method.

To Reproduce

Suppose that text_ds is a list of str that contains the texts you want to chunk.

chunker = TokenChunker(
    tokenizer=tokenizer, 
    chunk_size=300,
    chunk_overlap=20,
)

chunks = chunker.chunk_batch(text_ds)  
print(chunks[0][0].end_index)

this prints 300 or whatever chunk_size is.

Expected behavior

chunks[0][0].end_index should return a greater int value.

Additional context

I could not check for other chunkers, I have the same issue as #73 . I tried to look in the code, maybe it originates from the _process_batch method of the TokenChunker class ? I'll try to go deeper if I have time.

Anyway, thanks for your time and for the great package !
Cheers !

@CharlesMoslonka CharlesMoslonka added the bug Something isn't working label Dec 9, 2024
@bhavnicksm
Copy link
Collaborator

Hey @CharlesMoslonka!

Thanks for opening an issue and the kind words 😊

I understand the issue you are seeing, and I'll look into reproducing it so as to add a patch for it to Chonkie, at the earliest. Thanks to the detailed reproduction steps, I should be able to do that quickly.

Will update regarding the progress on this bug here,

Thanks! ☺️

@shreyashnigam shreyashnigam added the in progress Actively looking into the issue label Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working in progress Actively looking into the issue
Projects
None yet
Development

No branches or pull requests

3 participants