-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regex FSM fails with some tokenizers #762
Comments
Looks like such models use different encoding for incomplete UTF-8 sequences. I will do my best to try and find some time to take a look in the near future. The ideal and universal solution for this would be to have huggingface tokenizers expose decode_bytes function, which would simply decode incomplete sequences as raw bytes (I assume this is what underlying tokenizers do anyway - followed by decoding bytes from utf8). In the absence of that, we have to revert to hacks that are specific to tokenizer types. |
This PR is an extension of #763, related to extending the `re_replacement_seq` regex. The new [NorwAI models](https://huggingface.co/NorwAI) use a tokenizer that has the token `�.`, which leads to the same error as was described in the previous issue #762. This PR extends the fix from #763 to deal with this case, as well as adding a unit test to test various tokenizers, and a comment describing why we need the prefix and suffix in the regex.
This PR is an extension of dottxt-ai#763, related to extending the `re_replacement_seq` regex. The new [NorwAI models](https://huggingface.co/NorwAI) use a tokenizer that has the token `�.`, which leads to the same error as was described in the previous issue dottxt-ai#762. This PR extends the fix from dottxt-ai#763 to deal with this case, as well as adding a unit test to test various tokenizers, and a comment describing why we need the prefix and suffix in the regex.
This PR is an extension of dottxt-ai#763, related to extending the `re_replacement_seq` regex. The new [NorwAI models](https://huggingface.co/NorwAI) use a tokenizer that has the token `�.`, which leads to the same error as was described in the previous issue dottxt-ai#762. This PR extends the fix from dottxt-ai#763 to deal with this case, as well as adding a unit test to test various tokenizers, and a comment describing why we need the prefix and suffix in the regex.
This PR is an extension of dottxt-ai#763, related to extending the `re_replacement_seq` regex. The new [NorwAI models](https://huggingface.co/NorwAI) use a tokenizer that has the token `�.`, which leads to the same error as was described in the previous issue dottxt-ai#762. This PR extends the fix from dottxt-ai#763 to deal with this case, as well as adding a unit test to test various tokenizers, and a comment describing why we need the prefix and suffix in the regex.
Describe the issue as clearly as possible:
When running a regex FSM with some tokenizers, the
reduced_vocabulary
function fails due to the existence of a token▁�
in the vocabulary. This includes, but is not limited to, the following model types:The following models are not affected by this:
This was implemented in this commit as part of this PR. Tagging @ksvladimir here as he might be able to help resolve this 🙂
Note that this bug happens when installing from the main branch and not any release, but I include it here to be fixed before the next release (ideally)
Steps/code to reproduce the bug:
Expected result:
Error message:
Outlines/Python version information:
Version information
Context for the issue:
Prevents the use of
RegexGuide
with the above-mentioned models.The text was updated successfully, but these errors were encountered: