Fix generation of multi-token unicode characters #738

ai-and-i · 2024-03-12T00:35:04Z

This PR fixes generation of unicode strings that can only be represented by sequences of multiple tokens (closes #725).

Currently, outlines will prevent any such characters from being generated at all, even with '.*' regex. This creates a major problem when generating text in non-latin languages, or generating special characters like emojis.

This PR addresses this problem by converting character-level regex FSMs to byte-level FSMs. More precisely, it augments the FSM by adding byte-by-byte transitions that can be triggered by sub-character tokens generated by the LLM. The full-character transitions are kept as-is, so the performance of generation for normal tokens isn't impacted.

I considered other design choices, including keeping the logic of dealing with such tokens in the RegexGuide class. I don't think it can work at least for GPT2-like tokenizers (which includes gpt2, phi, qwen and other models). Such tokenizers have tokens that combine full utf8 characters followed by parts of the next character (for example, b'\x20\xf0'), and deciding whether to accept such tokens requires walking the FSM.

…r individual bytes

rlouf · 2024-03-14T12:56:23Z

We knew we would need something like this at some point. Thank you so much for implementing it!

ai-and-i · 2024-03-14T18:49:32Z

Great, thanks for merging, it unblocks me from using outlines in my project. Thanks for the great tool!

ksvladimir added 2 commits March 11, 2024 17:00

Add a function to convert utf8 regexps into regexps that operates ove…

c406299

…r individual bytes

Support generating multi-byte utf8 characters

8922151

rlouf merged commit 043117f into dottxt-ai:main Mar 14, 2024
5 checks passed

saattrupdan mentioned this pull request Mar 20, 2024

Regex FSM fails with some tokenizers #762

Closed

brandonwillard mentioned this pull request Apr 20, 2024

Use a trie for scanning during index construction #507

Closed

rlouf mentioned this pull request Apr 21, 2024

Encountering RuntimeError: Cannot convert token � (29333) to bytes: � for some model vocabularies when using llama.cpp #820

Closed

milesial mentioned this pull request Jun 27, 2024

Emojis unsupported (vLLM integration) noamgat/lm-format-enforcer#116

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix generation of multi-token unicode characters #738

Fix generation of multi-token unicode characters #738

ai-and-i commented Mar 12, 2024

rlouf commented Mar 14, 2024

ai-and-i commented Mar 14, 2024

Fix generation of multi-token unicode characters #738

Fix generation of multi-token unicode characters #738

Conversation

ai-and-i commented Mar 12, 2024

rlouf commented Mar 14, 2024

ai-and-i commented Mar 14, 2024