fix: handle incomplete UTF-8 sequences and add test for reproduction #1166
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I'd like to start by thanking you for releasing such an amazing TUI framework 🧋 .
This pull request introduces improvements to the way we handle input data by detecting incomplete UTF-8 sequences and addressing them appropriately.
Background
Currently, tea.KeyMsg detects an unknownInputByteMsg when a byte array is interrupted in the middle of reading multibyte UTF-8 characters. As a result, the character is corrupted and cannot be correctly input. Fortunately, UTF-8 encoding allows us to determine whether more bytes are needed based on the first byte. We believe this can be resolved by invoking an additional read to complete the sequence.
Reproduction
This issue can occasionally be reproduced by repeatedly inputting multiple multibyte characters using the code below. My environment is macOS 14.6.1, go version go1.23.1 darwin/arm64, tmux 3.4.
The log during reproduction is as follows.
I am repeatedly inputting
一二三
(representing one, two, three in Japanese). After several inputs, it is detected asunknownInputByteMsg
.This is because the expected 9 bytes (3 chars x 3 bytes) are read in two parts, like
0xe4, 0xb8, 0x80, 0xe4
and0xba, 0x8c, 0xe4, 0xb8, 0x89
.Fix
This fix will resolve the character missing issue.