-
Notifications
You must be signed in to change notification settings - Fork 20
fix: Multi-byte UTF-8 character input #798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a critical bug where multi-byte UTF-8 characters (Japanese, Chinese, emoji, etc.) were displayed as garbled text in text prompts. The root cause was that the Terminal's readCharacter() method only read one byte at a time and treated it as a complete character.
Key changes:
- Introduced UTF8Reader utility to properly decode multi-byte UTF-8 sequences by determining byte count from the first byte's bit pattern
- Updated Terminal.readCharacter() to use UTF8Reader for correct multi-byte character handling
- Added comprehensive test suite covering ASCII, 2-byte, 3-byte, 4-byte UTF-8 sequences and invalid inputs
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| cli/Sources/Noora/Utilities/Terminal.swift | Refactored readCharacter() to use new UTF8Reader and added UTF8Reader struct to handle multi-byte UTF-8 decoding |
| cli/Tests/NooraTests/Utilities/UTF8ReaderTests.swift | Added comprehensive test suite for UTF8Reader covering valid sequences (1-4 bytes), invalid sequences, and consecutive character reading |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| public func readCharacter() -> Character? { | ||
| if let char = readRawCharacter() { | ||
| return Character(UnicodeScalar(UInt8(char))) | ||
| let reader = UTF8Reader { | ||
| guard let rawChar = readRawCharacter() else { return nil } | ||
| return UInt8(truncatingIfNeeded: rawChar) | ||
| } | ||
| return nil | ||
| return reader.readCharacter() |
Copilot
AI
Dec 26, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating a new UTF8Reader instance on every call to readCharacter() is inefficient. The closure captures readRawCharacter and gets wrapped each time. Consider making UTF8Reader a stored property of Terminal or caching it to avoid repeated allocation and closure creation overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the struct is lightweight (just a closure) and is created only on user keypress, so there's no measurable overhead.
pepicrft
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment about an alternative approach to solve the issue that I believe it's more robust and handles more scenarios. Let me know what you think.
| } | ||
|
|
||
| /// A reader that decodes UTF-8 encoded bytes into characters. | ||
| struct UTF8Reader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Swift has the primities to handle this covering more scenarios like Grapheme clusters. Have you tried to do something like this instead?
struct UTF8Reader {
private var iterator: AnyIterator<UInt8>
private var codec = Unicode.UTF8()
private var buffer = ""
init(readByte: @escaping () -> UInt8?) {
self.iterator = AnyIterator(readByte)
}
mutating func readCharacter() -> Character? {
while true {
switch codec.decode(&iterator) {
case .scalarValue(let scalar):
buffer.unicodeScalars.append(scalar)
// When we have more than one grapheme cluster,
// we know the first one is complete
if buffer.count > 1 {
return buffer.removeFirst()
}
case .emptyInput:
// No more input, return whatever is buffered
return buffer.isEmpty ? nil : buffer.removeFirst()
case .error:
return nil
}
}
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the Unicode.UTF8 codec, but it doesn't work for interactive terminal input.
The codec tries to read the next byte even after a complete UTF-8 sequence. For file processing this is fine since EOF returns nil, but for terminal input, getchar() blocks waiting for the user's next input.
To use the codec, we need to determine the byte length first and read those bytes upfront. But at that point, the UTF-8 decoding is essentially done, so String(bytes:encoding:) is all we need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. A fix would be to avoid the streaming codec for TTY input: read the first byte, determine the UTF-8 sequence length from that, then read exactly that many bytes (non-blocking or blocking), and finally decode with String(bytes:encoding:). That prevents getchar() from blocking for an extra byte and keeps multi-byte characters intact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example sketch:
// Read 1st byte, determine sequence length, then read exactly that many bytes.
guard let first = readRawCharacter() else { return nil }
let firstByte = UInt8(truncatingIfNeeded: first)
guard let length = UTF8Sequence.length(forFirstByte: firstByte) else { return nil }
var bytes: [UInt8] = [firstByte]
for _ in 1..<length {
guard let next = readRawCharacter() else { return nil }
let nextByte = UInt8(truncatingIfNeeded: next)
guard UTF8Sequence.isContinuationByte(nextByte) else { return nil }
bytes.append(nextByte)
}
return String(bytes: bytes, encoding: .utf8)?.firstThis avoids the streaming decoder and prevents getchar() from blocking for an extra byte on interactive input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pepicrft Thanks for the feedback. I believe my current implementation already follows the approach you suggested. it reads the first byte, determines the sequence length, reads exactly that many bytes, and decodes with String(bytes:encoding:). The only difference is that I've split the logic into helper methods.
Do you see any issues with the current implementation?
- Overlong encoding (0xC0 0x80) - Codepoint exceeding Unicode range (0xF5+) - Invalid continuation byte (0x00 instead of 0x80-0xBF) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- 0xC0-0xC1: overlong encodings (should use 1-byte ASCII) - 0xF5-0xF7: exceed valid Unicode range (U+10FFFF) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Continuation bytes must match pattern 10xxxxxx (0x80-0xBF). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Hi, I fixed an issue where typing multi-byte UTF-8 characters (e.g., Japanese, Chinese) in
textPromptwould display garbled text.The previous Terminal.readCharacter() implementation read only 1 byte and treated it as a complete character. For multi-byte UTF-8 characters like "こ" ([E3 81 93]), this caused each byte to be interpreted as a separate character, resulting in garbled output.
To fix this, I introduced UTF8Reader, which determines the expected byte count from the first byte's bit pattern and reads exactly that many bytes before decoding.
Before
2025-12-26.5.12.12.mov
After
2025-12-26.5.15.06.mov