Skip to content

Conversation

@Ryu0118
Copy link
Collaborator

@Ryu0118 Ryu0118 commented Dec 25, 2025

Hi, I fixed an issue where typing multi-byte UTF-8 characters (e.g., Japanese, Chinese) in textPrompt would display garbled text.

The previous Terminal.readCharacter() implementation read only 1 byte and treated it as a complete character. For multi-byte UTF-8 characters like "こ" ([E3 81 93]), this caused each byte to be interpreted as a separate character, resulting in garbled output.

public func readCharacter() -> Character? {
    if let char = readRawCharacter() {
        return Character(UnicodeScalar(UInt8(char)))
    }
    return nil
}

To fix this, I introduced UTF8Reader, which determines the expected byte count from the first byte's bit pattern and reads exactly that many bytes before decoding.

Before

2025-12-26.5.12.12.mov

After

2025-12-26.5.15.06.mov

@Ryu0118 Ryu0118 requested a review from a team as a code owner December 25, 2025 20:38
@Ryu0118 Ryu0118 requested review from cschmatzler and fortmarek and removed request for a team December 25, 2025 20:38
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. changelog:fixed labels Dec 25, 2025
@cschmatzler cschmatzler requested a review from pepicrft December 25, 2025 22:29
@pepicrft pepicrft changed the title Fix multi-byte UTF-8 character input fix: Multi-byte UTF-8 character input Dec 26, 2025
@pepicrft pepicrft requested a review from Copilot December 26, 2025 08:47
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical bug where multi-byte UTF-8 characters (Japanese, Chinese, emoji, etc.) were displayed as garbled text in text prompts. The root cause was that the Terminal's readCharacter() method only read one byte at a time and treated it as a complete character.

Key changes:

  • Introduced UTF8Reader utility to properly decode multi-byte UTF-8 sequences by determining byte count from the first byte's bit pattern
  • Updated Terminal.readCharacter() to use UTF8Reader for correct multi-byte character handling
  • Added comprehensive test suite covering ASCII, 2-byte, 3-byte, 4-byte UTF-8 sequences and invalid inputs

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
cli/Sources/Noora/Utilities/Terminal.swift Refactored readCharacter() to use new UTF8Reader and added UTF8Reader struct to handle multi-byte UTF-8 decoding
cli/Tests/NooraTests/Utilities/UTF8ReaderTests.swift Added comprehensive test suite for UTF8Reader covering valid sequences (1-4 bytes), invalid sequences, and consecutive character reading

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 145 to +150
public func readCharacter() -> Character? {
if let char = readRawCharacter() {
return Character(UnicodeScalar(UInt8(char)))
let reader = UTF8Reader {
guard let rawChar = readRawCharacter() else { return nil }
return UInt8(truncatingIfNeeded: rawChar)
}
return nil
return reader.readCharacter()
Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a new UTF8Reader instance on every call to readCharacter() is inefficient. The closure captures readRawCharacter and gets wrapped each time. Consider making UTF8Reader a stored property of Terminal or caching it to avoid repeated allocation and closure creation overhead.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the struct is lightweight (just a closure) and is created only on user keypress, so there's no measurable overhead.

Copy link
Contributor

@pepicrft pepicrft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment about an alternative approach to solve the issue that I believe it's more robust and handles more scenarios. Let me know what you think.

}

/// A reader that decodes UTF-8 encoded bytes into characters.
struct UTF8Reader {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swift has the primities to handle this covering more scenarios like Grapheme clusters. Have you tried to do something like this instead?

struct UTF8Reader {
    private var iterator: AnyIterator<UInt8>
    private var codec = Unicode.UTF8()
    private var buffer = ""
    
    init(readByte: @escaping () -> UInt8?) {
        self.iterator = AnyIterator(readByte)
    }
    
    mutating func readCharacter() -> Character? {
        while true {
            switch codec.decode(&iterator) {
            case .scalarValue(let scalar):
                buffer.unicodeScalars.append(scalar)
                // When we have more than one grapheme cluster,
                // we know the first one is complete
                if buffer.count > 1 {
                    return buffer.removeFirst()
                }
            case .emptyInput:
                // No more input, return whatever is buffered
                return buffer.isEmpty ? nil : buffer.removeFirst()
            case .error:
                return nil
            }
        }
    }
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the Unicode.UTF8 codec, but it doesn't work for interactive terminal input.

The codec tries to read the next byte even after a complete UTF-8 sequence. For file processing this is fine since EOF returns nil, but for terminal input, getchar() blocks waiting for the user's next input.

To use the codec, we need to determine the byte length first and read those bytes upfront. But at that point, the UTF-8 decoding is essentially done, so String(bytes:encoding:) is all we need.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. A fix would be to avoid the streaming codec for TTY input: read the first byte, determine the UTF-8 sequence length from that, then read exactly that many bytes (non-blocking or blocking), and finally decode with String(bytes:encoding:). That prevents getchar() from blocking for an extra byte and keeps multi-byte characters intact.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example sketch:

// Read 1st byte, determine sequence length, then read exactly that many bytes.
guard let first = readRawCharacter() else { return nil }
let firstByte = UInt8(truncatingIfNeeded: first)

guard let length = UTF8Sequence.length(forFirstByte: firstByte) else { return nil }
var bytes: [UInt8] = [firstByte]
for _ in 1..<length {
  guard let next = readRawCharacter() else { return nil }
  let nextByte = UInt8(truncatingIfNeeded: next)
  guard UTF8Sequence.isContinuationByte(nextByte) else { return nil }
  bytes.append(nextByte)
}

return String(bytes: bytes, encoding: .utf8)?.first

This avoids the streaming decoder and prevents getchar() from blocking for an extra byte on interactive input.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pepicrft Thanks for the feedback. I believe my current implementation already follows the approach you suggested. it reads the first byte, determines the sequence length, reads exactly that many bytes, and decodes with String(bytes:encoding:). The only difference is that I've split the logic into helper methods.

Do you see any issues with the current implementation?

Ryu0118 and others added 3 commits December 27, 2025 18:04
- Overlong encoding (0xC0 0x80)
- Codepoint exceeding Unicode range (0xF5+)
- Invalid continuation byte (0x00 instead of 0x80-0xBF)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- 0xC0-0xC1: overlong encodings (should use 1-byte ASCII)
- 0xF5-0xF7: exceed valid Unicode range (U+10FFFF)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Continuation bytes must match pattern 10xxxxxx (0x80-0xBF).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@Ryu0118 Ryu0118 requested a review from pepicrft December 27, 2025 09:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog:fixed size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants