fix: Multi-byte UTF-8 character input #798

Ryu0118 · 2025-12-25T20:38:32Z

Hi, I fixed an issue where typing multi-byte UTF-8 characters (e.g., Japanese, Chinese) in textPrompt would display garbled text.

The previous Terminal.readCharacter() implementation read only 1 byte and treated it as a complete character. For multi-byte UTF-8 characters like "こ" ([E3 81 93]), this caused each byte to be interpreted as a separate character, resulting in garbled output.

public func readCharacter() -> Character? {
    if let char = readRawCharacter() {
        return Character(UnicodeScalar(UInt8(char)))
    }
    return nil
}

To fix this, I introduced UTF8Reader, which determines the expected byte count from the first byte's bit pattern and reads exactly that many bytes before decoding.

Before

2025-12-26.5.12.12.mov

After

2025-12-26.5.15.06.mov

Copilot

Pull request overview

This PR fixes a critical bug where multi-byte UTF-8 characters (Japanese, Chinese, emoji, etc.) were displayed as garbled text in text prompts. The root cause was that the Terminal's readCharacter() method only read one byte at a time and treated it as a complete character.

Key changes:

Introduced UTF8Reader utility to properly decode multi-byte UTF-8 sequences by determining byte count from the first byte's bit pattern
Updated Terminal.readCharacter() to use UTF8Reader for correct multi-byte character handling
Added comprehensive test suite covering ASCII, 2-byte, 3-byte, 4-byte UTF-8 sequences and invalid inputs

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
cli/Sources/Noora/Utilities/Terminal.swift	Refactored readCharacter() to use new UTF8Reader and added UTF8Reader struct to handle multi-byte UTF-8 decoding
cli/Tests/NooraTests/Utilities/UTF8ReaderTests.swift	Added comprehensive test suite for UTF8Reader covering valid sequences (1-4 bytes), invalid sequences, and consecutive character reading

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cli/Sources/Noora/Utilities/Terminal.swift

cli/Tests/NooraTests/Utilities/UTF8ReaderTests.swift

Copilot · 2025-12-26T08:51:20Z

cli/Sources/Noora/Utilities/Terminal.swift

    public func readCharacter() -> Character? {
-        if let char = readRawCharacter() {
-            return Character(UnicodeScalar(UInt8(char)))
+        let reader = UTF8Reader {
+            guard let rawChar = readRawCharacter() else { return nil }
+            return UInt8(truncatingIfNeeded: rawChar)
        }
-        return nil
+        return reader.readCharacter()


Creating a new UTF8Reader instance on every call to readCharacter() is inefficient. The closure captures readRawCharacter and gets wrapped each time. Consider making UTF8Reader a stored property of Terminal or caching it to avoid repeated allocation and closure creation overhead.

I think the struct is lightweight (just a closure) and is created only on user keypress, so there's no measurable overhead.

cli/Sources/Noora/Utilities/Terminal.swift

pepicrft

Left a comment about an alternative approach to solve the issue that I believe it's more robust and handles more scenarios. Let me know what you think.

pepicrft · 2025-12-26T08:51:10Z

cli/Sources/Noora/Utilities/Terminal.swift

 }
+
+/// A reader that decodes UTF-8 encoded bytes into characters.
+struct UTF8Reader {


Swift has the primities to handle this covering more scenarios like Grapheme clusters. Have you tried to do something like this instead?

struct UTF8Reader { private var iterator: AnyIterator<UInt8> private var codec = Unicode.UTF8() private var buffer = "" init(readByte: @escaping () -> UInt8?) { self.iterator = AnyIterator(readByte) } mutating func readCharacter() -> Character? { while true { switch codec.decode(&iterator) { case .scalarValue(let scalar): buffer.unicodeScalars.append(scalar) // When we have more than one grapheme cluster, // we know the first one is complete if buffer.count > 1 { return buffer.removeFirst() } case .emptyInput: // No more input, return whatever is buffered return buffer.isEmpty ? nil : buffer.removeFirst() case .error: return nil } } } }

I tried the Unicode.UTF8 codec, but it doesn't work for interactive terminal input.

The codec tries to read the next byte even after a complete UTF-8 sequence. For file processing this is fine since EOF returns nil, but for terminal input, getchar() blocks waiting for the user's next input.

To use the codec, we need to determine the byte length first and read those bytes upfront. But at that point, the UTF-8 decoding is essentially done, so String(bytes:encoding:) is all we need.

Good point. A fix would be to avoid the streaming codec for TTY input: read the first byte, determine the UTF-8 sequence length from that, then read exactly that many bytes (non-blocking or blocking), and finally decode with String(bytes:encoding:). That prevents getchar() from blocking for an extra byte and keeps multi-byte characters intact.

Example sketch:

// Read 1st byte, determine sequence length, then read exactly that many bytes. guard let first = readRawCharacter() else { return nil } let firstByte = UInt8(truncatingIfNeeded: first) guard let length = UTF8Sequence.length(forFirstByte: firstByte) else { return nil } var bytes: [UInt8] = [firstByte] for _ in 1..<length { guard let next = readRawCharacter() else { return nil } let nextByte = UInt8(truncatingIfNeeded: next) guard UTF8Sequence.isContinuationByte(nextByte) else { return nil } bytes.append(nextByte) } return String(bytes: bytes, encoding: .utf8)?.first

This avoids the streaming decoder and prevents getchar() from blocking for an extra byte on interactive input.

@pepicrft Thanks for the feedback. I believe my current implementation already follows the approach you suggested. it reads the first byte, determines the sequence length, reads exactly that many bytes, and decodes with String(bytes:encoding:). The only difference is that I've split the logic into helper methods.

Do you see any issues with the current implementation?

- Overlong encoding (0xC0 0x80) - Codepoint exceeding Unicode range (0xF5+) - Invalid continuation byte (0x00 instead of 0x80-0xBF) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- 0xC0-0xC1: overlong encodings (should use 1-byte ASCII) - 0xF5-0xF7: exceed valid Unicode range (U+10FFFF) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Continuation bytes must match pattern 10xxxxxx (0x80-0xBF). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Ryu0118 added 5 commits December 26, 2025 04:40

Fix UTF-8 multi-byte character decoding in Terminal.readCharacter()

777b2e3

Add comprehensive tests for UTF8Reader

8a1259c

Revert unrelated formatting change in readCharacterNonBlocking()

5ec0efa

Fix a compile error

f300243

Fix SwiftFormat lint errors in Terminal.swift and UTF8ReaderTests.swift

7a01d06

Ryu0118 requested a review from a team as a code owner December 25, 2025 20:38

Ryu0118 requested review from cschmatzler and fortmarek and removed request for a team December 25, 2025 20:38

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. changelog:fixed labels Dec 25, 2025

cschmatzler requested a review from pepicrft December 25, 2025 22:29

pepicrft changed the title ~~Fix multi-byte UTF-8 character input~~ fix: Multi-byte UTF-8 character input Dec 26, 2025

pepicrft requested a review from Copilot December 26, 2025 08:47

Copilot started reviewing on behalf of pepicrft December 26, 2025 08:47 View session

Copilot AI reviewed Dec 26, 2025

View reviewed changes

pepicrft requested changes Dec 26, 2025

View reviewed changes

Ryu0118 and others added 3 commits December 27, 2025 18:04

Reject invalid UTF-8 start bytes in sequenceLength

3183ee9

- 0xC0-0xC1: overlong encodings (should use 1-byte ASCII) - 0xF5-0xF7: exceed valid Unicode range (U+10FFFF) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Validate UTF-8 continuation bytes

f6fda1b

Continuation bytes must match pattern 10xxxxxx (0x80-0xBF). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Ryu0118 requested a review from pepicrft December 27, 2025 09:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Multi-byte UTF-8 character input #798

fix: Multi-byte UTF-8 character input #798

Ryu0118 commented Dec 25, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 26, 2025

Uh oh!

Ryu0118 Dec 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pepicrft left a comment

Uh oh!

pepicrft Dec 26, 2025

Uh oh!

Ryu0118 Dec 27, 2025

Uh oh!

pepicrft Dec 29, 2025

Uh oh!

pepicrft Dec 29, 2025

Uh oh!

Ryu0118 Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: Multi-byte UTF-8 character input #798

Are you sure you want to change the base?

fix: Multi-byte UTF-8 character input #798

Conversation

Ryu0118 commented Dec 25, 2025

Before

After

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Ryu0118 Dec 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pepicrft left a comment

Choose a reason for hiding this comment

Uh oh!

pepicrft Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Ryu0118 Dec 27, 2025

Choose a reason for hiding this comment

Uh oh!

pepicrft Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

pepicrft Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Ryu0118 Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants