ICU-23314 UnicodeSet: extended name escapes by aryanraj45 · Pull Request #3850 · unicode-org/icu

aryanraj45 · 2026-01-29T23:22:42Z

Implements support for extended name escapes in UnicodeSet patterns as specified in ICU-23314.

Changes

This PR adds support for the \N{hex:name} syntax in UnicodeSet patterns, allowing users to specify both the hexadecimal code point and its Unicode name for validation purposes.

Implementation Details

Extended the applyPropertyPattern() method in UnicodeSet.java to parse the new hex:name format
When a colon is detected in \N{...}, the format is parsed as hex:name
The hex value is parsed and validated to be a valid code point (0-0x10FFFF)
The actual Unicode name is retrieved and compared with the provided name
If names don't match, an IllegalArgumentException is thrown
Backward compatibility maintained: standard \N{name} syntax continues to work

Testing

Added comprehensive test cases in UnicodeSetTest.TestExtendedNameEscapes()
Tests cover: valid hex:name format, name mismatch errors, invalid hex errors, out-of-range errors, and backward compatibility
All existing UnicodeSet tests (55 tests) continue to pass

Example Usage

// Valid usage
UnicodeSet set = new UnicodeSet("[\\N{0041:LATIN CAPITAL LETTER A}]"); // Works
UnicodeSet emoji = new UnicodeSet("[\\N{1F4A9:PILE OF POO}]"); // Works

// Invalid usage - throws exception
UnicodeSet invalid = new UnicodeSet("[\\N{0041:WRONG NAME}]"); // Error: name mismatch

###Checklist

Required: Issue filed: ICU-23314
Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable
Approver: Feel free to merge on my behalf

eggrobin · 2026-01-30T00:00:27Z

icu4j/main/core/src/test/java/com/ibm/icu/dev/test/lang/UnicodeSetTest.java

+                e.getMessage().contains("out of range"));
+        }
+
+        // Test that standard \N{name} still works (backward compatibility)


As noted on JIRA, this PR is moot, but since I am looking at it: This isn’t for backward compatibility; these just serve different purposes. Sometimes you just want to refer to the character LATIN CAPITAL LETTER A and you don’t actually care what the code point is; the name alone is more readable.

In other cases, you care about the code point (what motivated that is tooling used to develop the Unicode Character Database; for new characters we absolutely want to check that we are putting them in the right place).

Likewise for the hex:literal:name version: sometimes you might want to illustrate what the character is, other times you might not need to (or it might be impractical, e.g., for control characters.

Thanks for the explanation! I understand now since the parser is being rewritten, my changes would conflict with that work.

I appreciate you taking the time to review and explain the context. I'll close this PR and look for other issues where I can contribute more effectively.

Sorry for not catching this earlier!

ICU-23314 UnicodeSet: extended name escapes

773fd31

eggrobin reviewed Jan 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ICU-23314 UnicodeSet: extended name escapes#3850

ICU-23314 UnicodeSet: extended name escapes#3850
aryanraj45 wants to merge 1 commit intounicode-org:mainfrom
aryanraj45:ICU-23314-unicode-set-extended-name-escapes

aryanraj45 commented Jan 29, 2026

Uh oh!

eggrobin Jan 30, 2026

Uh oh!

aryanraj45 Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

aryanraj45 commented Jan 29, 2026

Changes

Implementation Details

Testing

Example Usage

Uh oh!

eggrobin Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

aryanraj45 Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants