Skip to content

ICU-23314 UnicodeSet: extended name escapes#3850

Open
aryanraj45 wants to merge 1 commit intounicode-org:mainfrom
aryanraj45:ICU-23314-unicode-set-extended-name-escapes
Open

ICU-23314 UnicodeSet: extended name escapes#3850
aryanraj45 wants to merge 1 commit intounicode-org:mainfrom
aryanraj45:ICU-23314-unicode-set-extended-name-escapes

Conversation

@aryanraj45
Copy link

Implements support for extended name escapes in UnicodeSet patterns as specified in ICU-23314.

Changes

This PR adds support for the \N{hex:name} syntax in UnicodeSet patterns, allowing users to specify both the hexadecimal code point and its Unicode name for validation purposes.

Implementation Details

  • Extended the applyPropertyPattern() method in UnicodeSet.java to parse the new hex:name format
  • When a colon is detected in \N{...}, the format is parsed as hex:name
  • The hex value is parsed and validated to be a valid code point (0-0x10FFFF)
  • The actual Unicode name is retrieved and compared with the provided name
  • If names don't match, an IllegalArgumentException is thrown
  • Backward compatibility maintained: standard \N{name} syntax continues to work

Testing

  • Added comprehensive test cases in UnicodeSetTest.TestExtendedNameEscapes()
  • Tests cover: valid hex:name format, name mismatch errors, invalid hex errors, out-of-range errors, and backward compatibility
  • All existing UnicodeSet tests (55 tests) continue to pass

Example Usage

// Valid usage
UnicodeSet set = new UnicodeSet("[\\N{0041:LATIN CAPITAL LETTER A}]"); // Works
UnicodeSet emoji = new UnicodeSet("[\\N{1F4A9:PILE OF POO}]"); // Works

// Invalid usage - throws exception
UnicodeSet invalid = new UnicodeSet("[\\N{0041:WRONG NAME}]"); // Error: name mismatch

###Checklist

  • Required: Issue filed: ICU-23314
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable
  • Approver: Feel free to merge on my behalf

e.getMessage().contains("out of range"));
}

// Test that standard \N{name} still works (backward compatibility)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted on JIRA, this PR is moot, but since I am looking at it: This isn’t for backward compatibility; these just serve different purposes. Sometimes you just want to refer to the character LATIN CAPITAL LETTER A and you don’t actually care what the code point is; the name alone is more readable.

In other cases, you care about the code point (what motivated that is tooling used to develop the Unicode Character Database; for new characters we absolutely want to check that we are putting them in the right place).

Likewise for the hex:literal:name version: sometimes you might want to illustrate what the character is, other times you might not need to (or it might be impractical, e.g., for control characters.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! I understand now since the parser is being rewritten, my changes would conflict with that work.

I appreciate you taking the time to review and explain the context. I'll close this PR and look for other issues where I can contribute more effectively.

Sorry for not catching this earlier!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants