Skip to content

collation order vs encoding order in range matching #88

@bbolker

Description

@bbolker

The TRE documentation defines a range as

Two characters separated by -. This is shorthand for the full range of characters between those two (inclusive) in the collating sequence.

(here in the repository)

However, testing with the Estonian locale (in R's imported version of TRE) shows that T is incorrectly matched by [A-Z] ... this comment says

/* XXX - Should use collation order instead of encoding values in character ranges. */

Would it be correct to change the documentation to say

The characters to include are determined by Unicode code point ordering.

as in the ICU documentation ... ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions