collation order vs encoding order in range matching

The [TRE documentation](https://laurikari.net/tre/documentation/regex-syntax/) defines a range as

> Two characters separated by -. This is shorthand for the full range of characters between those two (inclusive) in the collating sequence. 

([here](https://github.com/laurikari/tre/blob/494b1c9c6827a3205f162f047c3d4bd0a681405d/doc/tre-syntax.html#L245-L248) in the repository)

However,  testing with the Estonian locale (in R's imported version of TRE) shows that T is incorrectly matched by [A-Z] ... [this comment](https://github.com/r-devel/r-svn/blob/66813d4c830d830a2e0acb0e2f53f522b8e2dc37/src/extra/tre/tre-parse.c#L300) says 

>  /* XXX - Should use collation order instead of encoding values in character ranges. */

Would it be correct to change the documentation to say

>  The characters to include are determined by Unicode code point ordering.

as in the [ICU documentation](https://unicode-org.github.io/icu/userguide/strings/regexp.html) ... ?




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

collation order vs encoding order in range matching #88

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

collation order vs encoding order in range matching #88

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions