Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query for unicode range \u036E-\u036F returns non-matching results #688

Open
ZeLonewolf opened this issue Apr 15, 2023 · 7 comments
Open
Labels

Comments

@ZeLonewolf
Copy link

The following query for a range of two consecutive unicode values returns 5,747 city nodes, however, none of the returned results actually appear to contain either character.

[out:csv(::id, name)][timeout:2500];
node[place=city][name~"[\u036E-\u036F]"];
out;

Queries for each character individually each return zero results:

[out:csv(::id, name)][timeout:2500];
node[place=city][name~"\u036E"];
out;
[out:csv(::id, name)][timeout:2500];
node[place=city][name~"\u036F"];
out;
@mmd-osm
Copy link
Contributor

mmd-osm commented Apr 15, 2023

#332 is probably related...

Also note that \u needs a bit more escaping here: \\u

@1ec5
Copy link

1ec5 commented Apr 15, 2023

A query for node[place=city][name~"[\u1ebf]"] (with just one backslash) does return two cities that contain this combining character (because editors and imports at the time didn’t normalize the text to NFC). Expanding the range to U+0300 to U+036F correctly returns this node.

@1ec5
Copy link

1ec5 commented Apr 15, 2023

Oh, I just got lucky because the city names happened to contain some of the letters in the hexadecimal numbers in the range. Never mind me.

@mmd-osm
Copy link
Contributor

mmd-osm commented Apr 15, 2023

So based on U+1EBF, I'm getting the following three place=city nodes (with proper unicode regex support):

  <node id="369487050"/>
  <node id="369487099"/>
  <node id="3140507587"/>

@ZeLonewolf
Copy link
Author

I note that even with the escaping fixed, I still get (different) non-sensical results:

[out:csv(::id, name)][timeout:2500];
node[place=city][name~"[\\u036E-\\u036F]"];
out;

@mmd-osm
Copy link
Contributor

mmd-osm commented Apr 15, 2023

Right, I've noticed the missing backslash when revisiting #332. In the end it doesn't make a whole lot of a difference, since the underlying regular expression implementation doesn't handle ranges as expected.

I hope you received some link to a github gist to try out another implementation that works a bit better.

@drolbr drolbr added the bug label Apr 20, 2023
@NeatNit
Copy link

NeatNit commented Nov 20, 2024

I've encountered this issue myself. It's baffling that character ranges can match characters outside the range.. How does this happen?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants