Skip to content

ICU-22284 dump Numeric_Value property in icuexportdata.cpp#3751

Open
m4rch3n1ng wants to merge 1 commit intounicode-org:mainfrom
m4rch3n1ng:numeric-value
Open

ICU-22284 dump Numeric_Value property in icuexportdata.cpp#3751
m4rch3n1ng wants to merge 1 commit intounicode-org:mainfrom
m4rch3n1ng:numeric-value

Conversation

@m4rch3n1ng
Copy link

@m4rch3n1ng m4rch3n1ng commented Oct 27, 2025

currently, there is no way to get the numeric value of a character from the icuexportdata, so this exports the values into a nv.toml file. this is a value, that icu4x would like to be able to provide (unicode-org/icu4x#3014).

for the new nv.toml export, i added a new type of property, a [[value_property]]. a value property is similar to an [[enum_property]], but it doesn't have the values key for the enum variants and it doesn't have a name field for each of the range maps. similar to the bmg.toml, this exports a [[enum_property]], but without the values and without the name field in each of the ranges.

i was a little unsure, of what value to export, as there were two options: exporting it as a double or exporting the raw numeric type value (via GET_NUMERIC_TYPE_VALUE(u_getMainProperties(c))). i have decided on the second, both for being smaller (a double vs an int32_t) and for being more accurate (floating point numbers cannot accurately represent some fractions and the highest number that unicode provides is higher than the max safe integer of a double). it is also more flexible, potentially allowing languages with native support for fractions to actually consume them as fractions. this does put the burden of reinterpreting the value again on the consumer side, but i think, that is a fine tradeoff.

i have also made a icu4x branch, where i provide this new property: https://github.com/m4rch3n1ng/icu4x/tree/numeric-value. you can also see how the new nv.toml file looks like there.

Checklist

  • Required: Issue filed: ICU-22284
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/tools/icuexportdata/icuexportdata.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@m4rch3n1ng
Copy link
Author

i noticed (a little late), that what i was doing here previously was essentially just what bmg already does, but using a new [[value_property]], while the bmg.toml is just a "normal" [[enumerated_property]], so i switched to do that too.

@markusicu
Copy link
Member

@sffc @robertbastian @hsivonen does this look like what you would want for ICU4X?

@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Sorry I didn't see this sooner

LocalUMutableCPTriePointer builder(umutablecptrie_open(0, 0, status));

for(UChar32 c = UCHAR_MIN_VALUE; c <= UCHAR_MAX_VALUE; c++) {
int32_t ntv = static_cast<int32_t>(GET_NUMERIC_TYPE_VALUE(u_getMainProperties(c)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GET_NUMERIC_TYPE_VALUE seems to be an internal function. Could we stick with the public function
u_getNumericValue?

Copy link
Author

@m4rch3n1ng m4rch3n1ng Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but that is lossy, and it would be impossible to "recover" the original lossless value on the icu4x side, if we ever wanted to.

like i said in the post:

floating point numbers cannot accurately represent some fractions and the highest number that unicode provides is higher than the max safe integer of a double

doing just the most basic "convert 1/3 to a float and then back to a fraction" (using the num-rational rust crate) gives me 6004799503160661/18014398509481984 and it would be nice, if sometime in the future the actual 1/3 value could be somehow extracted, for example for programming languages that natively support fractionals or similar.

additionally, the current representation fits nicely into a 32-bit uint32_t, while the other would need a 64-bit double.

i am not sure if there is a better way to do this than relying on an internal function though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, ok

sffc
sffc previously approved these changes Jan 28, 2026
LocalUMutableCPTriePointer builder(umutablecptrie_open(0, 0, status));

for(UChar32 c = UCHAR_MIN_VALUE; c <= UCHAR_MAX_VALUE; c++) {
int32_t ntv = static_cast<int32_t>(GET_NUMERIC_TYPE_VALUE(u_getMainProperties(c)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, ok

@m4rch3n1ng
Copy link
Author

i'm really not sure why the ci is failing. the valgrind one seems spurious, but the rest are complaining about not finding u_getMainProperties and i am not sure where i have to change something, so that it finds it, especially because it only seems to happen for some configurations and i cannot reproduce it locally. i'll look into it later again (going to sleep now), but i've not been able to fix it so far.

@sffc
Copy link
Member

sffc commented Jan 28, 2026

CI says:

/tmp/lto-llvm-f73411.o:ld-temp.o:function dumpNumericValue(_IO_FILE*): error: undefined reference to 'u_getMainProperties'

You need to export u_getMainProperties from the dylib. You can do this without making it a "public" API. Search for U_EXPORT.

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/common/uchar.cpp is now changed in the branch
  • icu4c/source/common/uprops.h is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • icu4c/source/common/uchar.cpp is different
  • icu4c/source/common/uprops.h is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@m4rch3n1ng
Copy link
Author

i changed the U_CFUNC to U_CAPI (so U_CFUNC U_EXPORT), but i am unsure about it, as U_CAPI is documented "to declare a function as a public ICU C API", which i don't think i want? but i am unsure.

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be fine; there are already lots and lots of functions with U_EXPORT that aren't in the public headers. @markusicu or @richgillam can confirm.

@markusicu markusicu self-assigned this Jan 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants