update for Unicode 16.0.0 #271

stevengj · 2024-08-30T17:08:52Z

Draft PR to update our data tables to the upcoming Unicode 16.0.0 standard.

data_generator.jl script is currently failing with:

julia --project=. data_generator.jl > utf8proc_data.c.new
ERROR: LoadError: AssertionError: !(haskey(comb_indices, dm1))
Stacktrace:
 [1] top-level scope
   @ ~/Documents/Code/utf8proc/data/data_generator.jl:325
in expression starting at /Users/stevenj/Documents/Code/utf8proc/data/data_generator.jl:293

@c42f, since you wrote/ported this script in #258, can you help?

inkydragon · 2024-10-09T09:25:43Z

An error occurred while processing Character Decomposition Mapping data.
Here, dm0, dm1 are new characters obtained after a character decomposition.

Assert fails when dm0 == dm1.
I tried the old ruby script, and it also fails at the same assert.

utf8proc/data/data_generator.jl

Lines 318 to 325 in 3de4596

    
               @assert !haskey(comb_indices, dm0) 
        
               comb_indices[dm0] = cumoffset 
        
               cumoffset += last - first + 1 + 2 
        
           end 
        
           offset = 0 
        
           for dm1 in comb2nd_indices_sorted_keys 
        
               @assert !haskey(comb_indices, dm1)

In other words, we assume that a character will not be split into two identical characters.

But Unicode 16 introduces a new character: KIRAT RAI VOWEL SIGN AI (16D68).
It will be split into two KIRAT RAI VOWEL SIGN E (16D67).

P683, Figure 13-16
The Unicode Standard, Version 16.0 – Core Specification

I'm not sure whether the current compressed table (xref: #68) can represent this type of mapping.

stevengj · 2024-10-09T14:19:34Z

Naively commenting out the assert doesn't work; it fails the normalization test for U+113C5, which is another such character introduced in Unicode 16.0 that decomposes into two identical characters U+113C2 + U+113C2.

It looks like we'll have to special-case the tables somehow for this. It would be unfortunate to have to add an extra table just for this, but I'm not sure I see a way around it yet.

update for Unicode 16.0.0

d05ed9e

giordano mentioned this pull request Oct 7, 2024

Support latest Unicode 16.0 JuliaLang/julia#56035

Open

clason mentioned this pull request Nov 7, 2024

Please make a new release with commit 3de4596 #272

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update for Unicode 16.0.0 #271

update for Unicode 16.0.0 #271

stevengj commented Aug 30, 2024

inkydragon commented Oct 9, 2024 •

edited

Loading

stevengj commented Oct 9, 2024 •

edited

Loading

update for Unicode 16.0.0 #271

Are you sure you want to change the base?

update for Unicode 16.0.0 #271

Conversation

stevengj commented Aug 30, 2024

inkydragon commented Oct 9, 2024 • edited Loading

stevengj commented Oct 9, 2024 • edited Loading

inkydragon commented Oct 9, 2024 •

edited

Loading

stevengj commented Oct 9, 2024 •

edited

Loading