Unicode string types #25

pbackus · 2024-03-01T13:55:39Z

pbackus
Mar 1, 2024

Instead of using string, wstring, and dstring to represent Unicode strings, Phobos v3 should include library types representing Unicode strings encoded in UTF-8, UTF-16, and UTF-32.

These types should validate their data upon construction, and their public interfaces should be designed to ensure that their data remains valid at all times (at least in @safe code). For example, the UTF-8 and UTF-16 types must not allow slicing in the middle of a code point.

This approach is an application of the principles described in "Parse, don't validate."

crazymonkyyy · 2024-03-01T14:18:59Z

crazymonkyyy
Mar 1, 2024

types as in the famous examples of when languges added more formal types like stringbuilder or python 3 or voldemore types where where its inlined as a "range starter"

0 replies

crazymonkyyy · 2024-03-01T14:20:48Z

crazymonkyyy
Mar 1, 2024

why support utf16 and 32 at all? ascii and unicode are the main choices

0 replies

rikkimax · 2024-03-01T15:35:44Z

rikkimax
Mar 1, 2024
Collaborator

Oh goodness, this is a hell of a topic.

I was genuinely hoping that we would never do this.

First let's start with the different storage representations:

Read only slice
A slice wrapped up.
Its internal representation may not match the encoding externally.
Can accept any string that the encoding supports representation of.
Appender
We already know this one, it's just an append-only dynamic array.
No slicing or anything else interesting.
Builder
If you have never known me to swear, because I chose not to do so generally, well let me just preface this by saying it is entirely earned here.
FUCK.
An unrolled linked list, that is slicing, removable safe, concurrent safe, differing encoding internally to externally, iterates correctly.
FUCK, it's a constant stream of bugs wrt. iterator.
I'm still not confident I got my one working correct.

Appender I didn't bother much about, although it does have character and my strings specific behavior.

My read only strings and builders are one source file each, that has some replacements done on it to form three of each.
Can't do it as templates, due to needing the guarantee that the binary that the binary is in, is the shared library.

Don't forget we'll need normalization support as comparison (note: no default here is correct, even if everyone uses NFC today).
And then normalization will need to be combined with case folding to make all of that not allocate by default.
The current status of std.uni doesn't do case folding comparisons that support tailoring.

10 replies

rikkimax Mar 1, 2024
Collaborator

This discussion has nothing to do with that at all.

This is about representation for manipulation, completely outside of ranges.

pbackus Mar 1, 2024
Author

@rikkimax I assume the internal representation would either be a slice (simple) or a range of code units (complex but more generic). So, either like this:

struct UTF8
{
    private char[] data;
    // ...
}

Or like this:

struct UTF8(R)
if (isForwardRange!(R, char))
{
    private R data;
    // ...
}

I don't think hard-coding a dependency on any specific data structure other than a slice (like an Appender, or a linked list) would make sense here.

If we implement Unicode algorithms in terms of ranges then either representation will work equally well. If we implement them in terms of slices, then the slice representation is probably a better choice. I have no experience in this area, so I can't really judge which would be better.

rikkimax Mar 1, 2024
Collaborator

Because no mutation happens in place, yes a slice is the way to go for a read only string type, up until memory fragmentation is a concern.

What I did was I made my iterators range first in terms of API, which then opApply wraps.

One thing I do want to mention is my opApply doesn't have an index.

If you want to calculate how much has been consumed you must use .length to figure out how much is left as you go.
This handles all the fun encoding problems because external doesn't match internal encoding.

Multiple representations can exist with differing storage each.
Same with iteration, it doesn't have to be just byChar.

You really have to decide just how far down this path you want to go.
The further you go, the harder the problems you solve for people.
But as a result, you will invent some new swear words.

pbackus Mar 1, 2024
Author

I guess the biggest decision is whether the Unicode string type is a container (which owns and manages memory) or just a view of data owned by something else ("read-only").

If we start with the read-only/view version, it should be possible to go back and add the container version later (possibly via a dub package), so I would suggest that as the first step.

rikkimax Mar 1, 2024
Collaborator

They are both containers, it's just that one cannot be mutated without going into the second form first.

One of my regrets is that my read only slice (templated for anything), dynamic array (templated for anything) and read only strings, do not have the same state object so as to prevent memory copying when changing type as things get constructed.

I suggest that the first step is actually the considering of boxing slices generically as a container and how it all relates together instead.

ichordev · 2024-03-03T17:47:06Z

ichordev
Mar 3, 2024

Meh. What's the point of this? I always thought that constantly checking string validity was a bit slow and pedantic, and wished that the std.uni code was just nothrow. I would like a function for checking string validity when I feel that it's necessary. A type is not the solution.

The existing built-in types are more interoperable with existing D code, already represent UTF-8/16/32 strings, and can have syntactic sugar that's obviated at compile-time, and so on.

9 replies

ichordev Mar 6, 2024

When did I say UTF-16? I think UTF-16 need only cover conversion to/from UTF-8/32. Otherwise, UTF-16 is a waste of time.
I’m not sure of any corresponding examples for UTF-8, so my bad if there are any.

rikkimax Mar 6, 2024
Collaborator

No, UTF-16 surrogates are characters that can be encoded in UTF-8 and UTF-32. Although it is an error to do so.

dukc Mar 6, 2024

A type which can only represent a valid UTF-8 codepoint.

You're essentially proposing a wrapper type over dchar that would reject illegal (from Unicode point of view) values.

No, UTF-16 surrogates are characters that can be encoded in UTF-8 and UTF-32. Although it is an error to do so.

Exactly. Those are invalid Unicode. If you need to encode data other than legal Unicode points, use raw arrays of wchar or ushort as you always have done. The proposed new types are for when your data can contain only legal Unicode and nothing else.

rikkimax Mar 6, 2024
Collaborator

Realistically, any wrapped type for Unicode is going to be quite complex to do the validation as mutation occurs if you take into account everything the standard disallows.

I haven't implemented it in my stuff, and I'm not recommending it here.

dukc Mar 6, 2024

if you take into account everything the standard disallows.

Not really. I have read the Unicode standard, and it has a huge simplification of design for this case. That is, no sequence of legal code points is ever illegal. If the transmission format is legal, the resulting string of code points is. There are exactly two ways UTF-32 can be illegal:

it's a surrogate code point
it's outside Unicode space, meaning over U+10FFFF.

Those two, and anything related to translating UTF-8 or -16 to -32 makes Unicode illegal. Everything else is legal. Yes, there are still a lot of senseless things outside these - think points from reserved for future use space, unpaired country codes, null characters in middle of a string, and such things - but none of those are illegal from the standards POV.

Granted, we probably will not want to allow mutating individual elements of a protected UTF-8 or -16 string in place because that would be quite hairy. But concatenating two protected strings to one, or iterating over parts of the string, why not?

dukc · 2024-03-06T22:12:56Z

dukc
Mar 6, 2024

A type-level guarantee against invalid UTF would be nice per se, but I feel this is pretty little gain compared to the effort required. I don't think we should have any other work waiting for the type to happen. But it can be done, the string functions can continue to parse ranges of characters regardless of their source, including good old arrays of them.

Then we can devise a checked string type any time we wish and it can work with existing functions - even V2 ones - out of the box.

0 replies

jmdavis · 2024-03-07T09:09:37Z

jmdavis
Mar 7, 2024
Maintainer

As far as the general Phobos API goes, I don't think that a string type is particularly relevant. The vast majority of the code will be written to work on ranges of characters (and probably just ranges of char in most cases, since it will simplify things if we can just not worry about wchar or dchar except in the cases where we actually need to - like the std.utf replacement or underneath the hood with the system APIs). Some code will probably be written to operate specifically on strings (e.g. doing some of the kind of stuff that std.string currently does), but most of it can just deal with ranges of char.

It's the stuff where you actually store a string where it becomes an issue, and that doesn't come up all that often in Phobos. So, with something like that, we would need to decide on what string type to use, but it'll be the simplest to just use string in most of those cases.

That being the case, a string type that handles string comparisons and normalization and whatnot (and potentially does stuff like have small string optimizations) might be useful to have for folks who want that sort of thing, but Phobos as a whole wouldn't need to know or care. Anything involving string building would be up to the string type itself to deal with, and for the rest, you just need to be able to get a range of char from it, and it will work with pretty much everything (the main exception being the functions that do stuff with string that doesn't involve the range API - mostly string-building stuff found in std.string).

So, I don't know how good or bad an idea it is to create a new string type for Phobos (and there are certainly arguments in favor and against), but I think that it's the kind of thing that can mostly be restricted to its own module without it needing to affect the rest of Phobos, and as such, I don't think that it's terribly relevant to much of the string-handling in Phobos.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode string types #25

{{title}}

Replies: 6 comments 19 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Unicode string types #25

Replies: 6 comments · 19 replies

rikkimax Mar 1, 2024 Collaborator

rikkimax Mar 1, 2024 Collaborator

pbackus Mar 1, 2024 Author

rikkimax Mar 1, 2024 Collaborator

pbackus Mar 1, 2024 Author

rikkimax Mar 1, 2024 Collaborator

rikkimax Mar 6, 2024 Collaborator

rikkimax Mar 6, 2024 Collaborator

jmdavis Mar 7, 2024 Maintainer

Replies: 6 comments 19 replies

rikkimax
Mar 1, 2024
Collaborator

rikkimax Mar 1, 2024
Collaborator

pbackus Mar 1, 2024
Author

rikkimax Mar 1, 2024
Collaborator

pbackus Mar 1, 2024
Author

rikkimax Mar 1, 2024
Collaborator

rikkimax Mar 6, 2024
Collaborator

rikkimax Mar 6, 2024
Collaborator

jmdavis
Mar 7, 2024
Maintainer