Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editorial: revamp the way we deal with code points and bytes #247

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

annevk
Copy link
Member

@annevk annevk commented Nov 2, 2020

This is WIP, mainly since I'm a little unsure we want to go this far, but I also kinda like it.

@andreubotella @ricea @domenic @aphillips thoughts?


Preview | Diff

This is WIP, mainly since I'm a little unsure we want to go this far, but I also kinda like it.
@domenic
Copy link
Member

domenic commented Nov 2, 2020

Seems reasonable to me.

@ricea
Copy link
Collaborator

ricea commented Nov 2, 2020

Am I understanding correctly that the purpose is disambiguate code-point and byte conversions?

If so, my concern is that the extra rigor creates extra opportunities for errors, and may not be pulling its weight.

However, if you prefer it, that's good enough for me.

Copy link
Contributor

@aphillips aphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this generally as a direction, but made some "food for thought" comments below.

@@ -1915,7 +1916,7 @@ constructor steps are:
<p class=note>{{DOMString}}, as well as an <a for=/>I/O queue</a> of code units rather than scalar
values, are used here so that a surrogate pair that is split between chunks can be reassembled into
the appropriate scalar value. The behavior is otherwise identical to {{USVString}}. In particular,
lone surrogates will be replaced with U+FFFD.
lone surrogates will be replaced with U+FFFD (�).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Charmod we often followed the convention:

� [U+FFFD REPLACEMENT CHARACTER]

(with the [U+xxxx character name] part styled distinctly). I say "often" because I willfully ignored the convention whenever it reduced clarity, particularly with long sequences used in this or that example. For examples this like, you might consider something similar, since it makes the text unambiguous?

OTOH, I find this pretty clear and am not sure that the charmod style adds that much. I like quoting the character like this when it's printable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We made up our own convention in https://infra.spec.whatwg.org/#code-points since we found the one in Charmod a bit too verbose, iirc.


<li><p>If <var>byte</var> is an <a>ASCII byte</a>, return
a code point whose value is <var>byte</var>.
<li><p>Let <var>byteValue</var> be <var>byte</var>'s <a for=byte>value</a>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is byteValue really needed vs. just saying things like:

If byte is an ASCII byte, then return a code point whose value is byte's value.

I realize that "code point's value" is a different integer type than "byte's value", but we mean the number in any case.

<a for="code point">value</a> is <var>byteValue</var>.

<li><p>Return a <a>code point</a> whose <a for="code point">value</a> is
0xF780 + <var>byteValue</var> &minus; 0x80.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the problem. You don't want prose here. But can't we just say 0xF780 + byte - 0x80?

Is there a reason I'm not seeing for why we don't just make the number 0xF700? Is the reason to emphasize that we're trying to get to/from bytes >= 0x80?

Copy link
Member Author

@annevk annevk Nov 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've had some cases where we want to distinguish bytes from numbers. So the question is whether we want to do that here as well. And I guess in some sense we do since we want to return code points or bytes, but a lot of the calculations are on numbers.

I think we could use byte in the calculation directly (as we already did), but it wouldn't really be logically consistent with how we talk about bytes and numbers elsewhere in the web platform.

(I guess another way would be that we say that in equations they are casted to their value.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could define implicit conversions code point → number and byte → number (whatwg/infra#319) and perhaps the other way around too. But even if we don't, we could use short algorithmic phrases inside the formula: "0xF780 + (byte's value) − 0x80".

There are other formulas in the standard that use byte or code point values directly, though, and they should be changed accordingly. (Interestingly, there are formulas dealing with code units around TextEncoder and TextEncoderStream, which don't have this problem because code units seem to be defined directly as a number type.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I intuitively like making the code point <—> byte/number conversions explicit, and don't see as much of a need for distinguishing bytes and numbers. (I'd be OK defining bytes as a subtype of numbers, if we ever make progress on defining numbers.)

<li><p>If <var>code point</var> is in the range U+F780 to U+F7FF, inclusive, return
a byte whose value is <var>code point</var> &minus; 0xF780 + 0x80.
<li><p>If <var>codePointValue</var> is in the range 0xF780 to 0xF7FF, inclusive, then return a
<a>byte</a> whose <a for=byte>value</a> is <var>codePointValue</var> &minus; 0xF780 + 0x80.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usw.

Base automatically changed from master to main January 15, 2021 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants