String handling notes #1

aaaaaa123456789 · 2018-08-22T09:30:05Z

This pull adds several comments regarding string handling to the specification. Namely, it addresses the following:

Forbids the use of invalid UTF-8 strings, requiring a fatal error in all cases
Actually points to the UTF-8 spec (i.e., the RFC) instead of leaving it implied
Specifies the acceptable behaviors of valid UTF-8 strings: what characters must be supported by the engine, and which ones can be substituted
Requires that any valid UTF-8 string is accepted, including those containing control characters (despite those control characters need not be actually displayed)
Elaborates on character/byte limits for strings, detailing how an engine that imposes such limits must behave
Adds a few notes that address corner cases regarding invalid UTF-8 strings
Indicates that arguments to bufchar above 0x1fffff are reserved and may actually have some purpose in future versions of the spec (since they are completely outside the range of Unicode)

This will most likely be version 0.6.0 once I merge it; the version number isn't assigned yet to allow for changes before this reaches master.

… that point to it

fstirlitz · 2018-08-27T06:22:11Z

specification.md

-(`'-,.;:#%&!?/()[]`) and the space character. Passing a value that isn't a valid Unicode codepoint (`0x000000` to 
-`0x00d7ff` and `0x00e000` to `0x10ffff`) is a fatal error.
+The `bufchar` instruction appends a single Unicode character to the message buffer. Passing a value that isn't a valid
+Unicode codepoint (`0x000000` to `0x00d7ff` and `0x00e000` to `0x10ffff`) is a fatal error; values above `0x1fffff`


A surrogate is a valid code point. You're thinking of Unicode scalar values. I know it's been here before, but if you're going over this part anyway, you might want to fix this as well.

I'll probably write this in a more wordy way since I'm sure most people aren't even aware of the valid range of codepoints, let alone which ones are surrogates. But thanks for the correction.

fstirlitz · 2018-08-27T06:22:39Z

specification.md

+
+Valid strings in the BSP itself must be in UTF-8 format, as specified by [RFC 3629][rfc3629], regardless of the
+effective output format of the engine. Any UTF-8 decoding errors must be treated as fatal (including recoverable ones
+such as overlong encodings or surrogate characters (codepoints between `0x00d800` and `0x00dfff`) being encoded).


Any UTF-8 encoding error is recoverable, you just substitute U+FFFD for the invalid byte. Conversely, just because there's an 'obvious' thing you can do to 'fix' overlong/surrogate encodings doesn't make them any more 'valid'.

So, do you suggest I simply remove the word "recoverable" from the writing?

fstirlitz · 2018-08-27T06:22:59Z

specification.md

+effective output format of the engine. Any UTF-8 decoding errors must be treated as fatal (including recoverable ones
+such as overlong encodings or surrogate characters (codepoints between `0x00d800` and `0x00dfff`) being encoded).
+
+Despite the engine must accept any valid UTF-8 string, it isn't required to be able to effectively display any Unicode


'Despite' is followed by a noun phrase. 'Although' (or 'despite that') would be better.

While I've seen "despite" followed by non-noun phrases (e.g., "despite the door was locked, they could open it"), I'll admit my choice of wording is less than perfect here. Maybe I'm just imagining things; I'm not a native speaker, after all.

Will fix.

fstirlitz · 2018-08-27T06:23:16Z

specification.md

+
+The engine may enforce a similar limit on the number of bytes and/or characters to be printed by a single `print`
+instruction, as well as a maximum length for option labels for the `menu` instruction. Any text exceeding these limits
+must be silently truncated as given in the previous paragraph.


You haven't specified a minimum number of characters an engine must be able to handle; theoretically, even an implementation that ignores all attempts to print anything will be compliant. Not sure if that was intentional.

Also, silent truncation may not be such a great idea either.

While I did think about adding a minimum number of characters, ultimately I couldn't come up with a valid number: if I require N characters to be printed, would a patch using N+1 characters be valid? Why not require 2N or N/2 instead?

I'll elaborate on this later.

fstirlitz · 2018-08-27T06:24:04Z

specification.md

+	db 0xa0, 0
+```
+
+may either append a `0x0000a0` codepoint to the message or cause a fatal error.


The part about encoding errors seems self-contradictory. On one hand, you write 'Valid strings in the BSP itself must be in UTF-8 format' and 'Any UTF-8 decoding errors must be treated as fatal', but on the other, you want to allow valid UTF-8 strings to be built by concatenating fragmentary encodings. However, a fragmentary encoding is an error like any other.

I assume the intention is to allow both 'strict' implementations that validate strings before appending them to the buffer and 'lax' ones that maintain the buffer as a plain array of bytes and only validate its encoding upon an attempt to print it out. However, by merely allowing the latter, you will be effectively mandating it, since this will allow patch scripts to be written that will only successfully run under a 'lax' implementation. Thus 'lax' implementations will be more interoperable and therefore more popular, pressuring 'strict' implementations to adopt 'lax' behaviour as well. This is the same mistake that made HTML the bloated mess it is today.

This is probably the issue that took me the longest to consider. The point seems valid, though, so I'll fix it.

Thinking about this again, I'm not sure if it's worth burdening implementations with validating every part of the buffer?

Considering I can't see valid patches attempt to do this for any good reason, is it really worth it? Would any reasonable tool generate a patch that only runs in a lax implementation?

aaaaaa123456789 · 2018-08-27T13:36:09Z

More on the string size limit:

First of all, while silent truncation might not be the best idea, what are the options? A fatal error seems excessive.

Also, in terms of not requiring a specific minimum, I simply couldn't choose a meaningful value. We're used to 80 for terminals, but that's completely arbitrary and doesn't make much sense outside of a terminal. And while zero doesn't seem like a good limit, how much good would it do to require the limit to be at least one or two?

In the end, I expect a result similar to what you described for strict/lax implementations (regarding fragmentary UTF-8 encodings), where implementations with useful limits simply push out unreasonable ones. The ideal limit is no limit, after all; the spec simply allows for implementation-defined limits so it will be usable in resource-constrained environments, where no unlimited implementations will ever be available anyway.

fstirlitz · 2018-08-28T18:50:57Z

specification.md

+The `bufchar` instruction appends a single Unicode character to the message buffer. The value passed to the `bufchar`
+instruction as an argument must represent a valid non-surrogate Unicode codepoint (i.e., it must be between `0x000000`
+and `0x00d7ff`, or `0x00e000` and `0x10ffff`); passing a value outside of those ranges is a fatal error. Values above
+`0x1fffff` are reserved for further versions of the specification.


Tangential musing: currently BSP lacks a dedicated opcode guaranteed to generate a fatal error under any circumstances in all future versions of the specification. Before I read this, I thought bufchar 0xdeadc0de would be a good candidate for such a thing.

Why even have one? For catching assertion failures, for one thing. (Compare __builtin_trap.)

Division by zero is guaranteed to fail.

What if someone implements an interpreter in Pony?

More seriously though, bufchar with an invalid USV feels cleaner, as it doesn't have to choose any 'output' register. But there's another reason I'm not fond of stuffing useful behaviours into invalid cases, which I'm going to elaborate on when I finally finish that email.

I was thinking of divide #0, 0, 0, which is a valid instruction. Also, what's Pony?

EDIT: I see what you mean about output registers.

How about bufchar 0xdead? It's a guaranteed fatal error because it's a surrogate codepoint, and the Unicode standard guarantees it will always be one.

fstirlitz · 2018-08-28T18:51:27Z

Failure to display the entire message to the user may entail a failure to convey vital information, as in my example. Bailing out entirely is not that an unreasonable response.

You're correct that any choice of a minimum buffer length is ultimately arbitrary. But some arbitrary choices are better than others. 80 characters is not only the standard terminal width, it's also about the length of an average English sentence (judging by cursory Web searches about the topic). I think it's a perfectly reasonable choice.

If you aren't comfortable with outright mandating it as a minimum length, you can always make it a SHOULD-clause instead of a MUST-clause.

aaaaaa123456789 · 2018-08-31T07:06:46Z

That seems like a reasonable idea, although I'd rather introduce RFC 2119 (and switch the entire spec to it) in a later pull.

aaaaaa123456789 · 2018-09-24T23:59:28Z

It's been nearly a month since this had any activity and I kinda forgot, so it's a good time to merge it. I'll introduce RFC 2119 in a later version.

aaaaaa123456789 added 5 commits August 22, 2018 05:11

Added 'forked off' note to version in specification

b925b89

Initial string handling section

3edaee5

Point to the new section and simplify the description of instructions…

63e2cfe

… that point to it

Add a note regarding control characters

1e2bf7f

Another note for the menu instruction

3a9e84f

aaaaaa123456789 mentioned this pull request Aug 22, 2018

Missing tests fstirlitz/c-bsp#2

Open

8 tasks

fstirlitz reviewed Aug 27, 2018

View reviewed changes

Fix various writing errors and Unicode inaccuracies

d6d9e4a

fstirlitz reviewed Aug 28, 2018

View reviewed changes

Assign version number and date

4224b8e

aaaaaa123456789 merged commit d295958 into master Sep 25, 2018

aaaaaa123456789 deleted the string-handling-notes branch September 25, 2018 00:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String handling notes #1

String handling notes #1

aaaaaa123456789 commented Aug 22, 2018

fstirlitz Aug 27, 2018

aaaaaa123456789 Aug 27, 2018

fstirlitz Aug 27, 2018

aaaaaa123456789 Aug 27, 2018

fstirlitz Aug 27, 2018

aaaaaa123456789 Aug 27, 2018

fstirlitz Aug 27, 2018

aaaaaa123456789 Aug 27, 2018

fstirlitz Aug 27, 2018

aaaaaa123456789 Aug 27, 2018

aaaaaa123456789 Aug 28, 2018

aaaaaa123456789 commented Aug 27, 2018 •

edited

Loading

fstirlitz Aug 28, 2018

aaaaaa123456789 Aug 28, 2018

fstirlitz Aug 28, 2018 •

edited

Loading

aaaaaa123456789 Aug 28, 2018 •

edited

Loading

aaaaaa123456789 Aug 28, 2018

fstirlitz commented Aug 28, 2018

aaaaaa123456789 commented Aug 31, 2018

aaaaaa123456789 commented Sep 24, 2018

String handling notes #1

String handling notes #1

Conversation

aaaaaa123456789 commented Aug 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaaaaa123456789 commented Aug 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fstirlitz Aug 28, 2018 • edited Loading

Choose a reason for hiding this comment

aaaaaa123456789 Aug 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fstirlitz commented Aug 28, 2018

aaaaaa123456789 commented Aug 31, 2018

aaaaaa123456789 commented Sep 24, 2018

aaaaaa123456789 commented Aug 27, 2018 •

edited

Loading

fstirlitz Aug 28, 2018 •

edited

Loading

aaaaaa123456789 Aug 28, 2018 •

edited

Loading