-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String handling notes #1
Conversation
specification.md
Outdated
(`'-,.;:#%&!?/()[]`) and the space character. Passing a value that isn't a valid Unicode codepoint (`0x000000` to | ||
`0x00d7ff` and `0x00e000` to `0x10ffff`) is a fatal error. | ||
The `bufchar` instruction appends a single Unicode character to the message buffer. Passing a value that isn't a valid | ||
Unicode codepoint (`0x000000` to `0x00d7ff` and `0x00e000` to `0x10ffff`) is a fatal error; values above `0x1fffff` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A surrogate is a valid code point. You're thinking of Unicode scalar values. I know it's been here before, but if you're going over this part anyway, you might want to fix this as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll probably write this in a more wordy way since I'm sure most people aren't even aware of the valid range of codepoints, let alone which ones are surrogates. But thanks for the correction.
specification.md
Outdated
|
||
Valid strings in the BSP itself must be in UTF-8 format, as specified by [RFC 3629][rfc3629], regardless of the | ||
effective output format of the engine. Any UTF-8 decoding errors must be treated as fatal (including recoverable ones | ||
such as overlong encodings or surrogate characters (codepoints between `0x00d800` and `0x00dfff`) being encoded). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any UTF-8 encoding error is recoverable, you just substitute U+FFFD for the invalid byte. Conversely, just because there's an 'obvious' thing you can do to 'fix' overlong/surrogate encodings doesn't make them any more 'valid'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, do you suggest I simply remove the word "recoverable" from the writing?
specification.md
Outdated
effective output format of the engine. Any UTF-8 decoding errors must be treated as fatal (including recoverable ones | ||
such as overlong encodings or surrogate characters (codepoints between `0x00d800` and `0x00dfff`) being encoded). | ||
|
||
Despite the engine must accept any valid UTF-8 string, it isn't required to be able to effectively display any Unicode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'Despite' is followed by a noun phrase. 'Although' (or 'despite that') would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I've seen "despite" followed by non-noun phrases (e.g., "despite the door was locked, they could open it"), I'll admit my choice of wording is less than perfect here. Maybe I'm just imagining things; I'm not a native speaker, after all.
Will fix.
|
||
The engine may enforce a similar limit on the number of bytes and/or characters to be printed by a single `print` | ||
instruction, as well as a maximum length for option labels for the `menu` instruction. Any text exceeding these limits | ||
must be silently truncated as given in the previous paragraph. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You haven't specified a minimum number of characters an engine must be able to handle; theoretically, even an implementation that ignores all attempts to print anything will be compliant. Not sure if that was intentional.
Also, silent truncation may not be such a great idea either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I did think about adding a minimum number of characters, ultimately I couldn't come up with a valid number: if I require N characters to be printed, would a patch using N+1 characters be valid? Why not require 2N or N/2 instead?
I'll elaborate on this later.
db 0xa0, 0 | ||
``` | ||
|
||
may either append a `0x0000a0` codepoint to the message or cause a fatal error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The part about encoding errors seems self-contradictory. On one hand, you write 'Valid strings in the BSP itself must be in UTF-8 format' and 'Any UTF-8 decoding errors must be treated as fatal', but on the other, you want to allow valid UTF-8 strings to be built by concatenating fragmentary encodings. However, a fragmentary encoding is an error like any other.
I assume the intention is to allow both 'strict' implementations that validate strings before appending them to the buffer and 'lax' ones that maintain the buffer as a plain array of bytes and only validate its encoding upon an attempt to print it out. However, by merely allowing the latter, you will be effectively mandating it, since this will allow patch scripts to be written that will only successfully run under a 'lax' implementation. Thus 'lax' implementations will be more interoperable and therefore more popular, pressuring 'strict' implementations to adopt 'lax' behaviour as well. This is the same mistake that made HTML the bloated mess it is today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably the issue that took me the longest to consider. The point seems valid, though, so I'll fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this again, I'm not sure if it's worth burdening implementations with validating every part of the buffer?
Considering I can't see valid patches attempt to do this for any good reason, is it really worth it? Would any reasonable tool generate a patch that only runs in a lax implementation?
More on the string size limit: First of all, while silent truncation might not be the best idea, what are the options? A fatal error seems excessive. Also, in terms of not requiring a specific minimum, I simply couldn't choose a meaningful value. We're used to 80 for terminals, but that's completely arbitrary and doesn't make much sense outside of a terminal. And while zero doesn't seem like a good limit, how much good would it do to require the limit to be at least one or two? In the end, I expect a result similar to what you described for strict/lax implementations (regarding fragmentary UTF-8 encodings), where implementations with useful limits simply push out unreasonable ones. The ideal limit is no limit, after all; the spec simply allows for implementation-defined limits so it will be usable in resource-constrained environments, where no unlimited implementations will ever be available anyway. |
The `bufchar` instruction appends a single Unicode character to the message buffer. The value passed to the `bufchar` | ||
instruction as an argument must represent a valid non-surrogate Unicode codepoint (i.e., it must be between `0x000000` | ||
and `0x00d7ff`, or `0x00e000` and `0x10ffff`); passing a value outside of those ranges is a fatal error. Values above | ||
`0x1fffff` are reserved for further versions of the specification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tangential musing: currently BSP lacks a dedicated opcode guaranteed to generate a fatal error under any circumstances in all future versions of the specification. Before I read this, I thought bufchar 0xdeadc0de
would be a good candidate for such a thing.
Why even have one? For catching assertion failures, for one thing. (Compare __builtin_trap
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Division by zero is guaranteed to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if someone implements an interpreter in Pony?
More seriously though, bufchar
with an invalid USV feels cleaner, as it doesn't have to choose any 'output' register. But there's another reason I'm not fond of stuffing useful behaviours into invalid cases, which I'm going to elaborate on when I finally finish that email.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of divide #0, 0, 0
, which is a valid instruction. Also, what's Pony?
EDIT: I see what you mean about output registers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about bufchar 0xdead
? It's a guaranteed fatal error because it's a surrogate codepoint, and the Unicode standard guarantees it will always be one.
Failure to display the entire message to the user may entail a failure to convey vital information, as in my example. Bailing out entirely is not that an unreasonable response. You're correct that any choice of a minimum buffer length is ultimately arbitrary. But some arbitrary choices are better than others. 80 characters is not only the standard terminal width, it's also about the length of an average English sentence (judging by cursory Web searches about the topic). I think it's a perfectly reasonable choice. If you aren't comfortable with outright mandating it as a minimum length, you can always make it a SHOULD-clause instead of a MUST-clause. |
That seems like a reasonable idea, although I'd rather introduce RFC 2119 (and switch the entire spec to it) in a later pull. |
It's been nearly a month since this had any activity and I kinda forgot, so it's a good time to merge it. I'll introduce RFC 2119 in a later version. |
This pull adds several comments regarding string handling to the specification. Namely, it addresses the following:
bufchar
above0x1fffff
are reserved and may actually have some purpose in future versions of the spec (since they are completely outside the range of Unicode)This will most likely be version 0.6.0 once I merge it; the version number isn't assigned yet to allow for changes before this reaches master.