-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String handling notes #1
Changes from 5 commits
b925b89
3edaee5
63e2cfe
1e2bf7f
3a9e84f
d6d9e4a
4224b8e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,13 @@ | ||
# BSP specification | ||
|
||
([version](versions.md): 0.5.2, rev. 37) | ||
(forked off [version](versions.md): 0.5.2, rev. 37) | ||
|
||
* [Introduction](#introduction) | ||
* [Execution model](#execution-model) | ||
* [Opcodes](#opcodes) | ||
* [Instruction set](#instruction-set) | ||
* [Instruction description](#instruction-description) | ||
* [String handling](#string-handling) | ||
|
||
## Introduction | ||
|
||
|
@@ -671,12 +672,7 @@ This document does not specify how the engine will display the message; however, | |
(or an environment that behaves in a similar fashion), it is recommended that the engine prints a newline character | ||
after the message. | ||
|
||
If the message is not valid UTF-8, the engine may choose to display the message anyway (handling the invalid characters | ||
in any way it can) or to treat it as a fatal error. | ||
|
||
An engine incapable of handling the full Unicode character set may choose to use a reduced character set and replace | ||
the remaining characters with a suitable substitution character; however, an engine must at least support the Latin | ||
letters (A-Z, a-z), digits (0-9), spaces, and the following punctuation characters: `'-,.;:#%&!?/()[]`. | ||
Further considerations regarding message strings are given in the [String handling](#string-handling) section. | ||
|
||
### Manipulating the message buffer | ||
|
||
|
@@ -693,11 +689,9 @@ The first three instructions concatenate data at the end of the message buffer. | |
The `bufstring` instruction concatenates a string (in the same format as for the `print` instruction) at the end of | ||
the message buffer. No separator is inserted before or after the string. | ||
|
||
The `bufchar` instruction appends a single Unicode character to the message buffer. An engine incapable of handling the | ||
full Unicode character set may choose to use a reduced character set and replace the remaining characters with suitable | ||
substitutes; it must however support at least the letters (A-Z, a-z), numbers (0-9), basic punctuation characters | ||
(`'-,.;:#%&!?/()[]`) and the space character. Passing a value that isn't a valid Unicode codepoint (`0x000000` to | ||
`0x00d7ff` and `0x00e000` to `0x10ffff`) is a fatal error. | ||
The `bufchar` instruction appends a single Unicode character to the message buffer. Passing a value that isn't a valid | ||
Unicode codepoint (`0x000000` to `0x00d7ff` and `0x00e000` to `0x10ffff`) is a fatal error; values above `0x1fffff` | ||
are reserved for further versions of the specification. | ||
|
||
The `bufnumber` instruction appends the decimal representation of a number to the message buffer. The number is | ||
treated as a 32-bit unsigned value and converted to decimal, and printed using the regular digit characters (0-9, | ||
|
@@ -708,6 +702,8 @@ The `printbuf` instruction prints the contents of the message buffer as a messag | |
`print` instruction) and clears the buffer, resetting it to the empty string. The `clearbuf` instruction resets the | ||
message buffer to the empty string without printing it. | ||
|
||
Further considerations regarding the message buffer are given in the [String handling](#string-handling) section. | ||
|
||
### Option menus | ||
|
||
``` | ||
|
@@ -744,6 +740,9 @@ Options: | |
If the list of pointers is empty (i.e., if the first pointer is `0xffffffff`), no menu is shown to the user, and the | ||
variable is set to `0xffffffff`. | ||
|
||
Further considerations regarding the text used as option labels are given in the [String handling](#string-handling) | ||
section. | ||
|
||
Note that a menu with just one option must still be shown to the user, as it is possible to use such a menu to give the | ||
user the possibility of aborting the process by stopping the BSP engine. | ||
|
||
|
@@ -1086,3 +1085,66 @@ child to the parent. | |
If the child's execution triggers a fatal error, this fatal error must be propagated to the parent; in other words, | ||
a fatal error at any depth must halt the whole engine. Execution of the parent must **not** be resumed after a fatal | ||
error occurs in the child. | ||
|
||
## String handling | ||
|
||
Several instructions in this specification deal with strings — namely, the [`print`][print] and [`menu`][menu] | ||
instructions, as well as [those that manipulate the message buffer][msgbuffer]. This section specifies how the engine | ||
must behave when handling strings, and which part of the functionality is implementation-dependent. | ||
|
||
Valid strings in the BSP itself must be in UTF-8 format, as specified by [RFC 3629][rfc3629], regardless of the | ||
effective output format of the engine. Any UTF-8 decoding errors must be treated as fatal (including recoverable ones | ||
such as overlong encodings or surrogate characters (codepoints between `0x00d800` and `0x00dfff`) being encoded). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any UTF-8 encoding error is recoverable, you just substitute U+FFFD for the invalid byte. Conversely, just because there's an 'obvious' thing you can do to 'fix' overlong/surrogate encodings doesn't make them any more 'valid'. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, do you suggest I simply remove the word "recoverable" from the writing? |
||
|
||
Despite the engine must accept any valid UTF-8 string, it isn't required to be able to effectively display any Unicode | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 'Despite' is followed by a noun phrase. 'Although' (or 'despite that') would be better. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. While I've seen "despite" followed by non-noun phrases (e.g., "despite the door was locked, they could open it"), I'll admit my choice of wording is less than perfect here. Maybe I'm just imagining things; I'm not a native speaker, after all. Will fix. |
||
character; an engine incapable of handling the full Unicode character set may choose to use a reduced one and replace | ||
characters not in its reduced set with zero or more suitable substitution characters. However, an engine is required | ||
to support at least Latin letters (A-Z, a-z), digits (0-9), spaces, and the following punctuation characters: | ||
`'-,.;:#%&!?/()[]`. All of these characters are encoded as single UTF-8 bytes, and belong to the following ranges: | ||
`0x20` - `0x21`, `0x23`, `0x25` - `0x29`, `0x2c` - `0x3b`, `0x3f`, `0x41` - `0x5b`, `0x5d`, and `0x61` - `0x7a`. | ||
|
||
Control characters in strings must be accepted, as they are valid UTF-8 characters; they are also valid arguments to | ||
the `bufchar` instruction. (In particular, `0` is a valid argument to `bufchar`, and therefore must not be treated as | ||
a string terminator in that context.) However, since they are not in the ranges listed in the previous paragraph, | ||
engines are not required to support them; control characters may be ignored (i.e., substituted by nothing) when the | ||
string (or the message buffer) is displayed to the user, or handled in any other appropriate way. | ||
|
||
The engine may enforce a limit on the number of bytes and/or characters that the message buffer can accept; this limit | ||
may also be dynamically determined during execution. If such a limit is enforced, characters and/or bytes in excess | ||
must be silently discarded without error; the engine must take care to discard multibyte characters as a whole, and | ||
not only some of their bytes. (For instance, if the last character to be added to the buffer is codepoint `0x0000a0`, | ||
encoded as `0xc2` `0xa0`, the engine may keep both bytes or discard them both, but it must not discard just the last | ||
byte.) If any of the instructions that append to the message buffer (i.e., `bufstring`, `bufchar` or `bufnumber`) | ||
cause some data to be silently discarded due to the buffer being full, any further such instructions must be silently | ||
ignored (i.e., wholly discarded) until the buffer is cleared via the `printbuf` or `clearbuf` instructions. | ||
|
||
The engine may enforce a similar limit on the number of bytes and/or characters to be printed by a single `print` | ||
instruction, as well as a maximum length for option labels for the `menu` instruction. Any text exceeding these limits | ||
must be silently truncated as given in the previous paragraph. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You haven't specified a minimum number of characters an engine must be able to handle; theoretically, even an implementation that ignores all attempts to print anything will be compliant. Not sure if that was intentional. Also, silent truncation may not be such a great idea either. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. While I did think about adding a minimum number of characters, ultimately I couldn't come up with a valid number: if I require N characters to be printed, would a patch using N+1 characters be valid? Why not require 2N or N/2 instead? I'll elaborate on this later. |
||
|
||
Invalid UTF-8 strings given as arguments to the `bufstring` instruction cause a fatal error; this error may occur at | ||
the time of executing that instruction, or when executing any further instruction that manipulates the message buffer, | ||
up to the point where the message buffer is printed via the `printbuf` instruction. If the message buffer is never | ||
printed, the error may occur up to the point where the message buffer is cleared (either via the `clearbuf` | ||
instruction or due to terminating execution) or not at all; this is implementation-defined. | ||
|
||
Multibyte UTF-8 characters appended to the message buffer via `bufstring` instructions must be fully contained within | ||
a single string; if two or more consecutive instructions append parts of a multibyte UTF-8 character that build up to | ||
a valid character, the engine may accept those parts as a whole character or trigger a fatal error. For instance, the | ||
following snippet: | ||
|
||
``` | ||
bufstring .first | ||
bufstring .second | ||
; ... | ||
|
||
; UTF encoding of U+00A0: 0xc2, 0xa0 | ||
.first | ||
db 0xc2, 0 | ||
.second | ||
db 0xa0, 0 | ||
``` | ||
|
||
may either append a `0x0000a0` codepoint to the message or cause a fatal error. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The part about encoding errors seems self-contradictory. On one hand, you write 'Valid strings in the BSP itself must be in UTF-8 format' and 'Any UTF-8 decoding errors must be treated as fatal', but on the other, you want to allow valid UTF-8 strings to be built by concatenating fragmentary encodings. However, a fragmentary encoding is an error like any other. I assume the intention is to allow both 'strict' implementations that validate strings before appending them to the buffer and 'lax' ones that maintain the buffer as a plain array of bytes and only validate its encoding upon an attempt to print it out. However, by merely allowing the latter, you will be effectively mandating it, since this will allow patch scripts to be written that will only successfully run under a 'lax' implementation. Thus 'lax' implementations will be more interoperable and therefore more popular, pressuring 'strict' implementations to adopt 'lax' behaviour as well. This is the same mistake that made HTML the bloated mess it is today. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is probably the issue that took me the longest to consider. The point seems valid, though, so I'll fix it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thinking about this again, I'm not sure if it's worth burdening implementations with validating every part of the buffer? Considering I can't see valid patches attempt to do this for any good reason, is it really worth it? Would any reasonable tool generate a patch that only runs in a lax implementation? |
||
|
||
[rfc3629]: https://tools.ietf.org/html/rfc3629 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A surrogate is a valid code point. You're thinking of Unicode scalar values. I know it's been here before, but if you're going over this part anyway, you might want to fix this as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll probably write this in a more wordy way since I'm sure most people aren't even aware of the valid range of codepoints, let alone which ones are surrogates. But thanks for the correction.