Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String handling notes #1

Merged
merged 7 commits into from
Sep 25, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 76 additions & 12 deletions specification.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
# BSP specification

([version](versions.md): 0.5.2, rev. 37)
([version](versions.md): 0.6.0, rev. 38)

* [Introduction](#introduction)
* [Execution model](#execution-model)
* [Opcodes](#opcodes)
* [Instruction set](#instruction-set)
* [Instruction description](#instruction-description)
* [String handling](#string-handling)

## Introduction

Expand Down Expand Up @@ -671,12 +672,7 @@ This document does not specify how the engine will display the message; however,
(or an environment that behaves in a similar fashion), it is recommended that the engine prints a newline character
after the message.

If the message is not valid UTF-8, the engine may choose to display the message anyway (handling the invalid characters
in any way it can) or to treat it as a fatal error.

An engine incapable of handling the full Unicode character set may choose to use a reduced character set and replace
the remaining characters with a suitable substitution character; however, an engine must at least support the Latin
letters (A-Z, a-z), digits (0-9), spaces, and the following punctuation characters: `'-,.;:#%&!?/()[]`.
Further considerations regarding message strings are given in the [String handling](#string-handling) section.

### Manipulating the message buffer

Expand All @@ -693,11 +689,10 @@ The first three instructions concatenate data at the end of the message buffer.
The `bufstring` instruction concatenates a string (in the same format as for the `print` instruction) at the end of
the message buffer. No separator is inserted before or after the string.

The `bufchar` instruction appends a single Unicode character to the message buffer. An engine incapable of handling the
full Unicode character set may choose to use a reduced character set and replace the remaining characters with suitable
substitutes; it must however support at least the letters (A-Z, a-z), numbers (0-9), basic punctuation characters
(`'-,.;:#%&!?/()[]`) and the space character. Passing a value that isn't a valid Unicode codepoint (`0x000000` to
`0x00d7ff` and `0x00e000` to `0x10ffff`) is a fatal error.
The `bufchar` instruction appends a single Unicode character to the message buffer. The value passed to the `bufchar`
instruction as an argument must represent a valid non-surrogate Unicode codepoint (i.e., it must be between `0x000000`
and `0x00d7ff`, or `0x00e000` and `0x10ffff`); passing a value outside of those ranges is a fatal error. Values above
`0x1fffff` are reserved for further versions of the specification.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tangential musing: currently BSP lacks a dedicated opcode guaranteed to generate a fatal error under any circumstances in all future versions of the specification. Before I read this, I thought bufchar 0xdeadc0de would be a good candidate for such a thing.

Why even have one? For catching assertion failures, for one thing. (Compare __builtin_trap.)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Division by zero is guaranteed to fail.

Copy link

@fstirlitz fstirlitz Aug 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if someone implements an interpreter in Pony?

More seriously though, bufchar with an invalid USV feels cleaner, as it doesn't have to choose any 'output' register. But there's another reason I'm not fond of stuffing useful behaviours into invalid cases, which I'm going to elaborate on when I finally finish that email.

Copy link
Owner Author

@aaaaaa123456789 aaaaaa123456789 Aug 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of divide #0, 0, 0, which is a valid instruction. Also, what's Pony?

EDIT: I see what you mean about output registers.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about bufchar 0xdead? It's a guaranteed fatal error because it's a surrogate codepoint, and the Unicode standard guarantees it will always be one.


The `bufnumber` instruction appends the decimal representation of a number to the message buffer. The number is
treated as a 32-bit unsigned value and converted to decimal, and printed using the regular digit characters (0-9,
Expand All @@ -708,6 +703,8 @@ The `printbuf` instruction prints the contents of the message buffer as a messag
`print` instruction) and clears the buffer, resetting it to the empty string. The `clearbuf` instruction resets the
message buffer to the empty string without printing it.

Further considerations regarding the message buffer are given in the [String handling](#string-handling) section.

### Option menus

```
Expand Down Expand Up @@ -744,6 +741,9 @@ Options:
If the list of pointers is empty (i.e., if the first pointer is `0xffffffff`), no menu is shown to the user, and the
variable is set to `0xffffffff`.

Further considerations regarding the text used as option labels are given in the [String handling](#string-handling)
section.

Note that a menu with just one option must still be shown to the user, as it is possible to use such a menu to give the
user the possibility of aborting the process by stopping the BSP engine.

Expand Down Expand Up @@ -1086,3 +1086,67 @@ child to the parent.
If the child's execution triggers a fatal error, this fatal error must be propagated to the parent; in other words,
a fatal error at any depth must halt the whole engine. Execution of the parent must **not** be resumed after a fatal
error occurs in the child.

## String handling

Several instructions in this specification deal with strings — namely, the [`print`][print] and [`menu`][menu]
instructions, as well as [those that manipulate the message buffer][msgbuffer]. This section specifies how the engine
must behave when handling strings, and which part of the functionality is implementation-dependent.

Valid strings in the BSP itself must be in UTF-8 format, as specified by [RFC 3629][rfc3629], regardless of the
effective output format of the engine. Any UTF-8 decoding errors (such as overlong encodings) must be treated as
fatal, without attempting any recovery; surrogate codepoints (i.e., those between `0x00d800` and `0x00dfff`) must be
treated as fatal errors as well.

Although the engine must accept any valid UTF-8 string, it isn't required to be able to effectively display any
Unicode character; an engine incapable of handling the full Unicode character set may choose to use a reduced one and
replace characters not in its reduced set with zero or more suitable substitution characters. However, an engine is
required to support at least Latin letters (A-Z, a-z), digits (0-9), spaces, and the following punctuation characters:
`'-,.;:#%&!?/()[]`. All of these characters are encoded as single UTF-8 bytes, and belong to the following ranges:
`0x20` - `0x21`, `0x23`, `0x25` - `0x29`, `0x2c` - `0x3b`, `0x3f`, `0x41` - `0x5b`, `0x5d`, and `0x61` - `0x7a`.

Control characters in strings must be accepted, as they are valid UTF-8 characters; they are also valid arguments to
the `bufchar` instruction. (In particular, `0` is a valid argument to `bufchar`, and therefore must not be treated as
a string terminator in that context.) However, since they are not in the ranges listed in the previous paragraph,
engines are not required to support them; control characters may be ignored (i.e., substituted by nothing) when the
string (or the message buffer) is displayed to the user, or handled in any other appropriate way.

The engine may enforce a limit on the number of bytes and/or characters that the message buffer can accept; this limit
may also be dynamically determined during execution. If such a limit is enforced, characters and/or bytes in excess
must be silently discarded without error; the engine must take care to discard multibyte characters as a whole, and
not only some of their bytes. (For instance, if the last character to be added to the buffer is codepoint `0x0000a0`,
encoded as `0xc2` `0xa0`, the engine may keep both bytes or discard them both, but it must not discard just the last
byte.) If any of the instructions that append to the message buffer (i.e., `bufstring`, `bufchar` or `bufnumber`)
cause some data to be silently discarded due to the buffer being full, any further such instructions must be silently
ignored (i.e., wholly discarded) until the buffer is cleared via the `printbuf` or `clearbuf` instructions.

The engine may enforce a similar limit on the number of bytes and/or characters to be printed by a single `print`
instruction, as well as a maximum length for option labels for the `menu` instruction. Any text exceeding these limits
must be silently truncated as given in the previous paragraph.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You haven't specified a minimum number of characters an engine must be able to handle; theoretically, even an implementation that ignores all attempts to print anything will be compliant. Not sure if that was intentional.

Also, silent truncation may not be such a great idea either.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I did think about adding a minimum number of characters, ultimately I couldn't come up with a valid number: if I require N characters to be printed, would a patch using N+1 characters be valid? Why not require 2N or N/2 instead?

I'll elaborate on this later.


Invalid UTF-8 strings given as arguments to the `bufstring` instruction cause a fatal error; this error may occur at
the time of executing that instruction, or when executing any further instruction that manipulates the message buffer,
up to the point where the message buffer is printed via the `printbuf` instruction. If the message buffer is never
printed, the error may occur up to the point where the message buffer is cleared (either via the `clearbuf`
instruction or due to terminating execution) or not at all; this is implementation-defined.

Multibyte UTF-8 characters appended to the message buffer via `bufstring` instructions must be fully contained within
a single string; if two or more consecutive instructions append parts of a multibyte UTF-8 character that build up to
a valid character, the engine may accept those parts as a whole character or trigger a fatal error. For instance, the
following snippet:

```
bufstring .first
bufstring .second
; ...

; UTF encoding of U+00A0: 0xc2, 0xa0
.first
db 0xc2, 0
.second
db 0xa0, 0
```

may either append a `0x0000a0` codepoint to the message or cause a fatal error.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part about encoding errors seems self-contradictory. On one hand, you write 'Valid strings in the BSP itself must be in UTF-8 format' and 'Any UTF-8 decoding errors must be treated as fatal', but on the other, you want to allow valid UTF-8 strings to be built by concatenating fragmentary encodings. However, a fragmentary encoding is an error like any other.

I assume the intention is to allow both 'strict' implementations that validate strings before appending them to the buffer and 'lax' ones that maintain the buffer as a plain array of bytes and only validate its encoding upon an attempt to print it out. However, by merely allowing the latter, you will be effectively mandating it, since this will allow patch scripts to be written that will only successfully run under a 'lax' implementation. Thus 'lax' implementations will be more interoperable and therefore more popular, pressuring 'strict' implementations to adopt 'lax' behaviour as well. This is the same mistake that made HTML the bloated mess it is today.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the issue that took me the longest to consider. The point seems valid, though, so I'll fix it.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this again, I'm not sure if it's worth burdening implementations with validating every part of the buffer?

Considering I can't see valid patches attempt to do this for any good reason, is it really worth it? Would any reasonable tool generate a patch that only runs in a lax implementation?


[rfc3629]: https://tools.ietf.org/html/rfc3629
18 changes: 17 additions & 1 deletion versions.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,23 @@ specification. However, this does not convey as much information as the full ver

## Changelog

#### [Version 0.5.2, rev. 37 (19 August 2018)](https://github.com/aaaaaa123456789/bsp/blob/master/specification.md)
#### [Version 0.6.0, rev. 38 (25 September 2018)](https://github.com/aaaaaa123456789/bsp/blob/master/specification.md)

Adds a section about characters and strings (referenced from all instructions that handle text data), including
changes such as:

* Forbidding the use of invalid UTF-8 strings, requiring a fatal error in all cases
* Actually pointing to the UTF-8 spec (i.e., the RFC) instead of leaving it implied
* Specifying the acceptable behaviors of valid UTF-8 strings: what characters must be supported by the engine, and
which ones can be substituted
* Requiring that any valid UTF-8 string is accepted, including those containing control characters (despite those
control characters need not be actually displayed)
* Elaborating on character/byte limits for strings, detailing how an engine that imposes such limits must behave
* Adding a few notes that address corner cases regarding invalid UTF-8 strings
* Indicating that arguments to bufchar above `0x1fffff` are reserved and may actually have some purpose in future
versions of the spec (since they are completely outside the range of Unicode)

#### [Version 0.5.2, rev. 37 (19 August 2018)](https://github.com/aaaaaa123456789/bsp/blob/92d13c851899eeb06d26ce346ee4f6ab46123ee7/specification.md)

* Adds a version number to the specification and a link to the (newly-written) changelog

Expand Down