Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String handling notes #1

Merged
merged 7 commits into from
Sep 25, 2018
Merged

String handling notes #1

merged 7 commits into from
Sep 25, 2018

Conversation

aaaaaa123456789
Copy link
Owner

This pull adds several comments regarding string handling to the specification. Namely, it addresses the following:

  • Forbids the use of invalid UTF-8 strings, requiring a fatal error in all cases
  • Actually points to the UTF-8 spec (i.e., the RFC) instead of leaving it implied
  • Specifies the acceptable behaviors of valid UTF-8 strings: what characters must be supported by the engine, and which ones can be substituted
  • Requires that any valid UTF-8 string is accepted, including those containing control characters (despite those control characters need not be actually displayed)
  • Elaborates on character/byte limits for strings, detailing how an engine that imposes such limits must behave
  • Adds a few notes that address corner cases regarding invalid UTF-8 strings
  • Indicates that arguments to bufchar above 0x1fffff are reserved and may actually have some purpose in future versions of the spec (since they are completely outside the range of Unicode)

This will most likely be version 0.6.0 once I merge it; the version number isn't assigned yet to allow for changes before this reaches master.

specification.md Outdated
(`'-,.;:#%&!?/()[]`) and the space character. Passing a value that isn't a valid Unicode codepoint (`0x000000` to
`0x00d7ff` and `0x00e000` to `0x10ffff`) is a fatal error.
The `bufchar` instruction appends a single Unicode character to the message buffer. Passing a value that isn't a valid
Unicode codepoint (`0x000000` to `0x00d7ff` and `0x00e000` to `0x10ffff`) is a fatal error; values above `0x1fffff`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A surrogate is a valid code point. You're thinking of Unicode scalar values. I know it's been here before, but if you're going over this part anyway, you might want to fix this as well.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll probably write this in a more wordy way since I'm sure most people aren't even aware of the valid range of codepoints, let alone which ones are surrogates. But thanks for the correction.

specification.md Outdated

Valid strings in the BSP itself must be in UTF-8 format, as specified by [RFC 3629][rfc3629], regardless of the
effective output format of the engine. Any UTF-8 decoding errors must be treated as fatal (including recoverable ones
such as overlong encodings or surrogate characters (codepoints between `0x00d800` and `0x00dfff`) being encoded).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any UTF-8 encoding error is recoverable, you just substitute U+FFFD for the invalid byte. Conversely, just because there's an 'obvious' thing you can do to 'fix' overlong/surrogate encodings doesn't make them any more 'valid'.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, do you suggest I simply remove the word "recoverable" from the writing?

specification.md Outdated
effective output format of the engine. Any UTF-8 decoding errors must be treated as fatal (including recoverable ones
such as overlong encodings or surrogate characters (codepoints between `0x00d800` and `0x00dfff`) being encoded).

Despite the engine must accept any valid UTF-8 string, it isn't required to be able to effectively display any Unicode

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'Despite' is followed by a noun phrase. 'Although' (or 'despite that') would be better.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I've seen "despite" followed by non-noun phrases (e.g., "despite the door was locked, they could open it"), I'll admit my choice of wording is less than perfect here. Maybe I'm just imagining things; I'm not a native speaker, after all.

Will fix.


The engine may enforce a similar limit on the number of bytes and/or characters to be printed by a single `print`
instruction, as well as a maximum length for option labels for the `menu` instruction. Any text exceeding these limits
must be silently truncated as given in the previous paragraph.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You haven't specified a minimum number of characters an engine must be able to handle; theoretically, even an implementation that ignores all attempts to print anything will be compliant. Not sure if that was intentional.

Also, silent truncation may not be such a great idea either.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I did think about adding a minimum number of characters, ultimately I couldn't come up with a valid number: if I require N characters to be printed, would a patch using N+1 characters be valid? Why not require 2N or N/2 instead?

I'll elaborate on this later.

db 0xa0, 0
```

may either append a `0x0000a0` codepoint to the message or cause a fatal error.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part about encoding errors seems self-contradictory. On one hand, you write 'Valid strings in the BSP itself must be in UTF-8 format' and 'Any UTF-8 decoding errors must be treated as fatal', but on the other, you want to allow valid UTF-8 strings to be built by concatenating fragmentary encodings. However, a fragmentary encoding is an error like any other.

I assume the intention is to allow both 'strict' implementations that validate strings before appending them to the buffer and 'lax' ones that maintain the buffer as a plain array of bytes and only validate its encoding upon an attempt to print it out. However, by merely allowing the latter, you will be effectively mandating it, since this will allow patch scripts to be written that will only successfully run under a 'lax' implementation. Thus 'lax' implementations will be more interoperable and therefore more popular, pressuring 'strict' implementations to adopt 'lax' behaviour as well. This is the same mistake that made HTML the bloated mess it is today.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the issue that took me the longest to consider. The point seems valid, though, so I'll fix it.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this again, I'm not sure if it's worth burdening implementations with validating every part of the buffer?

Considering I can't see valid patches attempt to do this for any good reason, is it really worth it? Would any reasonable tool generate a patch that only runs in a lax implementation?

@aaaaaa123456789
Copy link
Owner Author

aaaaaa123456789 commented Aug 27, 2018

More on the string size limit:

First of all, while silent truncation might not be the best idea, what are the options? A fatal error seems excessive.

Also, in terms of not requiring a specific minimum, I simply couldn't choose a meaningful value. We're used to 80 for terminals, but that's completely arbitrary and doesn't make much sense outside of a terminal. And while zero doesn't seem like a good limit, how much good would it do to require the limit to be at least one or two?

In the end, I expect a result similar to what you described for strict/lax implementations (regarding fragmentary UTF-8 encodings), where implementations with useful limits simply push out unreasonable ones. The ideal limit is no limit, after all; the spec simply allows for implementation-defined limits so it will be usable in resource-constrained environments, where no unlimited implementations will ever be available anyway.

The `bufchar` instruction appends a single Unicode character to the message buffer. The value passed to the `bufchar`
instruction as an argument must represent a valid non-surrogate Unicode codepoint (i.e., it must be between `0x000000`
and `0x00d7ff`, or `0x00e000` and `0x10ffff`); passing a value outside of those ranges is a fatal error. Values above
`0x1fffff` are reserved for further versions of the specification.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tangential musing: currently BSP lacks a dedicated opcode guaranteed to generate a fatal error under any circumstances in all future versions of the specification. Before I read this, I thought bufchar 0xdeadc0de would be a good candidate for such a thing.

Why even have one? For catching assertion failures, for one thing. (Compare __builtin_trap.)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Division by zero is guaranteed to fail.

Copy link

@fstirlitz fstirlitz Aug 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if someone implements an interpreter in Pony?

More seriously though, bufchar with an invalid USV feels cleaner, as it doesn't have to choose any 'output' register. But there's another reason I'm not fond of stuffing useful behaviours into invalid cases, which I'm going to elaborate on when I finally finish that email.

Copy link
Owner Author

@aaaaaa123456789 aaaaaa123456789 Aug 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of divide #0, 0, 0, which is a valid instruction. Also, what's Pony?

EDIT: I see what you mean about output registers.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about bufchar 0xdead? It's a guaranteed fatal error because it's a surrogate codepoint, and the Unicode standard guarantees it will always be one.

@fstirlitz
Copy link

Failure to display the entire message to the user may entail a failure to convey vital information, as in my example. Bailing out entirely is not that an unreasonable response.

You're correct that any choice of a minimum buffer length is ultimately arbitrary. But some arbitrary choices are better than others. 80 characters is not only the standard terminal width, it's also about the length of an average English sentence (judging by cursory Web searches about the topic). I think it's a perfectly reasonable choice.

If you aren't comfortable with outright mandating it as a minimum length, you can always make it a SHOULD-clause instead of a MUST-clause.

@aaaaaa123456789
Copy link
Owner Author

That seems like a reasonable idea, although I'd rather introduce RFC 2119 (and switch the entire spec to it) in a later pull.

@aaaaaa123456789
Copy link
Owner Author

It's been nearly a month since this had any activity and I kinda forgot, so it's a good time to merge it. I'll introduce RFC 2119 in a later version.

@aaaaaa123456789 aaaaaa123456789 merged commit d295958 into master Sep 25, 2018
@aaaaaa123456789 aaaaaa123456789 deleted the string-handling-notes branch September 25, 2018 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants