Performance of decodeUtf8 on non-ascii text #1

andrewthad · 2018-03-07T17:59:22Z

The performance of decodeUtf8 is excellent on ascii text, since the check can be vectorized to operate on a full machine word at a time, and nearly all branch prediction are correct. However, for non-ascii text, I haven't put much effort into optimizing it. There's no way it could ever compete with the decoding of ascii text, but it could probably be much better than it is now.

The text was updated successfully, but these errors were encountered:

chessai · 2018-03-30T05:29:18Z

Do you have any ideas for how it could be optimised? Why can you not reach the efficiency of decodeUtf8 from Data.Text?

I think it can be achieved with sheer willpower alone.

andrewthad · 2018-03-30T12:59:47Z

It could certainly match the performance of decodeUtf8 from Data.Text. Right now, I have some bounds checks that are redundant. It may be possible to eliminate them. Alternatively, it may improve things to make the code more concise. I think we could instead have a single helper function that handles two-byte, three-byte, and four-byte characters.

chessai · 2018-03-30T13:16:29Z

Is it possible to write something like 'isUtf8' (which we know can be made relatively efficient), and if that function returns true, make a pass over the text?

By the way, the willpower thing was a joke.

chessai · 2018-03-30T13:17:53Z

We have to perform the check for utf8 and then decode the character anyway, so maybe that would be an OK solution.

andrewthad · 2018-03-30T13:19:55Z

Actually, that's already what it does. It just passes over the Bytes and checks to see if it is UTF-8. So, it's zero-copy unless there are disallowed code points present. In that case, we have to clean them up, which requires allocating a new bytearray.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of decodeUtf8 on non-ascii text #1

Performance of decodeUtf8 on non-ascii text #1

andrewthad commented Mar 7, 2018

chessai commented Mar 30, 2018

andrewthad commented Mar 30, 2018 •

edited

Loading

chessai commented Mar 30, 2018 •

edited

Loading

chessai commented Mar 30, 2018

andrewthad commented Mar 30, 2018

Performance of decodeUtf8 on non-ascii text #1

Performance of decodeUtf8 on non-ascii text #1

Comments

andrewthad commented Mar 7, 2018

chessai commented Mar 30, 2018

andrewthad commented Mar 30, 2018 • edited Loading

chessai commented Mar 30, 2018 • edited Loading

chessai commented Mar 30, 2018

andrewthad commented Mar 30, 2018

andrewthad commented Mar 30, 2018 •

edited

Loading

chessai commented Mar 30, 2018 •

edited

Loading