Improving the text decode upon undefined encoding #1657

Ousret · 2021-05-28T06:09:00Z

Ousret
May 28, 2021

Hi,

I have an idea of how to improve the text decoder default behavior without trying any detection by confidence.

While the assertion for "UTF-8 is prevalent in the WWW HTML content is True" it is not the case in non-HTML content. Even in the TOP 1000 websites, there are still servers that do not disclose the charset in headers.

        if self.decoder is None:
            # If this is the first decode pass then we need to determine which
            # encoding to use by attempting UTF-8 and raising any decode errors.
            attempt_utf_8 = codecs.getincrementaldecoder("utf-8")(errors="strict")
            try:
                attempt_utf_8.decode(data)
            except UnicodeDecodeError:
                # Could not decode as UTF-8. Use Windows 1252.
                self.decoder = codecs.getincrementaldecoder("cp1252")(errors="replace")
            else:
                # Can decode as UTF-8. Use UTF-8 with lenient error settings.
                self.decoder = codecs.getincrementaldecoder("utf-8")(errors="replace")

Currently, httpx does:

Decode using utf_8 codec with strict err policy
Fallback to cp1252 with replc err policy

There is one main thing that needs to be addressed as-is:

This code shows that the payload is going to be decoded twice at least
A small performance issue is going to happen over large payloads.

I propose to change the default behavior by:

Checking for BOM or SIG widely recognized (UTF-8, 16, 32, etc..) and act accordingly
If none found, decode using a strict UTF-8 decoder (keep what is decoded in a var)
If utf-8 fail, use latin-1 with the "replace" err policy instead of cp1252 because the latin-1 has a native decoder.
If latin-1 is used and the decoded payload is having too much control char/unprintable char. Return an empty string. It is wiser.

tomchristie · 2021-06-01T13:00:33Z

tomchristie
Jun 1, 2021
Maintainer

So, I think a good place to start with any review on the behaviour here would have to be evidence-led.

There's a bunch of different possible charset decoding strategies and they all have different trade-offs.

Perhaps a smart thing to do here would be to find a "top-1000" list from someplace, and start by determining:

What character encoding (if any) each of the sites responds with.
Of sites that don't include a character encoding, which decode okay with utf-8.
Of sites that don't include a character encoding, what character encoding chardet guesses the content as.

1 reply

Ousret Jul 6, 2021
Author

Hi,

Here are some interesting data. https://github.com/potiuk/test-charset-normalizer

Regards,

Ousret · 2021-08-08T21:01:55Z

Ousret
Aug 8, 2021
Author

Hi,

Reboot

I am going to assume this was not much to proceed with.

any review on the behaviour here would have to be evidence-led.

Yes, you are absolutely right. But something does not seem right with how you decided that the detection was going the change.
Based on #1269 you said that UTF-8 or CP1252 was prevalent in most cases.

to be evidence-led I may certainly say that your assumptions are most likely false.

Those are dangerous assumptions. Looking at https://w3techs.com/technologies/overview/character_encoding may
seem like encoding detection is a thing of the past but not really. Solo based on 33k websites, you will find
3,4k responses without predefined encoding. 1,8k websites were not UTF-8, merely half!

This statistic does not offer any ponderation, so one should not read it as "I have 97 % chance of hitting UTF-8 content on HTML content".

(2021 Top 1000 sites from 80 countries in the world according to Data for SEO) https://github.com/potiuk/test-charset-normalizer

Neither httpx, chardet, or charset-normalizer are dedicated to HTML content.
The detection concern every text document, like SubRip Subtitle files for instance. And by my own experiences, I never had
a single database using full utf-8, many translated subtitles are from another era and never updated.

It is so hard to find any stats at all regarding this matter. Users' usages can be very dispersed, so making
assumptions are unwise.

The real debate is to state if the detection is an HTTP client matter or not. That is more complicated and not my field.

Initial thoughts

What I was saying, in the beginning, was very simple. Httpx decodes the content at least twice. That is bad, period.

No matter what strategy httpx opt for, we all agree that it won't be without some tradeoff.

What I would suggest today is to solo try out UTF-8 using a strict mode for errors handling. If it fails, either raise a warning, exception or just return bytes. In addition to guiding users on how they should handle this matter.
Recently, I had a thought on whenever a HTTP client can permit itself to alter rendered text when using the replace err strategy; my opinion is that it should not!

Or reintroduce, optionally a detection if any engine is available.

Regards,

6 replies

tomchristie Aug 9, 2021
Maintainer

Okay, so swapping out chardet for charset_normalizer I can immediately see equivalent guesses, but much faster.

Ousret Aug 9, 2021
Author

Yes, since the v2+ charset_normalizer became a very attractive alternative to chardet.

I've taken a dataset of top URLs from https://gtmetrix.com/top1000.html

Nice, these are valuable data. Using the other dataset URLs show that some countries are more concerned than others, so peoples around the globe might experience it quite differently.

Default to whatever charset is specified, using errors='replace'.

I strongly discourage that. Right under our noses, from raw.githubusercontent.com. Should be stricter I think. I don't get why but I saw that more than once... Could be modulated, more flexible, but I don't think that this behavior should remain completely unchecked.

curl -v https://raw.githubusercontent.com/Ousret/char-dataset/master/euc_jp/_ude_1.txt
*   Trying 185.199.110.133:443...
* TCP_NODELAY set
* Connected to raw.githubusercontent.com (185.199.110.133) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=California; L=San Francisco; O=GitHub, Inc.; CN=www.github.com
*  start date: May  6 00:00:00 2020 GMT
*  expire date: Apr 14 12:00:00 2022 GMT
*  subjectAltName: host "raw.githubusercontent.com" matched cert's "*.githubusercontent.com"
*  issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=DigiCert SHA2 High Assurance Server CA
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x563ae7b06ea0)
> GET /Ousret/char-dataset/master/euc_jp/_ude_1.txt HTTP/2
> Host: raw.githubusercontent.com
> user-agent: curl/7.68.0
> accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
< HTTP/2 200 
< cache-control: max-age=300
< content-security-policy: default-src 'none'; style-src 'unsafe-inline'; sandbox
< content-type: text/plain; charset=utf-8
.....

Also, it's important that the Chinese character set GB2312 should always be coerced to the superset GBK8,

Yes, I completely agree, safe move.

tomchristie Aug 9, 2021
Maintainer

Right under our noses, from raw.githubusercontent.com. Should be stricter I think. I don't get why but I saw that more than once...

We'd be getting the same output that you get when visiting that page with Chrome or Safari.
I think we probably ought to prefer that to loud errors here.

tomchristie Aug 10, 2021
Maintainer

Right further investigation...

Was interested to figure out if the proposed "fallback to utf-8 first" was a good policy or not, so dig some digging into the 192 cases where no charset was specified, but utf-8 decoding works successfully.

Of these cases what we're interested in is are there cases where content.decode(charset_normalizer_guess) != content.decode("utf-8"). Turns out there's exactly 12 of these cases. In all of those 12 cases, the reason is because the encoding is actually utf-8-sig (Includes a leading BOM)

Anyways, what's the actual upshot of that?...

We can keep our charset decoding policy nice & simple...

Default to whatever charset is specified, using errors='replace'.
If no charset is specified, use the charset_normalizer guess.

tomchristie Aug 10, 2021
Maintainer

The most awkward question here is what to do for the streaming text case...

In that case the ideal behaviour would be to wait until something like the first 10k? 64k? bytes have been consumed, then make a best guess based on the incomplete content, and return text decoding based on that result.

As it happens charset_normalizer doesn't appear to support an incremental decoding API, which chardet does.

I had a look at what results we just get from a guess based on a hard-cutoff of the leading 10k, 100k bytes on the top-1000 repo, but they differ too significantly from the guess based on the entire content. My assumption is that final trailing byte(s) being sometimes truncated is skewing the results too much. Perhaps a charset_normalizer feature could be a detect(content, is_incomplete=True) which would indicate that the detection is running against a portion of the content rather than the entirety, and that a result should be returned without taking the final character coherence into consideration.

Ousret · 2021-08-10T13:01:32Z

Ousret
Aug 10, 2021
Author

We can keep our charset decoding policy nice & simple...

Always the best way of doing things.

As it happens charset_normalizer doesn't appear to support an incremental decoding API, which chardet does.

I did not implement it, indeed. I think that Chardet did implement it out of performance concerns.

The most awkward question here is what to do for the streaming text case...
like the first 10k? 64k?..

The most content, the better. But one thing to keep in mind is that charset-normalizer protects the end-user from having a UnicodeDecodeError raised along the way. So I would recommend cutting off where it's relatively safe, like a space, carriage return, line feed, or even tab characters. I/O iter_line?

Another thing would be to use the public charset_normalizer.from_bytes to fine-tune the stream case detection. Using the kwargs steps and chunk_size.

Perhaps a charset_normalizer feature could be a detect(content, is_incomplete=True) which would indicate that the detection is running against a portion of the content rather than the entirety

I am open to ease the detection process in that case by providing a proper implementation.

0 replies

vitidev · 2021-08-12T19:09:38Z

vitidev
Aug 12, 2021

I propose to change the default behavior by:

Also need the ability to set the encoding and errors = 'replace/ignore' manually.
Now one can only set encoding name, but not the error handling scheme - and as result it will use fallback encoding 1252

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving the text decode upon undefined encoding #1657

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Improving the text decode upon undefined encoding #1657

Ousret May 28, 2021

Replies: 4 comments · 7 replies

tomchristie Jun 1, 2021 Maintainer

Ousret Jul 6, 2021 Author

Ousret Aug 8, 2021 Author

Reboot

Initial thoughts

tomchristie Aug 9, 2021 Maintainer

Ousret Aug 9, 2021 Author

tomchristie Aug 9, 2021 Maintainer

tomchristie Aug 10, 2021 Maintainer

tomchristie Aug 10, 2021 Maintainer

Ousret Aug 10, 2021 Author

vitidev Aug 12, 2021

Ousret
May 28, 2021

Replies: 4 comments 7 replies

tomchristie
Jun 1, 2021
Maintainer

Ousret Jul 6, 2021
Author

Ousret
Aug 8, 2021
Author

tomchristie Aug 9, 2021
Maintainer

Ousret Aug 9, 2021
Author

tomchristie Aug 9, 2021
Maintainer

tomchristie Aug 10, 2021
Maintainer

tomchristie Aug 10, 2021
Maintainer

Ousret
Aug 10, 2021
Author

vitidev
Aug 12, 2021