Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ByteString should not be considered Text #107

Open
dustin opened this issue Oct 31, 2022 · 13 comments
Open

ByteString should not be considered Text #107

dustin opened this issue Oct 31, 2022 · 13 comments

Comments

@dustin
Copy link

dustin commented Oct 31, 2022

The ToField instance of ByteString uses Escape which is meant for text-like strings which is a clearly invalid assumption for ByteString.

The Binary wrapper can be used to treat ByteString as non-text data, but that's redundant. The default causes character encoding issues when storing bytea values that may not appear in trivial tests. I found this by means of a property test that reported bad utf-8 characters in jpeg thumbnails I was storing.

@phadej
Copy link
Collaborator

phadej commented Oct 31, 2022

Yes. I'm not a fan of that decision made a decade ago. But changing that would be a big breaking change, which wouldn't cause type-errors, so I'm hesitant to "just" do it.

If anything, I'd first remove ByteString instances completely, leaving only Binary ByteString for a while.

@dustin
Copy link
Author

dustin commented Oct 31, 2022

It's pretty clearly incorrect right now. I'm sure something might break, but having it assume a ByteString is in some specific character set for the binding to work is surprising in at least two ways, but I'm having trouble imagining a way that this could break code that wasn't already buggy. The goal is to return any arbitrary ByteString you stored, which it won't do currently. It would still return the subset that it would've returned with the current code, wouldn't it?

I think it'd be fine to remove the instances and/or document how to handle ByteStrings, though. It's very not obvious currently and the documentation around escaping is at best misleading since it says Escape is "for all text-like types" which ByteString is not meant to represent.

@phadej
Copy link
Collaborator

phadej commented Oct 31, 2022

but I'm having trouble imagining a way that this could break code that wasn't already buggy.

Using ByteString for textual values. E.g. doing something like query conn ... (Base16.encode value, ...) where converting to something else (e.g. Text) is redundant as Textis anyway encoded intoByteString`.

@dustin
Copy link
Author

dustin commented Oct 31, 2022

Why would that break anything? My understanding is that it can only store a subset of valid ByteString values. Would storing all of them invalidate your example?

@phadej
Copy link
Collaborator

phadej commented Oct 31, 2022

Inserting as binary

*Database.PostgreSQL.Simple Data.ByteString> execute conn "INSERT INTO foo(bar) VALUES (?);" (Only (Binary ("foo" :: ByteString)))
1

querying as text

*Database.PostgreSQL.Simple Data.ByteString> query_ conn "SELECT * from foo;" :: IO [Only ByteString]
[Only {fromOnly = "\\x666f6f"}]

doesn't roundtrip.

Also other way around doesn't work. Insertting as text

*Database.PostgreSQL.Simple Data.ByteString> execute conn "INSERT INTO foo(bar) VALUES (?);" (Only ("foo" :: ByteString))
1

but querying as binary:

*Database.PostgreSQL.Simple Data.ByteString> query_ conn "SELECT * from foo;" :: IO [Only (Binary ByteString)]
*** Exception: Incompatible {errSQLType = "text", errSQLTableOid = Just (Oid 27717), errSQLField = "bar", errHaskellType = "Binary ByteString", errMessage = "types incompatible"}

with

create table foo(bar text);

as the schema.

@dustin
Copy link
Author

dustin commented Oct 31, 2022

OK, but didn't you throw that exception on the last one?

In any case, the example is obviously flawed, though someone probably has written that code somewhere. It'd be fine with making it not compile as it would've at least not told me that I had a bad UTF-8 character in my ByteString which isn't very encouraging.

@phadej
Copy link
Collaborator

phadej commented Oct 31, 2022

OK, but didn't you throw that exception on the last one?

That example illustrates that if there is an existent software which uses textual bytestrings, it successfully inserts and queries the data.
If the ByteString instance is changed to behave like Binary ByteString now, that application will suddenly start throwing exceptions (and insert bogus data due implicit bytea to text conversion on postgresql server side).

It's true that those oid-mismatch exceptions are thrown by postgresql-simple, and they will be thrown in the future, as we do runtime type checking when possible to catch logic errors early.

As I said, that design wart is made over a decade ago, and it's not easy to unwrap. The type-checker doesn't help with migration, except if the instance is removed, but that will cause inconveniences too. (The Base16.encode example is real code).

@googleson78
Copy link

googleson78 commented Oct 2, 2024

Now that deprecated instances are accepted and implemented, can we revisit this with the plan of adding a deprecation and eventually removing and then adding back the proper instance?

@epoberezkin
Copy link

We're fighting the same issue here.

If anything, I'd first remove ByteString instances completely, leaving only Binary ByteString for a while.

It is only needed for ToField, as FromField works correctly. It seems that's the least disruptive option, as it can be simply fixed in downstream code by either defining orphan instance (for expediency) or by moving to newtypes. Existence of ByteString instance creates risk to code breaking intermittently, and it's very hard to debug.

Aeson also does not define instances for ByteString, and it's a more sane decision than any alternative.

If there would be an agreement to eventually remove it, we'd just fork it now and use fork till it's merged.

@Daniel-Diaz
Copy link

Daniel-Diaz commented Jan 6, 2025

I just ran into this. Inserting bytestring fails while attempting to decode it as UTF-8 (!). I agree it doesn't make sense to assume a ByteString value to have UTF-8 encoding. What's the current workaround? If the correct behavior won't be implemented, at least it'd be nice to have a newtype or something.

@Daniel-Diaz
Copy link

This works:

import Data.ByteString (ByteString)
import Data.ByteString qualified as ByteString
import Data.Binary.Builder qualified as Builder
import Database.PostgreSQL.Simple.ToField qualified as SQL
import Numeric (showHex)

newtype SQLByteString = SQLByteString ByteString

instance SQL.ToField SQLByteString where
  toField (SQLByteString bytes) = SQL.Plain $ SQL.inQuotes $
    "\\x" <> ByteString.foldr
      (\byte r -> (if byte <= 0xF then Builder.putCharUtf8 '0' else mempty)
               <> Builder.putStringUtf8 (showHex byte [])
               <> r
        ) mempty bytes

I can add this in a PR if there's interest. The newtype can be named whatever you like. I'm bad at names. I guess it could go in Database.PostgreSQL.Simple.Newtypes. Either that, or making this the official instance for ByteString.

@dustin
Copy link
Author

dustin commented Jan 6, 2025

There's already a type Binary which does this.

Treating ByteString as UTF-8 Text is objectively wrong but will continue to be a source of bugs indefinitely.

@Daniel-Diaz
Copy link

There's already a type Binary which does this.

Treating ByteString as UTF-8 Text is objectively wrong but will continue to be a source of bugs indefinitely.

I see! I had totally missed this. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants