Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unions with logicalTypes json encoded incorrectly #252

Open
kklipsch opened this issue Jul 15, 2022 · 10 comments
Open

Unions with logicalTypes json encoded incorrectly #252

kklipsch opened this issue Jul 15, 2022 · 10 comments

Comments

@kklipsch
Copy link
Contributor

kklipsch commented Jul 15, 2022

The spec states "A logical type is always serialized using its underlying Avro type so that values are encoded in exactly the same way as the equivalent Avro type that does not have a logicalType attribute."

But goavro encodes the name of unions as "type.logicalType". The codec will not accept a record where the union is typed as anything else and it outputs this directly to json but that is incorrect. Binary encoding appears correct.

func TestNullableLogicalJSON(t *testing.T) {
		schema := `{
		"type": "record",
		"name": "nester",
		"namespace": "test",
		"fields" : [
			{"name": "a", "type": ["null", {"type": "long", "logicalType":"timestamp-millis"}], "default": null}
		]
	}`

	codec, err := goavro.NewCodec(schema)
	require.NoError(t, err)

	bs, err := codec.TextualFromNative(nil, map[string]interface{}{
		"a": goavro.Union("long.timestamp-millis", time.Date(2006, 1, 2, 15, 04, 05, 565000000, time.UTC)),
	})
	require.NoError(t, err)

	/*
	This is the bug.  This is how the reference implementation encodes the timestamp in a nullable union.
	Which matches the spec that states:
	"A logical type is always serialized using its underlying Avro type so that values are encoded in exactly the same way as the equivalent Avro type that does not have a logicalType attribute."
	
	goavro encodes this as {"a":{"long.timestamp-millis":1136214245565}}
	*/
	require.Equal(t, `{"a":{"long":1136214245565}}`, string(bs))
@mihaitodor
Copy link
Collaborator

I am also looking at logical types and comparing Goavro to the Java implementation and this code that I managed to put together seems to confirm your findings:

// > export JAVA_HOME=/usr/local/opt/openjdk
// > export PATH="${JAVA_HOME}/bin:$PATH"
// > java --version
// openjdk 18.0.1 2022-04-19
// OpenJDK Runtime Environment Homebrew (build 18.0.1+0)
// OpenJDK 64-Bit Server VM Homebrew (build 18.0.1+0, mixed mode, sharing)
// > java -cp avro_1.11/avro-1.11.0.jar:avro_1.11/jackson-core-2.12.5.jar:avro_1.11/jackson-annotations-2.12.5.jar:avro_1.11/jackson-databind-2.12.5.jar:avro_1.11/slf4j-api-1.7.32.jar Main.java

import java.math.BigInteger;
import java.util.Arrays;
import java.io.*;
import org.apache.avro.Schema;
import org.apache.avro.io.Decoder;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.*;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.specific.SpecificDatumReader;
import org.apache.avro.Conversions;
import org.apache.avro.file.DataFileReader;
import java.util.HexFormat;

public class Main {
    static byte[] fromJsonToAvro(String json, Schema schema) throws Exception {
        InputStream input = new ByteArrayInputStream(json.getBytes());
        DataInputStream din = new DataInputStream(input);

        Decoder decoder = DecoderFactory.get().jsonDecoder(schema, din);

        DatumReader<Object> reader = new GenericDatumReader<Object>(schema);
        Object datum = reader.read(null, decoder);

        GenericDatumWriter<Object>  w = new GenericDatumWriter<Object>(schema);
        ByteArrayOutputStream outputStream = new ByteArrayOutputStream();

        Encoder e = EncoderFactory.get().binaryEncoder(outputStream, null);

        w.write(datum, e);
        e.flush();

        return outputStream.toByteArray();
    }

    public static String avroToJson(Schema schema, byte[] avroBinary) throws IOException {
        DatumReader<Object> datumReader = new GenericDatumReader<>(schema);
        Decoder decoder = DecoderFactory.get().binaryDecoder(avroBinary, null);
        Object avroDatum = datumReader.read(null, decoder);

        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        DatumWriter<Object> writer = new GenericDatumWriter<>(schema);
        JsonEncoder encoder = EncoderFactory.get().jsonEncoder(schema, baos, false);
        writer.write(avroDatum, encoder);
        encoder.flush();
        baos.flush();
        return new String(baos.toByteArray());
    }

    static String schemaJSON = """
{
    "type": "record",
    "name": "nester",
    "namespace": "test",
    "fields" : [
        {"name": "a", "type": ["null", {"type": "long", "logicalType":"timestamp-millis"}], "default": null}
    ]
}
""";

    static String inputJSON = """
{"a":{"long":1136214245565}}
""";

    public static void main(String[] args) {
        try {
            Schema schema = new Schema.Parser().parse(schemaJSON);

            byte[] avroByteArray = fromJsonToAvro(inputJSON, schema);

            String outputJson = avroToJson(schema, avroByteArray);

            System.out.println(outputJson);
        }
        catch (Exception e) {
            System.out.println(e);
        }
    }
}

It outputs {"a":{"long":1136214245565}}. Also, if I change the inputJSON to {"a":{"long.timestamp-millis":1136214245565}}, it fails with org.apache.avro.AvroTypeException: Unknown union branch long.timestamp-millis, which makes sense given the output format...

Disclaimer: I'm not really experienced with Java and this is some copy/paste code from StackOverflow that I modified a bit....

@xmcqueen
Copy link
Contributor

Thanks for this issue. This will take some thinking over here. I'm not sure if this means the current implementation is simply broken, or if its just in violation of the spec (which can be regarded as a form of being broken too). Also I'm not sure is fixing this specific problem would cause breakage in the existing user base. It does sound like a risky fix. Please provide your comments if you know about those topics.

@kklipsch
Copy link
Contributor Author

From my point of view this means that JSON encoding is broken. You won't be able to use json encoded with goavro with other implementations of avro.

It likely will break any use cases where a current user is a) using json encoding with logicalTypes b) reading and writing exclusively with goavro (because if they had another client in the mix it would already be broken).

@mihaitodor
Copy link
Collaborator

Given how long this library has been around, a new major version should be released if this change is made. My understanding is that logical types are not used used by many people, so I think most users won't have any issues when upgrading to the new major version, which they'd need to do manually anyway. Adding and maintaining a new API for this seems more painful long term...

@xmcqueen
Copy link
Contributor

@kklipsch it sounds like you are in favor of getting this one going, and we would need to make a major version bump since it is expected to be a breaking change. Is that right? Will such a breaking change open up a more bright future or will it simply annoy the existing user base? It DOES look like its actually a small (but breaking) change and definitely seems to be the correct operation per the spec. I suppose it should be fine as long as both are kept up-to-date for a reasonable support period. Given that changes are infrequent here, that should not be a big maintenance problem.

Thoughts from you @kklipsch

@xmcqueen
Copy link
Contributor

I don't see a PR linked here. Is this one done already?

@kklipsch
Copy link
Contributor Author

I think it definitely needs to be fixed. I'm less opinionated on the version issue. To my mind this has always been broken. So its not a breaking change to fix it. But I can see the other side of the issue as well.

I did not actually attempt to fix this in a PR. I can take a crack at it if you like, but the earliest I'd be able to look at it would be next week.

@xmcqueen
Copy link
Contributor

I think it would be great if you'd fix it. We appreciate contributions!

@kklipsch
Copy link
Contributor Author

kklipsch commented Feb 6, 2023

Just a note that I won't be able to work on this for at least another week. I put an affirmative message if i take it on and anyone else should feel free to take it until I add that message.

@xmcqueen
Copy link
Contributor

xmcqueen commented Feb 7, 2023

Thanks, we'll see if anybody jumps at it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants