Skip to content

[Pitch] Apache avro serialization #74

@jspaezp

Description

@jspaezp

Hi y'all!

I started a (VERY EARLY PROTOTYPE) that implements serialization to apache avro.
I think it would be a good alternative to json with more efficient disk usage.

https://github.com/jspaezp/avrospeclib

I am still implementing the schema using pydantic and deriving form it the
avro schema.

Some disk usage metrics on a reasonably large speclib I have

    # ~ 50MB  binary speclib file from diann
    #  552M   tmp/speclib_out.tsv
    #  448M   tmp/speclib_out.mzlib.json # using mzspeclib
    #  148M   tests/data/test.mzlib.avro

Read-write speeds

avro write: 4.832904
avro read: 6.133625
json write: 6.304285
json read: 4.992042
pydantic validation: 19.415933 # Not needed for avro because schema is on-write.

let me know if there is any interest in adopting it!
best!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions