-
Notifications
You must be signed in to change notification settings - Fork 121
wip: Variant blog post #739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
wip: Variant blog post #739
Conversation
|
Preview URL: https://friendlymatthew.github.io/arrow-site If the preview URL doesn't work, you may forget to configure your fork repository for preview. |
13068a2 to
3d6ee40
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @friendlymatthew -- thank you so much.
I left some comments. How else can I help? I could for example, make some diagrams
|
|
||
| This article explains the limitations of JSON, how Variant solves these problems, and why you should be excited about it by analyzing performance characteristics on a real world dataset. | ||
|
|
||
| ## Whats the problem with JSON? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might help here to also point out the benefits of JSON. Namely
- You don't have to define your schema up front before loading it into the OLAP system( the biggest I think)
- it can tolerate different structures row by row (the messy part)
Supporting JSON/semi-structured types make OLAP systems applicable for many tasks traditionally done before loading into the data system (like ETL), enabling users to take advantge of the many great OLAP engines.
_posts/2025-11-25-variant.md
Outdated
|
|
||
| There are many problems with this approach. | ||
|
|
||
| _Read performance degrades quickly._ To evaluate any predicate requires scanning every row and deserializing the entire JSON payload into memory, no matter its size. Even multi-megabyte documents must be completely decoded just to inspect a single field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might also be worth emphasizing again that single field reads require both decoding the string, and then parsing that string into an in memory structure, and then walking that in memory structure for each row
| {% endcomment %} | ||
| --> | ||
|
|
||
| Variant is a data type designed to solve JSON's performance problems in OLAP systems, and its initial implementation has been released in Arrow 57.0.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest adding links (e.g. to the crates.io arrow page)
_posts/2025-11-25-variant.md
Outdated
|
|
||
| ### Variant has efficient serialization through a 2-column design | ||
|
|
||
| JSON columns have naive serialization which leads to unnecessary encoding overhead. For example, when storing `{"user_id": 123, "timestamp: "2025-04-21" }` thousands of times, the field names `user_id` and `timestamp` are written out in full for each row, even though they're identical across rows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good example, I think we could make a nice diagram that illustrates the point nicely -- aka a bunch of JSON documents with repeated fields showing that when encoded as JSON the strings get repeated over and over
Then we can add a "here is how you would encode it as variant" showing both the single copy of field names as well as the split into two different fields 🤔
| Even with fast field lookups, Variant still requires full deserialization to access any field, as the entire data must be decoded just to read the single value. This wastes CPU and memory on data that you don't need. | ||
|
|
||
| Variant solves this problem with its own shredding specification. The specification defines how to extract frequently accessed fields from a Variant value and store them as separate typed columns in the file format. For example, if you frequently query a timestamp field called `start_timestamp` or a 64-bit integer field called `user_id`, it can be shredded into dedicated timestamp and integer columns alongside the Variant columns. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also be good to add a link to the actual shredding spec for people to refer if they wanted more etails
|
|
||
| The metadata dictionary can similarly be optimized for faster search performance by making the list of field names unique and sorted. | ||
|
|
||
| ### Variant can leverage file format capabilities |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is a general capability applicable to all (columnar) file formats, but since this blog is about the Variant that was added to Parquet, I recommend you focus this part of Parquet ("Variant can leverage Parquet's columnar architecture" or something)
|
|
||
| Zone maps can skip entire row groups where `start_timestamp` falls outside the query range. Dictionary encoding can compress repeated `user_id` values efficiently. Bloom filters can rule out row groups without your target user. More importantly, you only deserialize the shredded columns you need, not the Variant columns. | ||
|
|
||
| Shredding makes the trade-off explicit: extract frequently queried fields into optimized columns at write time, and keep everything else flexible. You get columnar performance for common access patterns and schema flexibility for everything else. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would help to show how this would work in practice.
SELECT count(*) from events
WHERE event["start_timestamp"] > '2025-01-01' AND event["user_id"] = 12345;To
SELECT count(*) from events
WHERE typed_value.start_timestamp > '2025-01-01' AND typed_value.user_id = 12345;you would have to then explain that predicate pushdown works for the typed typed column
| Our benchmarks focus on three key metrics: write performance when serializing data to Parquet files, storage efficiency comparing both compressed and uncompressed file sizes, and query perfomance across common access patterns. | ||
|
|
||
| For query execution, we use Datafusion, a popular query engine. To work with Variant in Datafusion, we use [datafusion-variant](https://github.com/datafusion-contrib/datafusion-variant), a library that implements native Variant type support. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks great -- I can't wait for this section!
Co-authored-by: Andrew Lamb <[email protected]>
This PR contains a blog post that explains what Variants are and why we should care.
Please note, this is still a work in progress, specifically the last section which involves benchmark code to be written in tandem to the prose
cc @alamb