wip: Variant blog post #739

friendlymatthew · 2025-11-26T04:13:49Z

Closes Blog post about Variant support arrow-rs#8566

This PR contains a blog post that explains what Variants are and why we should care.

Please note, this is still a work in progress, specifically the last section which involves benchmark code to be written in tandem to the prose

cc @alamb

github-actions · 2025-11-26T04:14:00Z

Preview URL: https://friendlymatthew.github.io/arrow-site

If the preview URL doesn't work, you may forget to configure your fork repository for preview.
See https://github.com/apache/arrow-site/blob/main/README.md#forks how to configure.

alamb

This looks great @friendlymatthew -- thank you so much.

I left some comments. How else can I help? I could for example, make some diagrams

alamb · 2025-11-26T20:10:44Z

_posts/2025-11-25-variant.md

+
+This article explains the limitations of JSON, how Variant solves these problems, and why you should be excited about it by analyzing performance characteristics on a real world dataset.
+
+## Whats the problem with JSON?


It might help here to also point out the benefits of JSON. Namely

You don't have to define your schema up front before loading it into the OLAP system( the biggest I think)

it can tolerate different structures row by row (the messy part)

Supporting JSON/semi-structured types make OLAP systems applicable for many tasks traditionally done before loading into the data system (like ETL), enabling users to take advantge of the many great OLAP engines.

alamb · 2025-11-26T20:12:03Z

_posts/2025-11-25-variant.md

+
+There are many problems with this approach.
+
+_Read performance degrades quickly._ To evaluate any predicate requires scanning every row and deserializing the entire JSON payload into memory, no matter its size. Even multi-megabyte documents must be completely decoded just to inspect a single field.


it might also be worth emphasizing again that single field reads require both decoding the string, and then parsing that string into an in memory structure, and then walking that in memory structure for each row

alamb · 2025-11-26T20:15:42Z

_posts/2025-11-25-variant.md

+{% endcomment %}
+-->
+
+Variant is a data type designed to solve JSON's performance problems in OLAP systems, and its initial implementation has been released in Arrow 57.0.0.


I suggest adding links (e.g. to the crates.io arrow page)

alamb · 2025-11-26T20:17:31Z

_posts/2025-11-25-variant.md

+
+### Variant has efficient serialization through a 2-column design
+
+JSON columns have naive serialization which leads to unnecessary encoding overhead. For example, when storing `{"user_id": 123, "timestamp: "2025-04-21" }` thousands of times, the field names `user_id` and `timestamp` are written out in full for each row, even though they're identical across rows.


this is a good example, I think we could make a nice diagram that illustrates the point nicely -- aka a bunch of JSON documents with repeated fields showing that when encoded as JSON the strings get repeated over and over

Then we can add a "here is how you would encode it as variant" showing both the single copy of field names as well as the split into two different fields 🤔

_posts/2025-11-25-variant.md

alamb · 2025-11-26T20:23:10Z

_posts/2025-11-25-variant.md

+Even with fast field lookups, Variant still requires full deserialization to access any field, as the entire data must be decoded just to read the single value. This wastes CPU and memory on data that you don't need.
+
+Variant solves this problem with its own shredding specification. The specification defines how to extract frequently accessed fields from a Variant value and store them as separate typed columns in the file format. For example, if you frequently query a timestamp field called `start_timestamp` or a 64-bit integer field called `user_id`, it can be shredded into dedicated timestamp and integer columns alongside the Variant columns.
+


It would also be good to add a link to the actual shredding spec for people to refer if they wanted more etails

alamb · 2025-11-26T20:24:10Z

_posts/2025-11-25-variant.md

+
+The metadata dictionary can similarly be optimized for faster search performance by making the list of field names unique and sorted.
+
+### Variant can leverage file format capabilities


I know this is a general capability applicable to all (columnar) file formats, but since this blog is about the Variant that was added to Parquet, I recommend you focus this part of Parquet ("Variant can leverage Parquet's columnar architecture" or something)

alamb · 2025-11-26T20:27:36Z

_posts/2025-11-25-variant.md

+
+Zone maps can skip entire row groups where `start_timestamp` falls outside the query range. Dictionary encoding can compress repeated `user_id` values efficiently. Bloom filters can rule out row groups without your target user. More importantly, you only deserialize the shredded columns you need, not the Variant columns.
+
+Shredding makes the trade-off explicit: extract frequently queried fields into optimized columns at write time, and keep everything else flexible. You get columnar performance for common access patterns and schema flexibility for everything else.


I think it would help to show how this would work in practice.

SELECT count(*) from events WHERE event["start_timestamp"] > '2025-01-01' AND event["user_id"] = 12345;

To

SELECT count(*) from events WHERE typed_value.start_timestamp > '2025-01-01' AND typed_value.user_id = 12345;

you would have to then explain that predicate pushdown works for the typed typed column

alamb · 2025-11-26T20:27:56Z

_posts/2025-11-25-variant.md

+Our benchmarks focus on three key metrics: write performance when serializing data to Parquet files, storage efficiency comparing both compressed and uncompressed file sizes, and query perfomance across common access patterns.
+
+For query execution, we use Datafusion, a popular query engine. To work with Variant in Datafusion, we use [datafusion-variant](https://github.com/datafusion-contrib/datafusion-variant), a library that implements native Variant type support.
+


this looks great -- I can't wait for this section!

Co-authored-by: Andrew Lamb <[email protected]>

Variant blog post

3d6ee40

friendlymatthew force-pushed the friendlymatthew/variant branch from 13068a2 to 3d6ee40 Compare November 26, 2025 04:18

alamb reviewed Nov 26, 2025

View reviewed changes

friendlymatthew and others added 3 commits December 1, 2025 12:24

make edits

dcba452

Update _posts/2025-11-25-variant.md

cd1a223

Co-authored-by: Andrew Lamb <[email protected]>

more edits

bdddfdc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

wip: Variant blog post #739

wip: Variant blog post #739

Uh oh!

friendlymatthew commented Nov 26, 2025

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Nov 26, 2025

Uh oh!

alamb Nov 26, 2025

Uh oh!

alamb Nov 26, 2025

Uh oh!

alamb Nov 26, 2025

Uh oh!

Uh oh!

Uh oh!

alamb Nov 26, 2025

Uh oh!

alamb Nov 26, 2025

Uh oh!

alamb Nov 26, 2025

Uh oh!

alamb Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		This article explains the limitations of JSON, how Variant solves these problems, and why you should be excited about it by analyzing performance characteristics on a real world dataset.

		## Whats the problem with JSON?


		There are many problems with this approach.

		_Read performance degrades quickly._ To evaluate any predicate requires scanning every row and deserializing the entire JSON payload into memory, no matter its size. Even multi-megabyte documents must be completely decoded just to inspect a single field.


		### Variant has efficient serialization through a 2-column design

		JSON columns have naive serialization which leads to unnecessary encoding overhead. For example, when storing `{"user_id": 123, "timestamp: "2025-04-21" }` thousands of times, the field names `user_id` and `timestamp` are written out in full for each row, even though they're identical across rows.

		Even with fast field lookups, Variant still requires full deserialization to access any field, as the entire data must be decoded just to read the single value. This wastes CPU and memory on data that you don't need.

		Variant solves this problem with its own shredding specification. The specification defines how to extract frequently accessed fields from a Variant value and store them as separate typed columns in the file format. For example, if you frequently query a timestamp field called `start_timestamp` or a 64-bit integer field called `user_id`, it can be shredded into dedicated timestamp and integer columns alongside the Variant columns.


		The metadata dictionary can similarly be optimized for faster search performance by making the list of field names unique and sorted.

		### Variant can leverage file format capabilities


		Zone maps can skip entire row groups where `start_timestamp` falls outside the query range. Dictionary encoding can compress repeated `user_id` values efficiently. Bloom filters can rule out row groups without your target user. More importantly, you only deserialize the shredded columns you need, not the Variant columns.

		Shredding makes the trade-off explicit: extract frequently queried fields into optimized columns at write time, and keep everything else flexible. You get columnar performance for common access patterns and schema flexibility for everything else.

		Our benchmarks focus on three key metrics: write performance when serializing data to Parquet files, storage efficiency comparing both compressed and uncompressed file sizes, and query perfomance across common access patterns.

		For query execution, we use Datafusion, a popular query engine. To work with Variant in Datafusion, we use [datafusion-variant](https://github.com/datafusion-contrib/datafusion-variant), a library that implements native Variant type support.

wip: Variant blog post #739

Are you sure you want to change the base?

wip: Variant blog post #739

Uh oh!

Conversation

friendlymatthew commented Nov 26, 2025

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants