Reduce allocations in reading

This is a follow up of #193.

My understanding is that this crate is optimized for reading to (the owning enum) `Value`, that encapsulates all types, like e.g. serde-json does.

This design makes it expensive to deserialize it to other formats. Specifically, the main challenge is that given a rows of strings, each row will have to allocate a new `String` because `Value` owns `String`, even when the values were already allocated into a `Vec<u8>` during buffering + decompression (See `Reader` and `Block`). This happens to all non-copy types (`String`, `Bytes`, `Fixed` and `Enum`).

Formats such as arrow have their own way of storing strings. Thus, currently, using `Value::String(String)` adds a large overhead (I can quantify them if we need to) to interoperate. Specifically, the data path from a `Read` of a uncompressed file with 1 column of strings to Value is:

```
Read - [read] -> Vec<u8> - [decompress] -> Vec<u8> - [decode] -> String - [move] -> Value
```

This yields an extra allocation per row per column of non-copy types (`String`, `Bytes`, `Fixed` and `Enum`).

Usually, it is more efficient to read it to a `Vec<u8>` and produce non-owned values from it, in this case this would be

```
Read - [read] -> Vec<u8> - [decompress] -> Vec<u8> - [decode] -> &'a str -> Value<'a>
```

(there are other tricks to avoid re-allocating a `Vec<u8>` per Block + decompression which can be discussed in separate).

One way to address this is to declare a "non-owned" `Value` (e.g. `ValueRef::String(&'a str)`), and have the decoder to be of the form `decode<'a>(&'a[u8]) -> Result<(ValueRef<'a>, usize)>` where `usize` is the number of read bytes.

Since each row block has a declared number of bytes upfront and we read them to a `buf: Vec<u8>` as part of the buffering in `Reader` + decompression, we can leverage the existence of this `buf` to decode them to non-owning `ValueRef`, thereby avoiding the allocation per row per column.

We can offer `impl<'a> From<ValueRef<'a>> for Value` that calls `to_string()` (and the like to other types such as `Bytes`), if the user wishes to remove the lifetime at the cost of an allocation.

This is a pretty large, though, because we would need to also change the signatures of `zag_i64` and the like to read from `&[u8]` instead of `R: Read`, since they must return how many bytes were consumed to "advance pointer of `&[u8]`".

I recently did this type of exercise as part of a complete re-write of the `parquet` crate [parquet2](https://github.com/jorgecarleitao/parquet2), where this exact same problem emerges: a column is composed by pages (avro "blocks"), that are decompressed and decoded individually; it is advantageous to read each page to `Vec<u8>` before decompressing and decoding them, but we should not perform further allocations when not needed, specially for strings, bytes, etc.

The result of this re-write was a [5-10x speedup](https://docs.google.com/spreadsheets/d/12Sj1kjhadT-l0KXirexQDOocsLg-M4Ao1jnqXstCpx0/edit#gid=1919295045), which is why I am cautiously confident that this effort may be worth here too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce allocations in reading #195

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Reduce allocations in reading #195

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions