Skip to content

Reduce allocations in reading #195

@jorgecarleitao

Description

@jorgecarleitao

This is a follow up of #193.

My understanding is that this crate is optimized for reading to (the owning enum) Value, that encapsulates all types, like e.g. serde-json does.

This design makes it expensive to deserialize it to other formats. Specifically, the main challenge is that given a rows of strings, each row will have to allocate a new String because Value owns String, even when the values were already allocated into a Vec<u8> during buffering + decompression (See Reader and Block). This happens to all non-copy types (String, Bytes, Fixed and Enum).

Formats such as arrow have their own way of storing strings. Thus, currently, using Value::String(String) adds a large overhead (I can quantify them if we need to) to interoperate. Specifically, the data path from a Read of a uncompressed file with 1 column of strings to Value is:

Read - [read] -> Vec<u8> - [decompress] -> Vec<u8> - [decode] -> String - [move] -> Value

This yields an extra allocation per row per column of non-copy types (String, Bytes, Fixed and Enum).

Usually, it is more efficient to read it to a Vec<u8> and produce non-owned values from it, in this case this would be

Read - [read] -> Vec<u8> - [decompress] -> Vec<u8> - [decode] -> &'a str -> Value<'a>

(there are other tricks to avoid re-allocating a Vec<u8> per Block + decompression which can be discussed in separate).

One way to address this is to declare a "non-owned" Value (e.g. ValueRef::String(&'a str)), and have the decoder to be of the form decode<'a>(&'a[u8]) -> Result<(ValueRef<'a>, usize)> where usize is the number of read bytes.

Since each row block has a declared number of bytes upfront and we read them to a buf: Vec<u8> as part of the buffering in Reader + decompression, we can leverage the existence of this buf to decode them to non-owning ValueRef, thereby avoiding the allocation per row per column.

We can offer impl<'a> From<ValueRef<'a>> for Value that calls to_string() (and the like to other types such as Bytes), if the user wishes to remove the lifetime at the cost of an allocation.

This is a pretty large, though, because we would need to also change the signatures of zag_i64 and the like to read from &[u8] instead of R: Read, since they must return how many bytes were consumed to "advance pointer of &[u8]".

I recently did this type of exercise as part of a complete re-write of the parquet crate parquet2, where this exact same problem emerges: a column is composed by pages (avro "blocks"), that are decompressed and decoded individually; it is advantageous to read each page to Vec<u8> before decompressing and decoding them, but we should not perform further allocations when not needed, specially for strings, bytes, etc.

The result of this re-write was a 5-10x speedup, which is why I am cautiously confident that this effort may be worth here too.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions