Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explanation of Arrow.Stream vs. Arrow.Table seems ambiguous #472

Open
bdklahn opened this issue Jul 2, 2023 · 3 comments
Open

explanation of Arrow.Stream vs. Arrow.Table seems ambiguous #472

bdklahn opened this issue Jul 2, 2023 · 3 comments

Comments

@bdklahn
Copy link

bdklahn commented Jul 2, 2023

In addition to `Arrow.Table`, the Arrow.jl package also provides `Arrow.Stream` for processing arrow data. While `Arrow.Table` will iterate all record batches in an arrow file/stream, concatenating columns, `Arrow.Stream` provides a way to *iterate* through record batches, one at a time. Each iteration yields an `Arrow.Table` instance, with columns/data for a single record batch. This allows, if so desired, "batch processing" of arrow data, one record batch at a time, instead of creating a single long table via `Arrow.Table`.

Italicizing iterate for the second instance vs the first doesn't seem to illustrate in any unambiguous difference.

Does "Table" iterate batch units, while "Stream" iterates record units?

As I understand, a stream might be more like "on-demand" iteration. -like a generator, which only advances when necessary or able (space opens in a fixed buffer), while "Table" might "eagerly" load an entire table (batch), at a time.

It's not clear to me what are the units of each iteration, and the difference if both are iterating a table.

Maybe it would be simpler to see, if I just look at the code. :-)

@bdklahn bdklahn changed the title explanation of Arrow.Stream vs. Arrow.Table ambiguous explanation of Arrow.Stream vs. Arrow.Table seems ambiguous Jul 2, 2023
@Moelf
Copy link
Contributor

Moelf commented Jul 2, 2023

While Arrow.Table will iterate all record batches in an arrow file/stream, concatenating columns

this is saying it iterates AND concatenate, so holding the entire table in RAM (especially a problem if you're dealing with compressed file).

@bdklahn
Copy link
Author

bdklahn commented Jul 6, 2023

While Arrow.Table will iterate all record batches in an arrow file/stream, concatenating columns

this is saying it iterates AND concatenate, so holding the entire table in RAM (especially a problem if you're dealing with compressed file).

That's part of the problem: The documentation is NOT saying that. You are.

After re-reading a dozen times, I think I might understand what this means to say.

I think it is partly confusing to use the word "iterate". That is a verb, which indicates something is now happening (e.g when a Table or Stream object are instantiated).
It might make more sense to say "iterator" (noun). That is an object like a generator or like a file handle which points to a position on disk (or memory), and has some state knowledge of how far to jump ahead, for each iteration.

I believe when a Table is instantiated, it presents a view where all the batches appear as if there is a single "batch" (one table). In this case, an iterator might be constructed to have a step size of only one record (e.g. row). When a Stream object is constructed, it looks like each step will produce all the records in a batch, and wrap them as a Table. Each iteration will produce a new table (as the docs indicate).

I understand (from deduction and experience) that having compressed arrow will require an additional "buffer" of the actual binary format, because compressed bits aren't a memmap-optimized data form. So, yes, if you use compressed arrow, loading in batch by batch can mitigate this. But if you are using Arrow, at all, it almost doesn't make any sense to compress (even, e.g. the lz4 compressed feather format). If you are doing any compression, maybe just use parquet. So, given normal (uncompressed) Arrow, an "entire table in RAM" should not be an issue. That's one of the main purposes of using Arrow, in the first place: "out of memory" processing. I would think the memory-loaded schema size for Table would not be significantly bigger than that of each tabular batch (if at all).

Anyway, in terms of the Table interface, it might even be bad practice to even mention iteration. Typically you want to encourage thinking about things in terms of vectorization. (e.g. broadcasting over as many rows at once) vs. any implication of processing anything row by row. -at least for DataFrame style interfacing. For Stream, it makes sense to mention iteration (for the comparison context), because the "processing" here is the actual process of of loading (and unloading) batches of records in and out of memory.

I'm not sure if something a little more like this would make sense (and is accurate):

". . . While Arrow.Table provides an interface where all record batches appear vertically concatenated into a single table, Arrow.Stream creates an iterator, where each iteration produces a separate table for from each batch. A Stream can be helpful for large compressed (e.g. lz4-compressed feather files)", where decompressed arrow data will need to be buffered in memory. The buffer would only need to accommodate one batch at a time, vs. all the batches at once as would be the case with Arrow.Table."

That could probably be simplified. I left out the "concatenating columns", because I'm not sure how relevant or distracting that might be, in this context. I mean . . . that's already implicit in the definition of an Arrow batch.

@Moelf
Copy link
Contributor

Moelf commented Jul 6, 2023

That's part of the problem: The documentation is NOT saying that. You are.

beat me, ever since this project merged into Apache monorepo, it's impossible to get anything through in a responsive manner, matter of fact the "doc is not rendering up to date" took what I feels like almost a year to address. So sorry, I can only add information in github issues :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants