datalake: arrow_to_iobuf interface #23375

jcipar · 2024-09-18T20:37:36Z

This adds an arrow_to_iobuf interface that converts Arrow data to iobufs representing Parquet files that can be written to disk. There are two components:

An implementation of arrow::io::OutputStream that collects data in iobufs
A class that creates a parquet::io::FileWriter using that output stream and allows the caller to extract the generated iobufs.

This allows us to separate the compute side of generating parquet, which still occurs in the Arrow library, from the file io, which can now be made seastar-friednly.

Backports Required

Release Notes

none

jcipar · 2024-09-18T20:38:55Z

src/v/datalake/tests/arrow_writer_test.cc

@@ -86,7 +29,7 @@ test_int: int32
 test_long: int64
 test_float: float
 test_double: double
-test_decimal: decimal128(8, 16)
+test_decimal: decimal128(16, 8)


This was a bug in the previous versions of the test code. Precision must be greater than scale, but it was reversed before. It didn't matter earlier because we weren't validating that we could translate the data to Parquet, but now that we are, this needs to be correct.

vbotbuildovich · 2024-09-18T22:57:36Z

new failures in https://buildkite.com/redpanda/redpanda/builds/54698#0192071e-9aea-4b40-8a3f-cfa5af82c2fe:

"rptest.tests.delete_records_test.DeleteRecordsTest.test_delete_records_concurrent_truncations.cloud_storage_enabled=True.truncate_point=at_high_watermark"

vbotbuildovich · 2024-09-18T23:20:48Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/54698#0192071e-9ae3-4c2e-8d1c-18da583ddd2d

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/54698#0192071e-9ae7-47f6-baba-4e45599b4960

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/54698#01920738-23d4-45ef-933e-d860a14f64ec

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/54768#01920c74-23bd-4713-aac2-1bc68f7078ee

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/54844#019210d6-f4df-427e-ac9b-a34c5d0210a7

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55115#01922a08-b600-4efd-ba7c-5df20edd7b56

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55115#01922a75-0fef-4654-a1e2-a948654392e2

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55115#01922a75-0ff1-426e-b6bb-70366bc1a006

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/55115#01922add-f3ba-43e7-a9f0-87e4056bc507

jcipar · 2024-09-19T18:20:57Z

src/v/datalake/tests/parquet_writer_test.cc

+#include <parquet/arrow/reader.h>
+#include <parquet/type_fwd.h>
+
+TEST(ParquetWriter, DoesNothing) {


Forgot to rename this test after actually writing a test. fix coming...

dotnwat · 2024-09-18T21:38:05Z

src/v/datalake/parquet_writer.cc

+    //// METHODS SPECIFIC TO IOBUF OUTPUT STREAM ////
+    iobuf take_iobuf();


a comment that describes the method?

Removed ;-)

dotnwat · 2024-09-18T21:43:02Z

src/v/datalake/tests/arrow_writer_test.cc

-- child 5 type: decimal128(8, 16)
+-- child 5 type: decimal128(16, 8)
  [
-    0.E-16,
-    0.E-16,
-    0.E-16,
-    0.E-16,
-    0.E-16
+    0.E-8,
+    0.E-8,
+    0.E-8,
+    0.E-8,
+    0.E-8


seems like this should be in a separate commit explaining the issue

andrwng

Mostly nits. I take it there will be an additional abstraction for IO in parquet_writer that isn't included here?

andrwng · 2024-09-19T18:36:24Z

src/v/datalake/parquet_writer.h

+    void add_arrow_array(std::shared_ptr<arrow::Array> data);
+    iobuf take_iobuf();
+    iobuf close_and_take_iobuf();


nit: could you add some light documentation about what these are and their relationship with one another?

andrwng · 2024-09-19T18:37:11Z

src/v/datalake/parquet_writer.h

+#include <filesystem>
+#include <utility>


nit: probably not needed in the header?

andrwng · 2024-09-19T18:37:54Z

src/v/datalake/parquet_writer.h

+namespace arrow {
+class Array;
+}


nit: add comment on the ending brace?

Also if we have this i'm wondering do we need to include arrow/io/memory.h? Or is that included for something else?

I'm not totally sure what's going on. When I forward declare this and move #include <parquet/arrow/writer.h> to the cc file it fails on an incomplete type, but if I keep that include in the header it works. Same problem if I #include <parquet/type_fwd.h>

andrwng · 2024-09-19T18:41:01Z

src/v/datalake/parquet_writer.cc

+#include <stdexcept>
+
+namespace datalake {
+class iobuf_output_stream : public arrow::io::OutputStream {


nit: maybe stick this in an anonymous namespace? Also could you move the implementation of the methods above the implementation of the arrow_to_iobuf methods? Just so it's easier to read this class together

Alternatively just inline everything in this definition, given the methods are all tiny

I put it in an anonymous namespace. I have a slight preference for keeping the methods out of the class definition, but if you have a strong opinion on it I can change that.

the anonymous namespace is about avoiding external linkage where possible. i think andrew's point is that iobuf_output_stream is only used in this one translation unit.

preference for keeping the methods out of the class definition

not sure what this was referring to w.r.t. anonymous namespace?

not sure what this was referring to w.r.t. anonymous namespace?

It's not about anonymous namespaces. I interpreted the last line of Andrew's comment to be about putting the method definitions in the class definition instead of after the class.

I have a slight preference for keeping the methods out of the class definition, but if you have a strong opinion on it I can change that.

Not a strong preference, but I typically inline if the code is small, just so it's easier on the eyes. Feel free to leave it as is; thanks for moving it!

andrwng · 2024-09-19T18:43:36Z

src/v/datalake/parquet_writer.cc

+}
+
+void arrow_to_iobuf::add_arrow_array(std::shared_ptr<arrow::Array> data) {
+    arrow::ArrayVector data_av = {data};


{std::move(data)}?

andrwng · 2024-09-19T18:56:17Z

src/v/datalake/parquet_writer.h

+#include "datalake/data_writer_interface.h"
+


nit: I might be missing something, what do we need this here for? Same in the cc

Oh probably that this isn't the final state of parquet_writer, since this isn't doing IO yet!

Actually, I think you're right. I was thinking this would implement that interface, but the next PR is a higher level wrapper for this.

Also, that interface will have to change to make it futurized.

andrwng · 2024-09-19T19:08:18Z

src/v/datalake/tests/parquet_writer_test.cc

+    auto vbegin = iobuf::byte_iterator(
+      full_result.cbegin(), full_result.cend());
+    auto vend = iobuf::byte_iterator(full_result.cend(), full_result.cend());
+    std::string full_result_string;
+    // Byte iterators don't work with the string constructor.
+    while (vbegin != vend) {
+        full_result_string += *vbegin;
+        ++vbegin;
+    }


nit: wondering if bytes_to_iobuf() from bytes/bytes.h works here?

This is converting an iobuf to bytes.

andrwng · 2024-09-19T19:09:35Z

src/v/datalake/tests/parquet_writer_test.cc

+
+    datalake::arrow_to_iobuf writer(*schema_translator.build_arrow_schema());
+
+    // The first write is a special case because it is 4 bytes longer.


Curious, is this explaining why we're using EXPECT_NEAR instead of EXPECT_EQ? Generally wondering where the 4 is showing up in this test?

No, I had previously been checking the exact value. When I was only doing 2 batches they were consistent, but when I switched to more batches I noticed that they are not.

andrwng · 2024-09-19T19:13:36Z

src/v/datalake/tests/CMakeLists.txt

+  LIBRARIES
+    v::application
+    v::features
+    v::gtest_main
+    v::kafka_test_utils
+    v::datalake
+    v::model_test_utils
+    v::iceberg_test_utils
+  LABELS storage


nit: probably only need gtest, datalake, iceberg test utils? and the labels in this file are off

andrwng · 2024-09-19T19:15:50Z

src/v/datalake/parquet_writer.h

+    iobuf close_and_take_iobuf();
+
+private:
+    std::shared_ptr<iobuf_output_stream> _outfile;


nit: maybe _ostream? given this isn't a file?

dotnwat · 2024-09-19T21:21:08Z

src/v/datalake/tests/parquet_writer_test.cc

+    // Check that the data is a valid parquet file. Convert the iobuf to a
+    // single buffer then import that into an arrow::io::BufferReader
+    auto vbegin = iobuf::byte_iterator(
+      full_result.cbegin(), full_result.cend());
+    auto vend = iobuf::byte_iterator(full_result.cend(), full_result.cend());
+    std::string full_result_string;
+    // Byte iterators don't work with the string constructor.
+    while (vbegin != vend) {
+        full_result_string += *vbegin;
+        ++vbegin;
+    }


does this work:

auto b = iobuf_to_bytes()
std::string(b.c_str(), b.size())

or add a helper for this case somewhere like bytes/string.h you can model it after iobuf_to_bytes in bytes.h.

Better yet, the Arrow BufferReader can accept a pointer and length directly.

dotnwat · 2024-09-20T21:09:21Z

src/v/datalake/parquet_writer.h

+
+namespace datalake {
+
+namespace {


mm, im not sure it makes sense to have an anonymous namespace in a header.

+1 I think the code should work without wrapping the foward decl?

I think there is a clang-tidy warning for not using anonymous namespaces in headers, but can't find it with a quick search. Let's remove 👍

src/v/datalake/parquet_writer.h

dotnwat · 2024-09-20T21:29:47Z

src/v/datalake/parquet_writer.cc

+    // virtual Status Write(const std::shared_ptr<Buffer>& data);
+
+    // Take the data from the iobuf and clear the internal state.
+    iobuf take_iobuf();


looks like this should be r-value qualified iobuf take_iobuf() && since _current_iobuf will be left in a moved-from state after this call.

EDIT: see later comment in test regarding this and writer reuse.

dotnwat · 2024-09-20T21:42:22Z

src/v/datalake/tests/parquet_writer_test.cc

+    datalake::arrow_to_iobuf writer(*schema_translator.build_arrow_schema());
+
+    for (int i = 0; i < 10; i++) {
+        writer.add_arrow_array(result);
+        iobuf serialized = writer.take_iobuf();


it looks like there is a use-after-move issue here. when you call writer.take_iobuf that proxies to iobuf iobuf_output_stream::take_iobuf() { return std::move(_current_iobuf); }, but iobuf doesn't formally specify its moved-from-state. it works because it happens to leave it empty, but i think we should not depend on this. if you r-value qualified iobuf_output_stream::take_iobuf() per my other comment, then I think you'd have a cascade of changes which resulted in a use-after-move clang-tidy warning here.

i see two options. don't re-use the writer. this is the "cleanest" option, and if the writer is lightweight (it looks like it) then it probably makes the most sense.

the other option would be to not r-value qualify take_iobuf, and call it something like iobuf reset() which would move the current iobuf out of the ostream, and explicitly reset it.

The writer includes a parquet::arrow::FileWriter which is stateful, so I don't think it would work create a new writer, but resetting the iobuf in take_iobuf should work.

I made a copy of the iobuf so I could reset the _current_iobuf and return the copy. How can I return that by rvalue reference when it is allocated on the stack?

I made a copy of the iobuf so I could reset the _current_iobuf and return the copy. How can I return that by rvalue reference when it is allocated on the stack?

Does it work to do something like

iobuf take_iobuf() { iobuf b = std::move(_current_iobuf); _current_iobuf = {}; // or like _current_iobuf.clear() or somesuch return b; }

iobuf take_iobuf() { return std::exchange(_current_iobuf, {}); }

+1 to tyler and andrew neither of which make a copy

andrwng

LGTM, pending changes from Noah's comments about iobufs

andrwng · 2024-09-23T16:49:21Z

src/v/datalake/parquet_writer.cc

+#include <stdexcept>
+
+namespace datalake {
+class iobuf_output_stream : public arrow::io::OutputStream {


I have a slight preference for keeping the methods out of the class definition, but if you have a strong opinion on it I can change that.

Not a strong preference, but I typically inline if the code is small, just so it's easier on the eyes. Feel free to leave it as is; thanks for moving it!

andrwng · 2024-09-23T16:51:31Z

src/v/datalake/parquet_writer.h

+
+namespace datalake {
+
+namespace {


+1 I think the code should work without wrapping the foward decl?

andrwng · 2024-09-23T16:55:23Z

src/v/datalake/tests/parquet_writer_test.cc

+    datalake::arrow_to_iobuf writer(*schema_translator.build_arrow_schema());
+
+    for (int i = 0; i < 10; i++) {
+        writer.add_arrow_array(result);
+        iobuf serialized = writer.take_iobuf();


I made a copy of the iobuf so I could reset the _current_iobuf and return the copy. How can I return that by rvalue reference when it is allocated on the stack?

Does it work to do something like

iobuf take_iobuf() { iobuf b = std::move(_current_iobuf); _current_iobuf = {}; // or like _current_iobuf.clear() or somesuch return b; }

rockwotj · 2024-09-24T14:33:28Z

src/v/datalake/parquet_writer.h

+    explicit arrow_to_iobuf(const arrow::Schema& schema);
+
+    void add_arrow_array(std::shared_ptr<arrow::Array> data);


For this array, how do we know what elements of the array map to specific schema elements in the Schema? Does an arrow Array have a pointer to it's schema element or is there some ID mapping?

For example, what happens if we reverse all the arrays before calling add_arrow_array?

The array contains a pointer to its data type, yes. This includes both the types and column names.

rockwotj

A question on the arrow interfaces. Looking good!

The parameters to the decimal type were incorrect in the test code. Precision must be greater than scale, but it was reversed before. It didn't matter because we weren't validating that we could translate the data to Parquet, but once we start translating data to Parquet, this will generate an error.

This adds an arrow_to_iobuf interface that converts Arrow data to iobufs representing Parquet files that can be written to disk. There are two components: 1. An implementation of arrow::io::OutputStream that collects data in iobufs 2. A class that creates a parquet::io::FileWriter using that output stream and allows the caller to extract the generated iobufs. This allows us to separate the compute side of generating parquet, which still occurs in the Arrow library, from the file io, which can now be made seastar-friednly.

dotnwat · 2024-09-25T17:19:38Z

src/v/datalake/parquet_writer.h

+    iobuf close_and_take_iobuf();
+
+private:
+    class iobuf_output_stream;


jcipar requested review from dotnwat, rockwotj and andrwng September 18, 2024 20:37

github-actions bot added the area/redpanda label Sep 18, 2024

jcipar commented Sep 18, 2024

View reviewed changes

jcipar commented Sep 19, 2024

View reviewed changes

dotnwat reviewed Sep 19, 2024

View reviewed changes

andrwng reviewed Sep 19, 2024

View reviewed changes

dotnwat reviewed Sep 19, 2024

View reviewed changes

jcipar force-pushed the jcipar/seastar-friendly-arrow-writer branch 3 times, most recently from 65203de to e87f74b Compare September 20, 2024 17:29

jcipar requested review from dotnwat and andrwng September 20, 2024 19:49

dotnwat reviewed Sep 20, 2024

View reviewed changes

andrwng reviewed Sep 23, 2024

View reviewed changes

jcipar force-pushed the jcipar/seastar-friendly-arrow-writer branch 2 times, most recently from 254961e to fe9d575 Compare September 24, 2024 13:54

rockwotj reviewed Sep 24, 2024

View reviewed changes

jcipar force-pushed the jcipar/seastar-friendly-arrow-writer branch from fe9d575 to 80ec67b Compare September 24, 2024 20:42

jcipar force-pushed the jcipar/seastar-friendly-arrow-writer branch from 80ec67b to 061e378 Compare September 24, 2024 21:13

jcipar force-pushed the jcipar/seastar-friendly-arrow-writer branch from 061e378 to e943417 Compare September 25, 2024 15:19

jcipar requested review from andrwng, rockwotj and dotnwat September 25, 2024 16:55

andrwng approved these changes Sep 25, 2024

View reviewed changes

dotnwat approved these changes Sep 25, 2024

View reviewed changes

src/v/datalake/parquet_writer.h

iobuf close_and_take_iobuf();

private:

class iobuf_output_stream;

Copy link

Member

dotnwat Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

rockwotj approved these changes Sep 25, 2024

View reviewed changes

andrwng merged commit 2c3fc7d into redpanda-data:dev Sep 26, 2024
17 checks passed

		//// METHODS SPECIFIC TO IOBUF OUTPUT STREAM ////
		iobuf take_iobuf();


		datalake::arrow_to_iobuf writer(*schema_translator.build_arrow_schema());

		// The first write is a special case because it is 4 bytes longer.

		explicit arrow_to_iobuf(const arrow::Schema& schema);

		void add_arrow_array(std::shared_ptr<arrow::Array> data);

		#include <filesystem>
		#include <utility>

		#include "datalake/data_writer_interface.h"

datalake: arrow_to_iobuf interface #23375

datalake: arrow_to_iobuf interface #23375

Conversation

jcipar commented Sep 18, 2024 • edited Loading

Backports Required

Release Notes

Choose a reason for hiding this comment

vbotbuildovich commented Sep 18, 2024

vbotbuildovich commented Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrwng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotnwat Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotnwat Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrwng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rockwotj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcipar commented Sep 18, 2024 •

edited

Loading

vbotbuildovich commented Sep 18, 2024 •

edited

Loading

dotnwat Sep 19, 2024 •

edited

Loading

dotnwat Sep 20, 2024 •

edited

Loading