Native Protobuf Record Payload #13867
Replies: 4 comments 3 replies
-
|
The payload API is deprecated since 1.1, what you really want to propose here is a new file format encoded using
|
Beta Was this translation helpful? Give feedback.
-
|
@gudladona We plan to move towards RecordMerger and the 1.x merge modes to handle data using engine native POJOs. As long as protobuf can be converted into Spark internalRow at entry, would that address your concern? or are you proposing storing records at rest in protobuf |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the feedback, I think I need to familiarize myself with the RecordMerger. Historically, the (Avro) data itself is stored in the Payload as bytes. See the base class below Here we store the avro as bytes anyway to use the |
Beta Was this translation helpful? Give feedback.
-
|
The whole payload stuff has been deprecated since 1.1: |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I'd like to pitch the idea of making Hudi Record payload support native protobuf bytes. Protobuf as the wire format has become very popular for its small payload size per message, particularly over streaming. Also, Protobuf data can be represented as a spark InternalRow type without needing to convert to an Avro record. Additionally, the Parquet also supports Proto native writer (ProtoParquetWriter) which can take a Dynamic message as source that can be easily built from proto bytes. Doing this would alleviate the expensive conversion of Proto --> Avro --> HudiRecordPayload(contains re-serialized avro bytes) and then use a Proto Avro writer during the file write process.
Kindly let me know your thoughts, I am happy to start a PR if the idea(improvement) sounds legit.
Beta Was this translation helpful? Give feedback.
All reactions