-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(arrow/ipc): implement lazy loading/zero-copy for IPC files #216
base: main
Are you sure you want to change the base?
Conversation
@vtk9 can you give this a try and confirm that it addresses your issue? I added unit tests which confirm that the pointers match and that we're avoiding allocations but it would be great to confirm on your end that this reduces the memory usage you were seeing. |
@lidavidm @kou @joellubi would one of you be able to look this over and give a review? I wanna merge this and then kick off an RC as requested from #218 (comment) |
I'll try to get to it on Monday |
func (r *basicReaderImpl) readFooter(f *footerBlock) error { | ||
var err error | ||
|
||
if f.offset <= int64(len(Magic)*2+4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a variable for len(Magic)*2+4
because we use this minimum size multiple times in this file?
func (r *basicReaderImpl) readFooter(f *footerBlock) error { | ||
var err error | ||
|
||
if f.offset <= int64(len(Magic)*2+4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we give 4
a name something like footerSizeLen
or something?
return errNotArrowFile | ||
} | ||
|
||
size := int64(binary.LittleEndian.Uint32(buf[:4])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this should be int32 not uint32: https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
<FOOTER SIZE: int32>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is because Golang only has the unsigned versions, expecting you to convert to signed integer yourself
return errNotArrowFile | ||
} | ||
|
||
size := int64(binary.LittleEndian.Uint32(buf[:4])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
metaBytes := buf[:blk.meta] | ||
|
||
prefix := 0 | ||
switch binary.LittleEndian.Uint32(metaBytes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int32 not uint32?
https://github.com/apache/arrow/blob/313d11aa94c2be71142b55e3d8bb166d780c19c7/format/File.fbs#L45
metaDataLength: int;
|
||
func (r *mappedReaderImpl) getFooterEnd() (int64, error) { return int64(len(r.data)), nil } | ||
|
||
func (r *mappedReaderImpl) readFooter(f *footerBlock) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we unify more codes in this function with basicReaderImple.readFooter()
?
It seems that they have many similar codes.
return errNotArrowFile | ||
} | ||
|
||
size := int64(binary.LittleEndian.Uint32(buf[:4])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is because Golang only has the unsigned versions, expecting you to convert to signed integer yourself
|
||
var r io.Reader = sr | ||
// check for an uncompressed buffer | ||
if int64(uncompressedSize) != -1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: while this is existing code, maybe it would be safer to have uncompressedSize := int64(...)
so you don't have to remember to convert it on use and so it's consistent with above
Rationale for this change
closes #207
What changes are included in this PR?
Adding new method
NewMappedFileReader
to ipc package which accepts a byte slice instead of aReaderAtSeeker
. UpdatesipcSource
to reference the raw byte slices from the input directly instead of wrapping withbytes.NewReader
which forces copies viaRead
,ReadFull
, etc.Are these changes tested?
Unit tests added to confirm that the pointers match and that we aren't allocating unnecessarily.
Are there any user-facing changes?
Shouldn't be any user-facing changes other than a reduction in memory usage when reading non-compressed IPC data.