-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(arrow/ipc): implement lazy loading/zero-copy for IPC files #216
Conversation
@vtk9 can you give this a try and confirm that it addresses your issue? I added unit tests which confirm that the pointers match and that we're avoiding allocations but it would be great to confirm on your end that this reduces the memory usage you were seeing. |
@lidavidm @kou @joellubi would one of you be able to look this over and give a review? I wanna merge this and then kick off an RC as requested from #218 (comment) |
I'll try to get to it on Monday |
arrow/ipc/file_reader.go
Outdated
func (r *basicReaderImpl) readFooter(f *footerBlock) error { | ||
var err error | ||
|
||
if f.offset <= int64(len(Magic)*2+4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a variable for len(Magic)*2+4
because we use this minimum size multiple times in this file?
arrow/ipc/file_reader.go
Outdated
func (r *basicReaderImpl) readFooter(f *footerBlock) error { | ||
var err error | ||
|
||
if f.offset <= int64(len(Magic)*2+4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we give 4
a name something like footerSizeLen
or something?
arrow/ipc/file_reader.go
Outdated
return errNotArrowFile | ||
} | ||
|
||
size := int64(binary.LittleEndian.Uint32(buf[:4])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this should be int32 not uint32: https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
<FOOTER SIZE: int32>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is because Golang only has the unsigned versions, expecting you to convert to signed integer yourself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lidavidm is correct, only the unsigned versions are available and you are expected to convert to signed integer yourself if you need.
arrow/ipc/file_reader.go
Outdated
return errNotArrowFile | ||
} | ||
|
||
size := int64(binary.LittleEndian.Uint32(buf[:4])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
metaBytes := buf[:blk.meta] | ||
|
||
prefix := 0 | ||
switch binary.LittleEndian.Uint32(metaBytes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int32 not uint32?
https://github.com/apache/arrow/blob/313d11aa94c2be71142b55e3d8bb166d780c19c7/format/File.fbs#L45
metaDataLength: int;
arrow/ipc/file_reader.go
Outdated
|
||
func (r *mappedReaderImpl) getFooterEnd() (int64, error) { return int64(len(r.data)), nil } | ||
|
||
func (r *mappedReaderImpl) readFooter(f *footerBlock) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we unify more codes in this function with basicReaderImple.readFooter()
?
It seems that they have many similar codes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit hard to unify these more since the core part of it (reading the bytes) is where the change is but I'll see what I can do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated by adding a getBytes
method to the impls and having a single implementation for readFooter
that uses it, unifying the implementations.
arrow/ipc/file_reader.go
Outdated
return errNotArrowFile | ||
} | ||
|
||
size := int64(binary.LittleEndian.Uint32(buf[:4])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is because Golang only has the unsigned versions, expecting you to convert to signed integer yourself
arrow/ipc/file_reader.go
Outdated
|
||
var r io.Reader = sr | ||
// check for an uncompressed buffer | ||
if int64(uncompressedSize) != -1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: while this is existing code, maybe it would be safer to have uncompressedSize := int64(...)
so you don't have to remember to convert it on use and so it's consistent with above
@kou can you take another look and let me know if I covered what you were thinking? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Rationale for this change
closes #207
What changes are included in this PR?
Adding new method
NewMappedFileReader
to ipc package which accepts a byte slice instead of aReaderAtSeeker
. UpdatesipcSource
to reference the raw byte slices from the input directly instead of wrapping withbytes.NewReader
which forces copies viaRead
,ReadFull
, etc.Are these changes tested?
Unit tests added to confirm that the pointers match and that we aren't allocating unnecessarily.
Are there any user-facing changes?
Shouldn't be any user-facing changes other than a reduction in memory usage when reading non-compressed IPC data.