Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(arrow/ipc): implement lazy loading/zero-copy for IPC files #216

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

zeroshade
Copy link
Member

Rationale for this change

closes #207

What changes are included in this PR?

Adding new method NewMappedFileReader to ipc package which accepts a byte slice instead of a ReaderAtSeeker. Updates ipcSource to reference the raw byte slices from the input directly instead of wrapping with bytes.NewReader which forces copies via Read, ReadFull, etc.

Are these changes tested?

Unit tests added to confirm that the pointers match and that we aren't allocating unnecessarily.

Are there any user-facing changes?

Shouldn't be any user-facing changes other than a reduction in memory usage when reading non-compressed IPC data.

@zeroshade
Copy link
Member Author

@vtk9 can you give this a try and confirm that it addresses your issue? I added unit tests which confirm that the pointers match and that we're avoiding allocations but it would be great to confirm on your end that this reduces the memory usage you were seeing.

@zeroshade
Copy link
Member Author

@lidavidm @kou @joellubi would one of you be able to look this over and give a review?

I wanna merge this and then kick off an RC as requested from #218 (comment)

@lidavidm
Copy link
Member

I'll try to get to it on Monday

func (r *basicReaderImpl) readFooter(f *footerBlock) error {
var err error

if f.offset <= int64(len(Magic)*2+4) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a variable for len(Magic)*2+4 because we use this minimum size multiple times in this file?

func (r *basicReaderImpl) readFooter(f *footerBlock) error {
var err error

if f.offset <= int64(len(Magic)*2+4) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give 4 a name something like footerSizeLen or something?

return errNotArrowFile
}

size := int64(binary.LittleEndian.Uint32(buf[:4]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this should be int32 not uint32: https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format

<FOOTER SIZE: int32>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is because Golang only has the unsigned versions, expecting you to convert to signed integer yourself

https://pkg.go.dev/encoding/binary#ByteOrder

return errNotArrowFile
}

size := int64(binary.LittleEndian.Uint32(buf[:4]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

metaBytes := buf[:blk.meta]

prefix := 0
switch binary.LittleEndian.Uint32(metaBytes) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


func (r *mappedReaderImpl) getFooterEnd() (int64, error) { return int64(len(r.data)), nil }

func (r *mappedReaderImpl) readFooter(f *footerBlock) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we unify more codes in this function with basicReaderImple.readFooter()?
It seems that they have many similar codes.

return errNotArrowFile
}

size := int64(binary.LittleEndian.Uint32(buf[:4]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is because Golang only has the unsigned versions, expecting you to convert to signed integer yourself

https://pkg.go.dev/encoding/binary#ByteOrder


var r io.Reader = sr
// check for an uncompressed buffer
if int64(uncompressedSize) != -1 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: while this is existing code, maybe it would be safer to have uncompressedSize := int64(...) so you don't have to remember to convert it on use and so it's consistent with above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Support lazy load/zero copy when using shared memory
3 participants