-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(parquet): Move footerOffset into FileMetaData #217
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea, just some nitpicks
parquet/metadata/metadata_test.go
Outdated
@@ -114,7 +114,7 @@ func TestBuildAccess(t *testing.T) { | |||
require.NoError(t, err) | |||
serialized, err := faccessor.SerializeString(context.Background()) | |||
assert.NoError(t, err) | |||
faccessorCopy, err := metadata.NewFileMetaData([]byte(serialized), nil) | |||
faccessorCopy, err := metadata.NewFileMetaData([]byte(serialized), 0, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be invalid to use 0 for the footer offset?
parquet/metadata/file.go
Outdated
@@ -275,6 +278,9 @@ func NewFileMetaData(data []byte, fileDecryptor encryption.FileDecryptor) (*File | |||
// Size is the length of the raw serialized metadata bytes in the footer | |||
func (f *FileMetaData) Size() int { return f.metadataLen } | |||
|
|||
// SourceSz is the total size of the source file | |||
func (f *FileMetaData) SourceSz() int64 { return f.footerOffset } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the FileMetadata can potentially be separate from the file itself, it might make more sense to name this differently or at least use a different description for the comment.
parquet/metadata/file.go
Outdated
} | ||
|
||
// NewFileMetaData takes in the raw bytes of the serialized metadata to deserialize | ||
// and will attempt to decrypt the footer if a decryptor is provided. | ||
func NewFileMetaData(data []byte, fileDecryptor encryption.FileDecryptor) (*FileMetaData, error) { | ||
func NewFileMetaData(data []byte, footerOffset int64, fileDecryptor encryption.FileDecryptor) (*FileMetaData, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing I'm not a fan of here is that this is a breaking change. Now, I don't think there are many (if any) packages that are importing and using this function directly (as creating new file metadata would primarily be done via the Writer rather than done directly) but it's still a concern that I'd rather not create a breaking change if we can avoid it.
Is there a way we can avoid this being a breaking change at all? If 0 is valid, maybe we create a setter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I forgot the backwards compatibility. Can we name the setter/getter function like:
// SetSourceFileSize get the total size of the source file from meta data.
func (f *FileMetaData) SetSourceFileSize() int64 { return f.sourceFileSize }
// GetSourceFileSize set the total size of the source file in meta data.
func (f *FileMetaData) GetSourceFileSize() int64 { return f.sourceFileSize }
// file.go
type FileMetaData struct {
....
// sourceFileSize is not a part of FileMetaData, but it is mainly used to parse meta data.
// Users can manually set this value and they are responsible for the validity of it.
sourceFileSize int64
}
parquet/metadata/file.go
Outdated
// SetSourceFileSize get the total size of the source file from meta data. | ||
func (f *FileMetaData) GetSourceFileSize() int64 { return f.sourceFileSize } | ||
|
||
// GetSourceFileSize set the total size of the source file in meta data. | ||
func (f *FileMetaData) SetSourceFileSize(sourceFileSize int64) { f.sourceFileSize = sourceFileSize } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc strings are reversed currently, gotta swap those 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks!
Rationale for this change
After looking into the code, I found that
footerOffset
is mainly used in parsing the meta data of parquet file.I think we can store it in
FileMetaData
, so we don't need to getfooterOffset
again when we useWithMetadata
.What changes are included in this PR?
Move
Reader.footerOffset
intoFileMetadata
Are these changes tested?
Are there any user-facing changes?