Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(parquet): Move footerOffset into FileMetaData #217

Merged
merged 4 commits into from
Dec 13, 2024

Conversation

joechenrh
Copy link
Contributor

Rationale for this change

After looking into the code, I found that footerOffset is mainly used in parsing the meta data of parquet file.

I think we can store it in FileMetaData, so we don't need to get footerOffset again when we use WithMetadata.

What changes are included in this PR?

Move Reader.footerOffset into FileMetadata

Are these changes tested?

Are there any user-facing changes?

@joechenrh joechenrh requested a review from zeroshade as a code owner December 10, 2024 03:37
Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, just some nitpicks

@@ -114,7 +114,7 @@ func TestBuildAccess(t *testing.T) {
require.NoError(t, err)
serialized, err := faccessor.SerializeString(context.Background())
assert.NoError(t, err)
faccessorCopy, err := metadata.NewFileMetaData([]byte(serialized), nil)
faccessorCopy, err := metadata.NewFileMetaData([]byte(serialized), 0, nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be invalid to use 0 for the footer offset?

@@ -275,6 +278,9 @@ func NewFileMetaData(data []byte, fileDecryptor encryption.FileDecryptor) (*File
// Size is the length of the raw serialized metadata bytes in the footer
func (f *FileMetaData) Size() int { return f.metadataLen }

// SourceSz is the total size of the source file
func (f *FileMetaData) SourceSz() int64 { return f.footerOffset }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the FileMetadata can potentially be separate from the file itself, it might make more sense to name this differently or at least use a different description for the comment.

}

// NewFileMetaData takes in the raw bytes of the serialized metadata to deserialize
// and will attempt to decrypt the footer if a decryptor is provided.
func NewFileMetaData(data []byte, fileDecryptor encryption.FileDecryptor) (*FileMetaData, error) {
func NewFileMetaData(data []byte, footerOffset int64, fileDecryptor encryption.FileDecryptor) (*FileMetaData, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I'm not a fan of here is that this is a breaking change. Now, I don't think there are many (if any) packages that are importing and using this function directly (as creating new file metadata would primarily be done via the Writer rather than done directly) but it's still a concern that I'd rather not create a breaking change if we can avoid it.

Is there a way we can avoid this being a breaking change at all? If 0 is valid, maybe we create a setter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I forgot the backwards compatibility. Can we name the setter/getter function like:

// SetSourceFileSize get the total size of the source file from meta data.
func (f *FileMetaData) SetSourceFileSize() int64 { return f.sourceFileSize }

// GetSourceFileSize set the total size of the source file in meta data.
func (f *FileMetaData) GetSourceFileSize() int64 { return f.sourceFileSize }

// file.go
type FileMetaData struct {
    ....
    
    // sourceFileSize is not a part of FileMetaData, but it is mainly used to parse meta data.
    // Users can manually set this value and they are responsible for the validity of it.
    sourceFileSize int64
}

Comment on lines 282 to 286
// SetSourceFileSize get the total size of the source file from meta data.
func (f *FileMetaData) GetSourceFileSize() int64 { return f.sourceFileSize }

// GetSourceFileSize set the total size of the source file in meta data.
func (f *FileMetaData) SetSourceFileSize(sourceFileSize int64) { f.sourceFileSize = sourceFileSize }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc strings are reversed currently, gotta swap those 😄

Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks!

@zeroshade zeroshade merged commit 244d47c into apache:main Dec 13, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants