Skip to content

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Sep 1, 2025

Which issue does this PR close?

Rationale for this change

Do not compress v2 data page when compress is bad quality ( compressed size is greater or equal to uncompressed_size )

What changes are included in this PR?

Discard compression when it's too large

Are these changes tested?

Covered by existing

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Sep 1, 2025
cmpr.compress(&values_data.buf, &mut buffer)?;
if uncompressed_size <= buffer.len() - buffer_len {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can also be a "score", like if uncompressed_size <= (buffer.len() - buffer_len) * 0.9 {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest maybe going the opposite direction. If compression doesn't give say a 10% size reduction, then don't compress...it will be faster to read. 10% might be too much...maybe we could make it configurable.

Anyway, that's probably out of scope for this PR. Let's just go with the simple test for now...it's still an improvement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I do think that would need to configurable as I think some people would treat a 10% size increase as a regression, even if they got a faster reader

A good follow on ticket for sure

@mapleFU
Copy link
Member Author

mapleFU commented Sep 1, 2025

This need add some test, I found a "missing required field PageHeader.uncompressed_page_size" bug here

@mapleFU mapleFU marked this pull request as draft September 1, 2025 14:02
@mapleFU mapleFU marked this pull request as ready for review September 2, 2025 02:12
@mapleFU
Copy link
Member Author

mapleFU commented Sep 2, 2025

This need add some test, I found a "missing required field PageHeader.uncompressed_page_size" bug here

This is caused by another bug in user side, it's not related to this problem :-(

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes a lot of sense, and seems like the intended use of the is_compressed flag.

@mapleFU mapleFU force-pushed the page-v2-no-compress-if-size-too-lage branch from c35ae85 to 0e198e9 Compare September 5, 2025 14:43
@mapleFU
Copy link
Member Author

mapleFU commented Sep 5, 2025

I've added a test for this, would you mind take a look again?

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test looks good! Thanks

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me -- thank you @mapleFU and @etseidl

cmpr.compress(&values_data.buf, &mut buffer)?;
if uncompressed_size <= buffer.len() - buffer_len {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I do think that would need to configurable as I think some people would treat a 10% size increase as a regression, even if they got a faster reader

A good follow on ticket for sure

get_test_column_writer::<ByteArrayType>(Box::new(page_writer), 0, 0, Arc::new(props));

// Create small, simple data that Snappy compression will likely increase in size
// due to compression overhead for very small data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet: Do not compress v2 data page when compress is bad quality
3 participants