Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark : Derive Stats From Manifest on the Fly #11615

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

saitharun15
Copy link
Contributor

This PR helps to derives min,max,numOfNulls Statistics on the fly from manifest files to report back them to Spark.

Currently only Ndv is calculated and reported back to Spark Engine, which leads to inaccurate plans in Spark side since min,max,nullCount are returned as NULL

As there is a discussion still going on whether to store stats partition level or table level, even if we calculate them in either ways there would be an issue as per this comment in discussion #10791

These changes helps to enable the onFly collection of the stats using a table property or a session conf(by default it's false)

cc @guykhazma @jeesou

@saitharun15
Copy link
Contributor Author

Hi, @huaxingao @karuppayya @aokolnychyi @RussellSpitzer Can you help review this PR

.tableProperty(TableProperties.DERIVE_STATS_FROM_MANIFEST_ENABLED)
.defaultValue(TableProperties.DERIVE_STATS_FROM_MANIFEST_ENABLED_DEFAULT)
.parse();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This table-level property takes precedence over the session configuration when it is turned off, enabling users to derive statistics only for a specific table.

@@ -388,4 +388,8 @@ private TableProperties() {}
public static final int ENCRYPTION_DEK_LENGTH_DEFAULT = 16;

public static final int ENCRYPTION_AAD_LENGTH_DEFAULT = 16;

public static final String DERIVE_STATS_FROM_MANIFEST_ENABLED =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These properties don't effect any engines except for Spark so they probably need a prefix

}

// extract min/max values from the manifests
private Map<Integer, Object> calculateMinMax(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may have errors if any delete files are present or if there are any non file covering predicates in the query

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may also have issues if column stats for a particular column are not present


// extract min/max values from the manifests
private Map<Integer, Object> calculateMinMax(
boolean isMin, Map<String, Map<Integer, ByteBuffer>> distinctDataFilesBounds) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of parmeterizing "isMin", i'd probably just use two different functions that call a more generic version so that the code in the original calling location is clear and you don't have to know that "true" means "min"

return nullCount;
}

private Object toSparkType(Type type, Object value) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we must have this in a helper function somewhere, I know we have to do similar tricks with UTF8

List<StatisticsFile> files = table.statisticsFiles();
if (!files.isEmpty()) {
List<BlobMetadata> metadataList = (files.get(0)).blobMetadata();

if (readConf.deriveStatsFromManifestSessionConf()
|| readConf.deriveStatsFromManifestTableProperty()) {
Map<String, Map<Integer, Long>> distinctDataFilesNullCount = Maps.newHashMap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "distinct" in this context mean?

@@ -183,6 +193,7 @@ public Statistics estimateStatistics() {
return estimateStatistics(SnapshotUtil.latestSnapshot(table, branch));
}

@SuppressWarnings("checkstyle:CyclomaticComplexity")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to reduce the complexity rather than suppressing the warning

General this means

Removing nesting
Extracting sub-functions out
Avoiding early exits

.parse();
}

public boolean deriveStatsFromManifestTableProperty() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need two functions here? You should be able to add all the options into the same conf parser while still maintaining this hierarchy

for (ScanTaskGroup<?> taskGrp : taskGroups()) {
for (ScanTask task : taskGrp.tasks()) {
if (task.isFileScanTask()) {
FileScanTask fileScanTask = task.asFileScanTask();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just make a set of files being used?

Stream.of(tasks).filter(fl.isFileScanTask).map(file).collectAsSet

Then we could just work on the set for all of our min and max functions

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some overall worries about our inaccuracy in our stats reporting here. I know based on truncation / collection we may not providing accurate stats for all columns and of course if delete vectors or equality deletes are present the stats will be incorrect.

@huaxingao do you have any thoughts on this? I know you have dealt with similar issues before on the Aggregate pushdowns.

@saitharun15
Copy link
Contributor Author

@RussellSpitzer, thanks for the review comments,I will address them soon. As per @huaxingao implementation here , aggregate pushdown is skipped when row level deletes are detected, I have applied a similar change here as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants