-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark : Derive Stats From Manifest on the Fly #11615
base: main
Are you sure you want to change the base?
Conversation
Hi, @huaxingao @karuppayya @aokolnychyi @RussellSpitzer Can you help review this PR |
.tableProperty(TableProperties.DERIVE_STATS_FROM_MANIFEST_ENABLED) | ||
.defaultValue(TableProperties.DERIVE_STATS_FROM_MANIFEST_ENABLED_DEFAULT) | ||
.parse(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This table-level property takes precedence over the session configuration when it is turned off, enabling users to derive statistics only for a specific table.
@@ -388,4 +388,8 @@ private TableProperties() {} | |||
public static final int ENCRYPTION_DEK_LENGTH_DEFAULT = 16; | |||
|
|||
public static final int ENCRYPTION_AAD_LENGTH_DEFAULT = 16; | |||
|
|||
public static final String DERIVE_STATS_FROM_MANIFEST_ENABLED = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These properties don't effect any engines except for Spark so they probably need a prefix
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
// extract min/max values from the manifests | ||
private Map<Integer, Object> calculateMinMax( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may have errors if any delete files are present or if there are any non file covering predicates in the query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may also have issues if column stats for a particular column are not present
|
||
// extract min/max values from the manifests | ||
private Map<Integer, Object> calculateMinMax( | ||
boolean isMin, Map<String, Map<Integer, ByteBuffer>> distinctDataFilesBounds) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a big fan of parmeterizing "isMin", i'd probably just use two different functions that call a more generic version so that the code in the original calling location is clear and you don't have to know that "true" means "min"
return nullCount; | ||
} | ||
|
||
private Object toSparkType(Type type, Object value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we must have this in a helper function somewhere, I know we have to do similar tricks with UTF8
List<StatisticsFile> files = table.statisticsFiles(); | ||
if (!files.isEmpty()) { | ||
List<BlobMetadata> metadataList = (files.get(0)).blobMetadata(); | ||
|
||
if (readConf.deriveStatsFromManifestSessionConf() | ||
|| readConf.deriveStatsFromManifestTableProperty()) { | ||
Map<String, Map<Integer, Long>> distinctDataFilesNullCount = Maps.newHashMap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does "distinct" in this context mean?
@@ -183,6 +193,7 @@ public Statistics estimateStatistics() { | |||
return estimateStatistics(SnapshotUtil.latestSnapshot(table, branch)); | |||
} | |||
|
|||
@SuppressWarnings("checkstyle:CyclomaticComplexity") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to reduce the complexity rather than suppressing the warning
General this means
Removing nesting
Extracting sub-functions out
Avoiding early exits
.parse(); | ||
} | ||
|
||
public boolean deriveStatsFromManifestTableProperty() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need two functions here? You should be able to add all the options into the same conf parser while still maintaining this hierarchy
for (ScanTaskGroup<?> taskGrp : taskGroups()) { | ||
for (ScanTask task : taskGrp.tasks()) { | ||
if (task.isFileScanTask()) { | ||
FileScanTask fileScanTask = task.asFileScanTask(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just make a set of files being used?
Stream.of(tasks).filter(fl.isFileScanTask).map(file).collectAsSet
Then we could just work on the set for all of our min and max functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some overall worries about our inaccuracy in our stats reporting here. I know based on truncation / collection we may not providing accurate stats for all columns and of course if delete vectors or equality deletes are present the stats will be incorrect.
@huaxingao do you have any thoughts on this? I know you have dealt with similar issues before on the Aggregate pushdowns.
@RussellSpitzer, thanks for the review comments,I will address them soon. As per @huaxingao implementation here , aggregate pushdown is skipped when row level deletes are detected, I have applied a similar change here as well. |
This PR helps to derives min,max,numOfNulls Statistics on the fly from manifest files to report back them to Spark.
Currently only Ndv is calculated and reported back to Spark Engine, which leads to inaccurate plans in Spark side since min,max,nullCount are returned as NULL
As there is a discussion still going on whether to store stats partition level or table level, even if we calculate them in either ways there would be an issue as per this comment in discussion #10791
These changes helps to enable the onFly collection of the stats using a table property or a session conf(by default it's false)
cc @guykhazma @jeesou