Spark : Derive Stats From Manifest on the Fly #11615

saitharun15 · 2024-11-21T07:24:06Z

This PR helps to derives min,max,numOfNulls Statistics on the fly from manifest files to report back them to Spark.

Currently only Ndv is calculated and reported back to Spark Engine, which leads to inaccurate plans in Spark side since min,max,nullCount are returned as NULL

As there is a discussion still going on whether to store stats partition level or table level, even if we calculate them in either ways there would be an issue as per this comment in discussion #10791

These changes helps to enable the onFly collection of the stats using a table property or a session conf(by default it's false)

cc @guykhazma @jeesou

saitharun15 · 2024-11-21T07:25:46Z

Hi, @huaxingao @karuppayya @aokolnychyi @RussellSpitzer Can you help review this PR

saitharun15 · 2024-11-21T14:18:39Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java

+        .tableProperty(TableProperties.DERIVE_STATS_FROM_MANIFEST_ENABLED)
+        .defaultValue(TableProperties.DERIVE_STATS_FROM_MANIFEST_ENABLED_DEFAULT)
+        .parse();
+  }


This table-level property takes precedence over the session configuration when it is turned off, enabling users to derive statistics only for a specific table.

RussellSpitzer · 2024-11-21T15:47:00Z

core/src/main/java/org/apache/iceberg/TableProperties.java

@@ -388,4 +388,8 @@ private TableProperties() {}
  public static final int ENCRYPTION_DEK_LENGTH_DEFAULT = 16;

  public static final int ENCRYPTION_AAD_LENGTH_DEFAULT = 16;
+
+  public static final String DERIVE_STATS_FROM_MANIFEST_ENABLED =


These properties don't effect any engines except for Spark so they probably need a prefix

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java

RussellSpitzer · 2024-11-21T15:51:06Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java

+  }
+
+  // extract min/max values from the manifests
+  private Map<Integer, Object> calculateMinMax(


This may have errors if any delete files are present or if there are any non file covering predicates in the query

I think we may also have issues if column stats for a particular column are not present

RussellSpitzer · 2024-11-21T15:55:05Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java

+
+  // extract min/max values from the manifests
+  private Map<Integer, Object> calculateMinMax(
+      boolean isMin, Map<String, Map<Integer, ByteBuffer>> distinctDataFilesBounds) {


I'm not a big fan of parmeterizing "isMin", i'd probably just use two different functions that call a more generic version so that the code in the original calling location is clear and you don't have to know that "true" means "min"

RussellSpitzer · 2024-11-21T16:02:05Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java

+    return nullCount;
+  }
+
+  private Object toSparkType(Type type, Object value) {


I feel like we must have this in a helper function somewhere, I know we have to do similar tricks with UTF8

RussellSpitzer · 2024-11-21T16:05:04Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java

      List<StatisticsFile> files = table.statisticsFiles();
      if (!files.isEmpty()) {
        List<BlobMetadata> metadataList = (files.get(0)).blobMetadata();

+        if (readConf.deriveStatsFromManifestSessionConf()
+            || readConf.deriveStatsFromManifestTableProperty()) {
+          Map<String, Map<Integer, Long>> distinctDataFilesNullCount = Maps.newHashMap();


what does "distinct" in this context mean?

RussellSpitzer · 2024-11-21T16:06:26Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java

@@ -183,6 +193,7 @@ public Statistics estimateStatistics() {
    return estimateStatistics(SnapshotUtil.latestSnapshot(table, branch));
  }

+  @SuppressWarnings("checkstyle:CyclomaticComplexity")


Let's try to reduce the complexity rather than suppressing the warning

General this means

Removing nesting
Extracting sub-functions out
Avoiding early exits

RussellSpitzer · 2024-11-21T16:07:26Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkReadConf.java

+        .parse();
+  }
+
+  public boolean deriveStatsFromManifestTableProperty() {


Why do we need two functions here? You should be able to add all the options into the same conf parser while still maintaining this hierarchy

RussellSpitzer · 2024-11-21T16:13:01Z

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java

+          for (ScanTaskGroup<?> taskGrp : taskGroups()) {
+            for (ScanTask task : taskGrp.tasks()) {
+              if (task.isFileScanTask()) {
+                FileScanTask fileScanTask = task.asFileScanTask();


Could we just make a set of files being used?

Stream.of(tasks).filter(fl.isFileScanTask).map(file).collectAsSet

Then we could just work on the set for all of our min and max functions

RussellSpitzer

I have some overall worries about our inaccuracy in our stats reporting here. I know based on truncation / collection we may not providing accurate stats for all columns and of course if delete vectors or equality deletes are present the stats will be incorrect.

@huaxingao do you have any thoughts on this? I know you have dealt with similar issues before on the Aggregate pushdowns.

saitharun15 · 2024-11-21T19:24:31Z

@RussellSpitzer, thanks for the review comments,I will address them soon. As per @huaxingao implementation here , aggregate pushdown is skipped when row level deletes are detected, I have applied a similar change here as well.

Derive Stats From Manifest on the Fly

a58275f

github-actions bot added spark core labels Nov 21, 2024

saitharun15 commented Nov 21, 2024

View reviewed changes

RussellSpitzer reviewed Nov 21, 2024

View reviewed changes

spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScan.java Outdated Show resolved Hide resolved

RussellSpitzer reviewed Nov 21, 2024

View reviewed changes

address review comment

6791181

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark : Derive Stats From Manifest on the Fly #11615

Spark : Derive Stats From Manifest on the Fly #11615

saitharun15 commented Nov 21, 2024

saitharun15 commented Nov 21, 2024

saitharun15 Nov 21, 2024

RussellSpitzer Nov 21, 2024

RussellSpitzer Nov 21, 2024

RussellSpitzer Nov 21, 2024

RussellSpitzer Nov 21, 2024

RussellSpitzer Nov 21, 2024

RussellSpitzer Nov 21, 2024

RussellSpitzer Nov 21, 2024

RussellSpitzer Nov 21, 2024

RussellSpitzer Nov 21, 2024

RussellSpitzer left a comment

saitharun15 commented Nov 21, 2024

Spark : Derive Stats From Manifest on the Fly #11615

Are you sure you want to change the base?

Spark : Derive Stats From Manifest on the Fly #11615

Conversation

saitharun15 commented Nov 21, 2024

saitharun15 commented Nov 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RussellSpitzer left a comment

Choose a reason for hiding this comment

saitharun15 commented Nov 21, 2024