Skip to content

fix: Support Databricks Spark runtime#18256

Open
yihua wants to merge 1 commit intoapache:branch-0.xfrom
yihua:fix-databricks-spark
Open

fix: Support Databricks Spark runtime#18256
yihua wants to merge 1 commit intoapache:branch-0.xfrom
yihua:fix-databricks-spark

Conversation

@yihua
Copy link
Contributor

@yihua yihua commented Feb 26, 2026

Describe the issue this Pull Request addresses

Cherrypicking #13129 to branch-0.x

Summary and Changelog

This PR makes Hudi tables queryable on Databricks Spark runtime. With the fix, the datasource reader can read Hudi tables leveraging SparkHoodieTableFileIndex without using the metadata table.

The root cause of the issue is that the FileStatusCache from Databricks Spark runtime has different APIs compared to OSS Spark, leading to the NoSuchMethodError below:

java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileStatusCache.putLeafFiles(Lorg/apache/hadoop/fs/Path;[Lorg/apache/hadoop/fs/FileStatus;)V
	at org.apache.hudi.SparkHoodieTableFileIndex$$anon$1.put(SparkHoodieTableFileIndex.scala:516)
	at org.apache.hudi.BaseHoodieTableFileIndex.lambda$listPartitionPathFiles$13(BaseHoodieTableFileIndex.java:410)
	at java.util.HashMap.forEach(HashMap.java:1290)
	at org.apache.hudi.BaseHoodieTableFileIndex.listPartitionPathFiles(BaseHoodieTableFileIndex.java:408)

FileStatusCache's declared methods on Databricks Spark runtime:

classOf[FileStatusCache].getDeclaredMethods
res11: Array[java.lang.reflect.Method] = Array(public abstract java.util.UUID org.apache.spark.sql.execution.datasources.FileStatusCache.clientId(), public static void org.apache.spark.sql.execution.datasources.FileStatusCache.resetForTesting(), public scala.Option org.apache.spark.sql.execution.datasources.FileStatusCache.getLeafFiles(org.apache.hadoop.fs.Path), public abstract void org.apache.spark.sql.execution.datasources.FileStatusCache.putLeafFiles(org.apache.hadoop.fs.Path,org.apache.spark.sql.execution.datasources.SerializableFileStatus[]), public static org.apache.spark.sql.execution.datasources.FileStatusCache org.apache.spark.sql.execution.datasources.FileStatusCache.getOrCreate(org.apache.spark.sql.internal.SQLConf,scala.Option), public static org.apache.spark.sql.execution.datasources.FileStatusCache org.apache.spark.sql.execution.datasources.FileStatusCache.getOrCreate(org.apache.spark.sql.SparkSession), public abstract void org.apache.spark.sql.execution.datasources.FileStatusCache.invalidateAll())

As seen, the second parameter of putLeafFiles is of type SerializableFileStatus[] which is different from FileStatus[] in OSS.

Two alternatives are explored:

  • Using reflection: to use Java's reflection to invoke the method putLeafFiles we have to have the class SerializableFileStatus to pass the type safety check (i.e., passing SerializableFileStatus[] instead of Object[] which is the only possible way; otherwise seeing IllegalArgumentException: argument type mismatch). As the class SerializableFileStatus is no longer available in OSS Spark, the reflection route is not possible.
IllegalArgumentException: argument type mismatch
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
  • Avoid using the cache: this is the fix in this PR, by not using the file status cache. Such a fix gets rid of the NoSuchMethodError.

COW table generated by Spark Quick Start is validated (using spark.read.format("org.apache.hudi.Spark3DefaultSource").option("hoodie.metadata.enable", "false").load(tablePath)).

Impact

Fixes reading Hudi tables leveraging SparkHoodieTableFileIndex without using the metadata table.

Risk Level

none

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 26, 2026
@yihua yihua added this to the release-0.15.1 milestone Feb 26, 2026
@yihua yihua force-pushed the fix-databricks-spark branch from 3f18780 to e5e9a9f Compare February 27, 2026 18:41
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants