fix: Support Databricks Spark runtime by yihua · Pull Request #18256 · apache/hudi

yihua · 2026-02-26T21:31:08Z

Describe the issue this Pull Request addresses

Cherrypicking #13129 to branch-0.x

Summary and Changelog

This PR makes Hudi tables queryable on Databricks Spark runtime. With the fix, the datasource reader can read Hudi tables leveraging SparkHoodieTableFileIndex without using the metadata table.

The root cause of the issue is that the FileStatusCache from Databricks Spark runtime has different APIs compared to OSS Spark, leading to the NoSuchMethodError below:

java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileStatusCache.putLeafFiles(Lorg/apache/hadoop/fs/Path;[Lorg/apache/hadoop/fs/FileStatus;)V
	at org.apache.hudi.SparkHoodieTableFileIndex$$anon$1.put(SparkHoodieTableFileIndex.scala:516)
	at org.apache.hudi.BaseHoodieTableFileIndex.lambda$listPartitionPathFiles$13(BaseHoodieTableFileIndex.java:410)
	at java.util.HashMap.forEach(HashMap.java:1290)
	at org.apache.hudi.BaseHoodieTableFileIndex.listPartitionPathFiles(BaseHoodieTableFileIndex.java:408)

FileStatusCache's declared methods on Databricks Spark runtime:

classOf[FileStatusCache].getDeclaredMethods
res11: Array[java.lang.reflect.Method] = Array(public abstract java.util.UUID org.apache.spark.sql.execution.datasources.FileStatusCache.clientId(), public static void org.apache.spark.sql.execution.datasources.FileStatusCache.resetForTesting(), public scala.Option org.apache.spark.sql.execution.datasources.FileStatusCache.getLeafFiles(org.apache.hadoop.fs.Path), public abstract void org.apache.spark.sql.execution.datasources.FileStatusCache.putLeafFiles(org.apache.hadoop.fs.Path,org.apache.spark.sql.execution.datasources.SerializableFileStatus[]), public static org.apache.spark.sql.execution.datasources.FileStatusCache org.apache.spark.sql.execution.datasources.FileStatusCache.getOrCreate(org.apache.spark.sql.internal.SQLConf,scala.Option), public static org.apache.spark.sql.execution.datasources.FileStatusCache org.apache.spark.sql.execution.datasources.FileStatusCache.getOrCreate(org.apache.spark.sql.SparkSession), public abstract void org.apache.spark.sql.execution.datasources.FileStatusCache.invalidateAll())

As seen, the second parameter of putLeafFiles is of type SerializableFileStatus[] which is different from FileStatus[] in OSS.

Two alternatives are explored:

Using reflection: to use Java's reflection to invoke the method putLeafFiles we have to have the class SerializableFileStatus to pass the type safety check (i.e., passing SerializableFileStatus[] instead of Object[] which is the only possible way; otherwise seeing IllegalArgumentException: argument type mismatch). As the class SerializableFileStatus is no longer available in OSS Spark, the reflection route is not possible.

IllegalArgumentException: argument type mismatch
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)

Avoid using the cache: this is the fix in this PR, by not using the file status cache. Such a fix gets rid of the NoSuchMethodError.

COW table generated by Spark Quick Start is validated (using spark.read.format("org.apache.hudi.Spark3DefaultSource").option("hoodie.metadata.enable", "false").load(tablePath)).

Impact

Fixes reading Hudi tables leveraging SparkHoodieTableFileIndex without using the metadata table.

Risk Level

none

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

hudi-bot · 2026-02-27T18:47:25Z

CI report:

3f18780 Azure: SUCCESS
e5e9a9f Azure: PENDING

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 26, 2026

yihua added this to the release-0.15.1 milestone Feb 26, 2026

linliu-code approved these changes Feb 26, 2026

View reviewed changes

fix: Support Databricks Spark runtime (apache#13129)

e5e9a9f

yihua force-pushed the fix-databricks-spark branch from 3f18780 to e5e9a9f Compare February 27, 2026 18:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Support Databricks Spark runtime#18256

fix: Support Databricks Spark runtime#18256
yihua wants to merge 1 commit intoapache:branch-0.xfrom
yihua:fix-databricks-spark

yihua commented Feb 26, 2026

Uh oh!

hudi-bot commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yihua commented Feb 26, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-bot commented Feb 27, 2026

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants