add_files with RestCatalog, S3FileIO #11558

DongSeungLee · 2024-11-15T09:29:51Z

Query engine

Spark 3.5.3

Question

for study, i run spark cluster standalone in my local, and i have developed my own IcebergRestCatalog.
My IcebergRestCatalog Iceberg spec is based on 1.6.1 version
for running add_files provided by spark, like below.

CALL iceberg.system.add_files(
table => 'yearly_month_clicks',
source_table => '`parquet`.`s3a://dataquery-warehouse/iceberg/data`'
);

error occurs like below.

Caused by: org.apache.iceberg.exceptions.RuntimeIOException: Failed to get file system for path: s3://dataquery-warehouse/iceberg/dataquery/yearly_month_clicks/metadata/stage-31-task-1619-manifest-855c8009-c073-48b0-9fd7-e12c1daf8930.avro
	at org.apache.iceberg.hadoop.Util.getFs(Util.java:58)
	at org.apache.iceberg.hadoop.HadoopOutputFile.fromPath(HadoopOutputFile.java:53)
	at org.apache.iceberg.hadoop.HadoopFileIO.newOutputFile(HadoopFileIO.java:97)
	at org.apache.iceberg.spark.SparkTableUtil.buildManifest(SparkTableUtil.java:368)
	at org.apache.iceberg.spark.SparkTableUtil.lambda$importSparkPartitions$1e94a719$1(SparkTableUtil.java:796)
	at org.apache.spark.sql.Dataset.$anonfun$mapPartitions$1(Dataset.scala:3414)
	at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:198)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3443)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
	at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
	at org.apache.iceberg.hadoop.Util.getFs(Util.java:56)

from my point of view, spark try to create staging metadata from location of which iceberg table metadata has.
here, iceberg metadata location is started with s3, and scheme is fixed as s3.
Spark try to access file system by hadoop S3AFileSystem, thus it seems scheme s3 is not supported, s3a should be right scheme.
how can i overcome this issue?
thanks, sincerely

The text was updated successfully, but these errors were encountered:

RussellSpitzer · 2024-11-21T18:26:09Z

This is actually related to #11541 . Add Files uses some Hadoop Filesystem classes under the hood and because of this you currently must have a fully setup HadoopConfig in your runtime to do add_files. With #11541 completed we should be able to fix this for addfiles and use s3FileIO instead of hadoop filesystem classes

DongSeungLee · 2024-11-21T23:45:49Z

i appreciate your sincere answer.

DongSeungLee added the question Further information is requested label Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add_files with RestCatalog, S3FileIO #11558

add_files with RestCatalog, S3FileIO #11558

DongSeungLee commented Nov 15, 2024 •

edited

Loading

RussellSpitzer commented Nov 21, 2024

DongSeungLee commented Nov 21, 2024

add_files with RestCatalog, S3FileIO #11558

add_files with RestCatalog, S3FileIO #11558

Comments

DongSeungLee commented Nov 15, 2024 • edited Loading

Query engine

Question

RussellSpitzer commented Nov 21, 2024

DongSeungLee commented Nov 21, 2024

DongSeungLee commented Nov 15, 2024 •

edited

Loading