Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add_files with RestCatalog, S3FileIO #11558

Open
DongSeungLee opened this issue Nov 15, 2024 · 2 comments
Open

add_files with RestCatalog, S3FileIO #11558

DongSeungLee opened this issue Nov 15, 2024 · 2 comments
Labels
question Further information is requested

Comments

@DongSeungLee
Copy link

DongSeungLee commented Nov 15, 2024

Query engine

Spark 3.5.3

Question

for study, i run spark cluster standalone in my local, and i have developed my own IcebergRestCatalog.
My IcebergRestCatalog Iceberg spec is based on 1.6.1 version
for running add_files provided by spark, like below.

CALL iceberg.system.add_files(
table => 'yearly_month_clicks',
source_table => '`parquet`.`s3a://dataquery-warehouse/iceberg/data`'
);

error occurs like below.

Caused by: org.apache.iceberg.exceptions.RuntimeIOException: Failed to get file system for path: s3://dataquery-warehouse/iceberg/dataquery/yearly_month_clicks/metadata/stage-31-task-1619-manifest-855c8009-c073-48b0-9fd7-e12c1daf8930.avro
	at org.apache.iceberg.hadoop.Util.getFs(Util.java:58)
	at org.apache.iceberg.hadoop.HadoopOutputFile.fromPath(HadoopOutputFile.java:53)
	at org.apache.iceberg.hadoop.HadoopFileIO.newOutputFile(HadoopFileIO.java:97)
	at org.apache.iceberg.spark.SparkTableUtil.buildManifest(SparkTableUtil.java:368)
	at org.apache.iceberg.spark.SparkTableUtil.lambda$importSparkPartitions$1e94a719$1(SparkTableUtil.java:796)
	at org.apache.spark.sql.Dataset.$anonfun$mapPartitions$1(Dataset.scala:3414)
	at org.apache.spark.sql.execution.MapPartitionsExec.$anonfun$doExecute$3(objects.scala:198)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
	at org.apache.spark.scheduler.Task.run(Task.scala:141)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
	at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3443)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
	at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
	at org.apache.iceberg.hadoop.Util.getFs(Util.java:56)

from my point of view, spark try to create staging metadata from location of which iceberg table metadata has.
here, iceberg metadata location is started with s3, and scheme is fixed as s3.
Spark try to access file system by hadoop S3AFileSystem, thus it seems scheme s3 is not supported, s3a should be right scheme.
how can i overcome this issue?
thanks, sincerely

@DongSeungLee DongSeungLee added the question Further information is requested label Nov 15, 2024
@RussellSpitzer
Copy link
Member

This is actually related to #11541 . Add Files uses some Hadoop Filesystem classes under the hood and because of this you currently must have a fully setup HadoopConfig in your runtime to do add_files. With #11541 completed we should be able to fix this for addfiles and use s3FileIO instead of hadoop filesystem classes

@DongSeungLee
Copy link
Author

i appreciate your sincere answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants