Skip to content

[BUG] Spark Excel does not work on AWS EMR (even with thin assembly) #985

@christianknoepfle

Description

@christianknoepfle

Am I using the newest version of the library?

  • I have made sure that I'm using the latest version of the library.

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

For the past time I used my own compiled spark excel jar, because the default one did not work on AWS EMR (latest version 7.9.0).

After the "thin" assembly was introduced I did a first test and it worked very well for reading excel. But occasionally we also write excel and there it fails with NoSuchMethod.

Expected Behavior

The main reason for the failure is the presence of pre installed hadoop 3.4.1 jars on AWS EMR. hadoop 3.4.1 (is latest version) utilizes commons-compress version 1.22. poi also needs commons-compress and uses API introduced with 1.25 (currently binds 1.27). When running spark-submit the class loader pulls the hadoop provided commons-compress first,so we do not have the needed API function and the job crashes.

I assume this issue holds true for any other "Spark Cluster service provided by your favourite cloud provider".

Now we could try to force some other class load ordering (haven't investigated that) or patching the EMR installation (that will be tricky), but I guess "shading" would be the best option here. Or has someone a better idea how to cope with it?

If shading is the way to go I would suggest to also offer a classifier for "emr" (or a more generic name, because this issue will come up for other cluster environments too). @nightscape what are your thoughts on this?

BR

Christian

Steps To Reproduce

Here are some details on the issue:
The error message:
Image
The change in commons-compress (materialized in 1.25.0):
Image

Environment

- Spark version: 3.5.5
- Spark-Excel version: 3.5.6_0.31.2
- OS: Amazon Linux 2023
- Cluster environment: AWS EMR 7.9

Anything else?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions