Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task]: Issue with Running Hop Pipeline on Spark Standalone Cluster #4835

Closed
Raja10D opened this issue Jan 27, 2025 · 12 comments
Closed

[Task]: Issue with Running Hop Pipeline on Spark Standalone Cluster #4835

Raja10D opened this issue Jan 27, 2025 · 12 comments

Comments

@Raja10D
Copy link

Raja10D commented Jan 27, 2025

What needs to happen?

I am writing to report an issue we are encountering when running a Hop pipeline on a Spark Standalone cluster.

Our Spark-Submit command works perfectly in a local environment with the following configuration:

spark-submit
--master local[4]
--class org.apache.hop.beam.run.MainBeam
--driver-java-options '-DPROJECT_HOME=/home/decoders/hop_test_2_8_0_backup/hop/config/projects/default'
/home/decoders/hop_test_2_8_0_backup/hop/fat-jar.jar
/home/decoders/hop_test_2_8_0_backup/hop/Creating_5_payoutAmt_source_Dec23.hpl
/home/decoders/hop_test_2_8_0_backup/hop/metadata.json
beam_runner_spark

However, when we replace local[4] with the Spark Standalone cluster URL (spark://10d154:7077), the pipeline fails to run in our real-time production environment. We have verified that all necessary dependencies are included in the fat-jar.jar and reviewed the configuration settings, but the issue persists.

Could you please provide guidance on any additional files or settings that might be required to run the Hop pipeline successfully on a Spark Standalone cluster?

Thank you for your assistance.

Issue Priority

Priority: 1

Issue Component

Component: Other, Component: Hop Run

@mattcasters
Copy link
Contributor

Hi @Raja10D, please note that you need to submit this on the master of your production cluster and that you need to specify the master typically as spark://host:port, also in the Hop metadata element beam_runner_spark.
HTH

@Raja10D
Copy link
Author

Raja10D commented Jan 27, 2025

Hi @Raja10D, please note that you need to submit this on the master of your production cluster and that you need to specify the master typically as spark://host:port, also in the Hop metadata element beam_runner_spark. HTH

In this spark://host:port we have to mention our port right if we are running on 8080 (master port ) means spark://host:8080
Am I right

@mattcasters
Copy link
Contributor

You can see the address at the top of the Spark web console.

@mattcasters
Copy link
Contributor

More tips can be found at the documentation of the Hop Spark Pipeline engine.

@Raja10D
Copy link
Author

Raja10D commented Jan 28, 2025

Image
Image
Image

I have inserted my spark master web screenshot, and my beam_runner_spark configuration screenshot and spark-submit command

spark-submit
--master spark://10d154:7077
--class org.apache.hop.beam.run.MainBeam
--driver-java-options '-DPROJECT_HOME=/home/decoders/hop_test_2_8_0_backup/hop/config/projects/default'
/home/decoders/hop_test_2_8_0_backup/hop/fat-jar.jar
/home/decoders/hop_test_2_8_0_backup/hop/Creating_5_payoutAmt_source_Dec23.hpl
/home/decoders/hop_test_2_8_0_backup/hop/metadata.json
beam_runner_spark

Step by Step I followed the document, can you guide me where am I making mistake.
It will be very helpful.
Thankyou

@bamaer
Copy link
Contributor

bamaer commented Jan 28, 2025

what do the relevant intries the /etc/hosts files on your master and other cluster hosts look like?

@Raja10D
Copy link
Author

Raja10D commented Jan 28, 2025

My /etc/hosts


127.0.0.1 localhost
127.0.1.1 decoders-Latitude-7480

::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

192.125.1.140 master-node worker-node

My master and worker are in my same machine
My hop - 2.8.0
My beam - 2.50
My Spark - 3.4.4
Java 11

@mattcasters
Copy link
Contributor

So no entry for hostname 10d154?

@Raja10D
Copy link
Author

Raja10D commented Jan 29, 2025

So no entry for hostname 10d154?

Adding the hostname entry to the /etc/hosts file resolved the problem, and the Hop pipeline is now running smoothly on our Spark Standalone cluster.

Actual content in my /etc/hosts

127.0.0.1 localhost
127.0.1.1 decoders-Latitude-7480

::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

192.125.1.140 master-node worker-node

192.125.1.140 10d154


Your guidance was invaluable, and I appreciate your support. Thank you once again for your assistance!

Thanks all..

@mattcasters
Copy link
Contributor

You're welcome @Raja10D , I'm glad you managed to run on Spark. Can you resolve this issue and the next?

@Raja10D
Copy link
Author

Raja10D commented Jan 29, 2025

You're welcome @Raja10D , I'm glad you managed to run on Spark. Can you resolve this issue and the next?

I'm glad to report that our next step is to run the pipeline in the production environment(in server). We're preparing to ensure that everything runs smoothly there as well.

Are there any specific configurations or considerations we should be aware of before moving forward with the production deployment? Your insights would be invaluable in helping us achieve a seamless transition

@mattcasters
Copy link
Contributor

Just look at the transform specific limitations detailed over here: https://hop.apache.org/manual/latest/pipeline/beam/getting-started-with-beam.html#_universal_transforms

In general, if it runs on your local 1 node Spark, it should run on a larger cluster as well.
Good luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants