PySpark in Oozie

sachinkerala · September 7, 2017, 11:20am

Just curious to know any one tried to schedule PySpark Module in Oozie.
What are the fields that I need to enter here for the bellow . These fields are required for Spark Action in Oozie.

Spark Master
Mode
Main class
Jars/py files

sgiri · September 11, 2017, 4:39am

I have not yet tried in Oozie but here is my guess.

Spark master should be similar to the --master argument. The value can be “yarn” if running on Hadoop
Mode would be either the cluster or client
Main Class is the fully qualified name of the entry point
The jar/py is the location of the java/scala bundle or python program

Please note that these arguments are very similar to spark-submit arguments.

See more at https://spark.apache.org/docs/latest/submitting-applications.html
or in the Running on Cluster section of our course: https://cloudxlab.com/course/specialization/3/

sgiri · September 11, 2017, 5:22am

This just worked perfectly fine:
master: local[*]
Mode: client
AppName: Anything
Main: org.apache.spark.examples.SparkPi
jars: /usr/hdp/current/spark-client/lib/spark-examples-1.5.2.2.3.4.0-3485-hadoop2.7.1.2.3.4.0-3485.jar
arguments: 10

sgiri · September 11, 2017, 5:39am

I just created a 10min video. Here is the video of Spark running on Oozie:

sachinkerala · September 19, 2017, 6:44am

Thanks for this. This is for Java/Scala with Spark right …
But when we are submitting PySpark What should I fill on Main Class fields
I created a pySpark Job and its working perfectly fine on submitting thru spark-submit. Now When I tried thru Oozie its failing. I doubt the Fields that the fields I enter has issues . These fields are required for Spark Action in Oozie.
Capture1

sgiri · September 20, 2017, 5:30pm

Can you share your files?