Hive with Spark engine issue

raviteja · December 28, 2017, 10:31am

Can you please help us to resolve below issue.

For Hive, i’m able run Hive queries on different engines, like MapReduce & Tez, both engines are running fine.
But when i’m executing with Spark it is throwing below issue:

Below the parameter i’ve setup for ‘Spark Engine in Hive’:

set hive.execution.engine=spark;

Below is the stack trace from Hive Shell:

hive (ravi_practice)> CREATE TABLE widgets_avro ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ STORED AS AVRO AS select * from widgets;
Query ID = ravitejarockon1712_20171228101110_f297411b-39ff-47bb-9674-663111a56b79
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
java.lang.NoClassDefFoundError: org/apache/spark/SparkConf
at org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.generateSparkConf(HiveSparkClientFactory.java:160)
at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient.(RemoteHiveSparkClient.java:89)
at org.apache.hadoop.hive.ql.exec.spark.HiveSparkClientFactory.createHiveSparkClient(HiveSparkClientFactory.java:65)
at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.open(SparkSessionImpl.java:55)
at org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionManagerImpl.getSession(SparkSessionManagerImpl.java:116)
at org.apache.hadoop.hive.ql.exec.spark.SparkUtilities.getSparkSession(SparkUtilities.java:112)
at org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:101)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1703)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1460)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1237)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1101)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1091)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:168)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:379)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:739)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:684)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:624)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.SparkConf
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
… 26 more
FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. org/apache/spark/SparkConf

raviteja · December 28, 2017, 10:37am

I think it is Spark & Hive configuration issue.

abhinav · December 30, 2017, 4:24am

Hi @raviteja

I just tried setting spark as execution engine and it worked fine. Please find below steps

set spark.home=/usr/hdp/2.3.4.0-3485/spark;
set spark.master=yarn;
SET hive.execution.engine=spark;
ADD JAR /usr/hdp/2.3.4.0-3485/spark/lib/spark-assembly-1.5.2.2.3.4.0-3485-hadoop2.7.1.2.3.4.0-3485.jar;
select * from transactionsgot limit 10;

Hope this helps

Thanks

raviteja · December 30, 2017, 12:31pm

Hi @abhinav,

Just now i have tried your settings in hive, but facing same issues above mentioned.
I think while you’re running query

It doesn’t use any execution engine, because it’s just printing a file. Can you please try below query in your user environment & test it:

select sum(cost) from products;

This query for sure, uses an Spark execution engine.

abhinav · December 30, 2017, 2:45pm

Hi @raviteja,

Just curious if you found anything on the Google regarding the error. Please keep me updated with your findings and I will quickly deploy changes in the configuration

Thanks

raviteja · December 30, 2017, 4:41pm

@abhinav sure i will let you know.

raviteja · January 8, 2018, 12:45pm

Hi @abhinav ,

Update:
HDP is not officially supported with Hive on Spark.HDP - Hive on Spark
As per official documentation of Hive::- Hive on Spark: Getting Started

Note that you must have a version of Spark which does not include the Hive jars. Meaning one which was not built with the Hive profile.
To remove Hive jars from the installation, simply use the following command under your Spark repository:

Prior to Spark 2.0.0:

./make-distribution.sh --name “hadoop2-without-hive” --tgz “-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided”
Since Spark 2.0.0:

./dev/make-distribution.sh --name “hadoop2-without-hive” --tgz “-Pyarn,hadoop-provided,hadoop-2.7,parquet-provided”

To solve this issue, Spark needs to be built without Hive jars.

As per my research, i came to know that, Cloudera distribution officially supports it.

I Hope this would solve the issue. Please let me know if this is fine, I will close this issue.