Spark-submit with python throws error

Suresh_Kumar1 · June 23, 2021, 5:34am

I am getting the below error when submitting the spark job

sandeepgiri · June 23, 2021, 7:04pm

Did you set your default path to python3? You get this error usually then.

sandeepgiri · June 23, 2021, 7:06pm

I tested running the same command in a fresh session on your account. It works fine.

Suresh_Kumar1 · June 25, 2021, 10:14am

attached the image of how I am submitting a pyspark job

below is the error log.

can you please let me know what are the environment variables you are setting

I tried adding these lines too but still fails with the same error
import os
import sys

os.environ[“SPARK_HOME”] = “/usr/spark2.4.3”
os.environ[“PYLIB”] = os.environ[“SPARK_HOME”] + “/python/lib”

In below two lines, use /usr/bin/python2.7 if you want to use Python 2

os.environ[“PYSPARK_PYTHON”] = “/usr/local/anaconda/bin/python”
os.environ[“PYSPARK_DRIVER_PYTHON”] = “/usr/local/anaconda/bin/python”
sys.path.insert(0, os.environ[“PYLIB”] +"/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ[“PYLIB”] +"/pyspark.zip")

`(base) [sureshkumarmandapati1357@cxln4 ~]$ /usr/spark2.4.3/bin/spark-submit --master local[*] /usr/spark2.4.3/examples/src/main/python/pi.py
21/06/25 10:09:04 INFO spark.SparkContext: Running Spark version 2.4.3
21/06/25 10:09:04 INFO spark.SparkContext: Submitted application: PythonPi
21/06/25 10:09:04 INFO spark.SecurityManager: Changing view acls to: sureshkumarmandapati1357
21/06/25 10:09:04 INFO spark.SecurityManager: Changing modify acls to: sureshkumarmandapati1357
21/06/25 10:09:04 INFO spark.SecurityManager: Changing view acls groups to:
21/06/25 10:09:04 INFO spark.SecurityManager: Changing modify acls groups to:
21/06/25 10:09:04 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sureshkumarmandapati1357); groups with view permissions: Set(); users with modify permissions: Set(sureshkumarmandapati1357); groups with modify permissions: Set()
21/06/25 10:09:04 INFO util.Utils: Successfully started service ‘sparkDriver’ on port 37224.
21/06/25 10:09:04 INFO spark.SparkEnv: Registering MapOutputTracker
21/06/25 10:09:04 INFO spark.SparkEnv: Registering BlockManagerMaster
21/06/25 10:09:04 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/06/25 10:09:04 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/06/25 10:09:04 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-a8ea5821-3518-4563-9c6c-e196d38dae5a
21/06/25 10:09:04 INFO memory.MemoryStore: MemoryStore started with capacity 93.3 MB
21/06/25 10:09:04 INFO spark.SparkEnv: Registering OutputCommitCoordinator
21/06/25 10:09:04 INFO util.log: Logging initialized @2947ms
21/06/25 10:09:04 INFO server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
21/06/25 10:09:05 INFO server.Server: Started @3067ms
21/06/25 10:09:05 WARN util.Utils: Service ‘SparkUI’ could not bind on port 4040. Attempting port 4041.
21/06/25 10:09:05 INFO server.AbstractConnector: Started ServerConnector@63bacafd{HTTP/1.1,[http/1.1]}{0.0.0.0:4041}
21/06/25 10:09:05 INFO util.Utils: Successfully started service ‘SparkUI’ on port 4041.
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@c90dfeb{/jobs,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@46eeaeca{/jobs/json,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3acf81a2{/jobs/job,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2ad0c350{/jobs/job/json,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@19fba03e{/stages,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5f7fe032{/stages/json,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@51b0a38d{/stages/stage,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@19369c13{/stages/stage/json,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4dcef873{/stages/pool,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@151d212{/stages/pool/json,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7cef7431{/storage,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5194b9bb{/storage/json,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6ba3de9a{/storage/rdd,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@646d50be{/storage/rdd/json,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@63a4eefe{/environment,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1709ca50{/environment/json,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@622b9125{/executors,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6ccdd57e{/executors/json,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@28da1507{/executors/threadDump,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@71d81106{/executors/threadDump/json,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4f4414a1{/static,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3de3281d{/,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@541cefbf{/api,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@c1792a4{/jobs/job/kill,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@27f45fc{/stages/stage/kill,null,AVAILABLE,@Spark}
21/06/25 10:09:05 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://cxln4.c.thelab-240901.internal:4041
21/06/25 10:09:05 INFO executor.Executor: Starting executor ID driver on host localhost
21/06/25 10:09:05 INFO util.Utils: Successfully started service ‘org.apache.spark.network.netty.NettyBlockTransferService’ on port 45896.
21/06/25 10:09:05 INFO netty.NettyBlockTransferService: Server created on cxln4.c.thelab-240901.internal:45896
21/06/25 10:09:05 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/06/25 10:09:05 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, cxln4.c.thelab-240901.internal, 45896, None)
21/06/25 10:09:05 INFO storage.BlockManagerMasterEndpoint: Registering block manager cxln4.c.thelab-240901.internal:45896 with 93.3 MB RAM, BlockManagerId(driver, cxln4.c.thelab-240901.internal, 45896, None)
21/06/25 10:09:05 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, cxln4.c.thelab-240901.internal, 45896, None)
21/06/25 10:09:05 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, cxln4.c.thelab-240901.internal, 45896, None)
21/06/25 10:09:05 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@75e5fce{/metrics/json,null,AVAILABLE,@Spark}
21/06/25 10:09:07 INFO scheduler.EventLoggingListener: Logging events to hdfs:/spark2-history/local-1624615745252
21/06/25 10:09:07 INFO internal.SharedState: loading hive config file: file:/usr/spark2.4.3/conf/hive-site.xml
21/06/25 10:09:07 INFO internal.SharedState: spark.sql.warehouse.dir is not set, but hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the value of hive.metastore.warehouse.dir (’/apps/hive/warehouse’).
21/06/25 10:09:07 INFO internal.SharedState: Warehouse path is ‘/apps/hive/warehouse’.
21/06/25 10:09:07 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@47cbdc0b{/SQL,null,AVAILABLE,@Spark}
21/06/25 10:09:07 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@27fd9857{/SQL/json,null,AVAILABLE,@Spark}
21/06/25 10:09:07 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@55b9087{/SQL/execution,null,AVAILABLE,@Spark}
21/06/25 10:09:07 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@71e607d9{/SQL/execution/json,null,AVAILABLE,@Spark}
21/06/25 10:09:07 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@55290319{/static/sql,null,AVAILABLE,@Spark}
21/06/25 10:09:07 INFO state.StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
21/06/25 10:09:08 INFO spark.SparkContext: Starting job: reduce at /usr/spark2.4.3/examples/src/main/python/pi.py:44
21/06/25 10:09:08 INFO scheduler.DAGScheduler: Got job 0 (reduce at /usr/spark2.4.3/examples/src/main/python/pi.py:44) with 2 output partitions
21/06/25 10:09:08 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (reduce at /usr/spark2.4.3/examples/src/main/python/pi.py:44)
21/06/25 10:09:08 INFO scheduler.DAGScheduler: Parents of final stage: List()
21/06/25 10:09:08 INFO scheduler.DAGScheduler: Missing parents: List()
21/06/25 10:09:08 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (PythonRDD[1] at reduce at /usr/spark2.4.3/examples/src/main/python/pi.py:44), which has no missing parents
21/06/25 10:09:08 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 6.1 KB, free 93.3 MB)
21/06/25 10:09:08 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 4.2 KB, free 93.3 MB)
21/06/25 10:09:08 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on cxln4.c.thelab-240901.internal:45896 (size: 4.2 KB, free: 93.3 MB)
21/06/25 10:09:08 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1161
21/06/25 10:09:08 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (PythonRDD[1] at reduce at /usr/spark2.4.3/examples/src/main/python/pi.py:44) (first 15 tasks are for partitions Vector(0, 1))
21/06/25 10:09:08 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
21/06/25 10:09:08 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7852 bytes)
21/06/25 10:09:08 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 7852 bytes)
21/06/25 10:09:08 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
21/06/25 10:09:08 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1)
21/06/25 10:09:08 ERROR executor.Executor: Exception in task 1.0 in stage 0.0 (TID 1)
org.apache.spark.SparkException:
Error from python worker:
/usr/local/anaconda/bin/python: Error while finding module specification for ‘pyspark.daemon’ (AttributeError: module ‘pyspark’ has no attribute ‘path’)
PYTHONPATH was:
/usr/spark2.4.3/python/lib/pyspark.zip:/usr/spark2.4.3/python/lib/py4j-0.10.7-src.zip:/usr/spark2.4.3/jars/spark-core_2.11-2.4.3.jar:/usr/spark2.4.3/python/:/python/:
org.apache.spark.SparkException: No port number in pyspark.daemon’s stdout
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:204)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
21/06/25 10:09:08 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, localhost, executor driver): org.apache.spark.SparkException:
Error from python worker:
/usr/local/anaconda/bin/python: Error while finding module specification for ‘pyspark.daemon’ (AttributeError: module ‘pyspark’ has no attribute ‘path’)
PYTHONPATH was:
/usr/spark2.4.3/python/lib/pyspark.zip:/usr/spark2.4.3/python/lib/py4j-0.10.7-src.zip:/usr/spark2.4.3/jars/spark-core_2.11-2.4.3.jar:/usr/spark2.4.3/python/:/python/:
org.apache.spark.SparkException: No port number in pyspark.daemon’s stdout
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:204)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

21/06/25 10:09:08 ERROR scheduler.TaskSetManager: Task 1 in stage 0.0 failed 1 times; aborting job
21/06/25 10:09:08 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0
21/06/25 10:09:08 INFO scheduler.TaskSchedulerImpl: Killing all running tasks in stage 0: Stage cancelled
21/06/25 10:09:08 INFO scheduler.TaskSchedulerImpl: Stage 0 was cancelled
21/06/25 10:09:08 INFO executor.Executor: Executor is trying to kill task 0.0 in stage 0.0 (TID 0), reason: Stage cancelled
21/06/25 10:09:08 INFO scheduler.DAGScheduler: ResultStage 0 (reduce at /usr/spark2.4.3/examples/src/main/python/pi.py:44) failed in 0.478 s due to Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost, executor driver): org.apache.spark.SparkException:
Error from python worker:
/usr/local/anaconda/bin/python: Error while finding module specification for ‘pyspark.daemon’ (AttributeError: module ‘pyspark’ has no attribute ‘path’)
PYTHONPATH was:
/usr/spark2.4.3/python/lib/pyspark.zip:/usr/spark2.4.3/python/lib/py4j-0.10.7-src.zip:/usr/spark2.4.3/jars/spark-core_2.11-2.4.3.jar:/usr/spark2.4.3/python/:/python/:
org.apache.spark.SparkException: No port number in pyspark.daemon’s stdout
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:204)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
21/06/25 10:09:08 INFO executor.Executor: Executor interrupted and killed task 0.0 in stage 0.0 (TID 0), reason: Stage cancelled
21/06/25 10:09:08 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): TaskKilled (Stage cancelled)
21/06/25 10:09:08 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
21/06/25 10:09:08 INFO scheduler.DAGScheduler: Job 0 failed: reduce at /usr/spark2.4.3/examples/src/main/python/pi.py:44, took 0.566778 s
Traceback (most recent call last):
File “/usr/spark2.4.3/examples/src/main/python/pi.py”, line 44, in
count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
File “/usr/spark2.4.3/python/lib/pyspark.zip/pyspark/rdd.py”, line 844, in reduce
File “/usr/spark2.4.3/python/lib/pyspark.zip/pyspark/rdd.py”, line 816, in collect
File “/usr/spark2.4.3/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py”, line 1257, in call
File “/usr/spark2.4.3/python/lib/pyspark.zip/pyspark/sql/utils.py”, line 63, in deco
File “/usr/spark2.4.3/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py”, line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost, executor driver): org.apache.spark.SparkException:
Error from python worker:
/usr/local/anaconda/bin/python: Error while finding module specification for ‘pyspark.daemon’ (AttributeError: module ‘pyspark’ has no attribute ‘path’)
PYTHONPATH was:
/usr/spark2.4.3/python/lib/pyspark.zip:/usr/spark2.4.3/python/lib/py4j-0.10.7-src.zip:/usr/spark2.4.3/jars/spark-core_2.11-2.4.3.jar:/usr/spark2.4.3/python/:/python/:
org.apache.spark.SparkException: No port number in pyspark.daemon’s stdout
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:204)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException:
Error from python worker:
/usr/local/anaconda/bin/python: Error while finding module specification for ‘pyspark.daemon’ (AttributeError: module ‘pyspark’ has no attribute ‘path’)
PYTHONPATH was:
/usr/spark2.4.3/python/lib/pyspark.zip:/usr/spark2.4.3/python/lib/py4j-0.10.7-src.zip:/usr/spark2.4.3/jars/spark-core_2.11-2.4.3.jar:/usr/spark2.4.3/python/:/python/:
org.apache.spark.SparkException: No port number in pyspark.daemon’s stdout
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:204)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:122)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:95)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:108)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
… 1 more

21/06/25 10:09:08 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on cxln4.c.thelab-240901.internal:45896 in memory (size: 4.2 KB, free: 93.3 MB)
21/06/25 10:09:08 INFO spark.SparkContext: Invoking stop() from shutdown hook
21/06/25 10:09:08 INFO server.AbstractConnector: Stopped Spark@63bacafd{HTTP/1.1,[http/1.1]}{0.0.0.0:4041}
21/06/25 10:09:08 INFO ui.SparkUI: Stopped Spark web UI at http://cxln4.c.thelab-240901.internal:4041
21/06/25 10:09:08 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
21/06/25 10:09:08 INFO memory.MemoryStore: MemoryStore cleared
21/06/25 10:09:08 INFO storage.BlockManager: BlockManager stopped
21/06/25 10:09:08 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
21/06/25 10:09:08 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
21/06/25 10:09:08 INFO spark.SparkContext: Successfully stopped SparkContext
21/06/25 10:09:08 INFO util.ShutdownHookManager: Shutdown hook called
21/06/25 10:09:08 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-88838af5-a836-434f-9588-1585c56a5c4b
21/06/25 10:09:08 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-8c8f49c2-8800-4361-88c4-11a8acb785ab
21/06/25 10:09:08 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-8c8f49c2-8800-4361-88c4-11a8acb785ab/pyspark-69fcb3f6-6313-4f5a-b17f-3f2401d50a03`

Suresh_Kumar1 · June 25, 2021, 11:00am

I tried the same code on notebook. throws the same error

github.com

r578992/notebook/blob/main/pi.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "\n",
    "os.environ[\"SPARK_HOME\"] = \"/usr/spark2.4.3\"\n",
    "os.environ[\"PYLIB\"] = os.environ[\"SPARK_HOME\"] + \"/python/lib\"\n",
    "# In below two lines, use /usr/bin/python2.7 if you want to use Python 2\n",
    "os.environ[\"PYSPARK_PYTHON\"] = \"/usr/local/anaconda/bin/python\" \n",
    "os.environ[\"PYSPARK_DRIVER_PYTHON\"] = \"/usr/local/anaconda/bin/python\"\n",
    "sys.path.insert(0, os.environ[\"PYLIB\"] +\"/py4j-0.10.7-src.zip\")\n",
    "sys.path.insert(0, os.environ[\"PYLIB\"] +\"/pyspark.zip\")"
   ]
  },

This file has been truncated. show original

Suresh_Kumar1 · June 28, 2021, 1:28pm

can I have an update on this

sandeepgiri · June 29, 2021, 5:27am

I was able to run the example in the following way without setting any environment variables:

/usr/spark2.4.3/bin/run-example SparkPi 10

.... 
21/06/29 05:25:11 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
21/06/29 05:25:11 INFO scheduler.DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 0.828 s
21/06/29 05:25:11 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 0.912119 s
*Pi is roughly 3.147039147039147*
21/06/29 05:25:11 INFO server.AbstractConnector: Stopped Spark@759d81f3{HTTP/1.1,[http/1.1]}{0.0.0.0:4048}
21/06/29 05:25:11 INFO ui.SparkUI: Stopped Spark web UI at http://cxln4.c.thelab-240901.internal:4048
21/06/29 05:25:11 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
21/06/29 05:25:11 INFO memory.MemoryStore: MemoryStore cleared
21/06/29 05:25:11 INFO storage.BlockManager: BlockManager stopped
....

Suresh_Kumar1 · July 21, 2021, 2:34pm

can you please run this in YARN mode. it is failing for me

Suresh_Kumar1 · July 23, 2021, 3:30pm

can i have a response please

sandeepgiri · July 24, 2021, 1:44pm

Hi Suresh,

The default installation of spark is able to work well with yarn. The following works fine:

run-example --master yarn SparkPi 10

Also, with out yarn all version seem to run fine:

/usr/spark2.4.3/bin/run-example SparkPi 10
run-example SparkPi 10

but the non-default installations do not seem to work with YARN:

/usr/spark2.4.3/bin/run-example --master yarn SparkPi 10

It throws the exception about missing class:
java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig

I am trying to figure out how YARN can work with other versions of spark.

sandeepgiri · July 24, 2021, 2:34pm

Hi Suresh,

The main reason of this error is because the run-example depends upon certain version of jersey and there is a version conflict.

In your projects, however, this error should not occur because you can bundle your dependencies in POM file.

sandeepgiri · July 24, 2021, 7:37pm

You can take a look at how to write spark application here:

github.com

cloudxlab/bigdata/blob/master/spark/projects/helloworld/README.md

## What is this?

This is just a simple project which can be used as a starting skelton to start with sbt and spark-submit

1. Clone the repository:
```
	git clone https://github.com/cloudxlab/bigdata.git
	cd bigdata/spark/projects/helloworld
```
2. To run follow these steps:
```
	sbt package
	spark-submit target/scala-2.11/hello-world_2.11-1.0.jar
	# Or to run it in the yarn mode use
	spark-submit --master yarn target/scala-2.11/hello-world_2.11-1.0.jar
```

It should display something like:

	The Project Gutenberg EBook of The Adventures of Sherlock Holmesby Sir Arthur Conan Doyle(#15 in our series by Sir Arthur Conan Doyle)Copyright laws are changing all over the world. Be sure to check thecopyright laws for your country before downloading or redistributingthis or any other Project Gutenberg eBook.This header should be the first thing seen when viewing this ProjectGutenberg file.  Please do not remove it.  Do not change or edi

This file has been truncated. show original