[prassadht2483@cxln5 ~]$ cat s_b1.py
import sys
from pyspark import SparkContext
if name == “main”:
sc = SparkContext("local", "word count")
sc.setLogLevel("ERROR")
lines = sc.textFile("file.txt")
words = lines.flatMap(lambda s: s.split(" "))
wordCounts = words.countByValue()
for word, count in wordCounts.items():
print("{} : {}".format(word, count))
[prassadht2483@cxln5 ~]$ spark-submit s_b1.py
SPARK_MAJOR_VERSION is set to 2, using Spark2
21/02/11 03:11:22 INFO SparkContext: Running Spark version 2.1.1.2.6.2.0-205
21/02/11 03:11:22 INFO SecurityManager: Changing view acls to: prassadht2483
21/02/11 03:11:23 INFO SecurityManager: Changing modify acls to: prassadht2483
21/02/11 03:11:23 INFO SecurityManager: Changing view acls groups to:
21/02/11 03:11:23 INFO SecurityManager: Changing modify acls groups to:
21/02/11 03:11:23 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(prassadht2483); groups with view permissions: Set(); users with modify permissions: Set(prassadht2483); groups with modify permissions: Set()
21/02/11 03:11:23 INFO Utils: Successfully started service ‘sparkDriver’ on port 40527.
21/02/11 03:11:23 INFO SparkEnv: Registering MapOutputTracker
21/02/11 03:11:23 INFO SparkEnv: Registering BlockManagerMaster
21/02/11 03:11:23 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/02/11 03:11:23 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/02/11 03:11:23 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-ad25ba0e-91d4-4244-8510-30a45e30c8ff
21/02/11 03:11:23 INFO MemoryStore: MemoryStore started with capacity 93.3 MB
21/02/11 03:11:23 INFO SparkEnv: Registering OutputCommitCoordinator
21/02/11 03:11:23 INFO Utils: Successfully started service ‘SparkUI’ on port 4040.
21/02/11 03:11:23 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.142.0.5:4040
21/02/11 03:11:23 INFO SparkContext: Added file file:/home/prassadht2483/s_b1.py at file:/home/prassadht2483/s_b1.py with timestamp 1613013083976
21/02/11 03:11:23 INFO Utils: Copying /home/prassadht2483/s_b1.py to /tmp/spark-3b965f23-dc8b-4734-8ece-086e9e86c79c/userFiles-c9785ad1-dd00-4161-950e-ede990557125/s_b1.py
21/02/11 03:11:24 INFO Executor: Starting executor ID driver on host localhost
21/02/11 03:11:24 INFO Utils: Successfully started service ‘org.apache.spark.network.netty.NettyBlockTransferService’ on port 43874.
21/02/11 03:11:24 INFO NettyBlockTransferService: Server created on 10.142.0.5:43874
21/02/11 03:11:24 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/02/11 03:11:24 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.142.0.5, 43874, None)
21/02/11 03:11:24 INFO BlockManagerMasterEndpoint: Registering block manager 10.142.0.5:43874 with 93.3 MB RAM, BlockManagerId(driver, 10.142.0.5, 43874, None)
21/02/11 03:11:24 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.142.0.5, 43874, None)
21/02/11 03:11:24 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.142.0.5, 43874, None)
21/02/11 03:11:24 INFO EventLoggingListener: Logging events to hdfs:///spark2-history/local-1613013084023
Traceback (most recent call last):
File “/home/prassadht2483/s_b1.py”, line 16, in
wordCounts = words.countByValue()
File “/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/rdd.py”, line 1246, in countByValue
File “/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/rdd.py”, line 834, in reduce
File “/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/rdd.py”, line 808, in collect
File “/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py”, line 1133, in call
File “/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py”, line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://cxln1.c.thelab-240901.internal:8020/user/prassadht2483/file.txt
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:53)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1968)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.collect(RDD.scala:935)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)