Unable to launch pyspark on console

[sarithadsr217850@cxln5 ~]$ export PATH=/usr/local/anaconda/bin:$PATH
[sarithadsr217850@cxln5 ~]$ pyspark
SPARK_MAJOR_VERSION is set to 2, using Spark2
File “/bin/hdp-select”, line 232
print "ERROR: Invalid package - " + name
^
SyntaxError: Missing parentheses in call to ‘print’. Did you mean print("ERROR: Invalid package - " + name)?
Fatal Python error: Py_Initialize: can’t initialize sys standard streams
Traceback (most recent call last):
File “/usr/local/anaconda/lib/python3.6/io.py”, line 52, in
File “/home/sarithadsr217850/abc.py”, line 2, in
File “/usr/hdp/current/spark2-client/python/pyspark/init.py”, line 40, in
File “/usr/local/anaconda/lib/python3.6/functools.py”, line 20, in
ImportError: cannot import name ‘get_cache_token’
ls: cannot access /usr/hdp//hadoop/lib: No such file or directory
Fatal Python error: Py_Initialize: can’t initialize sys standard streams
Traceback (most recent call last):
File “/usr/local/anaconda/lib/python3.6/io.py”, line 52, in
File “/home/sarithadsr217850/abc.py”, line 2, in
File “/usr/hdp/current/spark2-client/python/pyspark/init.py”, line 40, in
File “/usr/local/anaconda/lib/python3.6/functools.py”, line 20, in
ImportError: cannot import name ‘get_cache_token’
Aborted

Hi, Saritha.

Can you check it now? pyspark is running perfectly.
Kindly ignore the error message, occurring due to the some bin files running at the backend.

Just run the last 6 lines of the Pyspark from the below article.

All the other environment variables are already been set.

All the best!

Spark Streaming’s Kafka libraries not found in class path. Try one of the following.

  1. Include the Kafka library and its dependencies with in the
    spark-submit command as

    $ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.4.3 …

  2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
    Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.4.3.
    Then, include the jar in the spark-submit command as

    $ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> …
    TypeError Traceback (most recent call last)
    in
    12 ssc = StreamingContext(sc, 5)
    13
    —> 14 lines = KafkaUtils.createStream(ssc, ‘localhost:2181’, “spark-streaming-consumer”, {‘saritha_kafka_test’:1})
    15
    16 # Split each line in each batch into words

/usr/spark2.4.3/python/lib/pyspark.zip/pyspark/streaming/kafka.py in createStream(ssc, zkQuorum, groupId, topics, kafkaParams, storageLevel, keyDecoder, valueDecoder)
76 raise TypeError(“topics should be dict”)
77 jlevel = ssc._sc._getJavaStorageLevel(storageLevel)
—> 78 helper = KafkaUtils._get_helper(ssc._sc)
79 jstream = helper.createStream(ssc._jssc, kafkaParams, topics, jlevel)
80 ser = PairDeserializer(NoOpSerializer(), NoOpSerializer())

/usr/spark2.4.3/python/lib/pyspark.zip/pyspark/streaming/kafka.py in _get_helper(sc)
215 def _get_helper(sc):
216 try:
–> 217 return sc._jvm.org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper()
218 except TypeError as e:
219 if str(e) == “‘JavaPackage’ object is not callable”:

TypeError: ‘JavaPackage’ object is not callable

Jupyter notebook is working but in CLI pyspark not working.

Hi Saritha,

The pyspark doesn’t work with Python 3. So, while running pyspark, you don’t need to “export PATH=/usr/local/anaconda/bin:$PATH” because that one make python3 default.

Check this for spark streaming: https://github.com/cloudxlab/bigdata/tree/master/spark/examples/streaming/word_count_kafka

How to make a spark-submit program using scala: https://github.com/cloudxlab/bigdata/tree/master/spark/examples/streaming/word_count_sbt