Reding SequenceFile in Spark-Shell

Hi,
I have created sequence file “tstSeqFile” in PySpark.

rdd = sc.parallelize([(“k1”,1.0),(“k2”,3.2),(“k3”,2.1)],2)
rdd.saveAsSequenceFile(“tstSeqFile”)

When I tried to read it from spark-shell (Scala), getting “NotSerializableException”.

Can you please help to understand why I am receiving this error? How to resolve?

Thanks
Sairam Srinivas. V

It seems the permissions have been restricted to save the files in the disk as many user may be creating and the disk space have been filled.
But your command is correct and you can try in your local.

You can also just import the packages which covers all

import org.apache.hadoop.io._

Also you can read this for more details

http://dmtolpeko.com/category/sequencefile/

No, there is no permission issue.

Hi Sairam,

There are two ways to deal with Sequence Files.

[Please launch the spark-shell and run the following commands on it.]

  1. Spark Way (Preferred)

    // Here we are writing the files to “seq-dir” in HDFS

    val RDD = sc.parallelize(List((“a”, 1), (“b”, 2), (“c”, 3)))
    RDD.saveAsSequenceFile(“seq-dir”)

This is what I did:

[sandeep@cxln5 ~]$ spark-shell
SPARK_MAJOR_VERSION is set to 2, using Spark2
Setting default log level to “WARN”.
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://10.142.0.5:4040
Spark context available as ‘sc’ (master = local[*], app id = local-1604952037647).
Spark session available as ‘spark’.
Welcome to
____ __
/ / ___ / /
\ / _ / _ `/ __/ '/
/
/ .__/_,// //_\ version 2.1.1.2.6.2.0-205
/
/

Using Scala version 2.11.8 (Java HotSpot™ 64-Bit Server VM, Java 1.8.0_112)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val RDD = sc.parallelize(List((“a”, 1), (“b”, 2), (“c”, 3)))
RDD: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at :24

scala> RDD.saveAsSequenceFile(“seq-dir”)

scala>

  1. Old way (Painful)

    // Here are loading the data saved in previous step.

    import org.apache.hadoop.io.Text
    import org.apache.hadoop.io.IntWritable
    val sequence_data = sc.sequenceFile(“seq-dir/*”, classOf[Text], classOf[IntWritable]).map{case (x, y) => (x.toString, y.get())}.collect

Here is what I did:

scala> import org.apache.hadoop.io.Text
import org.apache.hadoop.io.Text

scala> import org.apache.hadoop.io.IntWritable
import org.apache.hadoop.io.IntWritable

scala> val sequence_data = sc.sequenceFile(“seq-dir/*”, classOf[Text], classOf[IntWritable]).map{case (x, y) => (x.toString, y.get())}.collect
sequence_data: Array[(String, Int)] = Array((a,1), (b,2), (c,3))

You can exit spark-shell by typing :q and try checking if the files are created using:

hadoop fs -ls seq-dir

This is what I did:

scala> :q
[sandeep@cxln5 ~]$ hadoop fs -ls seq-dir
Found 17 items
-rw-r–r-- 3 sandeep hdfs 0 2020-11-09 20:00 seq-dir/_SUCCESS
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00000
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00001
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00002
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00003
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00004
-rw-r–r-- 3 sandeep hdfs 99 2020-11-09 20:00 seq-dir/part-00005
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00006
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00007
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00008
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00009
-rw-r–r-- 3 sandeep hdfs 99 2020-11-09 20:00 seq-dir/part-00010
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00011
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00012
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00013
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00014
-rw-r–r-- 3 sandeep hdfs 99 2020-11-09 20:00 seq-dir/part-00015

1 Like