Hi Sairam,
There are two ways to deal with Sequence Files.
[Please launch the spark-shell and run the following commands on it.]
-
Spark Way (Preferred)
// Here we are writing the files to “seq-dir” in HDFS
val RDD = sc.parallelize(List((“a”, 1), (“b”, 2), (“c”, 3)))
RDD.saveAsSequenceFile(“seq-dir”)
This is what I did:
[sandeep@cxln5 ~]$ spark-shell
SPARK_MAJOR_VERSION is set to 2, using Spark2
Setting default log level to “WARN”.
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://10.142.0.5:4040
Spark context available as ‘sc’ (master = local[*], app id = local-1604952037647).
Spark session available as ‘spark’.
Welcome to
____ __
/ / ___ / /
\ / _ / _ `/ __/ '/
// .__/_,// //_\ version 2.1.1.2.6.2.0-205
//
Using Scala version 2.11.8 (Java HotSpot™ 64-Bit Server VM, Java 1.8.0_112)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val RDD = sc.parallelize(List((“a”, 1), (“b”, 2), (“c”, 3)))
RDD: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at :24
scala> RDD.saveAsSequenceFile(“seq-dir”)
scala>
-
Old way (Painful)
// Here are loading the data saved in previous step.
import org.apache.hadoop.io.Text
import org.apache.hadoop.io.IntWritable
val sequence_data = sc.sequenceFile(“seq-dir/*”, classOf[Text], classOf[IntWritable]).map{case (x, y) => (x.toString, y.get())}.collect
Here is what I did:
scala> import org.apache.hadoop.io.Text
import org.apache.hadoop.io.Text
scala> import org.apache.hadoop.io.IntWritable
import org.apache.hadoop.io.IntWritable
scala> val sequence_data = sc.sequenceFile(“seq-dir/*”, classOf[Text], classOf[IntWritable]).map{case (x, y) => (x.toString, y.get())}.collect
sequence_data: Array[(String, Int)] = Array((a,1), (b,2), (c,3))
You can exit spark-shell by typing :q
and try checking if the files are created using:
hadoop fs -ls seq-dir
This is what I did:
scala> :q
[sandeep@cxln5 ~]$ hadoop fs -ls seq-dir
Found 17 items
-rw-r–r-- 3 sandeep hdfs 0 2020-11-09 20:00 seq-dir/_SUCCESS
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00000
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00001
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00002
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00003
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00004
-rw-r–r-- 3 sandeep hdfs 99 2020-11-09 20:00 seq-dir/part-00005
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00006
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00007
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00008
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00009
-rw-r–r-- 3 sandeep hdfs 99 2020-11-09 20:00 seq-dir/part-00010
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00011
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00012
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00013
-rw-r–r-- 3 sandeep hdfs 85 2020-11-09 20:00 seq-dir/part-00014
-rw-r–r-- 3 sandeep hdfs 99 2020-11-09 20:00 seq-dir/part-00015