After a lot of digging around, I think I have finally found the issue. It is a very interesting case. I never thought of it before.
When we are creating RDD, it is creating the RDD with 16 partitions:
scala> var arr = 1 to 100
arr: scala.collection.immutable.Range.Inclusive = Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)
scala> val nums = sc.parallelize(arr)
nums: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:26
scala> nums.getNumPartitions
res5: Int = 16
scala>
This is because there are 16 processors on the machine. When running in local mode, the default number of partitions are equal to number of processors
[kmrsherma3845@cxln4 ~]$ cat /proc/cpuinfo|grep pro
processor : 0
processor : 1
processor : 2
processor : 3
processor : 4
processor : 5
processor : 6
processor : 7
processor : 8
processor : 9
processor : 10
processor : 11
processor : 12
processor : 13
processor : 14
processor : 15
Now, since there are 16 partitions, it would try to reserve around HDFS block size * number of partitions * replication factor = 128163 MB = 6144 MB = 6GB. But the quota is 4GB. Hence, you get this error.
I found it out by clearing the quota from one account and then ran the script. The output folder had 16 files:
[sandeepgiri9034@cxln4 ~]$ hadoop fs -ls doublevals_28jan2020_2
Found 17 items
-rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 0 2020-01-28 12:33 doublevals_28jan2020_2/_SUCCESS
-rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 195 2020-01-28 12:33 doublevals_28jan2020_2/part-00000
-rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 252 2020-01-28 12:33 doublevals_28jan2020_2/part-00001
-rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 248 2020-01-28 12:33 doublevals_28jan2020_2/part-00002
-rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 252 2020-01-28 12:33 doublevals_28jan2020_2/part-00003
-rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 248 2020-01-28 12:33 doublevals_28jan2020_2/part-00004
-rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 252 2020-01-28 12:33 doublevals_28jan2020_2/part-00005
-rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 248 2020-01-28 12:33 doublevals_28jan2020_2/part-00006
-rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 253 2020-01-28 12:33 doublevals_28jan2020_2/part-00007
-rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 310 2020-01-28 12:33 doublevals_28jan2020_2/part-00008
-rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 315 2020-01-28 12:33 doublevals_28jan2020_2/part-00009
-rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 310 2020-01-28 12:33 doublevals_28jan2020_2/part-00010
-rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 315 2020-01-28 12:33 doublevals_28jan2020_2/part-00011
-rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 310 2020-01-28 12:33 doublevals_28jan2020_2/part-00012
-rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 315 2020-01-28 12:33 doublevals_28jan2020_2/part-00013
-rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 310 2020-01-28 12:33 doublevals_28jan2020_2/part-00014
-rw-r--r-- 3 sandeepgiri9034 sandeepgiri9034 315 2020-01-28 12:33 doublevals_28jan2020_2/part-00015
So, what can you do? Create the rdd with a predefined number of partitions. The following code should work:
var arr = 1 to 100
val nums = sc.parallelize(arr, 2)
def multiplyByTwo(x:Int):Int = x*2
var dbls = nums.map(multiplyByTwo)
dbls.saveAsTextFile("doublevals_28jan2020_3")
Notice the second argument to sc.parallelize
.