Error: Quota exceeded while running spark job but haven't used much of disk space [Solved]

I was following free spark tutorial available on your which is a very small program.

>     var arr = 1 to 1000
>     val nums = sc.parallelize(arr)
>     def multiplyByTwo(x:Int):Int = x*2
>     var dbls = nums.map(multiplyByTwo)
>     dbls.saveAsTextFile("doublevals")

When I run this it gives the quota error as follows.

Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.DSQuotaExceededException): The DiskSpace quota of /user/kmrsherma3845 is exceeded: quota = 4294967296 B = 4 GB but diskspace consumed = 4564046624 B = 4.25 GB

I’ve deleted all files in my hadoop file system as well as deleted all the hive tables
I’ve also checked the used diskspace and it is MB range

hdfs dfs -du -h /user/kmrsherma3845
0       /user/kmrsherma3845/.Trash
21.6 M  /user/kmrsherma3845/.staging
1.8 K   /user/kmrsherma3845/doubles
1.9 K   /user/kmrsherma3845/doublevals
302     /user/kmrsherma3845/hive
0       /user/kmrsherma3845/tmp

I can also copy folders to my hdfs user directory without any error, this happens only when running this small scala program

Can someone help me to understand what’s going on here, thanks

Can you please search this question on forum. It has been discussed here in lot of questions. Deleted files will have to be purged.

Can you please run purge command?

I’ve already tried that with the following command

hdfs dfs -expunge

but it gave the error ‘Access denied’

20/01/23 06:07:35 WARN hdfs.DFSClient: Cannot get all encrypted trash roots
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Access denied for user kmrsherma3845. Superuser privilege is required

so I don’t know what else I can do

Moreover if you look tha the -du -h command output I given above, .Trash is 0 bytes so it can’t be due to Trash files

After a lot of digging around, I think I have finally found the issue. It is a very interesting case. I never thought of it before.

When we are creating RDD, it is creating the RDD with 16 partitions:

scala> var arr = 1 to 100
arr: scala.collection.immutable.Range.Inclusive = Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)

scala> val nums = sc.parallelize(arr)
nums: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:26

scala> nums.getNumPartitions
res5: Int = 16

scala>

This is because there are 16 processors on the machine. When running in local mode, the default number of partitions are equal to number of processors

[kmrsherma3845@cxln4 ~]$ cat /proc/cpuinfo|grep pro
processor	: 0
processor	: 1
processor	: 2
processor	: 3
processor	: 4
processor	: 5
processor	: 6
processor	: 7
processor	: 8
processor	: 9
processor	: 10
processor	: 11
processor	: 12
processor	: 13
processor	: 14
processor	: 15

Now, since there are 16 partitions, it would try to reserve around HDFS block size * number of partitions * replication factor = 128163 MB = 6144 MB = 6GB. But the quota is 4GB. Hence, you get this error.

I found it out by clearing the quota from one account and then ran the script. The output folder had 16 files:

[sandeepgiri9034@cxln4 ~]$ hadoop fs -ls doublevals_28jan2020_2
Found 17 items
-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034          0 2020-01-28 12:33 doublevals_28jan2020_2/_SUCCESS
-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034        195 2020-01-28 12:33 doublevals_28jan2020_2/part-00000
-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034        252 2020-01-28 12:33 doublevals_28jan2020_2/part-00001
-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034        248 2020-01-28 12:33 doublevals_28jan2020_2/part-00002
-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034        252 2020-01-28 12:33 doublevals_28jan2020_2/part-00003
-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034        248 2020-01-28 12:33 doublevals_28jan2020_2/part-00004
-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034        252 2020-01-28 12:33 doublevals_28jan2020_2/part-00005
-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034        248 2020-01-28 12:33 doublevals_28jan2020_2/part-00006
-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034        253 2020-01-28 12:33 doublevals_28jan2020_2/part-00007
-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034        310 2020-01-28 12:33 doublevals_28jan2020_2/part-00008
-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034        315 2020-01-28 12:33 doublevals_28jan2020_2/part-00009
-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034        310 2020-01-28 12:33 doublevals_28jan2020_2/part-00010
-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034        315 2020-01-28 12:33 doublevals_28jan2020_2/part-00011
-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034        310 2020-01-28 12:33 doublevals_28jan2020_2/part-00012
-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034        315 2020-01-28 12:33 doublevals_28jan2020_2/part-00013
-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034        310 2020-01-28 12:33 doublevals_28jan2020_2/part-00014
-rw-r--r--   3 sandeepgiri9034 sandeepgiri9034        315 2020-01-28 12:33 doublevals_28jan2020_2/part-00015

So, what can you do? Create the rdd with a predefined number of partitions. The following code should work:

var arr = 1 to 100
val nums = sc.parallelize(arr, 2)
def multiplyByTwo(x:Int):Int = x*2
var dbls = nums.map(multiplyByTwo)
dbls.saveAsTextFile("doublevals_28jan2020_3")

Notice the second argument to sc.parallelize.

2 Likes

Thanks for this update in forum.