Processing Big Files using Spark (Locally)

Hi
I have a 4GB csv file which i try to process using PYSPARK in Mac with 8GB RAM. Iam getting the below error after i try to cache the RDD.

WARN MemoryStore: Not Enough space to cache rdd_xxx in memory. I believe some configuration need to be updated but couldnt figure it out.

Files with lesser size are getting processed successfully.

What settings need to be updated to process the file.

Could you share your script?

Here you go. I used this code in Juypter notebook.

from pyspark import SparkContext
from pyspark import *
from pyspark.sql import *

sc = SparkContext(“local”, “App Name”)
spark = SparkSession(sc)

rdd = spark.read.text(filepath+‘Bigfile.csv’)
rdd.cache()
rdd.count()

Below Memory settings changes didnt help.
SparkContext.setSystemProperty(‘spark.executor.memory’, ‘3g’)
SparkContext.setSystemProperty(‘spark.driver.memory’, ‘3g’)
SparkContext.setSystemProperty(‘spark.memory.fraction’, ‘1’)
SparkContext.setSystemProperty(‘spark.python.worker.memory’, ‘1g’)

Sometimes error ed as
java.lang.OutOfMemoryError: Java heap space

rdd.cache()

When you call rdd.cache(), it will basically create the data structure needed for RDD along with actual data. So, 2GB file takes up more memory than on the disk.

I remember a similar instance where I was loading 1GB data as a hash table and it was taking up 4GB RAM.

Is it really needed to cache that kind of data? Also, try using persist with MEMORY+DISK.

Regards,
Sandeep Giri

i am also facing the same kind of problem. still haven’t got the proper solution. Please help.