Spark RDD's storage understanding

karthik_j · June 1, 2020, 4:38pm

My friend asked me these questions about RDD, Could you please correct me if my understanding is wrong !!!

I create an RDD from a file in HDFS. The HDFS file itself is
partitioned and distributed across nodes in the cluster. So when an RDD
is created, is a copy made of the data in HDFS and persisted on disk or
loaded into memory?

When u read a file from HDFS location, RDD will create but it will actually load the file(data) after it execute action transformations. And as spark has its own partiton algo it will partition data based on it .

If I repartition the data while creating an RDD (or afterwards) what happens -
I think above statment answers this.

2 . In Q1, if the data in each node gets loaded into memory of that node, what happens if the data does not fit into memory?

Here there are 2 memories, 1 is primary memory (RAM) and other is HardDisk. These clusters are typical CPU’s which will use Virtual memory logic to store the exceeded data ie., if data does not fit into RAM.

If in Q1, if data gets loaded into memory when a RDD is created,
then what happens when I apply cache() on a RDD? How is this different
from just creating the RDD?

As Spark uses lazy evaluation, RDD’s not always store in RAM. When there is an action transformation applied on RDD then data will load to memory(RAM). Applying Cache MEMORY on RDD will store the RAM till spark job exists after whole RAM data will be cleared for next spark job.