Spark RDD, Datasets and Dataframes

satyajit_datta · February 19, 2020, 3:49pm

I have some basic questions regarding RDD.

I create an RDD from a file in HDFS. The HDFS file itself is partitioned and distributed across nodes in the cluster. So when an RDD is created, is a copy made of the data in HDFS and persisted on disk or loaded into memory?
In Q1, if the data in each node gets loaded into memory of that node, what happens if the data does not fit into memory?
If I repartition the data while creating an RDD (or afterwards) what happens -
a. Are the HDFS data blocks redistributed across the nodes? OR
b. Are copies made of the HDFS blocks and then distributed?
If in Q1, if data gets loaded into memory when a RDD is created, then what happens when I apply cache() on a RDD? How is this different from just creating the RDD?