I have some basic questions regarding RDD.
-
I create an RDD from a file in HDFS. The HDFS file itself is partitioned and distributed across nodes in the cluster. So when an RDD is created, is a copy made of the data in HDFS and persisted on disk or loaded into memory?
-
In Q1, if the data in each node gets loaded into memory of that node, what happens if the data does not fit into memory?
-
If I repartition the data while creating an RDD (or afterwards) what happens -
a. Are the HDFS data blocks redistributed across the nodes? OR
b. Are copies made of the HDFS blocks and then distributed? -
If in Q1, if data gets loaded into memory when a RDD is created, then what happens when I apply cache() on a RDD? How is this different from just creating the RDD?