Spark -ILT Batch

malavikam · November 6, 2018, 12:52pm

In Spark class of 25/03/2018, you have said that, if half of the line stored in one node and other half in another node second node will communicate with first node and transmit the second half.

But I have heard that HDFS will be taking care such that, complete record will be stored in one node by checking EOF.

sgiri · November 17, 2018, 1:11pm

HDFS does not take care of that. HDFS cuts the files into smaller files (internally) called blocks. These blocks are stored on the various datanode and have a replication factor. HDFS doesn’t care about the format of the file.

This InputFormat class takes care of understanding the raw data. In fact, the inputformat classes are part of the Mapreduce package.