Meetup Bigdata and AI

Abhishek_Dutt · July 12, 2017, 5:59pm

Hi ,

I wanted to know like when sandeep said that we transport logic to different machines rather than data, then how does each machine has access to data to process the logic on ?I want to know the flow?

sgiri · July 15, 2017, 1:20pm

Hi Abhishek,

Good question.

Say we are doing processing using Apache spark. We first define our data by the way creating something called RDD. And then we define our logic on topic of each RDD in the form of transformations. Once we are done defining all transformations, we execute the entire logic by calling an action. Once we call an action it takes our code and run it on the nodemanagers (YARN) which are closer to Datanodes (HDFS) holding blocks (or their replicas) of data.

A similar thing happens when we submit Hadoop MapReduce Job. While submitting our map-reduce job by the way of driver we specify our input data folders or files upfront along with our code mapper and reducer. The map logic is executed by the map-reduce framework on nodemanagers inside the containers located nearer to the datanode holding the data.

Now, since data on datanodes is already spread out over cluster because files are cut into blocks of 128MB (or 64MB) and these blocks are spread on multiple machines in cluster.

For more information, please go through our HDFS, YARN, MapReduce and Basics of RDD topics: http://cloudxlab.com