When to use an architecture like Zookeeper and when to use HDFS like architecture?

I am bit confused between zookeeper master child architecture and HDFS name and data node architecture.
Which is used when and how?
Can we use both concepts in one design or we have to use both concepts in any design?

Good Question. Let me try to give a brief answer.

In Zookeeper, there are multiple machines in the group called an ensemble. In this ensemble, any machine can become the leader based on various parameters such as which has the latest data etc. No one designates a leader. A leader is chosen based on election in ensemble. A leader might step down in various circumstances. In Zookeeper, all nodes maintain the copy of the same data in the disk and in memory. Meaning, if you have to store 2 GB data and you have 20 machines running zookeeper, the overall consumption of storage will be 40 GB (2 GB on each of 20 machines).

In HDFS, the Namenode (master) and Datanodes (slaves) are designated by the system administrator during installation time. There is no election. Also, the data is divided into blocks on multiple machines.

So, if you need really highly available system but have small data (like configs) to store, you will use zookeeper. In case, you have to store huge data in the form of files, in those cases, HDFS is the best choice.

In case, your need is to have highly available as well as huge data, you can use a combination of both architecture, for example, MongoDB (the architecture is like zookeeper) or Kafka (It stores data using multiple machines like HDFS but the information of coordination is stored in Zookeeper.)

3 Likes

Thanks, very nicely explained the difference & subtleties between zookeeper & HDFS. Also, sharing information on alternatives (MongoDB/Kafka) helps to set a perspective.

Thanks

1 Like