Setting up the cluster

Hi,

I’ve few questions regarding cluster setup. Can you help?

  1. In your experience is Ubuntu 16.04 is not a good release to setup a standalone HDFS, Spark, and HIVE

  2. I got everything to work almost (manually installed not Cloudera bundles) but when typing SHOW TABLES in HIVE a system with 32GB goes to a halt

  3. I liked Cloudera because I was able to upload the CSV files to HDFS and then create table mapping to those CSV files and run some queries before I jumped into spark.

  4. Is there any other bundle solutions that can give me that? I also do not mind owning technology

  5. if I put together a cluster of 16GB RAM 128GB SSD 8 core x 4 motherboard with centos do you think I can build a reliable cluster?

  6. It seems CentOS is a preferred Linux distro for Cloudera?

Hi,

Please find my responses below

+ In your experience is Ubuntu 16.04 is not a good release to setup a standalone HDFS, Spark, and HIVE

I do not see any problems in Ubuntu 16.04. Why do you think it is not a good release?

+ I got everything to work almost (manually installed not Cloudera bundles) but when typing SHOW TABLES in HIVE a system with 32GB goes to a halt

I think there is a problem with your Hive configuration. Since you have manually configured Hadoop, please check the Hive logs to get the relevant information.

+ I liked Cloudera because I was able to upload the CSV files to HDFS and then create table mapping to those CSV files and run some queries before I jumped into spark

Yes, that is the best part of choosing vendors like Cloudera and Hortonworks.

+ Is there any other bundle solutions that can give me that? I also do not mind owning technology

I guess you have checked Hue from Cloudera. What do you mean by owning technology?

+ if I put together a cluster of 16GB RAM 128GB SSD 8 core x 4 motherboards with centos do you think I can build a reliable cluster?

Reliability depends on how much data you want to churn and how much RAM is available for Hadoop components.

+ It seems CentOS is a preferred Linux distro for Cloudera

I prefer CentOS over Ubuntu in the production environment as it is stable than Ubuntu.

Hope this helps.

Quick update here,

after referencing through multiple documents blogs i now have my own cluster that was setup manually
so Hadoop (hdfs, yarn), Spark and Hive

i felt really good seeing it work and now i expanded it across 5 EC2 instances and very happy, i am thinking of doing a small write up giving just the reader just the essentials to get it going, amazingly its so easy but information so scattered.

A follow up question would be, unless i was running a business where customer response time was important i don’t believe i should use any GUI stuff like HUE,

is there any other essentials apart from the 3 things i have mentioned in my previous reply that i should give attention to. ?

Hi Nayana,

Good to know that you have setup cluster manually :slight_smile:

Yes, a write up will definitely help other users to setup a cluster on their own.

We can also publish your write-up on CloudxLab blog, it will help you in reaching the wider audience.

Thanks

GUI like Hue will give you a nice interface to interact with HDFS, Hive, Pig and other components.

It is mostly a personal choice between command line and GUI.

Also, cluster managers from Hortonworks and Cloudera, ease out the task of installing new hosts and managing the cluster.