Hadoop Architecture for Datawarhousing solution

Srinivasankk7 · January 19, 2019, 3:44pm

I am currently working on proposing data warehousing solution using Hadoop eco systems.

Data Ingestion ( Data source - Structured, unstructured, Feed etc) - NFI tool
2, Data Lake - HDFS
Data shore - HDFS( Processed data?)
Data Processing - Spark
BI - Apache Zeppelin/BIRT(for reports/dashboard, real time reports etc)

there is need of master data management, shall I consider HDFS itself for this purpose or should I consider another DB for final data storage( target data wareshouse system).

Let me know above architecture tools are good enough? or if any suggestions would be greatly appreciated.

Thanks,
Srinivasan

alokdeosingh1995 · January 20, 2019, 12:48pm

Hi Srini,
I am also learning Hadoop and spark with cloudxlabs. I dont have direct answers to your questions. Will try to solve

Can you pls let me know the following

How much data is flowing into your application?
How fast is the data being ingested?
How fast are the results to be shown back to user? if it is real time it is not possible with Hive or HDFS
What kind of processing you are looking at?

Regards,
Alok

Srinivasankk7 · January 20, 2019, 5:43pm

Thanks Alok.

To reitrate the requirement, existing legacy data systems will be moved into hadoop based data warehouse.
Data volume initially 300 - 500 GB and then it will be couple of gigs daily growth.
Since its first change over speed is not baselined yet.
4, Data consumption will be reports, dashboards etc, they can be offline/runtime reports and dashboards.

alokdeosingh1995 · January 21, 2019, 4:14am

I remember similar requirement from my project.

For data ingestion, we used multi threaded script to pick up the data and move to elastic search(HBase can also be used) after parsing. Our use case contained many simultaneous file being ingested. You can use sqoop to pick up the data and use oozie for orchestration.
Use Kafka for sending the data after parsing to HBASE or elastic search.
Data also needs to be backed up to HDFS also which will be your data lake
For real time reports data needs to be picked up from HBase or elastic search. For batch reports spark can be used over Hive and HDFS.
For user authentication or storing meta data, you can use any RDBMS like MySQL or Postgres.
I did some research on BI tools. You can check which fits your scheme of things.

image1108×542 20.7 KB

The above has been running well until recently. There was one issue we found that copying was taking lot of time as our file are 2-3 GBs. We are replacing threaded services for copying by spark job which will use our custom parser.

Let Sandeep also review it if this makes sense. Please do update after your discussion with Sandeep.

Srinivasankk7 · January 21, 2019, 12:44pm

Thanks Alok.

What would be the Master Database here?

Master Database would sit inside HDFS or outside like RDBMS/NOSQL etc?