Hadoop

Use Cases

Example

Use Cases

Hadoop

Major Users

Hadoop Components

Hadoop Versions Comparison

Version	Hadoop 1	Hadoop 2	Hadoop 3
Release	Dec 2011	May 2012 (2.0.0 Alpha)	Dec 2017
Hadoop Common, HFDS, Map Reduce
Resource Management	N/A	YARN	YARN
Namenode	Single	Multiple	Multiple
Containers like Dockers	N/A	N/A	Yes
Erasure Coding	N/A	N/A	Yes
Support GPU for processing	N/A	N/A	Yes

YARN

HDFS

MapReduce

Data Integration Comparison Matrix

Data Ingest	Flume	Sqoop	Kafka
Release	Jan 2019 (v1.9.0)	Jan 2019 (v1.99.7)	March 2019 (v2.2.0)
Use	The main design goal of flume is to ingest huge log data generated by application servers into HDFS at a higher speed.	To move data between Hadoop and other databases and it can transfer data in parallel for performance.	It is used for building real-time data pipelines and streaming apps.
Use case	Online analytics	Bulk data transfer	Event Streaming platform (MQ, publishers and producers, process and manipulate data)
Scalable	Horizontally		Horizonally
Data Source	Logs data generated by application or web server	RDBMS e.g Oracle, MariaDB	any realtime streaming application data
Data Destination	HFDS / HBase	Hadoop (HDFS, HBASE etc)	any Hadoop, Oracle, Twitter etc
Require Agent	Yes to be installed at application and web server	No	No

Information updated as of Apr 2019

Database comparison matrix

Hadoop

Architecture

Hadoop Hardware Components

Name Node

maintain namespace tree of HDFS
mapping file blocks to DataNode
only one active NameNode & secondary passive NameNode

Data Node

store data in a Hadoop cluster
datanode must be uniform to ensure datanode with lesser spec to jam the cluster.

Edge Node / Gateway Node

interface of Hadoop cluster to outside network
to run client applications and cluster administration tools (i.e. Sqoop, Oozie, Hue, Hive)
staging space for data

Journal Node

to synchronize active and standby NameNodes
to store HDFS file edit
at least 3 nodes to avoid NameNode split brain scenario

Hadoop

HFDS

What is HFDS?

Hadoop Distributed File System - a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.