Hadoop

 

Hadoop

Use Cases

Example

Use Cases

Hadoop

Major Users

Hadoop Components

Hadoop Versions Comparison

Version Hadoop 1 Hadoop 2 Hadoop 3
Release Dec 2011 May 2012 (2.0.0 Alpha) Dec 2017
Hadoop Common, HFDS, Map Reduce
Resource Management N/A YARN YARN
Namenode Single Multiple Multiple
Containers like Dockers N/A N/A Yes
Erasure Coding N/A N/A Yes
Support GPU for processing N/A N/A Yes

YARN

HDFS

MapReduce

Data Integration Comparison Matrix

Data Ingest Flume Sqoop  Kafka
Release Jan 2019 (v1.9.0) Jan 2019 (v1.99.7) March 2019 (v2.2.0)
Use The main design goal of flume is to ingest huge log data generated by application servers into HDFS at a higher speed. To move data between Hadoop and other databases and it can transfer data in parallel for performance. It is used for building real-time data pipelines and streaming apps. 
Use case Online analytics Bulk data transfer Event Streaming platform (MQ, publishers and producers, process and manipulate data)
Scalable Horizontally Horizonally
Data Source Logs data generated by application or web server RDBMS e.g Oracle, MariaDB any realtime streaming application data
Data Destination HFDS / HBase Hadoop (HDFS, HBASE etc) any Hadoop, Oracle, Twitter etc
Require Agent Yes to be installed at application and web server No No

Information updated as of Apr 2019

Database comparison matrix

Hadoop

Architecture

Hadoop Hardware Components

Name Node

  • maintain namespace tree of HDFS
  • mapping file blocks to DataNode
  • only one active NameNode & secondary passive NameNode

Data Node

  • store data in a Hadoop cluster
  • datanode must be uniform to ensure datanode with lesser spec to jam the cluster.                                                  

Edge Node / Gateway Node

  • interface of Hadoop cluster to outside network
  • to run client applications and cluster administration tools (i.e. Sqoop, Oozie, Hue, Hive)
  • staging space for data

Journal Node

  • to synchronize active and standby NameNodes
  • to store HDFS file edit
  • at least 3 nodes to avoid NameNode split brain scenario                                               

Hadoop

HFDS

What is HFDS?

 Hadoop Distributed File System - a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.

How HDFS Works?

File data divided into blocks of 64MB or 128MB

B1

B2

B3

B1, B2 & B3 blocks min and max size 64MB or 128MB

Writing to HDFS

Writing to HDFS

Writing to HDFS

Reading from HDFS

TYPE OF FAULTS AND THEIR DETECTIONS IN HDFS

Type of Faults

Detection Methods

Detection Methods

Handling Reading & Writing Failure

Handling Data Node Failure

Handling Data Node Failure

Replica Placement Strategy

Usable Space Calculation

Hadoop Components

By Wan Razali

Hadoop Components

  • 122