What is Hadoop?

  • Open-source programming framework
  • Distributed storage
  • Processing of large amounts of data on commodity hardware

Why Hadoop?

  • Store and manage large amount of data efficiently
  • Process this data to obtain meaningful information
  • Allow the system to scale for future expansion

Hadoop Architecture

  • Hadoop framework is built using Java,

but can support many different languages through the use of plugins 

because of its modular architecture

  • Is designed to run on multiple machines,

each with its own memory, storage and processing power

  • When data is loaded into Hadoop, the framework divides that data into multiple pieces,

they are spread and replicated across the available machines

Hadoop Architecture

  • To run a job on the data, each piece of the data is processed individually on the machine it is stored on 

instead of retrieving, combining and working on a single large dataset.

  • This provides parallel processing of the data thus it reducing the time required to generate output

Component of Hadoop

Clients

Masters

Slaves

Component of Hadoop

Clients

  • Clients are users of the Hadoop system and submit data/jobs and retrieve the output once a job completes

Component of Hadoop

Masters

  • Masters consists of the machines (servers) themselves and are sometimes called NameNodes or JobTrackers
  • A secondary NameNode is always recommended to provide for disaster recovery.

Component of Hadoop

Slaves

  • Slaves perform the actual storage and processing of data

  • Sometimes called DataNodes or TaskTrackers

  • Multiple DataNodes will provide for better parallel processing power

What is HDFS?

  • Hadoop Distributed File System
  • A Java-based file system
  • provides scalable and reliable data storage
  • designed to span large clusters of commodity servers.

HDFS Architecture

  • a master/slave architecture
  • with NameNodes as masters and DataNodes as slaves

What does the NameNode do?

  • manages the file system namespace and regulates access to files by clients
  • keeps track of the file metadata:
  1. Which portion of a file is saved in which part of the cluster
  2. Last access time for files
  3. Access control lists

What does the secondary NameNode do?

  • works almost like a backup to the main NameNode but it is NOT a backup
  • It does NOT mirror the content of the main NameNode
  • acts as a checkpoint node that updates the running instance from the main NameNode
  • so that in failure, the corrections are faster

What does the DataNode do?

  • responsible for serving read and write request from clients
  • perform block creation, deletion and replication based upon instruction from a NameNode
  • run the task tracker to receives job instructions from masters

How HDFS store data?

  • NameNode decides which DataNode the blocks will be stored in 
  • NameNode is also responsible to perform replication of the blocks

How are failures detected?

  • Every DataNode sends a periodic message to the NameNode. For example: a heartbeat
  • Upon loss of recent heartbeats, a NameNode may decide that a DataNode is dead
  1. No futher I/O requests will be send to the dead DataNode
  2. Affected blocks lost on the dead DataNode will be replicated again on other available DataNodes

Thank you

Please take out your phone and open Kahoot.

Let's play!

Hadoop

By shirlin1028

Hadoop

  • 187