What is Hadoop?

Why Hadoop?

but can support many different languages through the use of plugins

because of its modular architecture

each with its own memory, storage and processing power

When data is loaded into Hadoop, the framework divides that data into multiple pieces,

they are spread and replicated across the available machines

To run a job on the data, each piece of the data is processed individually on the machine it is stored on

instead of retrieving, combining and working on a single large dataset.

This provides parallel processing of the data thus it reducing the time required to generate output

Clients

Masters

Slaves

Clients

Clients are users of the Hadoop system and submit data/jobs and retrieve the output once a job completes

Masters

Masters consists of the machines (servers) themselves and are sometimes called NameNodes or JobTrackers
A secondary NameNode is always recommended to provide for disaster recovery.

Slaves

works almost like a backup to the main NameNode but it is NOT a backup
It does NOT mirror the content of the main NameNode
acts as a checkpoint node that updates the running instance from the main NameNode
so that in failure, the corrections are faster

responsible for serving read and write request from clients
perform block creation, deletion and replication based upon instruction from a NameNode
run the task tracker to receives job instructions from masters

Every DataNode sends a periodic message to the NameNode. For example: a heartbeat
Upon loss of recent heartbeats, a NameNode may decide that a DataNode is dead

No futher I/O requests will be send to the dead DataNode
Affected blocks lost on the dead DataNode will be replicated again on other available DataNodes

Please take out your phone and open Kahoot.

Let's play!