Hadoop

A. Fraga, V. Ostertag, A. Ruiz Rodriguez
Friday 16th of November 2017
Cranfield University

Hadoop presentation
Plan

INTRODUCTION
HISTORY
SOME BASICS
HOW IT WORKS
IMPLEMENTATION
CONCLUSION
Introduction

Hadoop presentation
Cranfield University

Cranfield University

Hadoop presentation
Hadoop

open-source


Memory storage

Big datasets processing
Developed in Java
HDFS
MapReduce (+ YARN since 2.0)
Cranfield University

Hadoop presentation
Goals & utilities

Handle hardware failures




used to do:
Marketing analysis

Machine learning

Data mining

Image processing

History

Hadoop presentation
Cranfield University

Cranfield University

Hadoop presentation
Chronology

2006
Released by Doug Cutting and Mike Cafarella


Hadoop
Cranfield University

Hadoop presentation
Chronology

2008
Yahoo moves its web index to Hadoop

Fastest system to sort a terabyte (209 seconds)

2009
Sort a terabyte in 62 seconds

2012
YARN is introduced
Cranfield University

Hadoop presentation
Success

Widely used:





Some basics

Hadoop presentation
Cranfield University

Cranfield University

Hadoop presentation
Map

[1, 2, 3, 4, 5, 6]
[1, 4, 9, 16, 25, 36]
Apply a function to every element.
Example:


map using square function
Cranfield University

Hadoop presentation
Reduce
Reduce a set of data to one value using an accumulator
[1, 2, 3, 4, 5, 6]
21
Example:

reduce using the sum accumulator

In Hadoop, we use MapReduce
So, what is it?
Cranfield University

Hadoop presentation
How it works

Hadoop presentation
Cranfield University

Cranfield University

Hadoop presentation
Nodes

Master nodes
Slave nodes
(manager)
(worker)
Data Node
Task tracker


Name node
Job tracker
Cranfield University

Hadoop presentation
HDFS


Client
Name node

150 Mo file
64 Mo
64 Mo
22 Mo
Data node 1
Data node 2
Data node 3



example.txt
example.txt
Cranfield University

Hadoop presentation
HDFS

Name node

"example.txt"
=
Data node 1 + Data node 2 + Data node 3
Cranfield University

Hadoop presentation
Backup

Name node

Secondary name node

Rack 1
Rack 2


Data Node 1.2
Data Node 1.5
Data Node 2.1
By default, saved 3 times









Cranfield University

Hadoop presentation
Jobs


Client
Job tracker



submits jobs
slave node
Task tracker
Create the task

heartbeat
Map
Reduce

Name node
Write / read
1
2
3
5
4
Cranfield University

Hadoop presentation
To do

Code the Mapper Class
Code the Reduce Class
Code the main
No need to worry about anything else!
Cranfield University

Hadoop presentation
YARN
What if I don't want to use MapReduce?

YARN
Framework that allows us to use other data processing
No more limitations!

Implementation

Hadoop presentation
Cranfield University

Cranfield University

Hadoop presentation
What we want to do

Count the number of words in a string
We will code in JAVA using Hadoop
... Let's see how easy it is!
Cranfield University

Hadoop presentation
First step : main

We'll start by initializing the problem at hand:
WordCount
Text
String
Program's name
Variable ready to be solved
1
2
public class WordCount {
public static void main(String[] args) throws Exception {
// New configuration
Configuration conf = new Configuration();
// I want to create a new job
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
...
}
...
}
1
2
Cranfield University

Hadoop presentation
Cranfield University

Hadoop presentation
Mapper

Now, we want to use a mapper to count the words
WordCount
Text
String
Map
("word", 1)
("word", 1)
("word", 1)
("word", 1)
3
Result of the mapper for each processor
...
// My job will convert unstructured data into structured data
job.setMapperClass(TokenizerMapper.class);
...
3
Custom Mapper to create
Cranfield University

Hadoop presentation
Cranfield University

Hadoop presentation
Reduce

Then, we need to sum all the results
Text
String
Map
("word", 1)
("word", 1)
("word", 1)
("word", 1)
Hadoop
("word", (1,1,1,1))
Reduce
("word", 4)
Reduce
Gather the data
4
5
Cranfield University

Hadoop presentation
...
// Then it will combine local results
job.setCombinerClass(IntSumReducer.class);
// Then it will reduce everything
job.setReducerClass(IntSumReducer.class);
// But do it with this format
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
...
4
5
Custom Reducer class
Cranfield University

Hadoop presentation
Output

...
// Defining Input and output path
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// Ok Job, do what I want you to do
System.exit(job.waitForCompletion(true) ? 0 : 1);
Cranfield University

Hadoop presentation
Example

Conclusion

Hadoop presentation
Cranfield University



MPI
Hadoop

Cranfield University

Hadoop presentation
Conclusion

Hadoop
By isvoli
Hadoop
- 762