Hadoop 101

Big data for masses

Marcin Stożek "Perk" / @marcinstozek

Why?

how long would it take to sort 15 PB of data on one computer?
(hint: in days rather than hours)

why not distribute the data, sort chunk by chunk and just merge the result?

map

sqr k = k * k

sqr list -> [1, 4, 9, 16]

sqr list2 -> [25, 36, 49, 64]

sqr list3 -> [81, 100, 121, 144]

reduce

add [a, .., z] = a + .. + z

add list -> 30

add list2 -> 174

add list3 -> 446

add list list2 list3 -> 650

data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

list = [1, 2, 3, 4]
list2 = [5, 6, 7, 8]
list3 = [9, 10, 11, 12]

Why Hadoop?

Apache Open Source Community Project
big players use this
scalability (petabyte scale)
fault tolerance designed in
alternatives are really expensive

Who uses it and what for?

Yahoo
Facebook
Amazon
IBM
Intel
Microsoft
Twitter
...

http://wiki.apache.org/hadoop/PoweredBy

What is Hadoop?

... distributed processing

of large data sets

across clusters of computers...

Hadoop is a computing framework

http://hadoop.apache.org

... distributed processing

HDFS
MapReduce
YARN

of large data sets

you don't have big data problem
- tiny.cc/0j62sx
"anything" from 5 TB
- tiny.cc/cn62sx

across clusters of computers...

runs on commodity hardware
- 4 - 8 GB RAM (2008)
- 48 - 64 GB RAM (2014+)
dual processor / core
up to 4 000 nodes

http://wiki.apache.org/hadoop/FAQ

Distributions

Cloudera
Hortonworks
BigInsights
MapR
...

You can think of them as of Linux distributions

Components

Common
YARN
HDFS
MapReduce

Ecosystem

Hive / Impala / BigSQL
HCatalog
Pig
HBase
Spark
Storm
Solr
Ambari
Oozie
ZooKeeper
Sqoop...

Security

authentication:
- Kerberos
- User / Pass (LDAP)
- none aka Simple
authorization:
- HDFS file permissions
- Hbase tables / column ACL
audit
data protection

Performance

SSD
some services are faster than others
- MPP like Impala instead of Hive
and many others

Internet search is your friend here

Shut up and show me a live demo

Questions

Thank you

Marcin Stożek "Perk" / @marcinstozek

Hadoop 101

By Marcin Stożek

Hadoop 101

2,329

Hadoop 101

Why?

map

reduce

Why Hadoop?

Who uses it and what for?

What is Hadoop?

... distributed processing

of large data sets

across clusters of computers...

Distributions

Components

Ecosystem

Security

Performance

Shut up and show me a live demo

Questions

Thank you

Hadoop 101

More from Marcin Stożek