Hadoop 101
Big data for masses
Marcin Stożek "Perk" / @marcinstozek
Why?
how long would it take to sort 15 PB of data on one computer?
(hint: in days rather than hours)
why not distribute the data, sort chunk by chunk and just merge the result?
map
sqr k = k * k
sqr list -> [1, 4, 9, 16]
sqr list2 -> [25, 36, 49, 64]
sqr list3 -> [81, 100, 121, 144]
reduce
add [a, .., z] = a + .. + z
add list -> 30
add list2 -> 174
add list3 -> 446
add list list2 list3 -> 650
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
- list = [1, 2, 3, 4]
- list2 = [5, 6, 7, 8]
- list3 = [9, 10, 11, 12]
Why Hadoop?
- Apache Open Source Community Project
- big players use this
- scalability (petabyte scale)
- fault tolerance designed in
- alternatives are really expensive
Who uses it and what for?
What is Hadoop?
... distributed processing
of large data sets
across clusters of computers...
Hadoop is a computing framework
... distributed processing
- HDFS
- MapReduce
- YARN
of large data sets
- you don't have big data problem
- "anything" from 5 TB
across clusters of computers...
- runs on commodity hardware
- 4 - 8 GB RAM (2008)
- 48 - 64 GB RAM (2014+)
- dual processor / core
- up to 4 000 nodes
Distributions
- Cloudera
- Hortonworks
- BigInsights
- MapR
- ...
You can think of them as of Linux distributions
Components
- Common
- YARN
- HDFS
- MapReduce
Ecosystem
- Hive / Impala / BigSQL
- HCatalog
- Pig
- HBase
- Spark
- Storm
- Solr
- Ambari
- Oozie
- ZooKeeper
- Sqoop...
Security
- authentication:
- Kerberos
- User / Pass (LDAP)
- none aka Simple
- authorization:
- HDFS file permissions
- Hbase tables / column ACL
- audit
- data protection
Performance
- SSD
- some services are faster than others
- MPP like Impala instead of Hive
- and many others
Internet search is your friend here
Shut up and show me a live demo
Questions
Thank you
Marcin Stożek "Perk" / @marcinstozek
Hadoop 101
By Marcin Stożek
Hadoop 101
- 2,084