Hadoop 101

Big data for masses

Marcin Stożek "Perk" / @marcinstozek

Why?

how long would it take to sort 15 PB of data on one computer?
(hint: in days rather than hours)

 

why not distribute the data, sort chunk by chunk and just merge the result?

map

sqr k = k * k

 

sqr list -> [1, 4, 9, 16]

sqr list2 -> [25, 36, 49, 64]

sqr list3 -> [81, 100, 121, 144]

reduce

add [a, .., z] = a + .. + z

 

add list -> 30

add list2 -> 174

add list3 -> 446

 

add list list2 list3 -> 650

data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

  • list = [1, 2, 3, 4]
  • list2 = [5, 6, 7, 8]
  • list3 = [9, 10, 11, 12]

Why Hadoop?

  • Apache Open Source Community Project
  • big players use this
  • scalability (petabyte scale)
  • fault tolerance designed in
  • alternatives are really expensive

Who uses it and what for?

  • Yahoo
  • Facebook
  • Amazon
  • IBM
  • Intel
  • Microsoft
  • Twitter
  • ...

 

http://wiki.apache.org/hadoop/PoweredBy

What is Hadoop?

... distributed processing

of large data sets

across clusters of computers...

 

Hadoop is a computing framework

 

​http://hadoop.apache.org

... distributed processing

  • HDFS
  • MapReduce
  • YARN

of large data sets

across clusters of computers...

  • runs on commodity hardware
    • 4 - 8 GB RAM (2008)
    • 48 - 64 GB RAM (2014+)
  • dual processor / core
  • up to 4 000 nodes

 

http://wiki.apache.org/hadoop/FAQ

Distributions

  • Cloudera
  • Hortonworks
  • BigInsights
  • MapR
  • ...

 

You can think of them as of Linux distributions

Components

  • Common
  • YARN
  • HDFS
  • MapReduce

Ecosystem

  • Hive / Impala / BigSQL
  • HCatalog
  • Pig
  • HBase
  • Spark
  • Storm
  • Solr
  • Ambari
  • Oozie
  • ZooKeeper
  • Sqoop...

Security

  • authentication:
    • Kerberos
    • User / Pass (LDAP)
    • none aka Simple
  • authorization:
    • HDFS file permissions
    • Hbase tables / column ACL
  • audit
  • data protection

Performance

  • SSD
  • some services are faster than others
    • MPP like Impala instead of Hive
  • and many others

 

Internet search is your friend here

Shut up and show me a live demo

Questions

Thank you

Marcin Stożek "Perk" / @marcinstozek

Hadoop 101

By Marcin Stożek