Big Data

(An Introduction to Hadoop and Map/Reduce)

What's the Deal?

"Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization."

- Wikipedia

Apache Hadoop

Apache Hadoop is an open source software framework offering reusable components for Big Data work. It is written in Java and I will go into some more detail about it later, but I'm bringing it up now so you won't be confused about what it is when I reference it.
Fun fact: It is named for a toy elephant!

How Big?

"Big" data sizes are always a moving target
In the Hadoop documentation they cite 200-300GB as being on the small end of the scale
Facebook users generate ~ 500 TB of data per day! (http://gigaom.com/2012/08/22/facebook-is-collecting-your-data-500-terabytes-a-day/)
That's around 4,369,066 times the size of my MacBook Air hard drive (120 GB)

dealing with so much data is kind of hard.

An individual computer usually only holds about 2GB - 16GB in memory
Therefore, we have to keep the data on disk and read it into memory in much smaller chunks
Even a large computer hard drive rarely exceeds 1TB
Therefore, we have to be able to synchronize multiple hard drives and computers working on this data
This is a hard problem

Why is it a hard problem, Nate?

"Whenever multiple machines are used in cooperation with one another, the probability of failures rises. In a single-machine environment, failure is not something that program designers explicitly worry about very often: if the machine has crashed, then there is no way for the program to recover anyway."

- Yahoo! Hadoop Tutorial

What could possibly go wrong?

Networks experience total or partial failure as switches and routers go down.
Data doesn't arrive at expected time due to network congestion.
Individual nodes crash, run out of disk space, overheat.
And so on...

So Why use a distributed system at all?

Moore's Law: transistor density will double ~ every 2 years
But nowadays instead of speeding up single core processors we're cramming more cores on one chip. So, we're living in a parallel world anyway.
Besides, hard drive read/write speeds are the bottleneck.

introducing the hadoop distributed filesystem

Hadoop provides a distributed filesystem to address some of these concerns
It allows for redundancy as well as splitting up a huge data workload over a cluster of many machines

design goals

HDFS is designed to store a very large amount of information (terabytes or petabytes). This requires spreading the data across a large number of machines. It also supports much larger file sizes than NFS.
HDFS should store data reliably. If individual machines in the cluster malfunction, data should still be available.
HDFS should provide fast, scalable access to this information. It should be possible to serve a larger number of clients by simply adding more machines to the cluster (scale well).
HDFS should integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally when possible.

design weaknesses

Applications that use HDFS are optimized to perform long sequential streaming reads from files. This comes at the expense of random seek times to arbitrary positions in files.
Data will be written to the HDFS once and then read several times; updates to files after they have already been closed are not supported.
Due to the large size of files, and the sequential nature of reads, the system does not provide a mechanism for local caching of data. (just re-read instead)

what's it like?

The data is broken up into blocks over the machines in the network (referred to as DataNodes). Each block is 64MB by default.
Each file on the filesystem is made up of several blocks, which aren't necessarily on the same machine. Access to a single file may require access to multiple machines the size of storable files is much larger than what is capable on single machines.
Copies of the same are block stored on multiple machines (3 by default) to account for possibility of DataNode failure
Metadata about these files (which blocks are where, for instance) is stored in a master node called the NameNode (it's very bad if the NameNode crashes, but it's less likely)

HDFS Visualized

So what is map/reduce, then?

Map/Reduce is a programming algorithm for processing large data sets with a parallel, distributed algorithm on a cluster.
Involves using a Map() step to apply a transformation to the data at hand, and then a Reduce() step to aggregate that processed data into more meaningful data.
If you're confused: try not to panic, we'll do an example.

map/reduce Diagram

The "hello World" of Map/reduce

Say we have a huge work of literature and we want to count how many times each word is used, using Map/Reduce so we can easily parallelize the task.
We'll do the operation manually to get the hang of it.
Let's use this example quote:

"Some are born great, some achieve greatness, and some have greatness thrust upon them."

- William Shakespeare

map step

We'll map each word to a tuple in the form

(word, numberOfTimesUsed)

In each word, we'll also make it lowercase and strip punctuation so our Reduce step will recognize duplicates. numberOfTimesUsed will always be 1 in the Map step, since each mapped tuple reflects one occurrence.

e.g.

("some", 1)("are", 1)("born", 1)

fully mapped

("some", 1)
("are", 1)
("born", 1)
("great", 1)
("some", 1)
("achieve", 1)
("greatness", 1)
("and", 1)
("some", 1)
("have", 1)
("greatness", 1)("thrust", 1)
("upon", 1)
("them", 1)

Shuffle and sort

("achieve", 1)("and", 1)
("are", 1)
("born", 1)
("great", 1)
("greatness", 1)
("greatness", 1)
("have", 1)
("some", 1)
("some", 1)
("some", 1)
("them", 1)
("thrust", 1)
("upon", 1)

Reduce

("achieve", 1)("and", 1)
("are", 1)
("born", 1)
("great", 1)
("greatness", 2)
("have", 1)
("some", 3)
("them", 1)
("thrust", 1)
("upon", 1)

WordCountMapper.java

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

public class WordCountMapper extends MapReduceBase
    implements Mapper<LongWritable, Text, Text, IntWritable> {

  private final IntWritable one = new IntWritable(1);
  private Text word = new Text();

  public void map(WritableComparable key, Writable value,
      OutputCollector output, Reporter reporter) throws IOException {

    String line = value.toString();
    StringTokenizer itr = new StringTokenizer(line.toLowerCase());
    while(itr.hasMoreTokens()) {
      word.set(itr.nextToken());
      output.collect(word, one);
    }
  }
}

WordReduceMapper.java

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class WordCountReducer extends MapReduceBase
    implements Reducer<Text, IntWritable, Text, IntWritable> {

  public void reduce(Text key, Iterator values,
      OutputCollector output, Reporter reporter) throws IOException {

    int sum = 0;
    while (values.hasNext()) {
      IntWritable value = (IntWritable) values.next();
      sum += value.get(); // process value
    }

    output.collect(key, new IntWritable(sum));
  }
}

WordCount.java

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;

public class WordCount {

  public static void main(String[] args) {
    JobClient client = new JobClient();
    JobConf conf = new JobConf(WordCount.class);

    // specify output types
    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);

    // specify input and output dirs
    FileInputPath.addInputPath(conf, new Path("input"));
    FileOutputPath.addOutputPath(conf, new Path("output"));

    // specify a mapper
    conf.setMapperClass(WordCountMapper.class);

    // specify a reducer
    conf.setReducerClass(WordCountReducer.class);
    conf.setCombinerClass(WordCountReducer.class);

    client.setConf(conf);
    try {
      JobClient.runJob(conf);
    } catch (Exception e) {
      e.printStackTrace();
    }
  }
}

it's not perfect, but it's pretty cool

Criticism: Map/Reduce is not novel and is too low-level
Also: Is your data truly "big" ?
Didn't even cover HBase, Cassandra, Apache Pig, Storm, Hive etc. which are cool things to look into if you are interested in Big Data.
Any questions?