Big Data Processing

INFO 253B: Backend Web Architecture

Kay Ashaolu

The Web is Used by Humans

Tool used to improve the lives of users
How to improve the tools we have?
Understand their use and their users

Understanding

Collect
Interpret
Understand

Big Data

Never easier to collect data
Advanced analytics tools available
Financial incentives for understanding

Terminology

Data
- Raw facts, represented in some way
Information
- Interpreted data with meaning
Knowledge
- Information used to achieve some purpose

Phonographic Records

Data
- Grooves in the record material
Information
- Sound heard by a human
Knowledge
- Enjoyment of a song

Web logs

Data
- Records of visits to a web page
Information
- Summary of user behavior
Knowledge
- Understanding shortcoming in a product

Web Frontier

Web particularly well suited for analysis
Easiest to instrument
Already requires high technology

Taxonomy of Data Science

Obtain
- Where do you get the data?
Scrub
- Clean the data for better exploration
Explore
- Looking through data to gain insights
Model
- Mathematical description of data
Interpret
- What knowledge to gain from information

Taxonomy of Data Engineering

Extract
- Prepping data to to be processed
Transform
- Converting data from original form to desired form
Load
- Writing transformed data to desired location

Tools

Languages
- Matlab, Python, R, SQL, Java, Scala
Tools
- Unix, MySQL, APis, scikit-learn
Paradigms
- MapReduce, Functional Programming

MapReduce

Map
- Extract a property to summarize over
Reduce
- Summarize all items with a particular propery
Simple: Each operation is stateless

Example

URL Shortener
How many actions have we seen?
Redirects: 200, Saves: 40, Loads: 60

Map

Input
- Key, Value
Output
- Keys, Values

Map Example

Input Key
- Log line number
Input Value
- Log line text
Output Key
- Action
Output Value
- Times this action has occurred on this line

Status

load            1
save            1
redirect        1
redirect        1
load            1
redirect        1
load            1
save            1
redirect        1

Reduce

Input
- Key, Values
Output
- Keys, Values

Reduce Example

Input Key
- Action
Input Values
- Counts: [1, 1, 1, 1]
Output Key
- Action
Output Value
- Total Count

Example Output

Output Key
- Action
Output Value
- Total Count

"redirect"  4
"save"      2
"load"      3

Point?

A lot of work for counting!
More complex calculations can be done this way, eg. PageRank
Stateless constraint means it can be used across thousands of computers

Inputs

MapReduce distributes computing power by distributing input
Input is distributed by splitting on lines (records)
You cannot depend on lines being "together" in MapReduce

Spark

Hadoop's implementation of MapReduce relies a lot on files on disk
Spark optimizes input shuffling by using memory instead of disk as much as possible
Pools the memory of all nodes in cluster to do this
Tries to process data "locally" as much as possible

Big Data Processing

The Web is Used by Humans

Understanding

Big Data

Terminology

Phonographic Records

Web logs

Web Frontier

Taxonomy of Data Science

Taxonomy of Data Engineering

Tools

MapReduce

Example

Map

Map Example

Status

Reduce

Reduce Example

Example Output

Point?

Inputs

Spark

Questions?