Big Data Processing

INFO 253B: Backend Web Architecture

Kay Ashaolu

The Web is Used by Humans

  • Tool used to improve the lives of users
  • How to improve the tools we have?
  • Understand their use and their users

Understanding

  • Collect
  • Interpret
  • Understand

Big Data

  • Never easier to collect data
  • Advanced analytics tools available
  • Financial incentives for understanding

Terminology

  • Data

    • Raw facts, represented in some way

  • Information

    • Interpreted data with meaning

  • Knowledge

    • Information used to achieve some purpose

Phonographic Records

  • Data

    • Grooves in the record material

  • Information

    • Sound heard by a human

  • Knowledge

    • Enjoyment of a song

Web logs

  • Data

    • Records of visits to a web page

  • Information

    • Summary of user behavior

  • Knowledge

    • Understanding shortcoming in a product

       

Web Frontier

  • Web particularly well suited for analysis
  • Easiest to instrument
  • Already requires high technology

Taxonomy of Data Science

  • Obtain
    • Where do you get the data?
  • Scrub
    • Clean the data for better exploration
  • Explore
    • Looking through data to gain insights
  • Model
    • Mathematical description of data
  • Interpret
    • What knowledge to gain from information

Taxonomy of Data Engineering

  • Extract
    • Prepping data to to be processed
  • Transform
    • Converting data from original form to desired form
  • Load
    • Writing transformed data to desired location

Tools

  • Languages
    • Matlab, Python, R, SQL, Java, Scala
  • Tools
    • Unix, MySQL, APis, scikit-learn
  • Paradigms
    • MapReduce, Functional Programming

MapReduce

  • Map

    • Extract a property to summarize over

  • Reduce

    • Summarize all items with a particular propery

  • Simple: Each operation is stateless

Example

  • URL Shortener
  • How many actions have we seen?
  • Redirects: 200, Saves: 40, Loads: 60

Map

  • Input

    • Key, Value

  • Output

    • Keys, Values

 

Map Example

  • Input Key

    • Log line number

  • Input Value

    • Log line text

  • Output Key

    • Action

  • Output Value

    • Times this action has occurred on this line

Status

load            1
save            1
redirect        1
redirect        1
load            1
redirect        1
load            1
save            1
redirect        1

Reduce

  • Input

    • Key, Values

  • Output

    • Keys, Values

 

Reduce Example

  • Input Key

    • Action

  • Input Values

    • Counts: [1, 1, 1, 1]

    Output Key

    • Action

  • Output Value

    • Total Count

Example Output

  • Output Key

    • Action

  • Output Value

    • Total Count

"redirect"  4
"save"      2
"load"      3

Point?

  • A lot of work for counting!
  • More complex calculations can be done this way, eg. PageRank
  • Stateless constraint means it can be used across thousands of computers

Inputs

  • MapReduce distributes computing power by distributing input
  • Input is distributed by splitting on lines (records)
  • You cannot depend on lines being "together" in MapReduce

Spark

  • Hadoop's implementation of MapReduce relies a lot on files on disk
  • Spark optimizes input shuffling by using memory instead of disk as much as possible
  • Pools the memory of all nodes in cluster to do this
  • Tries to process data "locally" as much as possible

Questions?