Big Data Processing
INFO 253B: Backend Web Architecture
Kay Ashaolu
The Web is Used by Humans
- Tool used to improve the lives of users
- How to improve the tools we have?
- Understand their use and their users
Understanding
- Collect
- Interpret
- Understand
Big Data
- Never easier to collect data
- Advanced analytics tools available
- Financial incentives for understanding
Terminology
-
Data
-
Raw facts, represented in some way
-
-
Information
-
Interpreted data with meaning
-
-
Knowledge
-
Information used to achieve some purpose
-
Phonographic Records
-
Data
-
Grooves in the record material
-
-
Information
-
Sound heard by a human
-
-
Knowledge
-
Enjoyment of a song
-
Web logs
-
Data
-
Records of visits to a web page
-
-
Information
-
Summary of user behavior
-
-
Knowledge
-
Understanding shortcoming in a product
-
Web Frontier
- Web particularly well suited for analysis
- Easiest to instrument
- Already requires high technology
Taxonomy of Data Science
- Obtain
- Where do you get the data?
- Scrub
- Clean the data for better exploration
- Explore
- Looking through data to gain insights
- Model
- Mathematical description of data
- Interpret
- What knowledge to gain from information
Taxonomy of Data Engineering
- Extract
- Prepping data to to be processed
- Transform
- Converting data from original form to desired form
- Load
- Writing transformed data to desired location
Tools
- Languages
- Matlab, Python, R, SQL, Java, Scala
- Tools
- Unix, MySQL, APis, scikit-learn
- Paradigms
- MapReduce, Functional Programming
MapReduce
-
Map
-
Extract a property to summarize over
-
-
Reduce
-
Summarize all items with a particular propery
-
-
Simple: Each operation is stateless
Example
- URL Shortener
- How many actions have we seen?
- Redirects: 200, Saves: 40, Loads: 60
Map
-
Input
-
Key, Value
-
-
Output
-
Keys, Values
-
Map Example
-
Input Key
-
Log line number
-
-
Input Value
-
Log line text
-
-
Output Key
-
Action
-
-
Output Value
-
Times this action has occurred on this line
-
Status
load 1
save 1
redirect 1
redirect 1
load 1
redirect 1
load 1
save 1
redirect 1
Reduce
-
Input
-
Key, Values
-
-
Output
-
Keys, Values
-
Reduce Example
-
Input Key
-
Action
-
-
Input Values
-
Counts: [1, 1, 1, 1]
Output Key
-
Action
-
-
Output Value
-
Total Count
-
Example Output
-
Output Key
-
Action
-
-
Output Value
-
Total Count
-
"redirect" 4
"save" 2
"load" 3
Point?
- A lot of work for counting!
- More complex calculations can be done this way, eg. PageRank
- Stateless constraint means it can be used across thousands of computers
Inputs
- MapReduce distributes computing power by distributing input
- Input is distributed by splitting on lines (records)
- You cannot depend on lines being "together" in MapReduce
Spark
- Hadoop's implementation of MapReduce relies a lot on files on disk
- Spark optimizes input shuffling by using memory instead of disk as much as possible
- Pools the memory of all nodes in cluster to do this
- Tries to process data "locally" as much as possible
Questions?
Big Data Processing
By kayashaolu