Authors: K. Nguyen, K. Wang, et al.
Presented by David A.N
Data
Data
Distributed Computing
Data
Distributed Computing
Machine Learning
Data
Distributed Computing
Machine Learning
Big
Data
Runtime Bloat
Count the number of occurrences of each word in a document
Using hash table
Using hash table
Using hash table
Using hash table
What if the document is really really big?
Problem?
Results have to fit on one machine
Using Divide and Conquer!
Using Divide and Conquer!
Using Divide and Conquer!
Google Map-Reduce 2004
Map reduce for sorting: what word is used most?
A real pipeline has several iterations jobs in a distributed execution
Disk I/O is very slow, then the solution is "keep more data in -memory"
[of the paper]
Big Data Frameworks are written in Java and Scala because of quick development and rich community resources
Managed runtime of Java has a high cost [runtime bloat], which cannot be amortized by increasing # of data-processing machines
Excessive use of pointers and references [high space overhead] and frequent GC runs [impact scalability]
GC time accounts for up to 50% overall execution time
Data & Control
Data
Control
Data
Control
Organizes tasks in pipelines and performs optimizations
Representation and manipulation of the data
S: cardinality of
S: cardinality of dataset
t: # of threads
S: cardinality of dataset
t: # of threads
n: #data types
S: cardinality of dataset
t: # of threads
n: #data types
p: #page objects used to store data
Not reduced statically as n, but the size can be controlled
FACADE
FACADE
P
FACADE
P
Iteration info
Java Classes
FACADE
P
Iterations
Java Classes
P'
FACADE
P
Iterations
Java Classes
P'
Managed Heap
Managed Heap (bounded)
Native Memory (unbounded)
Managed Heap (bounded)
Native Memory (unbounded)
Managed Heap (bounded)
Native Memory (unbounded)
data class
non-data class
data class
HashMap
HashMapFacade
data class
The class hierarchy must be data classes
Class Hierarchy
Instruction
Resolving Types