Gabor Ratky
CTO at Secret Sauce Partners
Climbing the slope of enlightenement
EMR is the largest Hadoop distribution by market share
EMR 4 based on Apache Bigtop
Integration with AWS services (EC2, S3, Redshift)
Ephemeral clusters
Intelligent resizing
EMR Sandbox (Zeppelin)
Spark 1.5
Not for everyone (cloud lock-in)
We write distributed applications
Right level of abstraction to reason about
"Write once, run everywhere"
Huge momentum (also hype)
Frameworks on top of primitives (DataFrame, MLLib)
80% of data science/engineering is data munging
Document datasets, share steps and results
Killer app to work with and collaborate on datasets
Familiar REPL experience