Gabor Ratky
CTO at Secret Sauce Partners

EMR 101

Future of Data: Budapest

3/21/2017

Secret Sauce Partners

Format of this talk

Q&A

Do you need Hadoop?

Probably no.

Do you need Hadoop?

  • Does your workload fit on a single machine?
    • You can provision an RDS instance with 32 vCPU, 244GB memory, and up to 6TB of storage
    • Your working set is probably even smaller and the rest can be 150μs away on SSDs [1]
  • Are you hitting any limits?
    • CPU, Memory, Disk I/O, Network I/O saturated
    • Processing takes too long
  • Distributed software is complex and always comes with overhead ("distributed tax")

Is EMR a service?

Yes.

Is EMR a service?

  • Introduced by Amazon in April 2009
    • Cloudera founded in 2008, CDH1 in March 2009
    • Hortonworks founded in 2011, HDP 1.0 in June 2012
  • Launches and configures Hadoop clusters in EC2
  • Provides its own API to submit jobs to EMR clusters
  • Supports auto scaling based on YARN metrics
  • Supports spot instances for Task nodes (vs Core nodes)
  • Supports MapR R3, R5, R7 distributions on top of EMR
  • No Hadoop Management Console (Ambari, CM)
  • No High Availability for Master Node
  • Costs money on top of EC2 costs ("EMR tax")

Is EMR a distribution?

Yes*.

* Since EMR 4.x releases are packaged using Apache BigTop.

Is EMR a distribution?

  • Packaged using Apache BigTop
  • EMR 5.4.0
    • Hadoop 2.7.3
    • Hive 2.1.1 (1.1.0 in CDH, 1.2.1, 2.1* TP in HDP)
    • Hue 3.11.0 (3.10.0 in CDH)
    • Presto 0.166
    • Spark 2.1.0 (1.6.x in CDH/HDP, 2.0* TP in HDP)
    • Ganglia 3.7.2
    • Zeppelin 0.7.0 (0.6.0 in HDP)
  • Additional components
    • Proprietary access to S3 + s3-dist-cp
    • EMRFS (consistent S3 using DynamoDB)
    • DynamoDB, Kinesis connectors

Is EMR open source?

No.

Is EMR expensive?

It depends.

Is EMR expensive?

  • Price is on top of any EC2 cost incurred
  • Pricing is based on instance type and ranges from $0.011 (m1.small, +25%) to $0.27 (r4.16xlarge, +6.34%)
  • Price maxes out at $0.27 so larger instances are more cost effective
  • Reserved and Spot Instances can greatly reduce total cost (~40% Reserved, ~90% Spot Instances)
  • Compared with other services:
    • Azure HDInsight: $0.08 - $1.48
    • Databricks: $0.40 - $3.20 (1-8 DBUs)
    • Qubole: $0.01375 - $0.337 (0.125 - 3.0681 QCUHs)

Can I run Hadoop in EC2 without EMR?

Absolutely.

What is EMR's use case?

Ephemeral clusters

What is EMR's use case?

  • Prerequisites
    • Your applications and data are already in AWS (EC2, S3, RDS, etc.)
    • You have batch jobs (data processing, ETL, ML) that run on a schedule (hourly, daily, weekly)
  • Ephemeral clusters
    • Spin up EMR cluster, submit jobs, run, terminate
    • Decouples storage from compute, you only pay for compute when you use it (high utilization)
    • Multiple EMR clusters can access the data (S3, Hive Metastore)
    • Interactive workloads (Hive, Presto, Spark, Zeppelin) on on-demand clusters

Who uses EMR?

Thanks!

Questions?

 

@rgabo

gabor@secretsaucepartners.com

Made with Slides.com