Getting Started with Apache Spark
Shagun Sodhani
Big Data Training Program, IIT Roorkee
Agenda
- Introduction to Spark.
- How is it different from MapReduce
- Components
- RDD
- Dataframes
- GraphX
- PySpark
- Sample Applications
What is Apache Spark?
- Apache Spark™ is a fast and general engine for large-scale data processing
- Originally developed in AMPLab at UC Berkely (2009), open-sourced in 2010, transferred to Apache 2013
- Claims to run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Advantages Over MapReduce
- Speed.
- Ease of Use - Scala, Java, Python, R.
- Generality - Not just Map and Reduce.
- Runs Everywhere - Hadoop, Mesos, standalone etc.
Spark Components
- Spark RDD
- Spark SQL and DataFrames
- Spark Machine Learning (MLlib)
- GraphX
- Spark Streaming
RDD
Resilient Distributed Dataset
- Immutable partitioned collections.
- Lazy and ephemeral
- Building block for other libraries
- Transformations Vs Actions
SQL And DataFames
- Structured data processing.
- Distributed SQL query engine.
- DataFrame is similar to table in RDBMS.
- No indexing.
GraphX and GraphFrame
-
Graphs and graph-parallel computation.
-
Property Graph Model (Directed multigraph).
-
Pregel API.
-
GraphFrame : DataFrame-based Graphs
- Very new - announced less than a month back.
PySpark
Sample Apps
Thank You
Getting Started with Apache Spark
By Shagun Sodhani
Getting Started with Apache Spark
For Big Data Training Program at IIT Roorkee
- 1,762