Introduction to PySpark

Shagun Sodhani


 

Agenda

  • Introduction to Spark.
  • How to use Spark with Python.
    • PySpark Shell
    • IPython/Jupyter Notebooks
    • Python Scripts
 

What is Apache Spark?

  • Apache Spark™ is a fast and general engine for large-scale data processing
  • Originally developed in AMPLab at UC Berkely (2009), open-sourced in 2010, transferred to Apache 2013
  • Claims to run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
 

Advantages Over MapReduce

  • Speed.
  • Ease of Use - Scala, Java, Python, R.
  • Generality - Supports different use cases.
  • Runs Everywhere - Hadoop, Mesos, standalone ...
 

Spark Components

  • Spark SQL and DataFrames
  • Spark Machine Learning (MLlib)
  • GraphX
  • Spark Streaming
 

Talk is cheap. Show me the code.

Linus Torvalds

 

Thank You