Introduction to PySpark

Shagun Sodhani



  • Introduction to Spark.
  • How to use Spark with Python.
    • PySpark Shell
    • IPython/Jupyter Notebooks
    • Python Scripts

What is Apache Spark?

  • Apache Spark™ is a fast and general engine for large-scale data processing
  • Originally developed in AMPLab at UC Berkely (2009), open-sourced in 2010, transferred to Apache 2013
  • Claims to run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

Advantages Over MapReduce

  • Speed.
  • Ease of Use - Scala, Java, Python, R.
  • Generality - Supports different use cases.
  • Runs Everywhere - Hadoop, Mesos, standalone ...

Spark Components

  • Spark SQL and DataFrames
  • Spark Machine Learning (MLlib)
  • GraphX
  • Spark Streaming

Talk is cheap. Show me the code.

Linus Torvalds


Thank You