0 to 100 with

Jowanza Joseph

@jowanza

Origins

Basic Design

  • Inspired by the MapReduce paradigm
  • Focused on in-memory processing
  • Built primarily for batch jobs
  • Streaming as a special case
  • Written in Scala

Principles

  • Memory Efficiency
  • Parallelism
  • Lazy Evaluation* 

APIs

  • Scala, Java, Python and R APIs
  • Spark SQL
  • Structured Streaming
  • GraphX
import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()


import spark.implicits._


val df = spark.read.json("examples/src/main/resources/people.json")

// Displays the content of the DataFrame to stdout
df.show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+
val spark: SparkSession = ...

// Read text from socket
val socketDF = spark
  .readStream
  .format("socket")
  .option("host", "localhost")
  .option("port", 9999)
  .load()

socketDF.isStreaming    // Returns True for DataFrames that have streaming sources

socketDF.printSchema

// Read all the csv files written atomically in a directory
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark
  .readStream
  .option("sep", ";")
  .schema(userSchema)      // Specify schema of the csv files
  .csv("/path/to/directory")    // Equivalent to format("csv").load("/path/to/directory")

Jobs

Ecosystem

Stream Processing

Deployment Options

Extras

Demo

Resources

Thanks

0 to 100 withApache Spark

By Jowanza Joseph

0 to 100 withApache Spark

  • 636