Spark.. Scala

Tools setup

- JDK

 $ java -version

- sbt (Scala Build Tool)

 

 $ sbt about

 $ sbt compile

 $ sbt run

- VS Code

 Scala Syntax & Scala Metals extensions

Tools setup

- DataBricks Community Edition

- Jupyter Notebooks, run local

https://databricks.com/product/faq/community-edition

 https://community.cloud.databricks.com

You will be getting emails!

- IntelliJ IDEA Community Edition

  Scala plugin 

Tools setup VS Code

Free Book

Concurrent work

  • Split the data
  • Concurrent work routine (threads)
  • Combine back when done
  • (Deal with errors)

one machine (or microservice)

Distributed work

  • Split the data (or not) at Ingress
  • Distribute workflow 
  • Produce desired results at Egress
  • (Deal with errors)

microservices / ( serverless )

Distributed Data 

  • Data splitted over nodes
  • Nodes operate on data shards

Concurrent - Spark nodes

Spark API

RDD

DataFrame

Dataset

Not typed

Typed

Low Level

Dataset[Row]

case class Something

Dataset[Something]

Catalyst optimizer

Spark Docs

DataFrame / Dataset

Functions

spark-scala

By Cosmin P

spark-scala

spark scala resources

  • 407