7 conseils pour démarrer avec Spark
Nastasia Saby
@saby_nastasia
1. utilise le spark-shell
Scala.io - Nastasia Saby @saby_nastasia - Zenika
2. Fais bien la différence entre les transformations et les actions
Scala.io - Nastasia Saby @saby_nastasia - Zenika
3. Apprends les bases de Scala
Scala.io - Nastasia Saby @saby_nastasia - Zenika
3. Apprends les bases de Scala
Scala.io - Nastasia Saby @saby_nastasia - Zenika
3. Connais ton infra
Scala.io - Nastasia Saby @saby_nastasia - Zenika
Noeud 2
DataNode
NameNode
Noeud 1
DataNode
Noeud 4
DataNode
Noeud 3
DataNode
Noeud 2
Ressource Manager
Noeud 1
NodeManager
Noeud 4
Noeud 3
NodeManager
NodeManager
NodeManager
Noeud 2
Ressource Manager
Noeud 1
NodeManager
Noeud 4
Noeud 3
NodeManager
NodeManager
NodeManager
NameNode
DataNode
DataNode
DataNode
DataNode
YARN
MESOS
KUBERNETES
Scala.io - Nastasia Saby @saby_nastasia - Zenika
Driver Program
Négociateur
de
ressources
Worker Node
Executor
Task
Task
Worker Node
Executor
Task
Task
Scala.io - Nastasia Saby @saby_nastasia - Zenika
spark-submit \
--class org.apache.spark.MyApplication \
--master local \
/path/to/spark.jar
spark-submit \
--class org.apache.spark.MyApplication \
--master yarn \
/path/to/spark.jar
Scala.io - Nastasia Saby @saby_nastasia - Zenika
4. Apprends et désapprends les RDDs
Scala.io - Nastasia Saby @saby_nastasia - Zenika
5. Replonge-toi dans du SQL
Scala.io - Nastasia Saby @saby_nastasia - Zenika
6. Ne cherche pas à tout faire avec les UDFs
Scala.io - Nastasia Saby @saby_nastasia - Zenika
val upper: String => String = _.toUpperCase
import org.apache.spark.sql.functions.udf
val upperUDF = udf(upper)
diamonds.withColumn("upperCut", upperUDF(col("cut"))).show
Scala.io - Nastasia Saby @saby_nastasia - Zenika
Scala.io - Nastasia Saby @saby_nastasia - Zenika
Etre aussi simple que possible
Être pur
Tests
Scala.io - Nastasia Saby @saby_nastasia - Zenika
Performance
=> Eviter les UDFs
Scala.io - Nastasia Saby @saby_nastasia - Zenika
Spark SQL built in functions combinaisons
7. Ouvre ton esprit pour tester avec Spark
Scala.io - Nastasia Saby @saby_nastasia - Zenika
case class Diamond(cut: String, price: Int)
val diamonds = spark.read.csv("diamonds.csv").
as[Diamond]
diamonds.map(diamond => {
diamond.cut
})
Scala.io - Nastasia Saby @saby_nastasia - Zenika
case class Diamond(cut: String, price: Int)
val diamonds = spark.read.csv("diamonds.csv").
as[Diamond]
def selectCut(diamond: Diamond) = {
diamond.cut
}
diamonds.map(selectCut(_)
Scala.io - Nastasia Saby @saby_nastasia - Zenika
Scala.io - Nastasia Saby @saby_nastasia - Zenika
color | price |
---|---|
Vert | 1200 |
Rouge | 700 |
Diamonds
color | score |
---|---|
Vert | 7 |
Rouge | 4 |
TrendyColors
case class Diamond(color: String, price: Int)
case class TrendyColor(color: String, trendyScore: Int)
val diamonds = spark.read.
parquet("diamonds.parquet").
as[Diamond]
val trendyColors = spark.read.
parquet("trendyColors.parquet").
as[TrendyColor]
val diamondsJoinedWithTrendyColors = diamonds.join(
trendyColors,
Seq("color"),
"inner"
)
val diamondsWithHighTrendyScores = diamondsJoinedWithTrendyColors.
filter("trendyScore > 5")
diamondsWithHighTrendyScores.select("price")
Scala.io - Nastasia Saby @saby_nastasia - Zenika
def priceOfDiamondsWithTrendyColors(
diamonds: Dataset[Diamond],
trendyColors: Dataset[TrendyColor]
)
(implicit spark: SparkSession) = {
import spark.implicits._
val diamondsJoinedWithTrendyColors = diamonds.join(
trendyColors,
Seq("color"),
"inner"
)
val diamondsWithHighTrendyScores = diamondsJoinedWithTrendyColors.
filter("trendyScore > 5")
diamondsWithHighTrendyScores.select("price")
}
Scala.io - Nastasia Saby @saby_nastasia - Zenika
import org.apache.spark.sql.SparkSession
trait SparkSessionTestWrapper {
implicit lazy val spark: SparkSession = {
SparkSession
.builder()
.master("local")
.appName("spark test")
.getOrCreate()
}
}
Scala.io - Nastasia Saby @saby_nastasia - Zenika
class TestSpec extends Specification with SparkSessionTestWrapper {
"Test" should {
import spark.implicits._
"test" in {
val diamonds = createDataset ...
val trendyColors = createDataset ...
val expected = expectedList ...
val result: DataFrame = priceOfDiamondsWithTrendyColors(
diamonds,
trendyColors
)
result.collectAsList must beEqualTo(expected)
}
}
}
Scala.io - Nastasia Saby @saby_nastasia - Zenika
val result = priceOfDiamondsWithTrendyColors(
diamonds,
trendyColors
)
result.collect must beEqualTo(expected)
result.count must beEqualTo(3)
Scala.io - Nastasia Saby @saby_nastasia - Zenika
val result = priceOfDiamondsWithTrendyColors(
diamonds,
trendyColors
)
val good: Array = result.collect
good must beEqualTo(expected)
good.size must beEqualTo(3)
Scala.io - Nastasia Saby @saby_nastasia - Zenika
Merci ET JOYEUX HALLOWEEN
Nastasia Saby
Des Questions ?
Notes 7 conseils pour démarrer avec Spark
By nastasiasaby
Notes 7 conseils pour démarrer avec Spark
- 1,302