7 conseils pour démarrer avec Spark
Nastasia Saby
@saby_nastasia
1. utilise le spark-shell, c'est un bon ami
TZ - Nastasia Saby @saby_nastasia - Zenika
2. Fais bien la différence entre les transformations et les actions
TZ - Nastasia Saby @saby_nastasia - Zenika
3. Apprends les bases de Scala peu importe le langage que tu utilises
TZ - Nastasia Saby @saby_nastasia - Zenika
4. Comprendre l'infrastructure dans laquelle tu es te permettra de MIEUX comprendre Spark
TZ - Nastasia Saby @saby_nastasia - Zenika
Noeud 2
DataNode
NameNode
Noeud 1
DataNode
Noeud 4
DataNode
Noeud 3
DataNode
TZ - Nastasia Saby @saby_nastasia - Zenika
Noeud 2
Ressource Manager
Noeud 1
NodeManager
Noeud 4
Noeud 3
NodeManager
NodeManager
NodeManager
TZ - Nastasia Saby @saby_nastasia - Zenika
Noeud 2
Ressource Manager
Noeud 1
NodeManager
Noeud 4
Noeud 3
NodeManager
NodeManager
NodeManager
NameNode
DataNode
DataNode
DataNode
DataNode
TZ - Nastasia Saby @saby_nastasia - Zenika
YARN
MESOS
KUBERNETES
TZ - Nastasia Saby @saby_nastasia - Zenika
Driver Program
Négotiateur
de
ressources
Worker Node
Executor
Task
Task
Worker Node
Executor
Task
Task
TZ - Nastasia Saby @saby_nastasia - Zenika
Scala.io - Nastasia Saby @saby_nastasia - Zenika
spark-submit \
--class org.apache.spark.MyApplication \
--master local \
/path/to/spark.jar
spark-submit \
--class org.apache.spark.MyApplication \
--master yarn \
/path/to/spark.jar
TZ - Nastasia Saby @saby_nastasia - Zenika
5. Apprends et désapprends les RDDs
TZ - Nastasia Saby @saby_nastasia - Zenika
6. Replonge-toi dans du SQL, c'est un grand compagnon de route
TZ - Nastasia Saby @saby_nastasia - Zenika
7. Ne cherche pas à tout faire avec les UDFs
TZ - Nastasia Saby @saby_nastasia - Zenika
val upper: String => String = _.toUpperCase
import org.apache.spark.sql.functions.udf
val upperUDF = udf(upper)
dataset.withColumn("upper", upperUDF('text')).show
TZ - Nastasia Saby @saby_nastasia - Zenika
Etre aussi simple que possible
Être pur
Tests
TZ - Nastasia Saby @saby_nastasia - Zenika
Performance
=> Eviter les UDFs
TZ - Nastasia Saby @saby_nastasia - Zenika
Spark SQL built in functions combinaisons
TZ - Nastasia Saby @saby_nastasia - Zenika
Bonus. Ouvre ton esprit pour apprendre d'autres manières de tester avec Spark
TZ - Nastasia Saby @saby_nastasia - Zenika
Tests unitaires avec Spark
case class People(age: Int, name: String)
val peoplesDS = spark.read.parquet("peoples.parquet").
as[People]
peopleDS.map(people => {
people.age
})
TZ - Nastasia Saby @saby_nastasia - Zenika
Tests unitaires avec Spark
case class People(age: Int, name: String)
val peoplesDS = spark.read.parquet("peoples.parquet").
as[People]
def selectAge(people: People) = {
people.age
}
peopleDS.map(selectAge(_)
TZ - Nastasia Saby @saby_nastasia - Zenika
Tests avec Spark
case class People(age: Int, name: String)
case class Revenue(age: Int, revenue: Int)
val peoplesDS = spark.read.parquet("peoples.parquet").
as[People]
val revenueDS = spark.read.parquet("revenue.parquet").
as[Revenue]
val joined = peopleDS.join(revenueDS, Seq("age"), "inner")
val selected = joined.select("name", "revenue")
joined.filter("revenue > 87 000")
TZ - Nastasia Saby @saby_nastasia - Zenika
Tests avec Spark
case class People(age: Int, name: String)
case class Revenue(age: Int, revenue: Int)
val peoplesDS = spark.read.parquet("peoples.parquet").
as[People]
val revenueDS = spark.read.parquet("revenue.parquet").
as[Revenue]
def nameOfRichPeople(
peopleDS: Dataset[Personne],
revenueDS: Dataset[Revenue]
)
(implicit spark: SparkSession)
: DataFrame = {
val joined = peopleDS.join(revenueDS, Seq("age"), "inner")
val selected = joined.select("name", "revenue")
joined.filter("revenue > 87 000")
}
nameOfRichPeople(peopleDS, revenueDS)
TZ - Nastasia Saby @saby_nastasia - Zenika
Tests avec Spark : un wrapper
import org.apache.spark.sql.SparkSession
trait SparkSessionTestWrapper {
implicit lazy val spark: SparkSession = {
SparkSession
.builder()
.master("local")
.appName("spark test")
.getOrCreate()
}
}
TZ - Nastasia Saby @saby_nastasia - Zenika
Tests avec Spark : un wrapper
class TestSpec extends Specification with SparkSessionTestWrapper {
"Test" should {
import spark.implicits._
"test" in {
val revenueDS = createDataset ...
val peopleDS = createDataset ...
val expected = expectedList ...
val result: DataFrame = nameOfRichPeople(revenueDS, peopleDS)
result.collectAsList must beEqualTo(expected)
}
}
}
TZ - Nastasia Saby @saby_nastasia - Zenika
Tests avec Spark : la performance
val result = myFunction(ds1, ds2)
result.collect must beEqualTo(expected)
result.count must beEqualTo(3)
TZ - Nastasia Saby @saby_nastasia - Zenika
Tests avec Spark : la performance
val result: DataFrame = myFunction(ds1, ds2)
val good: Array = result.collect
good must beEqualTo(expected)
good.size must beEqualTo(3)
TZ - Nastasia Saby @saby_nastasia - Zenika
Merci
Nastasia Saby
Des Questions ?
Zenika 7 conseils pour démarrer avec Spark
By nastasiasaby
Zenika 7 conseils pour démarrer avec Spark
- 928