Scala: Big Data swiss army knife
Airton Libório
Data Analytics Lead @ McKinsey Digital Labs
Short Bio
- Msc Computer Science (Distributed Systems) PUC-Rio
- Previous work in:
Telco: call logs pipeline, self organising networks, geoprocessing
Energy: network optimization, genetic algorithms
Startups: PSafe (Antivirus, cloud storage) - Data Analytics Lead @ McKinsey Digital Labs Latam
- Big Data, large scale systems, service oriented architectures, open source, backend
Agenda
Scala
DEMO!
What is Functional Programming about
- Functions (not objects or procedures) are used as the fundamental building blocks of a program
- Assingment-less programming
- Functions don't have side effect
- Functions are first class citizen
- Functions are higher order
- Correct and expressive programs
- Programs broken down into smaller pieces
- FP evaluates expressions X IP statements modify global state
- Referential transparency -> y = f(x), g = h (y, y) => g = h(f(x), f(x))
- Distributed systems (L) Immutable data
Scala
- = (Object oriented + functional paradigm) / JVM
- Seamless Java interoperation
- Statically typed with dynamic features
- High order functions
- Read Eval Print Loop
- Duck typing (structural typing)
- Deeply extensible
- DSL Support
- Strict
(scalable language)
Scala is a modern multi-paradigm programming language designed to express common programming patterns in a concise, elegant, and type-safe way
Immutability
Prefer vals, immutable objects, and methods without side effects. Reach for them first
- Decoupling and isolation
- Function arguments are val by default
- Parallelism and multithreading
- Accessing a mutable object from separate threads requires locking
- Scalability!
// Bad
var x: ExpressionType = null
if (myBoolean) x = expr1 else x = expr2
// Good
val x = if (myBoolean) expr1 else expr2
Mutable x Immutable
val firstDate: Date = ...
val c = Calendar.getInstance()
c.setTime(firstDate)
c.add(Calendar.DATE, -1)
val dayAgo = c.getTime()
val dateTime = new DateTime(date)
val dayBefore = dateTime.minusDays(1)
Joda-time
java.util.date
Mutability
// Not everything should be solved with immutability though...
class Item{ ... }
class Player(var health: Int = 100,
val items: mutable.Buffer[Item] = mutable.Buffer.empty)
val player = new Player()
// Mutability for performance is OK
def getFibs(n: Int): Seq[Int] = {
val fibs = mutable.ArrayBuffer(1, 1)
while(fibs.length < n){
fibs.append(fibs(fibs.length-1) + fibs(fibs.length-2))
}
fibs
}
// But try to avoid this
def getFibs(n: Int, fibs: mutable.ArrayBuffer[Int]): Unit = {
fibs.clear()
fibs.append(1)
fibs.append(1)
while(fibs.length < n){
fibs.append(fibs(fibs.length-1) + fibs(fibs.length-2))
}
fibs
}
// Mutate either the variable or the value
val myList = new mutable.ArrayBuffer[Int]()
var myList = immutable.List[Int](1, 2, 3)
// Bad!
var myList = mutable.ArrayBuffer[Int]
Type inference x Explicit typing
val sum = 34 + 4 * 2
val list = List(1, 2, 3)
val map = Map("hey" -> list)
def succ(x: Int) = x + 1
val sum: Int = 34 + 4 * 2
val list: List[Int] = List(1, 2, 3)
val map: Map[String, List[Int]] = Map("hey" -> list)
def succ(x: Int): Int = x + 1
def someComplexFunction(p: Parameter) = {
def theFirstStep = {
// do something, using parameter
}
def anotherStep = {
// do something else
}
theFirstStep + theSecondStep
}
Inner Functions
def someComplexFunction(p: Parameter) =
theFirstStep(p) + anotherStep(p)
private def theFirstStep(p: Parameter) = {
...
}
private def anotherStep(p: Parameter) = {
...
}
Conciseness
public class MyUser {
String myStr;
Integer myInt;
public MyUser(String myStr, Integer myInt) {
this.myStr = myStr;
this.myInt = myInt;
}
public String getMyStr() {
return this.myStr;
}
public Integer getMyInt() {
return this.myInt;
}
}
case class MyUser(myStr: String, myInt: Int)
Case classes
Case classes are regular classes which export their constructor parameters and which provide a recursive decomposition mechanism via pattern matching
- A Point class
- A constructor taking at least two parameters
- Can be invoked with named parameters
- Getters for x, y and z
- a hashCode method
- a equals method
- A method for creating copies of immutable instances
case class Point(x: Int, y: Int, z: Int = 0)
Pattern matching
(...) tests whether a given value (or sequence of values) has the shape defined by a pattern, and, if it does, binds the variables in the pattern to the corresponding components of the value (or sequence of values)
// Mixed
case class Player(name: String, score: Int)
def message(player: Player) = player match {
case Player(_, score) if score > 100000 => "Get a job, dude!"
case Player(name, _) => "Hey " + name + ", nice to see you again!"
}
// With collections
val list = List(0, 4, 5)
list match {
case List(0, _, _) => println("found it")
case _ =>
}
// Value matching
val sign = ch match {
case '+' => 1
case '-' => -1
case _ => 0
}
// Type matching
obj match {
case x: Int => x
case s: String => Integer.parseInt(s)
case _: BigInt => Int.MaxValue
case _ => 0
}
def isIntIntMap(x: Any) = x match {
case m: Map[Int, Int] => true
case _ => false
}
Pattern matching
sealed abstract class Shape
case class Circle(radius: Double) extends Shape
case class Rectangle(width: Double, height: Double) extends Shape
case class Triangle(base: Double, height: Double) extends Shape
def area(shape: Shape): Double = {
shape match {
case Circle(radius) => math.Pi * math.pow(radius, 2.0)
case Rectangle(1, height) => height
case Rectangle(width, 1) => width
case Rectangle(width, height) => width * height
case Triangle(0, _) | Triangle(_, 0) => 0
case Triangle(base, height) => height * base / 2
}
}
The Option[T] type
case class User(id: Int, name: String, age: Int, gender: Option[String])
object UserRepository {
private val users = Map(1 -> User(1, "John Doe", 32, Some("male")),
2 -> User(2, "Johanna Doe", 30, None))
users(2).gender match {
case Some(gender) => println("Gender: " + gender)
case None => println("Gender: not specified")
}
def findById(id: Int): Option[User] = users.get(id)
def findAll = users.values
}
UserRepository.findById(2).foreach(user => println(user.age)) // prints 30
for {
User(_, _, _, _, Some(gender)) <- UserRepository.findAll
} yield gender
- Container for an optional value of type T, i.e. values that may be present or not
- If the value of type T is present, Option[T] is an instance of Some[T]
- If the value is absent, Option[T] is the object None
Laziness
class X { val x = { Thread.sleep(2000); 15 } }
class Y { lazy val y = { Thread.sleep(2000); 13 } }
new X // we have to wait two seconds to the result
new Y // Returns instantly
// Expression only evaluated if needed
def logMsg(lazy val str: String) { ... }
def expensive: String = { ... }
logMsg(s"Some $expensive message!")
-
Some languages (like Haskell) are lazy: every expression’s evaluation waits for its (first) use
-
Scala is strict by default, but lazy if explicitly specified for given variables or parameters
-
Laziness is made of lambdas – anonymous functions closed over their lexical scope
Collections
- Scala has a rich set of collection library
- Containers can be sequenced, linear sets of items like List, Tuple, Option, Map, etc
- Collections may have an arbitrary number of elements or be bounded to zero or one element (e.g., Option)
- Either strict or lazy, mutable or immutable
Parallelism
-
Concurrency != Parallelism
-
Avoid concurrency like the plague it is!!!
-
Parallelism is about speeding up a program by using multiple processors
use Parallelism if you can, Concurrency otherwise
(Haskell wiki)
Currying
- Process of transforming a function that takes multiple arguments into a sequence of functions that each have only a single parameter
- Partial application is different in that it takes an argument, applies it partially and returns a new function without the passed parameter.
f: ( X x Y x Z ) -> N
currying produces...
curry(f): X -> (Y -> (Z -> N ))
One liners
(1 to 10) map { _ * 2 }
(1 to 1000).reduceLeft( _ + _ )
(1 to 1000).sum
val fileText = Source.fromFile("file.txt").mkString
val fileLines = Source.fromFile("file.txt").getLines.toList
List(14, 35, -7, 46, 98).reduceLeft ( _ min _ )
List(14, 35, -7, 46, 98).min
// Verify if words exists in a String
val wordList = List("scala", "akka", "play framework", "sbt", "typesafe")
val tweet = "This is an example tweet talking about scala and sbt."
(wordList.foldLeft(false)( _ || tweet.contains(_) ))
wordList.exists(tweet.contains)
val pangram = "The quick brown fox jumps over the lazy dog"
(pangram split " ") filter (_ contains 'o')
val m = pangram filter (_.isLetter) groupBy (_.toLower) mapValues (_.size)
m.toSeq sortBy (_._2)
m.toSeq sortWith (_._2 > _._2)
m.filter(_._2 > 1).toSeq sortWith (_._2 > _._2) mkString "\n"
Source.fromURL("https://github.com/humans.txt").take(335).mkString
Misc
Programs must be written for people to read, and only incidentally for machines to execute
-- Harold Abelson
Should array indices start at 0 or 1? My compromise of 0.5 was rejected without, I thought, proper consideration
-- Stan Kelly-Bootle
A programming language is low level when its programs require attention to the irrelevant
-- Alan J. Perlis
Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live
-- Martin Golding
Tools and frameworks
- A toolkit for building highly concurrent message systems
- Asynchronous, Resilient and Distributed by Design
- High-level abstractions like Actors, Streams and Futures
- Adaptive cluster management, load balancing, routing, partitioning and sharding
- High performant, elastic and decentralized
- Futures support
class MyActor extends Actor {
def receive = {
msg match {
case HttpRequest(request) => {
val query = buildQuery(request)
dbCall(query)
}
case DbResponse(dbResponse) => {
var wsRequest = buildWebServiceRequest(dbResponse)
wsCall(dbResponse)
}
case WsResponse(wsResponse) => sendReply(wsResponse)
}
}
}
Tools and frameworks
- Functional-relational mapping for Scala
- Compile time static type checking
- Seamless manipulation in a collections manner
- Query compiler for different databases (MySQL, PostgreSQL,...)
// Definition of the COFFEES table
class Coffees(tag: Tag) extends Table[(String, Int, Double, Int, Int)](tag, "COFFEES") {
def name = column[String]("COF_NAME", O.PrimaryKey)
def supID = column[Int]("SUP_ID")
def price = column[Double]("PRICE")
def sales = column[Int]("SALES")
def total = column[Int]("TOTAL")
def * = (name, supID, price, sales, total)
// A reified foreign key relation that can be navigated to create a join
def supplier = foreignKey("SUP_FK", supID, suppliers)(_.id)
}
val coffees = TableQuery[Coffees]
coffees ++= Seq(
("Colombian", 101, 7.99, 0, 0),
("French_Roast_Decaf", 49, 9.99, 0, 0)
)
def fetchAll(): = { for(c <- coffees) yield c }
def fetch(name: String): = { coffees.filter(_.name === name).result }
def insert(c: Coffee) = { projects += p }
def insert(coffees: Seq[Coffee]) = { (projects ++= projectSeq).transactionally }
def delete(supID: Int) = { coffees.filter(_.supID === supID).delete }
for {
c <- coffees if c.price < 9.0
s <- suppliers if s.id === c.supID
} yield (c.name, s.name)
Tools and frameworks
- Lightweight, stateless web framework
- Built on top of Akka
- Non-blocking I/O
- RESTful by default
class HomeController @Inject() (computerService: ComputerService,
companyService: CompanyService,
val messagesApi: MessagesApi)
extends Controller with I18nSupport {
val computerForm = Form(
mapping(
"id" -> ignored(None:Option[Long]),
"name" -> nonEmptyText,
"introduced" -> optional(date("yyyy-MM-dd")),
"discontinued" -> optional(date("yyyy-MM-dd")),
"company" -> optional(longNumber)
)(Computer.apply)(Computer.unapply)
)
def edit(id: Long) = Action {
computerService.findById(id).map { computer =>
Ok(html.editForm(id, computerForm.fill(computer), companyService.options))
}.getOrElse(NotFound)
}
def list(page: Int, orderBy: Int, filter: String) = Action { implicit request =>
Ok(html.list(
computerService.list(page = page, orderBy = orderBy, filter = ("%"+filter+"%")),
orderBy, filter
))
}
}
Tools and frameworks
- REST/HTTP-based integration layers on top of Scala and Akka
- DSL for routes spec
- Fully asynchronous, non-blocking, actor and Future based
- Lightweight and modular (loosely coupled)
val listRoute = pathPrefix(PREFIX) {
(path("list") & get) {
respondWithMediaType(MediaTypes.`application/json`) {
onComplete(projectDS.fetchAll()) {
case Success(f) => complete(f)
case Failure(ex) => complete(InternalServerError, ex.getMessage)
}
}
}
}
val deleteRoute = pathPrefix(PREFIX) {
(path("delete" / IntNumber) & delete) { pid =>
respondWithMediaType(MediaTypes.`application/json`) {
complete(projectDS.delete(pid))
}
}
}
val routes = listRoute ~ addRoute ~ deleteRoute
Big data
A couple of applications
- Smart Cities – traffic analytics, congestion prediction and travel time
- Oil & Gas – automated actions to avoid potential equipment failures
- Cybersecurity – network package analysis for intrusion detection
- Industrial automation – offering online analytics and predictive actions for patterns of manufacturing plant issues and quality problems
- Telecoms – call rating, fraud detection, QoS monitoring from CDR and network performance data, root cause analysis
- Cloud infrastructure and web clickstream analysis for IT Operations
- Credit cards - fraud detection, online alerts, anomaly detection
a term for data sets that are so large or complex that traditional data processing applications are inadequate
Big data pipeline
What we want from a Big data pipeline system
- Scale to the order of magnitude of billions of events per day
- Guarantee message delivery even in the face of errors to avoid inconsistencies, with fault-tolerance mechanisms
- Process data in parallel, for us to take advantage of multi-core or multiprocessor architectures
- Be runnable in commodity hardware
- Provide near linear scalability
- Allow arbitrary analytics
Queue system
What we want from a Big data queue / message system
- Send and resend messages in a high level fashion
- Retain messages for a configurable period of time (day, week, month)
- Dispatch persisted messages when needed, to multiple clients in a distributed manner
- Be scalable and distributed
- Process data in parallel
- Be runnable in commodity hardware
- Provide near linear scalability
Apache Kafka
- Developed at Linkedin because RabbitMQ did not scale to the order of billions of messages
- Distributed by design, offers strong durability and fault-tolerance guarantees
- A single broker can handle hundreds of megabytes of reads and writes per second from thousands of clients
- Designed to allow a single cluster to serve as the central data backbone for a large organization
- Elastically and transparently expanded without downtime
Apache Kafka
- 789k messages per second using three commodity nodes
- At Linkedin 7M writes / sec, 35M reads / sec
- Well tested, open source benchmarks available
- Linear scalability, throughput is not affected by increasing data size
Stream processing
What we want from a stream processing system
- Process, enrich, geolocate, aggregate tons of events/messages/logs
- In-memory, continuous processing of time-series data streams
- Scalability through efficient distributed execution over multiple cores and multiple servers
- Allow fast, arbitrary analytics
- Ideally the tool should allow the compute units to be implemented in a range of programming languages
- Computation units should be close to the idea of a CEP
system or the actor model
analysis of data in motion
Apache Spark
- A fast, open source and general engine for large-scale data processing
- Provides an API centered on resilient distributed datasets (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way
- Immutable datasets equipped with functional transformers
- map, flatMap, filter, reduce, union, intersection, aggregate, sum...
- Spark is lazy, Scala is strict
The Ultimate Scala Collections
Elasticsearch
- Store document-based, timestamped-oriented data
- Out-of-the-box aggregation functionalities
- Index and search text messages using a wide range of operators (equality, operators, regexes, booleans)
- Distributed, easily scalable, highly available (near) realtime search engine
- RESTful API (any programming language), several client
- Open source, multitenant
- Fault-tolerant, replicated
- Several client implementations
- Schema on demand
Demo
Questions??
Thanks!
github.com/airtonjal
airtonjal@gmail.com
airton_liborio@mckinsey.com
Scala: Big Data swiss army knife
By Airton Liborio
Scala: Big Data swiss army knife
Scala (scalable language) is a programming language born out of the combination of the object oriented and functional paradigms, executed on top of the Java Virtual Machine. In this talk we'll present how powerful mechanisms implemented in the language (lambdas, currying, immutability, laziness, pattern matching) support the development of Big Data applications. We will show an end-to-end ingestion and data processing architecture with Kafka, Spark and Elasticsearch, with a mini demo.
- 3,473