Scala: Big Data swiss army knife

Airton Libório

Data Analytics Lead @ McKinsey Digital Labs

Short Bio

  • Msc Computer Science (Distributed Systems) PUC-Rio
  • Previous work in:
    Telco: call logs pipeline, self organising networks, geoprocessing
    Energy: network optimization, genetic algorithms
    Startups: PSafe (Antivirus, cloud storage)
  • Data Analytics Lead @ McKinsey Digital Labs Latam
  • Big Data, large scale systems, service oriented architectures, open source, backend

Agenda

Scala

DEMO!

What is Functional Programming about

  • Functions (not objects or procedures) are used as the fundamental building blocks of a program
  • Assingment-less programming
  • Functions don't have side effect
  • Functions are first class citizen
  • Functions are higher order
  • Correct and expressive programs
  • Programs broken down into smaller pieces
  • FP evaluates expressions X IP statements modify global state
  • Referential transparency -> y = f(x), g = h (y, y) => g = h(f(x), f(x))
  • Distributed systems (L) Immutable data

Scala

  • = (Object oriented + functional paradigm) / JVM
  • Seamless Java interoperation
  • Statically typed with dynamic features
  • High order functions
  • Read Eval Print Loop
  • Duck typing (structural typing)
  • Deeply extensible
  • DSL Support
  • Strict

(scalable language)

Scala is a modern multi-paradigm programming language designed to express common programming patterns in a concise, elegant, and type-safe way

Immutability

Prefer vals, immutable objects, and methods without side effects. Reach for them first

  • Decoupling and isolation
  • Function arguments are val by default
  • Parallelism and multithreading
  • Accessing a mutable object from separate threads requires locking
  • Scalability!
// Bad
var x: ExpressionType = null
if (myBoolean) x = expr1 else x = expr2

// Good
val x = if (myBoolean) expr1 else expr2

Mutable x Immutable

val firstDate: Date = ...
val c = Calendar.getInstance()
c.setTime(firstDate)
c.add(Calendar.DATE, -1)
val dayAgo = c.getTime()
val dateTime = new DateTime(date)
val dayBefore = dateTime.minusDays(1)

Joda-time

java.util.date

Mutability

// Not everything should be solved with immutability though...
class Item{ ... }
class Player(var health: Int = 100,
             val items: mutable.Buffer[Item] = mutable.Buffer.empty)
val player = new Player()

// Mutability for performance is OK
def getFibs(n: Int): Seq[Int] = {
  val fibs = mutable.ArrayBuffer(1, 1)
  while(fibs.length < n){
    fibs.append(fibs(fibs.length-1) + fibs(fibs.length-2))
  }
  fibs
}

// But try to avoid this
def getFibs(n: Int, fibs: mutable.ArrayBuffer[Int]): Unit = {
  fibs.clear()
  fibs.append(1)
  fibs.append(1)
  while(fibs.length < n){
    fibs.append(fibs(fibs.length-1) + fibs(fibs.length-2))
  }
  fibs
}

// Mutate either the variable or the value
val myList = new mutable.ArrayBuffer[Int]()
var myList = immutable.List[Int](1, 2, 3)

// Bad!
var myList = mutable.ArrayBuffer[Int]

Type inference x Explicit typing

val sum = 34 + 4 * 2

val list = List(1, 2, 3)

val map = Map("hey" -> list)

def succ(x: Int) = x + 1
val sum: Int = 34 + 4 * 2

val list: List[Int] = List(1, 2, 3)

val map: Map[String, List[Int]] = Map("hey" -> list)

def succ(x: Int): Int = x + 1
def someComplexFunction(p: Parameter) = {
  def theFirstStep = {
    // do something, using parameter
  }
  def anotherStep = {
    // do something else
  }
 
  theFirstStep + theSecondStep
}

Inner Functions

def someComplexFunction(p: Parameter) =
  theFirstStep(p) + anotherStep(p)
 
private def theFirstStep(p: Parameter) = { 
  ...
}
 
private def anotherStep(p: Parameter) = {
  ...
}

Conciseness

public class MyUser {
  String myStr;
  Integer myInt;

  public MyUser(String myStr, Integer myInt) {
    this.myStr = myStr;
    this.myInt = myInt;
  }

  public String getMyStr() {
    return this.myStr;
  }
  
  public Integer getMyInt() {
    return this.myInt;
  }
}
case class MyUser(myStr: String, myInt: Int)

Case classes

Case classes are regular classes which export their constructor parameters and which provide a recursive decomposition mechanism via pattern matching

 

  • A Point class
  • A constructor taking at least two parameters
  • Can be invoked with named parameters
  • Getters for x, y and z
  • a hashCode method
  • a equals method
  • A method for creating copies of immutable instances
case class Point(x: Int, y: Int, z: Int = 0)

Pattern matching

(...) tests whether a given value (or sequence of values) has the shape defined by a pattern, and, if it does, binds the variables in the pattern to the corresponding components of the value (or sequence of values)

// Mixed
case class Player(name: String, score: Int)
def message(player: Player) = player match {
  case Player(_, score) if score > 100000 => "Get a job, dude!"
  case Player(name, _) => "Hey " + name + ", nice to see you again!"
}

// With collections
val list = List(0, 4, 5)
list match {
  case List(0, _, _) => println("found it")
  case _ =>
}
// Value matching
val sign = ch match {
  case '+' => 1
  case '-' => -1
  case _   => 0
}

// Type matching
obj match {
  case x: Int => x
  case s: String => Integer.parseInt(s)
  case _: BigInt => Int.MaxValue
  case _ => 0
}

def isIntIntMap(x: Any) = x match {
  case m: Map[Int, Int] => true
  case _ => false
}

Pattern matching

sealed abstract class Shape
case class Circle(radius: Double) extends Shape
case class Rectangle(width: Double, height: Double) extends Shape
case class Triangle(base: Double, height: Double) extends Shape

def area(shape: Shape): Double = {
  shape match {
    case Circle(radius) => math.Pi * math.pow(radius, 2.0)
    case Rectangle(1, height) => height
    case Rectangle(width, 1) => width
    case Rectangle(width, height) => width * height
    case Triangle(0, _) | Triangle(_, 0) => 0
    case Triangle(base, height) => height * base / 2
  }
}

The Option[T] type

case class User(id: Int, name: String, age: Int, gender: Option[String])

object UserRepository {
  private val users = Map(1 -> User(1, "John Doe", 32, Some("male")),
                          2 -> User(2, "Johanna Doe", 30, None))

  users(2).gender match {
    case Some(gender) => println("Gender: " + gender)
    case None => println("Gender: not specified")
  }

  def findById(id: Int): Option[User] = users.get(id)
  def findAll = users.values
}

UserRepository.findById(2).foreach(user => println(user.age)) // prints 30

for {
  User(_, _, _, _, Some(gender)) <- UserRepository.findAll
} yield gender
  • Container for an optional value of type T, i.e. values that may be present or not
  • If the value of type T is present, Option[T] is an instance of Some[T]
  • If the value is absent, Option[T] is the object None

Laziness

class X { val x = { Thread.sleep(2000); 15 } }
class Y { lazy val y = { Thread.sleep(2000); 13 } }

new X  // we have to wait two seconds to the result
new Y  // Returns instantly

// Expression only evaluated if needed
def logMsg(lazy val str: String) { ... }
def expensive: String = { ... }
logMsg(s"Some $expensive message!")
  • Some languages (like Haskell) are lazy: every expression’s evaluation waits for its (first) use

  • Scala is strict by default, but lazy if explicitly specified for given variables or parameters

  • Laziness is made of lambdas – anonymous functions closed over their lexical scope

Collections

  • Scala has a rich set of collection library
  • Containers can be sequenced, linear sets of items like List, Tuple, Option, Map, etc
  • Collections may have an arbitrary number of elements or be bounded to zero or one element (e.g., Option)
  • Either strict or lazy, mutable or immutable

Parallelism

  • Concurrency != Parallelism

  • Avoid concurrency like the plague it is!!!

  • Parallelism is about speeding up a program by using multiple processors

use Parallelism if you can, Concurrency otherwise
(Haskell wiki)

Currying

  • Process of transforming a function that takes multiple arguments into a sequence of functions that each have only a single parameter
  • Partial application is different in that it takes an argument, applies it partially and returns a new function without the passed parameter.

 

f: ( X x Y x Z ) -> N

   currying produces...

curry(f): X -> (Y -> (Z -> N ))

One liners

(1 to 10) map { _ * 2 }

(1 to 1000).reduceLeft( _ + _ )
(1 to 1000).sum

val fileText  = Source.fromFile("file.txt").mkString
val fileLines = Source.fromFile("file.txt").getLines.toList

List(14, 35, -7, 46, 98).reduceLeft ( _ min _ )
List(14, 35, -7, 46, 98).min


// Verify if words exists in a String
val wordList = List("scala", "akka", "play framework", "sbt", "typesafe")
val tweet = "This is an example tweet talking about scala and sbt."

(wordList.foldLeft(false)( _ || tweet.contains(_) ))
wordList.exists(tweet.contains)

val pangram = "The quick brown fox jumps over the lazy dog"
(pangram split " ") filter (_ contains 'o')


val m = pangram filter (_.isLetter) groupBy (_.toLower) mapValues (_.size)
m.toSeq sortBy (_._2)
m.toSeq sortWith (_._2 > _._2)
m.filter(_._2 > 1).toSeq sortWith (_._2 > _._2) mkString "\n"


Source.fromURL("https://github.com/humans.txt").take(335).mkString

Misc

Programs must be written for people to read, and only incidentally for machines to execute
-- Harold Abelson

Should array indices start at 0 or 1? My compromise of 0.5 was rejected without, I thought, proper consideration
-- Stan Kelly-Bootle

 A programming language is low level when its programs require attention to the irrelevant

-- Alan J. Perlis

Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live
-- Martin Golding

Tools and frameworks

  • A toolkit for building highly concurrent message systems
  • Asynchronous, Resilient and Distributed by Design
  • High-level abstractions like Actors, Streams and Futures
  • Adaptive cluster management, load balancing, routing, partitioning and sharding
  • High performant, elastic and decentralized
  • Futures support
class MyActor extends Actor {
  def receive = {
    msg match {
      case HttpRequest(request) => {
        val query = buildQuery(request)
        dbCall(query)
      }
      case DbResponse(dbResponse) => {
        var wsRequest = buildWebServiceRequest(dbResponse)
        wsCall(dbResponse)
      }
      case WsResponse(wsResponse) => sendReply(wsResponse)
    }
  }
}

Tools and frameworks

  • Functional-relational mapping for Scala
  • Compile time static type checking
  • Seamless manipulation in a collections manner
  • Query compiler for different databases (MySQL, PostgreSQL,...)
// Definition of the COFFEES table
class Coffees(tag: Tag) extends Table[(String, Int, Double, Int, Int)](tag, "COFFEES") {
  def name = column[String]("COF_NAME", O.PrimaryKey)
  def supID = column[Int]("SUP_ID")
  def price = column[Double]("PRICE")
  def sales = column[Int]("SALES")
  def total = column[Int]("TOTAL")
  def * = (name, supID, price, sales, total)
  // A reified foreign key relation that can be navigated to create a join
  def supplier = foreignKey("SUP_FK", supID, suppliers)(_.id)
}
val coffees = TableQuery[Coffees]

coffees ++= Seq(
  ("Colombian",         101, 7.99, 0, 0),
  ("French_Roast_Decaf", 49, 9.99, 0, 0)
)

def fetchAll(): = { for(c <- coffees) yield c }
def fetch(name: String): = { coffees.filter(_.name === name).result }
def insert(c: Coffee) = { projects += p }
def insert(coffees: Seq[Coffee]) = { (projects ++= projectSeq).transactionally }
def delete(supID: Int) = { coffees.filter(_.supID === supID).delete }

for {
  c <- coffees if c.price < 9.0
  s <- suppliers if s.id === c.supID
} yield (c.name, s.name)

Tools and frameworks

  • Lightweight, stateless web framework
  • Built on top of Akka
  • Non-blocking I/O
  • RESTful by default
class HomeController @Inject() (computerService: ComputerService,
                                companyService: CompanyService,
                                val messagesApi: MessagesApi)
  extends Controller with I18nSupport {

  val computerForm = Form(
    mapping(
      "id" -> ignored(None:Option[Long]),
      "name" -> nonEmptyText,
      "introduced" -> optional(date("yyyy-MM-dd")),
      "discontinued" -> optional(date("yyyy-MM-dd")),
      "company" -> optional(longNumber)
    )(Computer.apply)(Computer.unapply)
  )

  def edit(id: Long) = Action {
    computerService.findById(id).map { computer =>
      Ok(html.editForm(id, computerForm.fill(computer), companyService.options))
    }.getOrElse(NotFound)
  }

  def list(page: Int, orderBy: Int, filter: String) = Action { implicit request =>
    Ok(html.list(
      computerService.list(page = page, orderBy = orderBy, filter = ("%"+filter+"%")),
      orderBy, filter
    ))
  }
}

Tools and frameworks

  • REST/HTTP-based integration layers on top of Scala and Akka
  • DSL for routes spec
  • Fully asynchronous, non-blocking, actor and Future based
  • Lightweight and modular (loosely coupled)
val listRoute = pathPrefix(PREFIX) {
  (path("list") & get) {
    respondWithMediaType(MediaTypes.`application/json`) {
      onComplete(projectDS.fetchAll()) {
        case Success(f) => complete(f)
        case Failure(ex) => complete(InternalServerError, ex.getMessage)
      }
    }
  }
}

val deleteRoute = pathPrefix(PREFIX) {
  (path("delete" / IntNumber) & delete) { pid =>
    respondWithMediaType(MediaTypes.`application/json`) {
      complete(projectDS.delete(pid))
    }
  }
}

val routes = listRoute ~ addRoute ~ deleteRoute

Big data

A couple of applications

  • Smart Cities – traffic analytics, congestion prediction and travel time
  • Oil & Gas – automated actions to avoid potential equipment failures
  • Cybersecurity – network package analysis for intrusion detection
  • Industrial automation offering online analytics and predictive actions for patterns of manufacturing plant issues and quality problems
  • Telecoms – call rating, fraud detection, QoS monitoring from CDR and network performance data, root cause analysis
  • Cloud infrastructure and web clickstream analysis for IT Operations
  • Credit cards - fraud detection, online alerts, anomaly detection

a term for data sets that are so large or complex that traditional data processing applications are inadequate

Big data pipeline

What we want from a Big data pipeline system

  • Scale to the order of magnitude of billions of events per day
  • Guarantee message delivery even in the face of errors to avoid inconsistencies, with fault-tolerance mechanisms
  • Process data in parallel, for us to take advantage of multi-core or multiprocessor architectures
  • Be runnable in commodity hardware
  • Provide near linear scalability
  • Allow arbitrary analytics

Queue system

What we want from a Big data queue / message system

  • Send and resend messages in a high level fashion
  • Retain messages for a configurable period of time (day, week, month)
  • Dispatch persisted messages when needed, to multiple clients in a distributed manner
  • Be scalable and distributed
  • Process data in parallel
  • Be runnable in commodity hardware
  • Provide near linear scalability

Apache Kafka

  • Developed at Linkedin because RabbitMQ did not scale to the order of billions of messages
  • Distributed by design, offers strong durability and fault-tolerance guarantees
  • A single broker can handle hundreds of megabytes of reads and writes per second from thousands of clients
  • Designed to allow a single cluster to serve as the central data backbone for a large organization
  • Elastically and transparently expanded without downtime

Apache Kafka

  • 789k messages per second using three commodity nodes
  • At Linkedin 7M writes / sec, 35M reads / sec
  • Well tested, open source benchmarks available
  • Linear scalability, throughput is not affected by increasing data size

Stream processing

What we want from a stream processing system

  • Process, enrich, geolocate, aggregate tons of events/messages/logs
  • In-memory, continuous processing of time-series data streams
  • Scalability through efficient distributed execution over multiple cores and multiple servers
  • Allow fast, arbitrary analytics
  • Ideally the tool should allow the compute units to be implemented in a range of programming languages
  • Computation units should be close to the idea of a CEP
    system or the actor model

analysis of data in motion

Apache Spark

  • A fast, open source and general engine for large-scale data processing
  • Provides an API centered on resilient distributed datasets (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way
  • Immutable datasets equipped with functional transformers
  • map, flatMap, filter, reduce, union, intersection, aggregate, sum...
  • Spark is lazy, Scala is strict

The Ultimate Scala Collections

Elasticsearch

  • Store document-based, timestamped-oriented data
  • Out-of-the-box aggregation functionalities
  • Index and search text messages using a wide range of operators (equality, operators, regexes, booleans)
  • Distributed, easily scalable, highly available (near) realtime search engine
  • RESTful API (any programming language), several client
  • Open source, multitenant
  • Fault-tolerant, replicated
  • Several client implementations
  • Schema on demand

Demo

Questions??

Thanks!

github.com/airtonjal

airtonjal@gmail.com

airton_liborio@mckinsey.com

Scala: Big Data swiss army knife

By Airton Liborio

Scala: Big Data swiss army knife

Scala (scalable language) is a programming language born out of the combination of the object oriented and functional paradigms, executed on top of the Java Virtual Machine. In this talk we'll present how powerful mechanisms implemented in the language (lambdas, currying, immutability, laziness, pattern matching) support the development of Big Data applications. We will show an end-to-end ingestion and data processing architecture with Kafka, Spark and Elasticsearch, with a mini demo.

  • 1,904
Loading comments...

More from Airton Liborio