Scala: Big Data swiss army knife

Airton Libório

 

What is Functional Programming about

  • Functions (not objects or procedures) are used as the fundamental building blocks of a program
  • Assingment-less programming
  • Functions don't have side effect
  • Functions are first class citizen
  • Functions are higher order
  • Correct and expressive programs
  • Programs broken down into smaller pieces
  • FP evaluates expressions X IP statements modify global state
  • Referential transparency -> y = f(x), g = h (y, y) => g = h(f(x), f(x))
  • Distributed systems (L) Immutable data

Scala

  • = (Object oriented + functional paradigm) / JVM
  • Seamless Java interoperation
  • Statically typed with dynamic features
  • High order functions
  • Read Eval Print Loop
  • Duck typing (structural typing)
  • Deeply extensible
  • DSL Support
  • Strict

(scalable language)

Scala is a modern multi-paradigm programming language designed to express common programming patterns in a concise, elegant, and type-safe way

Immutability

Prefer vals, immutable objects, and methods without side effects. Reach for them first

  • Decoupling and isolation
  • Function arguments are val by default
  • Parallelism and multithreading
  • Accessing a mutable object from separate threads requires locking
  • Scalability!
// Bad
var x: ExpressionType = null
if (myBoolean) x = expr1 else x = expr2

// Good
val x = if (myBoolean) expr1 else expr2

Mutable x Immutable

val firstDate: Date = ...
val c = Calendar.getInstance()
c.setTime(firstDate)
c.add(Calendar.DATE, -1)
val dayAgo = c.getTime()
val dateTime = new DateTime(date)
val dayBefore = dateTime.minusDays(1)

Joda-time

java.util.date

Mutability

// Not everything should be solved with immutability though...
class Item{ ... }
class Player(var health: Int = 100,
             val items: mutable.Buffer[Item] = mutable.Buffer.empty)
val player = new Player()

// Mutability for performance is OK
def getFibs(n: Int): Seq[Int] = {
  val fibs = mutable.ArrayBuffer(1, 1)
  while(fibs.length < n){
    fibs.append(fibs(fibs.length-1) + fibs(fibs.length-2))
  }
  fibs
}

// But try to avoid this
def getFibs(n: Int, fibs: mutable.ArrayBuffer[Int]): Unit = {
  fibs.clear()
  fibs.append(1)
  fibs.append(1)
  while(fibs.length < n){
    fibs.append(fibs(fibs.length-1) + fibs(fibs.length-2))
  }
  fibs
}

// Mutate either the variable or the value
val myList = new mutable.ArrayBuffer[Int]()
var myList = immutable.List[Int](1, 2, 3)

// Bad!
var myList = mutable.ArrayBuffer[Int]

Type inference x Explicit typing

val sum = 34 + 4 * 2

val list = List(1, 2, 3)

val map = Map("hey" -> list)

def succ(x: Int) = x + 1
val sum: Int = 34 + 4 * 2

val list: List[Int] = List(1, 2, 3)

val map: Map[String, List[Int]] = Map("hey" -> list)

def succ(x: Int): Int = x + 1
def someComplexFunction(p: Parameter) = {
  def theFirstStep = {
    // do something, using parameter
  }
  def anotherStep = {
    // do something else
  }
 
  theFirstStep + theSecondStep
}

Inner Functions

def someComplexFunction(p: Parameter) =
  theFirstStep(p) + anotherStep(p)
 
private def theFirstStep(p: Parameter) = { 
  ...
}
 
private def anotherStep(p: Parameter) = {
  ...
}

Case classes

Case classes are regular classes which export their constructor parameters and which provide a recursive decomposition mechanism via pattern matching

 

  • A Point class
  • A constructor taking at least two parameters
  • Can be invoked with named parameters
  • Getters for x, y and z
  • a hashCode method
  • a equals method
  • A method for creating copies of immutable instances
case class Point(x: Int, y: Int, z: Int = 0)

Pattern matching

(...) tests whether a given value (or sequence of values) has the shape defined by a pattern, and, if it does, binds the variables in the pattern to the corresponding components of the value (or sequence of values)

// Mixed
case class Player(name: String, score: Int)
def message(player: Player) = player match {
  case Player(_, score) if score > 100000 => "Get a job, dude!"
  case Player(name, _) => "Hey " + name + ", nice to see you again!"
}

// With collections
val list = List(0, 4, 5)
list match {
  case List(0, _, _) => println("found it")
  case _ =>
}
// Value matching
val sign = ch match {
  case '+' => 1
  case '-' => -1
  case _   => 0
}

// Type matching
obj match {
  case x: Int => x
  case s: String => Integer.parseInt(s)
  case _: BigInt => Int.MaxValue
  case _ => 0
}

def isIntIntMap(x: Any) = x match {
  case m: Map[Int, Int] => true
  case _ => false
}

The Option[T] type

case class User(id: Int, name: String, age: Int, gender: Option[String])

object UserRepository {
  private val users = Map(1 -> User(1, "John Doe", 32, Some("male")),
                          2 -> User(2, "Johanna Doe", 30, None))

  users(2).gender match {
    case Some(gender) => println("Gender: " + gender)
    case None => println("Gender: not specified")
  }

  def findById(id: Int): Option[User] = users.get(id)
  def findAll = users.values
}

UserRepository.findById(2).foreach(user => println(user.age)) // prints 30

for {
  User(_, _, _, _, Some(gender)) <- UserRepository.findAll
} yield gender
  • Container for an optional value of type T, i.e. values that may be present or not
  • If the value of type T is present, Option[T] is an instance of Some[T]
  • If the value is absent, Option[T] is the object None

Laziness

class X { val x = { Thread.sleep(2000); 15 } }
class Y { lazy val y = { Thread.sleep(2000); 13 } }

new X  // we have to wait two seconds to the result
new Y  // Returns instantly

// Expression only evaluated if needed
def logMsg(lazy val str: String) { ... }
def expensive: String = { ... }
logMsg(s"Some $expensive message!")
  • Some languages (like Haskell) are lazy: every expression’s evaluation waits for its (first) use

  • Scala is strict by default, but lazy if explicitly specified for given variables or parameters

  • Laziness is made of lambdas – anonymous functions closed over their lexical scope

Collections

  • Scala has a rich set of collection library
  • Containers can be sequenced, linear sets of items like List, Tuple, Option, Map, etc
  • Collections may have an arbitrary number of elements or be bounded to zero or one element (e.g., Option)
  • Either strict or lazy, mutable or immutable

Parallelism

  • Concurrency != Parallelism

  • Avoid concurrency like the plague it is!!!

  • Parallelism is about speeding up a program by using multiple processors

use Parallelism if you can, Concurrency otherwise
(Haskell wiki)

Misc

Programs must be written for people to read, and only incidentally for machines to execute
-- Harold Abelson

Should array indices start at 0 or 1? My compromise of 0.5 was rejected without, I thought, proper consideration
-- Stan Kelly-Bootle

 A programming language is low level when its programs require attention to the irrelevant

-- Alan J. Perlis

Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live
-- Martin Golding

Tools and frameworks

  • A toolkit for building highly concurrent message systems
  • Asynchronous, Resilient and Distributed by Design
  • High-level abstractions like Actors, Streams and Futures
  • Adaptive cluster management, load balancing, routing, partitioning and sharding
  • High performant, elastic and decentralized
  • Futures support
class MyActor extends Actor {
  def receive = {
    msg match {
      case HttpRequest(request) => {
        val query = buildQuery(request)
        dbCall(query)
      }
      case DbResponse(dbResponse) => {
        var wsRequest = buildWebServiceRequest(dbResponse)
        wsCall(dbResponse)
      }
      case WsResponse(wsResponse) => sendReply(wsResponse)
    }
  }
}

Tools and frameworks

  • Functional-relational mapping for Scala
  • Compile time static type checking
  • Seamless manipulation in a collections manner
  • Query compiler for different databases (MySQL, PostgreSQL,...)
// Definition of the COFFEES table
class Coffees(tag: Tag) extends Table[(String, Int, Double, Int, Int)](tag, "COFFEES") {
  def name = column[String]("COF_NAME", O.PrimaryKey)
  def supID = column[Int]("SUP_ID")
  def price = column[Double]("PRICE")
  def sales = column[Int]("SALES")
  def total = column[Int]("TOTAL")
  def * = (name, supID, price, sales, total)
  // A reified foreign key relation that can be navigated to create a join
  def supplier = foreignKey("SUP_FK", supID, suppliers)(_.id)
}
val coffees = TableQuery[Coffees]

coffees ++= Seq(
  ("Colombian",         101, 7.99, 0, 0),
  ("French_Roast_Decaf", 49, 9.99, 0, 0)
)

def fetchAll(): = { for(c <- coffees) yield c }
def fetch(name: String): = { coffees.filter(_.name === name).result }
def insert(c: Coffee) = { projects += p }
def insert(coffees: Seq[Coffee]) = { (projects ++= projectSeq).transactionally }
def delete(supID: Int) = { coffees.filter(_.supID === supID).delete }

for {
  c <- coffees if c.price < 9.0
  s <- suppliers if s.id === c.supID
} yield (c.name, s.name)

Tools and frameworks

  • Lightweight, stateless web framework
  • Built on top of Akka
  • Non-blocking I/O
  • RESTful by default
class HomeController @Inject() (computerService: ComputerService,
                                companyService: CompanyService,
                                val messagesApi: MessagesApi)
  extends Controller with I18nSupport {

  val computerForm = Form(
    mapping(
      "id" -> ignored(None:Option[Long]),
      "name" -> nonEmptyText,
      "introduced" -> optional(date("yyyy-MM-dd")),
      "discontinued" -> optional(date("yyyy-MM-dd")),
      "company" -> optional(longNumber)
    )(Computer.apply)(Computer.unapply)
  )

  def edit(id: Long) = Action {
    computerService.findById(id).map { computer =>
      Ok(html.editForm(id, computerForm.fill(computer), companyService.options))
    }.getOrElse(NotFound)
  }

  def list(page: Int, orderBy: Int, filter: String) = Action { implicit request =>
    Ok(html.list(
      computerService.list(page = page, orderBy = orderBy, filter = ("%"+filter+"%")),
      orderBy, filter
    ))
  }
}

Tools and frameworks

  • REST/HTTP-based integration layers on top of Scala and Akka
  • DSL for routes spec
  • Fully asynchronous, non-blocking, actor and Future based
  • Lightweight and modular (loosely coupled)
val listRoute = pathPrefix(PREFIX) {
  (path("list") & get) {
    respondWithMediaType(MediaTypes.`application/json`) {
      onComplete(projectDS.fetchAll()) {
        case Success(f) => complete(f)
        case Failure(ex) => complete(InternalServerError, ex.getMessage)
      }
    }
  }
}

val deleteRoute = pathPrefix(PREFIX) {
  (path("delete" / IntNumber) & delete) { pid =>
    respondWithMediaType(MediaTypes.`application/json`) {
      complete(projectDS.delete(pid))
    }
  }
}

val routes = listRoute ~ addRoute ~ deleteRoute

Apache Spark

  • A fast, open source and general engine for large-scale data processing
  • Provides an API centered on resilient distributed datasets (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way
  • Immutable datasets equipped with functional transformers
  • map, flatMap, filter, reduce, union, intersection, aggregate, sum...
  • Spark is lazy, Scala is strict

The Ultimate Scala Collections

Downsides

  • Lots of ways to solve problems (multi paradigm)
  • Building is not very fast
  • Limited backward compatibility
  • Learning curve

Questions??

Thanks!

github.com/airtonjal

airtonjal@gmail.com

airton@contaquanto.com.br

Scala: Big Data swiss army knife

By Airton Liborio

Scala: Big Data swiss army knife

Scala (scalable language) is a programming language born out of the combination of the object oriented and functional paradigms, executed on top of the Java Virtual Machine. In this talk we'll present how powerful mechanisms implemented in the language (lambdas, currying, immutability, laziness, pattern matching) support the development of Big Data applications. We will show an end-to-end ingestion and data processing architecture with Kafka, Spark and Elasticsearch, with a mini demo.

  • 854