Scala: Big Data swiss army knife

Airton Libório

 

What is Functional Programming about

  • Functions (not objects or procedures) are used as the fundamental building blocks of a program
  • Assingment-less programming
  • Functions don't have side effect
  • Functions are first class citizen
  • Functions are higher order
  • Correct and expressive programs
  • Programs broken down into smaller pieces
  • FP evaluates expressions X IP statements modify global state
  • Referential transparency -> y = f(x), g = h (y, y) => g = h(f(x), f(x))
  • Distributed systems (L) Immutable data

Scala

  • = (Object oriented + functional paradigm) / JVM
  • Seamless Java interoperation
  • Statically typed with dynamic features
  • High order functions
  • Read Eval Print Loop
  • Duck typing (structural typing)
  • Deeply extensible
  • DSL Support
  • Strict

(scalable language)

Scala is a modern multi-paradigm programming language designed to express common programming patterns in a concise, elegant, and type-safe way

Immutability

Prefer vals, immutable objects, and methods without side effects. Reach for them first

  • Decoupling and isolation
  • Function arguments are val by default
  • Parallelism and multithreading
  • Accessing a mutable object from separate threads requires locking
  • Scalability!
// Bad
var x: ExpressionType = null
if (myBoolean) x = expr1 else x = expr2

// Good
val x = if (myBoolean) expr1 else expr2

Mutable x Immutable

val firstDate: Date = ...
val c = Calendar.getInstance()
c.setTime(firstDate)
c.add(Calendar.DATE, -1)
val dayAgo = c.getTime()
val dateTime = new DateTime(date)
val dayBefore = dateTime.minusDays(1)

Joda-time

java.util.date

Mutability

// Not everything should be solved with immutability though...
class Item{ ... }
class Player(var health: Int = 100,
             val items: mutable.Buffer[Item] = mutable.Buffer.empty)
val player = new Player()

// Mutability for performance is OK
def getFibs(n: Int): Seq[Int] = {
  val fibs = mutable.ArrayBuffer(1, 1)
  while(fibs.length < n){
    fibs.append(fibs(fibs.length-1) + fibs(fibs.length-2))
  }
  fibs
}

// But try to avoid this
def getFibs(n: Int, fibs: mutable.ArrayBuffer[Int]): Unit = {
  fibs.clear()
  fibs.append(1)
  fibs.append(1)
  while(fibs.length < n){
    fibs.append(fibs(fibs.length-1) + fibs(fibs.length-2))
  }
  fibs
}

// Mutate either the variable or the value
val myList = new mutable.ArrayBuffer[Int]()
var myList = immutable.List[Int](1, 2, 3)

// Bad!
var myList = mutable.ArrayBuffer[Int]

Type inference x Explicit typing

val sum = 34 + 4 * 2

val list = List(1, 2, 3)

val map = Map("hey" -> list)

def succ(x: Int) = x + 1
val sum: Int = 34 + 4 * 2

val list: List[Int] = List(1, 2, 3)

val map: Map[String, List[Int]] = Map("hey" -> list)

def succ(x: Int): Int = x + 1
def someComplexFunction(p: Parameter) = {
  def theFirstStep = {
    // do something, using parameter
  }
  def anotherStep = {
    // do something else
  }
 
  theFirstStep + theSecondStep
}

Inner Functions

def someComplexFunction(p: Parameter) =
  theFirstStep(p) + anotherStep(p)
 
private def theFirstStep(p: Parameter) = { 
  ...
}
 
private def anotherStep(p: Parameter) = {
  ...
}

Case classes

Case classes are regular classes which export their constructor parameters and which provide a recursive decomposition mechanism via pattern matching

 

  • A Point class
  • A constructor taking at least two parameters
  • Can be invoked with named parameters
  • Getters for x, y and z
  • a hashCode method
  • a equals method
  • A method for creating copies of immutable instances
case class Point(x: Int, y: Int, z: Int = 0)

Pattern matching

(...) tests whether a given value (or sequence of values) has the shape defined by a pattern, and, if it does, binds the variables in the pattern to the corresponding components of the value (or sequence of values)

// Mixed
case class Player(name: String, score: Int)
def message(player: Player) = player match {
  case Player(_, score) if score > 100000 => "Get a job, dude!"
  case Player(name, _) => "Hey " + name + ", nice to see you again!"
}

// With collections
val list = List(0, 4, 5)
list match {
  case List(0, _, _) => println("found it")
  case _ =>
}
// Value matching
val sign = ch match {
  case '+' => 1
  case '-' => -1
  case _   => 0
}

// Type matching
obj match {
  case x: Int => x
  case s: String => Integer.parseInt(s)
  case _: BigInt => Int.MaxValue
  case _ => 0
}

def isIntIntMap(x: Any) = x match {
  case m: Map[Int, Int] => true
  case _ => false
}

The Option[T] type

case class User(id: Int, name: String, age: Int, gender: Option[String])

object UserRepository {
  private val users = Map(1 -> User(1, "John Doe", 32, Some("male")),
                          2 -> User(2, "Johanna Doe", 30, None))

  users(2).gender match {
    case Some(gender) => println("Gender: " + gender)
    case None => println("Gender: not specified")
  }

  def findById(id: Int): Option[User] = users.get(id)
  def findAll = users.values
}

UserRepository.findById(2).foreach(user => println(user.age)) // prints 30

for {
  User(_, _, _, _, Some(gender)) <- UserRepository.findAll
} yield gender
  • Container for an optional value of type T, i.e. values that may be present or not
  • If the value of type T is present, Option[T] is an instance of Some[T]
  • If the value is absent, Option[T] is the object None

Laziness

class X { val x = { Thread.sleep(2000); 15 } }
class Y { lazy val y = { Thread.sleep(2000); 13 } }

new X  // we have to wait two seconds to the result
new Y  // Returns instantly

// Expression only evaluated if needed
def logMsg(lazy val str: String) { ... }
def expensive: String = { ... }
logMsg(s"Some $expensive message!")
  • Some languages (like Haskell) are lazy: every expression’s evaluation waits for its (first) use

  • Scala is strict by default, but lazy if explicitly specified for given variables or parameters

  • Laziness is made of lambdas – anonymous functions closed over their lexical scope

Collections

  • Scala has a rich set of collection library
  • Containers can be sequenced, linear sets of items like List, Tuple, Option, Map, etc
  • Collections may have an arbitrary number of elements or be bounded to zero or one element (e.g., Option)
  • Either strict or lazy, mutable or immutable

Parallelism

  • Concurrency != Parallelism

  • Avoid concurrency like the plague it is!!!

  • Parallelism is about speeding up a program by using multiple processors

use Parallelism if you can, Concurrency otherwise
(Haskell wiki)

Misc

Programs must be written for people to read, and only incidentally for machines to execute
-- Harold Abelson

Should array indices start at 0 or 1? My compromise of 0.5 was rejected without, I thought, proper consideration
-- Stan Kelly-Bootle

 A programming language is low level when its programs require attention to the irrelevant

-- Alan J. Perlis

Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live
-- Martin Golding

Tools and frameworks

  • A toolkit for building highly concurrent message systems
  • Asynchronous, Resilient and Distributed by Design
  • High-level abstractions like Actors, Streams and Futures
  • Adaptive cluster management, load balancing, routing, partitioning and sharding
  • High performant, elastic and decentralized
  • Futures support
class MyActor extends Actor {
  def receive = {
    msg match {
      case HttpRequest(request) => {
        val query = buildQuery(request)
        dbCall(query)
      }
      case DbResponse(dbResponse) => {
        var wsRequest = buildWebServiceRequest(dbResponse)
        wsCall(dbResponse)
      }
      case WsResponse(wsResponse) => sendReply(wsResponse)
    }
  }
}

Tools and frameworks

  • Functional-relational mapping for Scala
  • Compile time static type checking
  • Seamless manipulation in a collections manner
  • Query compiler for different databases (MySQL, PostgreSQL,...)
// Definition of the COFFEES table
class Coffees(tag: Tag) extends Table[(String, Int, Double, Int, Int)](tag, "COFFEES") {
  def name = column[String]("COF_NAME", O.PrimaryKey)
  def supID = column[Int]("SUP_ID")
  def price = column[Double]("PRICE")
  def sales = column[Int]("SALES")
  def total = column[Int]("TOTAL")
  def * = (name, supID, price, sales, total)
  // A reified foreign key relation that can be navigated to create a join
  def supplier = foreignKey("SUP_FK", supID, suppliers)(_.id)
}
val coffees = TableQuery[Coffees]

coffees ++= Seq(
  ("Colombian",         101, 7.99, 0, 0),
  ("French_Roast_Decaf", 49, 9.99, 0, 0)
)

def fetchAll(): = { for(c <- coffees) yield c }
def fetch(name: String): = { coffees.filter(_.name === name).result }
def insert(c: Coffee) = { projects += p }
def insert(coffees: Seq[Coffee]) = { (projects ++= projectSeq).transactionally }
def delete(supID: Int) = { coffees.filter(_.supID === supID).delete }

for {
  c <- coffees if c.price < 9.0
  s <- suppliers if s.id === c.supID
} yield (c.name, s.name)

Tools and frameworks

  • Lightweight, stateless web framework
  • Built on top of Akka
  • Non-blocking I/O
  • RESTful by default
class HomeController @Inject() (computerService: ComputerService,
                                companyService: CompanyService,
                                val messagesApi: MessagesApi)
  extends Controller with I18nSupport {

  val computerForm = Form(
    mapping(
      "id" -> ignored(None:Option[Long]),
      "name" -> nonEmptyText,
      "introduced" -> optional(date("yyyy-MM-dd")),
      "discontinued" -> optional(date("yyyy-MM-dd")),
      "company" -> optional(longNumber)
    )(Computer.apply)(Computer.unapply)
  )

  def edit(id: Long) = Action {
    computerService.findById(id).map { computer =>
      Ok(html.editForm(id, computerForm.fill(computer), companyService.options))
    }.getOrElse(NotFound)
  }

  def list(page: Int, orderBy: Int, filter: String) = Action { implicit request =>
    Ok(html.list(
      computerService.list(page = page, orderBy = orderBy, filter = ("%"+filter+"%")),
      orderBy, filter
    ))
  }
}

Tools and frameworks

  • REST/HTTP-based integration layers on top of Scala and Akka
  • DSL for routes spec
  • Fully asynchronous, non-blocking, actor and Future based
  • Lightweight and modular (loosely coupled)
val listRoute = pathPrefix(PREFIX) {
  (path("list") & get) {
    respondWithMediaType(MediaTypes.`application/json`) {
      onComplete(projectDS.fetchAll()) {
        case Success(f) => complete(f)
        case Failure(ex) => complete(InternalServerError, ex.getMessage)
      }
    }
  }
}

val deleteRoute = pathPrefix(PREFIX) {
  (path("delete" / IntNumber) & delete) { pid =>
    respondWithMediaType(MediaTypes.`application/json`) {
      complete(projectDS.delete(pid))
    }
  }
}

val routes = listRoute ~ addRoute ~ deleteRoute

Apache Spark

  • A fast, open source and general engine for large-scale data processing
  • Provides an API centered on resilient distributed datasets (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way
  • Immutable datasets equipped with functional transformers
  • map, flatMap, filter, reduce, union, intersection, aggregate, sum...
  • Spark is lazy, Scala is strict

The Ultimate Scala Collections

Downsides

  • Lots of ways to solve problems (multi paradigm)
  • Building is not very fast
  • Limited backward compatibility
  • Learning curve

Questions??

Thanks!

github.com/airtonjal

airtonjal@gmail.com

airton@contaquanto.com.br

Made with Slides.com