Big Data friends
Scala and FP

a.k.a. Noootsab

Proud husband and father

Lidjeu po l'bon

Have to wear glasses since Maths graduation in '03
Learn to dress well since CS graduation in '05

Lost myself since expertize in Geomatic and GIS
Risking myself in NextLab (GIS, Big Data and Scala)
Public interest work: co-founded Wajug
Helper and organizer of Devoxx4Kids
Scala trainer

WHY I mean it

and others do...
Scala has a reputation to be accessible
It eases the maths (mostly [matrix]  algebra)
The CS world is changing (fast)
It shifts from the cloud to analysis
That is, from IT needs to Market opportunities

Reused knowledge

Fact: syntax close to Java, C#, Ruby, ...
Cause: Object Oriented
case class Person(  name:String, 
                    first:String, 
                    age:Double, 
                    gender:Gender, 
                    father:Option[Person], 
                    mother:Option[Person], 
                    children:List[Child]=Nil
) {
  def incAge(n:Int):Person = copy(age = age+n)
  def newSon(child:Person):(Person, Person) = {
    val newChild = this.gender match {
        case Male => child.copy(father = Some(this))
        case Female => child.copy(mother = Some(this))
    }
    (newChild, this.copy(children = newChild :: children)
  }
}
    val _Noah = Person("Petrella", "Noah", 
                       age=4, Male,
                       mother=Some(Sandrine)
father=None) val boringNoootsab = Person("Petrella", "Andy", 32, Male, father=Some(Arcangelo), mother=Some(Nadine)) val (Noah, happyNoootsab) = boringNoootsab.newSon(_Noah)

Following the wave

Fact: Functional Programming ftw
Cause: Scalable Language


Please, bear with me...

WHO

alot

mainly data fans

Coursera

10⁶ online students
PHP → Scala
Concurrency primitives
Play
Type safety
Ecosystem

Twitter


REPL
case classes
productivity gains
concise code



Scala school
Tens of open source libs

Netflix


Billion devices
Historical events
Real-time analytics
Proper API (Option)
Async (Try)
Scalatra + ScalaTest

And more

AirBnB
Snips (smart cities, ...)
Tuplejump (analytic platform)
eBay (analytics)
BBC (Future Media project)
Virdata (IoT analytic platform)
Ooyala (video analytic platform)
LinkedIn

Functional Programming

in a nutshell

source wikipedia: http://en.wikipedia.org/wiki/Function_(mathematics)

Input x

can be a function...


Defines a general process
that could behave differently
listOfNames map { name => DB.getByName(name) }

listOfPersons flatMap { person => person.friends }

listOfFriends filter { (f:Friend) => f.met moreThan (10 years) }

listOfOldFriends.count(_.person.gender != me.gender)

Output x

bah... can be a function as well...

Prepares a process
that will be available for later usage
def authentication(manager:SecurityManager): User=>Authentication

def source(url:String): Authentication=>DataRepo=>Data

//[...]

val authenticate = authentication(FakeSecurityManager)
val settings = source("/settings")
def request = {
  val user = //...
  val auth = authenticate(user)
  val settingsFetcher = settings(auth)
  // and so on
}

Show me

 def lm(x:List[Double], y:List[Double]):((Double, Double), Double=>Double) = {
  val n = x.size
  val ẍ = x.sum.toDouble / n
  val ÿ = y.sum.toDouble / n
  val Sp = ((x ∙- ẍ) ∙* (y ∙- ÿ) sum) / (n-1)
  val Sx2 = ((x ∙- ẍ) ∙^ 2 sum) / (n-1)
  val ß1 = Sp / Sx2
  val ß0 = ÿ - ß1 * ẍ
  val coefs = (ß0, ß1)
  val predict = (d:Double) => ß0 + ß1 * d
  (coefs, predict)
}
def test(ß0:Double = 18.1d, ß1:Double = 6d, error:Int=>List[Double]) = {
  val n = 10000
  val x:List[Double] = -n.toDouble to n by 1 toList
  val e = error(2*n+1)
  val y:List[Double] = ß0 ∙+: (ß1 ∙*: x) ∙+: e
  lm(x, y)
}
val error = rnorm(mean=0, sigma=5) // gen gaussian nbs 
val model = test(103, 7, error)
on github

Lazy

yeah yeah... I'll do it
lazy val app:App = initializeApp()

def logDebug(m: => String)= if (LOG.debugEnabled) LOG.error(m) else ()
Avoid computations
Delayed initialization

Sooo laaazy

Come back... in a potential future

TL; DW
val app:Future[App] = initializeApp()

val http:Future[HttpClient] = app.map( _.http.client )

def isOk(url:String):Future[Boolean] = 
    http.flatMap(client => client.get(url) )
        .map( _.code )
        .filter( _ == 200 )
        .recoverWith {
            case x:CommunicationException => isOk(url)
        }.recover {
            case e: Throwable => false
        }

Code... now

(I promised)
class LazyCons[+A](a:A, t: => Lazy[A]) extends Lazy[A] {
  val head = Some(a)
  lazy val tail = t
}
def fetch(file:String):Lazy[Future[String]] = {
  val texts = io.Source.fromFile(new java.io.File(file)).getLines
  def readLine(texts:Iterator[String]):Lazy[Future[String]] = //...
  readLine(texts)
}
for the funval fibs:Stream[Int] = 0 #:: 1 #:: ((fibs zip fibs.drop(1)) map  ((_:Int) + (_:Int)).tupled)
on github

Mashup


A function could either 
→ be called on data (method, sync)
→ be sent to the data (message, async)

A function composes

A function is a delayed computation

...
...

Spark

...
...
What if I compose all the computations

Then I send the whole shebang to where the data are?

.↓.↓.
Map/Reduce : degenerated case
Spark : generalized case (Back to Gerard's talk)

Funky code



trait Data {
  def dependent:List[Double]
  def observed:Matrix
  def bootstrap(proportion:Double):Future[Data]
}
trait Model {
  type Coefs
  def apply(data:Data):Future[(Coefs, List[Double]=>Future[Double])]
}
def bagging(model:Model)(agg:Aggregation[model.Coefs], n:Int)(data:Data):Future[model.Coefs] = {
  def exec:Future[model.Coefs] =  for {
                                    sample     <- data.bootstrap(0.6)
                                    (coefs, _) <- model(sample)
                                  } yield coefs
  val execs:List[Future[model.Coefs]] = List.fill(n)(exec)
  val coefsList:Future[List[model.Coefs]] = Future.sequence(execs)

  val result:Future[model.Coefs] = coefsList map agg
  result
}
on github

Enough!

Thanks ^_^

Poke me:
→ for Scala training
→ for fun with Data
→ with Books ideas

Scala and FP in Big Data

By andy petrella

Scala and FP in Big Data

Talk given for the BigData.be meetup on July, 2014. Scala and FP introduced for the following talks about Spark.

  • 3,748