Using an actor framework for scientific computing
opportunities and challenges
author: Krzysztof Borowski
supervisor: Ph. D. Bartosz Baliś
@liosedhel
Workflow
- Directed Graph
- Many inputs/Many outputs
- Nodes - activities
- Edges - dependencies (control flow)
- Each node activity == fun(Data): Result

Scientific workflow
- Data elements can be big
- Activities can be long-running and resource intensive
- Often invoke legacy code (e.g. Fortran, C) or external services

Scientific workflow - requirements
- Parallelization and distribution of computations
- Persistence and recovery
- Fault tolerance
Actor Model
- State isolation
- Async communication
- Behavior changing
- Spawning new actors

But why exactly the actor model?
But why exactly the actor model?

Similarities. Why not to give it a shot?


Actor Model difficulties

Aspect | Flow activity | Actors |
---|---|---|
Input data | Many typed input channels | One input mailbox |
Output data | Many typed output channels | Lack of output channels |
Flow patterns | Complicated patterns | Simple async messages in "fire and forget" manner |
Akka-streams to the rescue!
- Build with actor model
- Support for complicated flows (beautiful graph oriented API)
- Concurrent data processing

But...
Scientific workflows | Akka streams |
---|---|
bounded input data set | unbounded data stream |
big data elements | small data elements |
focused on scaling | focused on back-pressure (reactive streams implementation) |
important recovery mechanism | important high message throughput |
Current workflow engines
- Kepler, Teverna, Pegasus
- Focused on graphical interface for non programmers, or has complicated API
- Complex
- At the end of the day you always must write external program
Scaflow - New Hope
- Simple workflow engine for scientific computations
- For programmers
- Build with modern technologies in less then 1.5k lines of code (Scala, Akka, Cassandra)




Components

Source

Data processing
Data filtering


Data grouping
Components
Broadcast data

Merge data

Components
Synchronize data
Data sink


Scaflow API

StandardWorkflow.source(List(1, 2, 3, 4, 5, 6))
.map(a => a * a)
.group(3)
.map(_.sum)
.sink(println).run
Concurrent computations in actor model

Push model

Pull model
Scalability in actor model

Workflow persistent state
- Event sourcing
- Persistent actors
override def receiveRecover: Receive = receiveRecoverWithAck {
case n: NextVal[A] =>
if (filter(n.data)) deliver(destination, n)
}
override def receiveCommand: Receive = receiveCommandWithAck {
case n: NextVal[A] =>
persistAsync(n) { e =>
if (filter(e.data)) deliver(destination, n)
}
}
/* ... */
PersistentWorkflow.source("source", List(1, 2, 3, 4, 5, 6))
.map("square", a => a * a)
.group("group", 3)
.map("sum", _.sum)
.sink(println).run
Fault tolerance
val HTTPSupervisorStrategy = OneForOneStrategy(10, 5.seconds) {
case e: TimeoutException => Restart //retry
case _ => Stop //drop the message
}
PersistentWorkflow.connector[String]("pngConnector")
.map("getPng", getPathwayMapPng, Some(HTTPSupervisorStrategy))
.sink("sinkPng", id => println(s"PNG map downloaded for $id"))
Real world example

Real world example cont.
val HTTPSupervisorStrategy = OneForOneStrategy(10, 10.seconds) {
case e: TimeoutException => Restart // try to perform operation again
case _ => Stop // drop the message
}
val savePathwayPngFlow = PersistentWorkflow.connector[String]("pngConnector")
.map("getPng", getPathwayMapPng, Some(HTTPSupervisorStrategy), workersNumber = 8)
.sink("sinkPng", id => println(s"PNG map downloaded for $id"))
val savePathwayTextFlow = PersistentWorkflow.connector[String]("textConnector")
.map("getTxt", getPathwayDetails, Some(HTTPSupervisorStrategy), workersNumber = 8)
.sink("sinkTxt", id => println(s"TXT details downloaded for pathway $id"))
PersistentWorkflow
.source("source", List("hsa"))
.map("getSetOfPathways", getSetOfPathways, Some(HTTPSupervisorStrategy))
.split[String]("split")
.broadcast("broadcast", savePathwayPngFlow, savePathwayTextFlow)
.run
Real world example - scaling
val remoteWorkersHostLocations =
Seq(AddressFromURIString("akka.tcp://workersActorSystem@localhost:5150"),
AddressFromURIString("akka.tcp://workersActorSystem@localhost:5151"))
val HTTPSupervisorStrategy = OneForOneStrategy(10, 10.seconds) {
case e: TimeoutException => Restart // try to perform operation again
case _ => Stop // drop the message
}
val savePathwayPngFlow = PersistentWorkflow.connector[String]("pngConnector")
.map("getPng", getPathwayMapPng, Some(HTTPSupervisorStrategy),
workersNumber = 8, remoteAddresses = remoteWorkersHostLocations)
.sink("sinkPng", id => println(s"PNG map downloaded for $id"))
val savePathwayTextFlow = PersistentWorkflow.connector[String]("textConnector")
.map("getTxt", getPathwayDetails, Some(HTTPSupervisorStrategy), workersNumber = 8)
.sink("sinkTxt", id => println(s"TXT details downloaded for pathway $id"))
PersistentWorkflow
.source("source", List("hsa"))
.map("getSetOfPathways", getSetOfPathways, Some(HTTPSupervisorStrategy))
.split[String]("split")
.broadcast("broadcast", savePathwayPngFlow, savePathwayTextFlow)
.run
Event sourcing - performance

Future development
- Graph API
- Extensive usage of Akka clusters
- Workflow monitoring
https://github.com/liosedhel/scaflow
Special thanks go to
Ph.D. Bartosz Baliś
from AGH Universisty
for the idea, work supervision, advices and all other help during the research and Scaflow development
https://github.com/liosedhel/scaflow
Bibliography
- Akka - http://akka.io/
- Cassandra - http://cassandra.apache.org/
- Pegasus - pegasus.isi.edu
- Taverna - http://www.taverna.org.uk/
- Kepler - https://kepler-project.org/
- akka-streams - http://doc.akka.io/docs/akka-stream-and-http-experimental/2.0.3/scala.html
Scaflow-LambdaDays2016
By liosedhel
Scaflow-LambdaDays2016
- 2,822