Gleb Kanterov
@kanterov
gleb@kanterov.ru
simple things should be simple, complex things should be possible"
Alan Kay
class RDD[A] {
def map[B](f: A => B): RDD[B]
def filter(f: A => Boolean): RDD[A]
def union(other: RDD[A]): RDD[A]
def groupBy[B](f: A => B): RDD[(B, Seq[A])]
def collect(): Array[A]
}
class DataFrame {
def select(cols: Column*): DataFrame
def filter(condition: Column): DataFrame
def col(name: String): Column
def collect(): Array[Row]
}
def min(col: Column): Column
def avg(col: Column): Column
class Dataset[A: Encoder] {
def select[U1: Encoder, U2: Encoder](
c1: TypedColumn[A, U1],
c2: TypedColumn[A, U1]
): Dataset[(U1, U2)]
def map[U: Encoder](
func: A => U
): Dataset[U]
def filter(f: A => Boolean): Dataset[A]
}
Provide more typeful experience
working with Apache Spark
github.com/adelbertc/frameless
val logs = ctx
.read.format("json")
//.open("s3://logs")
.stream("s3://logs")
logs
.groupBy(logs("user_id"))
.agg(sum(logs("time")))
.write.format("jdbc")
//.save("jdbc:mysql://...")
.stream("jdbc:mysql://...")
data-driven
adhoc research
models
classifiers
predictors
number crunching
KPI