What will be discussed
1. Main Problems/Challanges
2. How to solve them
3. Implementation suggestion
4. DEMO
5. What else can we gain/provide
A funnel example of user actions
What kind of operation?
Read or Write?
Probably both!
Well, it matters!!
(main concern)
The write operation ratio increases with the number of users.
(2nd concern)
Even though the read/write ratio is low, the reads/second should be considered.
Denormalization
But how?
Sounds like we can use MongoDB for this!?
The knowledge exists within the company!
Yes but..
"Benchmarking Top NoSQL Databases Apache Cassandra, Couchbase, HBase, and MongoDB", by End Point 13/04-15
",,those blue bars look promising"
The incoming data is mainly events that are based on time
This suits the Cassandra data model (if sorted by time)
CREATE TABLE IF NOT EXISTS raw_user_ab_logs (
application text,
start_time text,
timestamp timestamp,
event text,
user_id text,
group text,
PRIMARY KEY (
(application, start_time, group),
timestamp, user_id
)
) WITH CLUSTERING ORDER BY (timestamp DESC);
Rows are separated by its partition key
4 node Cassandra cluster that partitioning data in range of 100.
Uses consistent hashing to determine which node stores which row.
Lightning-Fast Cluster Computing
Apache Spark is a fast and general engine for large-scale data processing.
Why Spark?
We want to process data, near its location. Preferably on the same network
Cassandra has a connector!
But this use case supports user interaction
More specifically funnel keys (start and goal key)
./bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[<funnel-startKey>, <funnel-betweenKey>, <funnel-endKey>] //<-- application-argumentsTakes time to submit application to spark cluster.
This could most likely increase the overhead time of the query latency with 10 seconds.
!
curl --data-binary @target/scala-2.10/<jar-file-name>.jar localhost:8090/jars/funnelAnalysis
curl -d "input.string = <funnel-startKey> <funnel-betweenKey> <funnel-endKey>"
> 'localhost:8090/jobs?appName=funnelAnalysis&classPath=spark.jobserver.sparkFunnelAnalysisApp'
{
"status": "STARTED",
"result": {
"jobId": "4b0edc57-b689-459b-8e7b-f99855128194",
"context": "553a26df-spark.jobserver.sparkFunnelAnalysisApp"
}
}
curl localhost:8090/jobs/4b0edc57-b689-459b-8e7b-f99855128194 <-- Get results
Funnel counts (show on white board?)
DEMO
Usually have to implement and maintain the same logic in two different systems
Not with Spark Streaming
(can reuse the same system logic)
Provide real time anomaly detection of error logs.
Let's say that the error log count for an Android application starts to increase, we can notify this to the customer directly, instead of having this as a scheduled daily batch process.
2/6
16/8
31/8
First handout
Second version