Yuchen Zhang
Personal account
A Tiny Session
?
?
Batch Processing
Stream Processing
Often break tasks to individually as splitting, mapping, shuffling, reducing...
Dealing with very large datasets that require quite a bit of computation
Real time processing, which operates on a continuous stream of data composed
Sometimes works the data in cluster's memory to avoid write back
?
?
https://www.digitalocean.com/community/tutorials/hadoop-storm-samza-spark-and-flink-big-data-frameworks-compared
spout
blot
tuple
tuple
tuple
tuple
tuple
Topology
?
Spout
sentences
Blot
split
Blot
count word
Blot
report
?
Nodes:The servers inside the topology,execute a part of work.
Workers:A node JVM process, each node could run one or multiple workers.
Executor:Thread, each task could run in it, in default, storm will dispatch one task to executor.
Task:the instance of spout and bolt, running nextTuple() and execute() for executors.
Shuffle: randomly
Fields: group by field values
All: copy tuple to bolt task, each bolt will receive all tuple copies
Global: all to a task, should be careful
Direct: declare destination
?
nextTuple()
ack()
fail()
at least once or exactly once problem? RTFM
(read the friendly manual)
http://storm.apache.org/releases/1.1.0/Guaranteeing-message-processing.html
Trident is a high-level abstraction for doing realtime computing on top of Storm: transaction & state
?
DRPC Parallelize the computation of really intense functions on the fly using Storm: request-response in topology
?
By Yuchen Zhang