Batch Processing
Types of systems
Service: wait for a request, handle asap. Response time and availability matter
Batch processing system: take a large amount of input data, run job to process, produce output data. It may take some time and often run periodically. Throughput matters (size)
Stream processing: near-real-time processing. Consume input and produce output rather than respond to requests. However, operates on input shortly after receiving it - lower latency
Example
If you have a log to be processed and find 5 most visited pages
Unix command line vs Ruby examples
Ruby stores all URLs in a hashmap in memory, Unix doesn't
If there are no many URLs then Ruby is fine.
If there are Unix use merge sort to process small amounts of data and write to disk with sequential writes. The bottleneck is original file read throughput.
Why Unix is cool
- uniform interface - file descriptor => can pipe
- separate logic and wiring (piping) - inter-process, TCP socket, file, etc.
- inputs are immutable
- using the "less" command in the middle of pipeline (like tap in rxjs)
- dump intermediate result into a file and use it for another pipeline
the biggest limitation though is they can be run on a single machine only
MapReduce
- do not change input
- do sequential write (once written stays unchanged)
- as a glueing interface - distributed file system
Example of nginx log:
- read log and break into lines
- get the URL from each line (mapper)
- sort (output is always sorted)
- calculate occurrences since same URLs are adjacent (reducer)
mapper - can generate any number of key-value pairs including zero
reducer - collect values belonging to the same key, reduces them returning single value
MapReduce
- mapper is done on the same machine as input file - putting the computation near the data principle; mapper application code has to be copied to the processing node, input is there already
- number of mapper partitions is defined by input file blocks, number of reducer partitions is defined by the job author
- reducer partitioning is done by key hash
- sorting is done first on the mapper node separately for each reducer partition, then reducer node gets files from all mapper nodes and merges them preserving sorting order (called shuffle)

Reduce-side Joins
Naive: get user data one by one from remote DB - throughput limitation, requests are not deterministic since data can b updated between 2 requests
Solution: have a copy of DB close to the join process.
Results of one joint can be sorted against results from another joint (dob comes first) - secondary sort. Since mapper output is sorted by key, and the reducer merges sorted lists of results of both merges the algorithm is called sort-merge join.

Reduce-side Groups
Examples: count, sum, top k records..
skew: if there is too much data for a particular single key (linchpin object or hotkey)
Pig makes pre-sampling to indicate hotkeys
Crunch requires manual setup.
In both cases the key is partitioned among multiple reducers.
Hive uses map-side join
Map-side joins
Joining large and small datasets:
- the small dataset can fit into the memory. The approach is called broadcast hash join
- or store on the disk together with big dataset but create an index to aces data
Reduce-side joins: don't care about initial input data - that will be prepared by mappers. Downside: we copy data from mappers to reducers which might be expensive.
Output of reduce-side join is partitioned and sorted by key
Output of map-side join is partitioned and sorted same as a large dataset
Materialization
Problems:
- one task can start only after the previous one is fully complete
- mappers are often redundant and just read the input files, reducers could be stacked instead
- in a distributed file system materialized files are replicated which is overkill
Solution: dataflow engines (Spark, Tex, Flink) - chain of lambda functions:
- sorting is done on demand only (in MapReduce always)
- no unnecessary map tasks
- way for optimizations since framework aware of all steps of the job
- intermediate steps can be kept in memory, not on disk
- no need to wait for the entire step to finish
- VMs (like JVM) can be used instead of starting over and over
Materialization
We still need materialization sometimes:
- for fault tolerance, if computation is expensive to start over
- when do sorting we anyway have to wait for the whole task to be done before proceeding with another
Batch Processing
By Michael Romanov
Batch Processing
- 37