Batch Processing

Types of systems

Service: wait for a request, handle asap. Response time and availability matter

Batch processing system: take a large amount of input data, run job to process, produce output data. It may take some time and often run periodically. Throughput matters (size)

Stream processing: near-real-time processing. Consume input and produce output rather than respond to requests. However, operates on input shortly after receiving it - lower latency

Example

If you have a log to be processed and find 5 most visited pages

Unix command line vs Ruby examples

Ruby stores all URLs in a hashmap in memory, Unix doesn't

If there are no many URLs then Ruby is fine.

If there are Unix use merge sort to process small amounts of data and write to disk with sequential writes. The bottleneck is original file read throughput.

Why Unix is cool

- uniform interface - file descriptor => can pipe

- separate logic and wiring (piping) - inter-process, TCP socket, file, etc.

- inputs are immutable

- using the "less" command in the middle of pipeline (like tap in rxjs)

- dump intermediate result into a file and use it for another pipeline

the biggest limitation though is they can be run on a single machine only

MapReduce

- do not change input

- do sequential write (once written stays unchanged)

- as a glueing interface - distributed file system

Example of nginx log:

- read log and break into lines

- get the URL from each line (mapper)

- sort (output is always sorted)

- calculate occurrences since same URLs are adjacent (reducer)

mapper - can generate any number of key-value pairs including zero

reducer - collect values belonging to the same key, reduces them returning single value

MapReduce

- mapper is done on the same machine as input file - putting the computation near the data principle; mapper application code has to be copied to the processing node, input is there already

- number of mapper partitions is defined by input file blocks, number of reducer partitions is defined by the job author

- reducer partitioning is done by key hash

- sorting is done first on the mapper node separately for each reducer partition, then reducer node gets files from all mapper nodes and merges them preserving sorting order (called shuffle)

Reduce-side Joins

Naive: get user data one by one from remote DB - throughput limitation, requests are not deterministic since data can b updated between 2 requests

Solution: have a copy of DB close to the join process.

Results of one joint can be sorted against results from another joint (dob comes first) - secondary sort. Since mapper output is sorted by key, and the reducer merges sorted lists of results of both merges the algorithm is called sort-merge join.

Reduce-side Groups

Examples: count, sum, top k records..

skew: if there is too much data for a particular single key (linchpin object or hotkey)

Pig makes pre-sampling to indicate hotkeys

Crunch requires manual setup.

In both cases the key is partitioned among multiple reducers.

Hive uses map-side join

Map-side joins

Joining large and small datasets:

- the small dataset can fit into the memory. The approach is called broadcast hash join

- or store on the disk together with big dataset but create an index to aces data

Reduce-side joins: don't care about initial input data - that will be prepared by mappers. Downside: we copy data from mappers to reducers which might be expensive.

Output of reduce-side join is partitioned and sorted by key

Output of map-side join is partitioned and sorted same as a large dataset

Materialization

Problems:

- one task can start only after the previous one is fully complete

- mappers are often redundant and just read the input files, reducers could be stacked instead

- in a distributed file system materialized files are replicated which is overkill

Solution: dataflow engines (Spark, Tex, Flink) - chain of lambda functions:

- sorting is done on demand only (in MapReduce always)

- no unnecessary map tasks

- way for optimizations since framework aware of all steps of the job

- intermediate steps can be kept in memory, not on disk

- no need to wait for the entire step to finish

- VMs (like JVM) can be used instead of starting over and over

Materialization

We still need materialization sometimes:

- for fault tolerance, if computation is expensive to start over

- when do sorting we anyway have to wait for the whole task to be done before proceeding with another

Batch Processing

By Michael Romanov

Batch Processing

Types of systems

Example

Why Unix is cool

MapReduce

MapReduce

Reduce-side Joins

Reduce-side Groups

Map-side joins

Materialization

Materialization

Batch Processing

More from Michael Romanov