Flink Operations

Flink Slots

Task managers have "slots" a "slot" can be thought of as a worker
Because of context shifts, flink recommends one slot per core: https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/config.html#configuring-taskmanager-processing-slots
We can break this rule, but this means performance tuning, at first we had many slots per core, not much work was done because of hand offs between threads.

Job and Operator Parallelism

We can set parallelism per job or per operator
If we set it per job, all operators appear to get the same parallelism
Tuning each operator is part of the flink production ready checklist and should always be done: https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/production_ready.html

Relationship between slots and Parallelism

In order to properly tune an operator, you need to know how many slots will be consumed
Understanding how to count this relies on understanding operator "chaining" vs "unchaining" (by default operators are "chained" but this is configurable via the streaming api)
If you're "unchaining" an operator, the slots appear to equal parallelism, that means each operator instance gets it's own thread.
For "chaining" flink notices that many times the same thread could continue executing the next operation without thread headoff/network traffic, so the operation is "chained".

Chaining vs Unchaining

https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/operators/#task-chaining-and-resource-groups

Chaining vs Unchaining

https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/operators/#task-chaining-and-resource-groups

Summary

Start with 1 slot per core per flink docs
Slots are difficult to calculate due to chaining, and will intentionally apply a heavy load to each box to avoid copying/shuffling data

Snapshots

At least Once vs Exactly Once

If execution pauses from when the first barrier reaches an operator until the last barrier reaches an operator, we get exactly once processing, if we continue processing we get at least once.

(t1, B, t2) -> Op1

(t3, t4, B) ----^

https://ci.apache.org/projects/flink/flink-docs-master/internals/stream_checkpointing.html

Where to view Stats?

Summary

Snapshot time is a funny way to say "worst case stream latency".

Until we can make all streams flow more evenly, we need to use at least once processing.

The actual time we take snap shotting is just a second or two.

Async Operators

https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/operators/asyncio.html

Summary

Async as much as we can

The flink docs around this topic are a mix of wrong and undocumented.

Flink Operations

By Philip Doctor

Flink Operations

1,149

Flink Operations

Flink Slots

Job and Operator Parallelism

Relationship between slots and Parallelism

Chaining vs Unchaining

Chaining vs Unchaining

Chaining vs Unchaining

Summary

Snapshots

At least Once vs Exactly Once

Where to view Stats?

Summary

Async Operators

Summary

Flink Operations

More from Philip Doctor