Flink Operations

Flink Slots

  • Task managers have "slots" a "slot" can be thought of as a worker
  • Because of context shifts, flink recommends one slot per core: https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/config.html#configuring-taskmanager-processing-slots
  • We can break this rule, but this means performance tuning, at first we had many slots per core, not much work was done because of hand offs between threads.

 

Job and Operator Parallelism

  • We can set parallelism per job or per operator
  • If we set it per job, all operators appear to get the same parallelism
  • Tuning each operator is part of the flink production ready checklist and should always be done: https://ci.apache.org/projects/flink/flink-docs-release-1.2/ops/production_ready.html 

Relationship between slots and Parallelism

  • In order to properly tune an operator, you need to know how many slots will be consumed
  • Understanding how to count this relies on understanding operator "chaining" vs "unchaining" (by default operators are "chained" but this is configurable via the streaming api)
  • If you're "unchaining" an operator, the slots appear to equal parallelism, that means each operator instance gets it's own thread.
  • For "chaining" flink notices that many times the same thread could continue executing the next operation without thread headoff/network traffic, so the operation is "chained".

Chaining vs Unchaining

Chaining vs Unchaining

 

 

 

https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/operators/#task-chaining-and-resource-groups

Chaining vs Unchaining

 

 

 

https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/operators/#task-chaining-and-resource-groups

Summary

  • Start with 1 slot per core per flink docs
  • Slots are difficult to calculate due to chaining, and will intentionally apply a heavy load to each box to avoid copying/shuffling data

Snapshots

At least Once vs Exactly Once

If execution pauses from when the first barrier reaches an operator until the last barrier reaches an operator, we get exactly once processing, if we continue processing we get at least once.

(t1, B, t2) -> Op1

(t3, t4, B) ----^

 

https://ci.apache.org/projects/flink/flink-docs-master/internals/stream_checkpointing.html

Where to view Stats?

Summary

Snapshot time is a funny way to say "worst case stream latency".

 

Until we can make all streams flow more evenly, we need to use at least once processing.

 

The actual time we take snap shotting is just a second or two.

Async Operators

https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/operators/asyncio.html

 

Summary

Async as much as we can

The flink docs around this topic are a mix of wrong and undocumented.

Flink Operations

By Philip Doctor

Flink Operations

  • 1,132