Using Apache Flink as a microservice for stateful async processing
Principal Software Engineer/Architect
Hardcore backend engineer
Previously worked as a BigData Engineer
Customer Service Platform and in app support software
High Scale, complex workflows - e.g. Classification of customer tickets, Event based automations on tickets etc.
Breaking the title of the talk
Microservice - Independent service for performing different computation/different workload than the conventional application server
Event based system
State update is needed for each event
Output is not synchronous, it is a function of (future events, time)
With traditional approaches of using database as state store has challenges
Latency in case of higher volume of events
Ensuring consistency and atomicity in case of failures is challenging due to series of state operations need to be performed
A lot of bookkeeping needed in case of failure scenarios
Need of a system, which is stateful and manage state itself in a fault tolerant way and with exactly once semantics
Minimizes development cycle and adds robustment
Technology investment justified as other potential use cases as well in other parts of the system
Framework for Distributed Stream Processing System
Yet another ? - Not Really!
Strong theoritical foundation
Exactly Once semantics 'for stateful computations'
Removes the conventional notion - Low latency vs High Throughput
Event time, Windowing based on time, count, sessions, CEP support, Lightweight fault tolerance
Fault Tolerance - Most important concept which dictates many things
Consistent snapshots of the distributed data stream and operator state.
What does it mean?
Supported State Backends (RocksDB/HDFS)
Checkpointing by default is synchronous
For large states this can cause latency issues
For processing and async checkpointing to coexist -
Need of copy-on-write structures which are used in RocksDB
How copy-on-write solves this?
Productionization Problems and Resolutions
Flink Taskmanager failover time tuning
Default config took 70 seconds for died taskmanager's processing getting transferred to another taksmanager
Product requirement - ETA of 15 seconds
How to tune?
Tune Restart Strategy
Failure detection mechanism tuning
Parameter of heartbeat interval, expected pause time and network characteristics
State leak leads to monotonically increasing state
Code example of State leak
Impact of state leak
Delays in processing randomly without a fixed pattern to detect
Random nature of delays caused by the nature of asynchronous checkpointing
If statepoint is async why does it affect processing?
Synchronous processing of async checkpointing
Background garbage collection in RocksDB
Fix State leak without downtime
Problem: State is leaked but we need to clear the state with zero downtime
Using domain knowledge
For state key sending dummy events which enables business logic to clear that without code deployment
General purpose solution to fix state leak without downtime
Paramters to monitor
Input source rate
Output sink rate
Summary & Questions