Using Apache Flink as a microservice for stateful async processing

@Helpshift

Jagadish Bihani

Introduction

Principal Software Engineer/Architect
- Hardcore backend engineer
- Distributed Systems
- Previously worked as a BigData Engineer
Helpshift
- Customer Service Platform and in app support software
- High Scale, complex workflows - e.g. Classification of customer tickets, Event based automations on tickets etc.

Breaking the title of the talk
Microservice - Independent service for performing different computation/different workload than the conventional application server
Event based system
State update is needed for each event
Output is not synchronous, it is a function of (future events, time)
With traditional approaches of using database as state store has challenges

Latency in case of higher volume of events
Ensuring consistency and atomicity in case of failures is challenging due to series of state operations need to be performed
A lot of bookkeeping needed in case of failure scenarios
Need of a system, which is stateful and manage state itself in a fault tolerant way and with exactly once semantics
Minimizes development cycle and adds robustment
Technology investment justified as other potential use cases as well in other parts of the system

Framework for Distributed Stream Processing System
Yet another ? - Not Really!
Strong theoritical foundation
Exactly Once semantics 'for stateful computations'
Removes the conventional notion - Low latency vs High Throughput
Features :
- Event time, Windowing based on time, count, sessions, CEP support, Lightweight fault tolerance

Fault Tolerance - Most important concept which dictates many things
- Checkpointing : Consistent snapshots of the distributed data stream and operator state.
- What does it mean?
- Chandy-Lamport Algorithm
- Barriers

Default config took 70 seconds for died taskmanager's processing getting transferred to another taksmanager
Product requirement - ETA of 15 seconds
How to tune?
- Tune Restart Strategy
- Failure detection mechanism tuning
  - Parameter of heartbeat interval, expected pause time and network characteristics
Operational Recommendations

State leak leads to monotonically increasing state
Code example of State leak
Impact of state leak
- Delays in processing randomly without a fixed pattern to detect
Random nature of delays caused by the nature of asynchronous checkpointing
If statepoint is async why does it affect processing?
- Synchronous processing of async checkpointing
- Background garbage collection in RocksDB

Problem: State is leaked but we need to clear the state with zero downtime
Using domain knowledge
- For state key sending dummy events which enables business logic to clear that without code deployment
General purpose solution to fix state leak without downtime