Using Apache Flink as a microservice for stateful async processing
@Helpshift
By
Jagadish Bihani
Introduction
- Principal Software Engineer/Architect
- Hardcore backend engineer
- Distributed Systems
- Previously worked as a BigData Engineer
- Helpshift
- Customer Service Platform and in app support software
- High Scale, complex workflows - e.g. Classification of customer tickets, Event based automations on tickets etc.
Use case
- Breaking the title of the talk
- Microservice - Independent service for performing different computation/different workload than the conventional application server
- Event based system
- State update is needed for each event
- Output is not synchronous, it is a function of (future events, time)
- With traditional approaches of using database as state store has challenges
..continued
- Latency in case of higher volume of events
- Ensuring consistency and atomicity in case of failures is challenging due to series of state operations need to be performed
- A lot of bookkeeping needed in case of failure scenarios
- Need of a system, which is stateful and manage state itself in a fault tolerant way and with exactly once semantics
- Minimizes development cycle and adds robustment
- Technology investment justified as other potential use cases as well in other parts of the system
Flink Introduction
- Framework for Distributed Stream Processing System
- Yet another ? - Not Really!
- Strong theoritical foundation
- Exactly Once semantics 'for stateful computations'
- Removes the conventional notion - Low latency vs High Throughput
- Features :
- Event time, Windowing based on time, count, sessions, CEP support, Lightweight fault tolerance
Concepts
- Fault Tolerance - Most important concept which dictates many things
- Checkpointing : Consistent snapshots of the distributed data stream and operator state.
- What does it mean?
- Chandy-Lamport Algorithm
- Barriers
..continued
- State
- User defined/Managed
- Supported State Backends (RocksDB/HDFS)
..continued
- Asynchronous Checkpointing
- Checkpointing by default is synchronous
- For large states this can cause latency issues
- For processing and async checkpointing to coexist -
- Need of copy-on-write structures which are used in RocksDB
- How copy-on-write solves this?
Productionization Problems and Resolutions
Flink Taskmanager failover time tuning
- Default config took 70 seconds for died taskmanager's processing getting transferred to another taksmanager
- Product requirement - ETA of 15 seconds
- How to tune?
- Tune Restart Strategy
- Failure detection mechanism tuning
- Parameter of heartbeat interval, expected pause time and network characteristics
- Operational Recommendations
State Leaks
- State leak leads to monotonically increasing state
- Code example of State leak
- Impact of state leak
- Delays in processing randomly without a fixed pattern to detect
- Random nature of delays caused by the nature of asynchronous checkpointing
- If statepoint is async why does it affect processing?
- Synchronous processing of async checkpointing
- Background garbage collection in RocksDB
Fix State leak without downtime
- Problem: State is leaked but we need to clear the state with zero downtime
- Using domain knowledge
- For state key sending dummy events which enables business logic to clear that without code deployment
- General purpose solution to fix state leak without downtime
Monitoring
- Paramters to monitor
- State size
- Checkpoint time
- Succeeded/failed checkpoints
- Input source rate
- Output sink rate
Summary & Questions
Apache Flink - Production Report
By Jagadish Bihani
Apache Flink - Production Report
- 928