0. Streaming 101
1. Latency & Throughput
2. Offeset Management ( with Kafka source )
3. Process Fail Over
4. Monitoring
5. Detect Outliars & Alerting
- Next ....
A traditional data warehouse architecture for data analytics
Architecture sketch of a streaming analytics system
출처 : Safari book's Streaming processing with Apache Flink
출처 : Safari book's Streaming processing with Apache Flink
출처 : Safari book's Streaming processing with Apache Flink
출처 : Safari book's Streaming processing with Apache Flink
출처 : Safari book's Streaming processing with Apache Flink
Latency : how long it takes for an event to be processed
Throughput : events or operations per time unit
출처 : https://www.slideshare.net/BrandonOBrien/spark-streaming-kafka-best-practices-w-brandon-obrien
Stand alone best
출처 : http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
자세한 얘기는 (전) SK Planet 엄태욱님의 글을 읽어보세요 :)
Spark Streaming으로 유실 없는 스트림 처리 인프라 구축하기
( http://readme.skplanet.com/?p=12465 )
External Store
- HBase
- Zookeeper
- Redis
Internal Store
- Local File
참고 링크
- http://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-streaming/
- Reids Cluster 가 문제가 생겼을 때
- Kafka Broker 에 문제가 생겼을 때
- Yarn 에 문제가 생겼을 때
- Master Node 에 문제가 생겼을 때...
- 기타 등등
=> 자동으로 Process 를 Restart 할 수 있어야 한다.
( PM2, systemd, service ... )
그렇다고 이건 너무 오바;;
Hadoop 3 에서 해결해주길..
- Yarn Cluster 에 Batch Query 가 말도 안되게 큰게 실행되어 Network 장애가 발생
- Kafka Broker Controller 가 죽은 경우
- Kafka Broker NIC 가 노후되어 Consumer 이 밀리는 경우
- Kafka Broker NIC 가 노후되어 Producing 이 밀리는 경우
- Kafka Partition 정보가 바뀐 경우
- HBase region server process 가 밀리는 경우
참조 : http://readme.skplanet.com/?p=13110
참조 : http://readme.skplanet.com/?p=13110
- CPU
- Memory
- Network Traffic
- Disk
- Load avg
- Web Server ( 4xx, 5xx Req Count, RPS, Load Balacing, etc ... )
- Spark worker and the executors
- Jobs Launched by an Application
- visual representation of a DAG
- details for a stage
- metrics for the completed tasks in a stage
- The amount of data cached by a Spark application in memory or disk
- visualizing the execution of a Spark Streaming application
- Monitoring Spark SQL Queries
- Monitoring Spark SQL JDBC/ODBC Server
- Z-score method
- Modified Z-score method
- IQR method
- Pattern matching
참고 : http://colingorrie.github.io/outlier-detection.html
- Increase streming delay time
- Process restart
- Driver node Shutdown
- Master node Shutdown
- Executor node Shutdown
- Decrease consume message
- Decrease upsert data
- Increase failed parsing message
- etc ...
참고 : http://colingorrie.github.io/outlier-detection.html
- Slack
- SMS
- etc..
- Increase streming delay time
- Process restart
- Driver node Shutdown
- Master node Shutdown
- Executor node Shutdown
- Decrease consume message
- Decrease upsert data
- Increase failed parsing message
- etc ...
참고 : http://colingorrie.github.io/outlier-detection.html
As with alerts, an information radiator that always shows red has no value. If a condition shown on the radiator isn’t important enough to fix immediately, then remove it.
- O'Reilly Media, Inc. Infrastructure as code