Phi failover

Background

Phi is now our main source of truth for incident detection and troubleshooting
Production services rely on phi for metrics 24/7
Phi is an service built internally so we need to support it

De-risking Phi

SRE laid out the risks Phi is currently subjected to and a mitigation plan for those.

You can view it in detail on confluence https://confluence.condenastint.com/pages/viewpage.action?spaceKey=PLAT&title=Derisking+Phi

Failover mechanism for Phi

CNAME for fastly syslog endpoint
Deployment of Phi on prod-eu-central-1

Performing failover

Simulate unavailability in tools-prod cluster: Scaled down ingress deploy to 0
Switch CNAME value to `fastly-log-transport.prod.cni.digital`
Watch metrics https://app.datadoghq.com/dashboard/hmy-qs8-i7i/jen---phi-failover?from_ts=1574763323653&live=false&tile_size=m&to_ts=1574773200000
Switch CNAME value back to `fastly-log-transport.eu-west-1.tools-prod.cni.digital`
Stop connections being sent to prod-eu-central-1 (Scale ingress deploy down and up)

Results

10:38 - requests to tools prod dropped straight away as expected
It took <3 minutes for Phi to pick up buffered logs from Fastly
After another 2 minutes Fastly and Phi were in sync with real time logs/metrics
When changing CNAME back to tools-prod endpoint, tools prod started to converge and matched Fastly metrics 100% 1 minute after connections were completely dropped on prod (k8s deploy replicas: 0)

Next step(Backlog):

(Potentially) Failover performed automatically via Route53 health checks

Documentation:

https://confluence.condenastint.com/pages/viewpage.action?spaceKey=PLAT&title=Derisking+Phi

https://confluence.condenastint.com/display/PLAT/Phi+failover

Dashboards:

https://app.datadoghq.com/dashboard/hmy-qs8-i7i/jen---phi-failover?from_ts=1574763323653&live=false&tile_size=m&to_ts=1574773200000

https://app.datadoghq.com/dashboard/inj-mky-asi/phi-fastly-syslog-echo

Made with Slides.com