SRE laid out the risks Phi is currently subjected to and a mitigation plan for those.
You can view it in detail on confluence https://confluence.condenastint.com/pages/viewpage.action?spaceKey=PLAT&title=Derisking+Phi
Simulate unavailability in tools-prod cluster: Scaled down ingress deploy to 0
Switch CNAME value to `fastly-log-transport.prod.cni.digital`
Switch CNAME value back to `fastly-log-transport.eu-west-1.tools-prod.cni.digital`
Stop connections being sent to prod-eu-central-1 (Scale ingress deploy down and up)
10:38 - requests to tools prod dropped straight away as expected
It took <3 minutes for Phi to pick up buffered logs from Fastly
After another 2 minutes Fastly and Phi were in sync with real time logs/metrics
When changing CNAME back to tools-prod endpoint, tools prod started to converge and matched Fastly metrics 100% 1 minute after connections were completely dropped on prod (k8s deploy replicas: 0)
(Potentially) Failover performed automatically via Route53 health checks