PagerDuty: Production Support as Front End Developer

Xinjiang Shao

2021-7-29

Agenda

  • The Mindset
  • Incident Examples In the past
  • Most Common Questions
  • Tools
  • Demo

The Mindset

  • Stay Calm
  • On-Call !== Bug Triage
  • Prefer to no have any code changes when being on call
    • Restart Service (e.g Flip Blue/Green)
    • Content republish
    • Site config to disable feature
    • Data changes or direct the request to the right team

Incident #1

CMDM is down

Observation:

  • Logged in users cannot checkout in GNTC and MRTN
  • The new account cannot be created, existing user cannot log in
  • Coupons cannot be clipped

Resolution:

We published a system-wide notification

Incident #2

Client Log 503/413 Status

Observation:

An increasing amount of client log requests

Resolution:

- The rate limit is added from Cloudflare for client log

- Add new site config to control the intervals of sending client log

- New ways of sending logs in batch from azure event hub

Incident #3

Missing PodBag

Observation:

Missing podbag for delivery service, however the location config indicated the podbag is enabled

Resolution:

- Ask bus-system dev to fix the data configuration

- Gracefully turn off podbag if the product info cannot be retrieved after the incident

 

Incident #4

iOS App Native Login

Observation:

App users cannot login 

Resolution:

- Disable native login from site config

 

Most Common Questions for Front End Devs

  • What kind of functionality would be impacted if we restart service x (e.g loyalty account API)?
  • Are our users(customers) recovered from the incident?
  • Could we disable feature x (e.g client log)?
  • When did certain incident start?

Common Root Causes for Other Teams

  • Copient/Quotient Offers
  • Informix DB perf
  • Too many concurrent users
  • MDM
  • Running out of disk
  • DNS server
  • Firewall outage
  • Expired SSL certificates
  • VMWare in QTS datacenter

Tools

  • Datadog RUM
  • Client Log in Splunk
  • JIRACore
  • Sourcemap in Prod
  • SuperUser
  • Webbase (BusSys)
  • CloudFlare (Security Team)
  • Shape Security (Security Team)
  • Optimizely (Product Team)

Demo Time

FrontEnd Production Support

By Xinjiang Shao

FrontEnd Production Support

The general process of being on call as a front end dev

  • 175