Head of Bit Moving
Behance Team @ Adobe
Do you care about the website?
Effective deployments are quick AND safe
- Build tools must run quickly
- The queue mustn't be bottlenecked
- Deployers must be prepared to verify
- No one in the queue must roll back
- Deployer must be able to debug build failures effectively
As the build step is happening, you should...
- Create a written list of things you need to do to verify the change.
- Open up all of the dashboards you will need to look at.
- If you don't have standard checklist dashes open, create a bookmark folder for them.
- Stage absolutely must not be the first time you have exercised your changes.
- A rollback from stage wastes about 15 minutes of time
- A rollback from prod wastes about 20 minutes of time
Debugging Build Step Failures
- Open logs from build, look for failure msgs
- Run the build step locally with full rebase
- Look at trends to detect recent flakiness
- After all that fails, hit up slack channels
Debugging Deploy Step Failures
- Hit up slack channels
- Don't accept answers you don't understand
- Ensure it's the right time to be deploying
- Verify use cases thoroughly before reaching PROD and once again in PROD
- Communicate the change appropriately
- Know where to look for trouble
- Don't mark as good you fully verify it as safe
- Hang out in slack channels for a bit after deploy
- Don't be afraid to emergency rollback / lockdown
Make sure it's the right time to be deploying
- Don't deploy non-urgent changes after 5PM
- Don't deploy big changes on a Friday
- Is there are a code freeze soon?
Make sure it works on your local first
- Do you need to add fixtures?
- Have you tested in multiple browsers?
- Is your local PHP the right version?
- Have you run tests/linters locally?
- Have you actually thought deeply about how to verify this change?
- Have you checked with design?
Communicate your change
- Have you told community what the change is and why? It's failure mode? It's benefit?
- Have you looped in design for design review?
- Have you looped in ops if the change can have stability / perf implications?
- Do you have the dashboards open?
- Do you know how to search logs?
- Do you have PHP Errors open?
- Do you have JS Errors open?
- DO you know how to drill into PHP errors?
- Do you know how to drill into Perf issues?
Let it simmer 5 minutes
before marking as good.
- Some issues take several minutes to surface on any dashboard... it's OK, take your time!
- Test out the fix 1 or 2 more times while you wait.
- Look into other error messages or dashboards while you wait!
Assume any issue that comes up after your deploy was from your deploy
- If someone is deploying behind you, and they start experiencing issues in PROD, assume it's your fault.
- If community raises an issue shortly after your deploy, think about how it could be your issue.
Emergency rollbacks / lockdowns
- If you already marked as good, a Moonbeam admin can use emergency rollbacks to deploy a previous deploy to PROD.
- If fix can't be quickly found, put Moonbeam into lockdown mode, this prevents anyone from building and overwriting your rollback.
- Ensure next PR that goes up is either a reversion of the offending PR or fixes the offending PR.
What it feels like
- You aren't nervous going to PROD
- You are spending less time debugging PROD issues
- Less firefights
- Less complaints from the millions of daily users
What Questions do You Have?