Halt and Catch Fire
WordPress Meetup Kaunas 2021-08
WordPress Core Contributor, WordPress Kaunas Meetup co-organizer, WordCamp, WordSesh, TEDx speaker and one of the editors of the Lithuanian WordPress translation team.
Free, premium and custom WordPress plugin developer
Engineering Team Lead at
The capability of an organisation to continue the delivery of products or services at pre-defined acceptable levels following a disruptive incident.
- Cyber attack
- Hurricane or other major storm
- Power outage
- Water outage
- Telecomms outage
- IT outage
- War/civil disorder
- Random failure of mission-critical systems
- Single point dependency
- Supplier failure
- Data corruption
But... that never happens
OVH Data Center fire
2021-03-10 - Completely destroyed one datacenter and severely damaged another one.
- 3.6 million websites down
- Many lost years of data
CDN provider outage
2021-06-08 Fastly suffers a basically complete outage for 49 minutes.
Took down images on Amazon.com, emojis on Twitter, whole sites of BBC, theGuardian, NYTimes, Kayak, StackOverflow, Reddit and many many others
Mitigation and Recovery
Business Impact Analysis
Recovery Point Objective (RPO)
Recovery Time Objective (RTO)
Recovery Point Objective
RPO - the acceptable latency of data that will not be recovered.
Is it OK for the organisation to loose last two days of comments of their website?
Is it OK to loose last two days of sales data?
Recovery Time Objective
RTO - the acceptable amount of time to restore the function.
- Site downtime
- Ordered items not being sent out
- Customer support lines not working
- Functional overlap
- Documentation, documentation, documentation
- Access recovery
- Password manager
- No personal accounts
- No use of personal email
- Redundancy everywhere
- Have spares at hand
- Service contracts
- Always know where to get a new one and be aware of timeframes involved.
- Backups, backups, backups!
- Cross-location, cross-provider
- Test the backups
- Alternative providers
- Switch-over plans
- Version control!
Single Point of Failure
A part of a system that, if it fails, will stop the entire system from working.
- Disasters almost never happen.
- But when they do, consequences can be brutal.
- Planning for them makes your business more resilient.
- You can't be ready for everything, but once you run through a few diverse threat scenarios, you'll see that same stuff keeps coming up.
- Recovery point objective - level where you can start doing business again.
- Recovery time objective - realistic timeframe
- Resources you'll need to recover:
- Try to find and eliminate single points of failure.
- Think of yourself as one as well.
Halt and Catch Fire
By Arūnas Liuiza