Halt and Catch Fire

WordPress Meetup Kaunas 2021-08

Arūnas Liuiza

WordPress Core Contributor, WordPress Kaunas Meetup co-organizer, WordCamp, WordSesh, TEDx speaker and one of the editors of the Lithuanian WordPress translation team.

 

Free, premium and custom WordPress plugin developer

 

Engineering Team Lead at

Business Continuity

The capability of an organisation to continue the delivery of products or services at pre-defined acceptable levels following a disruptive incident.

Threat scenarios

  • Epidemic/pandemic
  • Earthquake
  • Fire
  • Flood
  • Cyber attack
  • Sabotage
  • Hurricane or other major storm
  • Power outage
  • Water outage
  • Telecomms outage
  • IT outage
  • Terrorism/Piracy
  • War/civil disorder
  • Theft
  • Random failure of mission-critical systems
  • Single point dependency
  • Supplier failure
  • Data corruption
  • Misconfiguration
  • ...

But... that never happens

almost

OVH Data Center fire

2021-03-10 - Completely destroyed one datacenter and severely damaged another one.

  • 3.6 million websites down
  • Many lost years of data

CDN provider outage

2021-06-08 Fastly suffers a basically complete outage for 49 minutes.


Took down images on Amazon.com, emojis on Twitter, whole sites of BBC, theGuardian, NYTimes, Kayak, StackOverflow, Reddit and many many others

Mitigation and Recovery

Business Impact Analysis

 

  • People
  • Equipment
  • Data
  • Technology

Recovery Point Objective (RPO)

Recovery Time Objective (RTO)

Recovery Point Objective

RPO - the acceptable latency of data that will not be recovered.

 

Is it OK for the organisation to loose last two days of comments of their website?

Is it OK to loose last two days of sales data?

Recovery Time Objective

RTO - the acceptable amount of time to restore the function.

 

  • Site downtime
  • Ordered items not being sent out
  • Customer support lines not working
  • ...

People

  • Functional overlap
  • Documentation, documentation, documentation
  • Access recovery
    • Password manager
    • No personal accounts
    • No use of personal email
  • Hand-overs

Equipment

  • Redundancy everywhere
  • Have spares at hand
  • Service contracts
  • Always know where to get a new one and be aware of timeframes involved.

Data

  • Backups, backups, backups!
  • Cross-location, cross-provider
  • Test the backups

Technology

  • Redundancy
  • Alternative providers
  • Switch-over plans
  • Version control!

Single Point of Failure

A part of a system that, if it fails, will stop the entire system from working.

Recap

Recap

  • Disasters almost never happen.
  • But when they do, consequences can be brutal.
  • Planning for them makes your business more resilient.
  • You can't be ready for everything, but once you run through a few diverse threat scenarios, you'll see that same stuff keeps coming up.
  • Recovery point objective - level where you can start doing business again.
  • Recovery time objective - realistic timeframe

Recap (2)

  • Resources you'll need to recover:
    • People
    • Equipment
    • Data
    • Technology
  • Try to find and eliminate single points of failure.
  • Think of yourself as one as well.

Questions?

Halt and Catch Fire

By Arūnas Liuiza

Halt and Catch Fire

WordPress Meetup Kaunas 2021-08

  • 680