Experiences with a long-running big system migration​

Part I – emerging behavior

Stefan Schwetschke

stefan@schwetschke.de

What other parts can I offer?

  • Part 1: Emerging behavior
  • Part 2: Measuring methods in depth
    The USE method, measuring points
  • Part 3: Sizing, archiving & decommissioning, project setup

About me

  • I ❤️ the mountains a lot!
  • I studied "Informatik" (basically computer science)
  • During my studies, I was one of the founders of 4friendsonly.com
  • I worked over ten years as a consultant for Capgemini in various roles
  • I work now as a lead-developer over at Check24

We are building the "Kontomanager"

The scenario

  • Merging two data centers
  • Cleaning up the application stack
  • At a mobile phone carrier
  • Complex workflow ahead!

Idea

  • Migrate during business hours
  • Migrate parallel to normal operations
  • Pause migration when the system overloads
  • Migrate in small batches, correct problems before much damage occurs

Why a complex workflow?

  • Lock customer account against changes
  • Deactivate on old systems
  • Move to new systems
  • Propagate in all business applications
  • Propagate in all network elements
  • Update location in the global lookup
  • Un-lock account

Measuring basics: 

The USE Method 

  • Idea from Brandan Gregg: The USE Method
  • Three metrics for each ressource
    • Utilization
    • Saturation
    • Errors

Utilization

  • Goal: Get this up (70% is a good value)
  • Examples:
    • CPU usage
    • Network throughput
    • Used memory
  • 100% means: you have a bottle-neck!

Saturation

  • Goal: Avoid
  • Examples:
    • Waiting threads
    • Network packets waiting
    • Paging: swap-in events
      (Swap out is not a problem!)

Errors

  • Goal: No measurement while errors are present!
  • Examples:
    • Throttled core due to overheating
    • Lost network packets
    • ECC errors

Measuring basics:
Hockey stick curve

  • Term "stolen" from climate science
  • Typical curve for systems under load
  • Allows to predict the system behavior
  • Idea: workload vs resource usage
  • Typically: throughput vs latency
  • Distinguishes linear from "exponential" behavior
    (not really exponential in the mathematical sense)

Hockey stick: Example

What we found

  • Throughput of the system less than 50%
  • No clear utilization bottleneck
  • Saturation all over the system
  • No errors

Theory

  • The system is oscillating
  • The oscillation is too fast for our measurement
  • Average utilization is low
  • Parts of the system reach maximum saturation periodically

Problems with oscillation

  • Hard to measure
  • Many overload alerts
  • The system stays more time in an inefficient overload state
  • Average throughput goes down
  • Periodical overload-state misleads local optimizations

Reasons for oscillation

  • Internal state
  • Reaction time
  • Shared resources
  • Sudden load changes
  • Thermodynamics

Reason: Internal state

  • Hysteresis is inherent to some models
  • Timeouts
  • Queues fill up
  • Local optimizations
    • Database optimizer
    • OS caches
    • Java VM heap (Moved to Eden space...)

Reason: reaction time

Reason: Shared resources

  • One component strangles the others
  • When they depend on each other, the system oscillates
  • See Lotka–Volterra equations / prey-predator-model
  • Examples:
    • VM host
    • Application server
    • Network link
    • Storage

Reason: Sudden load changes

  • Activate latent oscillation potential
  • Dirac-pules: Impulse response
  • Examples:
    • Batching
    • Queues flushing
    • Input surges
    • Huge work packages mixed with small ones
      (This also leads to the “Zen Garden Effect")

Reason: Thermodynamics

  • Often there is some source of entropy
  • When we take a random snapshot 
    • it is unlikely to get all work packets evenly distributed over time

       
    • it is more likely to get some “clumping”

Details: Thermodynamic forces

Potential countermeasures

  • Fine-grained monitoring
     (especially for saturation)
  • Isolated or guaranteed resources
  • Throttling (at the head of the traffic jam)
  • Avoid surges: spread the input
  • Traffic control: Global back pressure
  • Faster reaction times
    • Shorter queues
    • Start back pressure when queue fills up
  • Handle overspill

Our countermeasures

  • Input spreading: Smaller batch sizes
  • Throttling at the queues

Let's discuss

  • Who has already experienced this?
  • Other reasons?
  • Other countermeasures?
  • Other emerging behavior?
    • Deadlock
    • Livelock
    • ????

Complex migration - part 1: emerging behaviour

By Stefan Schwetschke

Complex migration - part 1: emerging behaviour

This is my talk about my experiences with a very complex migration project. This talk is for the Munich DevOps Meetup in July 2018, hosted at Google in Munich.

  • 768