Self-Healing Systems

Ashish Pandey

@ashishapy

blog.ashishapy.com

Meet-up on 18th Sep 2016 
@ Thoughtworks Pune

 

In the modern era, software is commonly delivered as a service: called web apps, or software-as-a-service. The twelve-factor app is a methodology for building software-as-a-service apps that:

 

  • Use declarative formats for setup automation, to minimize time and cost for new developers joining the project;
  • Have a clean contract with the underlying operating system, offering maximum portability between execution environments;
  • Are suitable for deployment on modern cloud platforms, obviating the need for servers and systems administration;
  • Minimize divergence between development and production, enabling continuous deployment for maximum agility;
  • And can scale up without significant changes to tooling, architecture, or development practices.

The Twelve Factors

 

I. Codebase

One codebase tracked in revision control, many deploys

II. Dependencies

Explicitly declare and isolate dependencies

III. Config

Store config in the environment

IV. Backing services

Treat backing services as attached resources

V. Build, release, run

Strictly separate build and run stages

VI. Processes

Execute the app as one or more stateless processes

 

VII. Port binding

Export services via port binding

VIII. Concurrency

Scale out via the process model

IX. Disposability

Maximize robustness with fast startup and graceful shutdown

X. Dev/prod parity

Keep development, staging, and production as similar as possible

XI. Logs

Treat logs as event streams

XII. Admin processes

Run admin/management tasks as one-off processes

The Twelve Factors

  • Introduction of Self-Healing Systems
  • ​Introduction of Docker & Microservices
  • Demo:
    • Create Infrastructure
    • Create Services
    • Demonstrate Self-Healing
    • Effortless Scaling
    • Effortless Rolling Update
  • ​​Questions

Let's face it!

The systems
We are creating, are
Perfect
Not

Sooner or Later

One of our application will fail.
One of our application will not be able to handle the increased load.
One of our commits will introduce fatal bug.
A piece of hardware will fail.
Something entirely unexpected will happen.

What we should do?

Nothing is perfect, can’t design a perfect system.
Embrace the inevitable, design system which is able to recover from failures.   
System should be able to predict likely future.
Design for failure.
Hope for the best, but be prepare for the worst.

Self-Healing Systems

Discover, what is not working correctly
without any human intervention, make the necessary changes to restore itself to the normal or designed state

Three Levels of Self-Healing Systems

Application Level
System Level
Hardware Level
Exception & logging 
Developer to take care
failures of processes & response time
Restart/redeploy && scale/descale services
No such a thing as hardware self-healing
Redeployment on healthy one && Preventive healing
Okey! 
Do self-Healing systems can be applied to Microservies only?
humm...
Self-Healing systems can be applied to any architecture
Packaging
Server
Virtual Machines
VM Images
Image Layers
Container
. . . 3 2 1

Quick Demo

$ docker run -d -p 8000:8080 <image-name>
<image-name> = tomcat:7/8/9
$ docker exec -it <container_name/id> bash

Microservices

Services are small - fine-grained to perform a single function.

Services are easy to replace and deploy  independently  

One service fails, then the whole application does not have to fail 

Services can be implemented using different  programming languagesdatabases, hardware and software environment, depending on what fits best

Service

One service managed by two pizza team

Comes with complexity and new challenges

Principles of Microservices

Microservices

            Modeled around          business concept
        
            
                Small autonomous services
            
        
            Culture of automation
        
            Highly Observable 
        
            Isolate failure
        
            Deploy independently
        
            Decentralize all the things
        
            Hide internal implementation details
        

Showtime

Logistics

Infrastructure
Service/Application
*Not actual representation of demo
Amazon Web Services
Docker Images
*Node: A physical or virtual-machine that hosts services
*Service: Executing a software that provide utility via a interface 


- SSH to Manager:

	$ ssh -i <AWS_Pvt_Key> <ManagerSSHLoadBalancer>


- Check all nodes/VMs. Identify Manager & Worker nodes/VMs

	$ docker node ls


- Login to docker registry/hub

	$ docker login


- Create new service and validate

	$ docker service create -p 80:4000 --with-registry-auth --name blogapy ashishapy/blog
	$ docker service ls
	$ docker service ps blogapy

- Open browser and enter external load balancer in URL. Check application is running.

Remember: Three Levels of Self-Healing

Application Level
System Level
Hardware Level
Exception & logging 
Developer to take care
failures of processes & response time
Restart/redeploy && scale/descale services
No such a thing as hardware self-healing
Redeployment on healthy one && Preventive healing

- Two aim with one shot :)

    -> Terminate VM which has service running.

	$ docker node ls

- Check service is rescheduled to healthy node

	$ docker service ls
	$ docker service ps blogapy
	
- After some time, when another VM is up and joined the cluster
	
	$ docker node ls
	
- Scale up the service 
	
	$ docker service scale blogapy=12
	$ docker service ls
        $ docker service ps blogapy -f "desired-state=Running"
	
- Rolling updates

  	$ docker service update --update-delay=10s --update-parallelism=3 \
            --image ashishapy/blog:v2 blogapy
        $ docker service ps blogapy

Application Cluster (Docker 1.12.x)

Docker Swarm Mode (Docker 1.12.x)

Infrastructure

AWS Services:
  • EC2 Instances + Autoscaling Group
  • IAM Profiles
  • DynamoDB Tables
  • SQS Queue
  • VPC + Subnets
  • ELB
*Simplified Diagram