Canary Release com Docker, Kubernetes, Gitlab e Traefik

About me

From Fortaleza - CE

10 years+ coding

CTO/Co-founder @Agendor

Java, Javascript, Android, PHP, Node.js, Ruby, React, Infra, Linux, AWS, MySQL, Business, Product.

Tulio Monte Azul

@tuliomonteazul

tulio@agendor.com.br

Agendor é a ferramenta principal do vendedor.

(Github do Vendedor)

Versão Web e Mobile com plano gratuito e pago.

Antes

Depois

  • SSH + ./deploy.sh
  • 19hs + Edit ELB
  • Rollback? Hã?
  • Medo
  • Auto recovery
  • Auto scaling
  • Canary Release
  • Rollback

Contexto

Soluções

Desafios

Contexto

  • Projeto em PHP (legado)
  • Migração pra SPA com Rails + React
  • Equipe enxuta (6 devs)
  • Mercado competitivo
  • 25k usuários ativos
  • 5k rpm

Soluções

créditos: https://martinfowler.com/bliki/CanaryRelease.html

"Eu acho que eu vi um bug"

CI/CD 

Deploy/Rollback fácil

Automação

Custo

stages:
  - test
  - release
  - deploy

image: azukiapp/docker-kubernetes-ci:17.04

services:
- docker:dind

variables:
  DOCKER_DRIVER: overlay

before_script:
  - docker info
  - docker-compose --version

  # Install kubectl and generate configurations
  # More info in https://gitlab.com/agendorteam/agendor-infra#configure-kubectl-context-in-ci
  - kube-config-generator
  - kubectl cluster-info
  - helm version

# Test
###########
test:
  stage: test
  tags:
    - docker

  script:
    - docker-compose build test
    - docker-compose run test

# Release
###########
.release: &release
  stage: release
  script:
    # REFERENCES: https://docs.gitlab.com/ce/ci/variables/
    # - ./scripts/release.sh agendor-web master
    - ./scripts/release.sh $CI_PROJECT_NAME $IMAGE_TAG

release_current_staging:
  <<: *release
  tags:
    - docker
  variables:
    DEPLOY_TYPE: staging
    IMAGE_TAG: stg-$CI_BUILD_REF_SLUG-$CI_BUILD_REF
  except:
    - master

release_current_test:
  <<: *release
  tags:
    - docker
  variables:
    DEPLOY_TYPE: test
    IMAGE_TAG: tst-$CI_BUILD_REF_SLUG-$CI_BUILD_REF
  except:
    - master

release_current_production:
  <<: *release
  tags:
    - docker
  variables:
    DEPLOY_TYPE: production
    IMAGE_TAG: $CI_BUILD_REF_SLUG-$CI_BUILD_REF
  only:
    - master

release_latest_production:
  <<: *release
  tags:
    - docker
  variables:
    DEPLOY_TYPE: production
    IMAGE_TAG: latest
  only:
    - master

# Deploy
###########
.deploy_template: &deploy_definition
  stage: deploy
  tags:
    - docker
  script:
    # ./scripts/deploy.sh [PROJECT_NAME [REF_NAME [HELM_NAME]]]
    - ./scripts/deploy.sh $CI_PROJECT_NAME $IMAGE_TAG
    - ./scripts/deploy.sh $CI_PROJECT_NAME $IMAGE_TAG agendor-api-mobile

deploy_staging:
  <<: *deploy_definition
  environment:
    name: staging
  variables:
    DEPLOY_TYPE: staging
    IMAGE_TAG: stg-$CI_BUILD_REF_SLUG-$CI_BUILD_REF
  only:
    - develop

deploy_canary:
  <<: *deploy_definition
  environment:
    name: canary
  variables:
    DEPLOY_TYPE: canary
    IMAGE_TAG: $CI_BUILD_REF_SLUG-$CI_BUILD_REF
  only:
    - master

# manuals
deploy_staging_manual:
  <<: *deploy_definition
  environment:
    name: staging
  variables:
    DEPLOY_TYPE: staging
    IMAGE_TAG: stg-$CI_BUILD_REF_SLUG-$CI_BUILD_REF
  when: manual
  except:
    - develop
    - master

deploy_test_manual:
  <<: *deploy_definition
  environment:
    name: test
  variables:
    DEPLOY_TYPE: test
    IMAGE_TAG: tst-$CI_BUILD_REF_SLUG-$CI_BUILD_REF
  when: manual
  except:
    - master

deploy_production_manual:
  <<: *deploy_definition
  environment:
    name: production
  variables:
    DEPLOY_TYPE: production
  when: manual
  only:
    - master
gitlab-ci.yaml

Usado em produção

Google, Red Hat, CoreOs

Monitoramento, Rolling Updates, Auto recovery

Comunidade

Documentação

Ferramentas auxiliares

  • Kubectl

  • Kubernetes Dashboard

  • Kops

  • Helm

  • Cluster-Autoscaler

Request

Bah Guri! Como é que

tá o CPU dos Pods aí?

O Pod 1 tá com 43%

e o Pod 2 com 62%

Então, a média tá 52,5%

Tchê! Tu precisa mudar tua quantidade de réplicas de 2 pra 3

(node)

Usado em produção

Múltiplos ambientes

Deploy por ambiente

Hipster

Infrastructure as Code

Ambiente dev = prod

Host mapping

Fast

Canary Release

frontends:
  # agendor-web
  web:
    rules:
      - "Host:web.agendor.com.br"
  web_canary:
    rules:
      - "Host:web.agendor.com.br"
      - "HeadersRegexp:Cookie,_AUSER_TYPE=canary"
backends:
  ## agendor-web
  web:
    - name: k8s
      url: "http://agendor-web.production.svc.cluster.local"
  web_canary:
    - name: k8s
      url: "http://agendor-web-canary.production.svc.cluster.local"

Desafios

#1 Pods Scaling

NewRelic FTW

Kubectl FTW

$ kubectl --namespace=production scale --replicas=8 deployment/agendor-web-canary
$ kubectl --namespace=production get pods -l=app=agendor-web-canary
NAME                                  READY     STATUS    RESTARTS   AGE
agendor-web-canary-1388324162-2qqgr   1/1       Running   0          5h
agendor-web-canary-1388324162-38xlb   1/1       Running   0          1m
agendor-web-canary-1388324162-3g815   1/1       Running   0          5h
agendor-web-canary-1388324162-89j6k   1/1       Running   0          5h
$ kubectl --namespace=production get pods -l=app=agendor-web-canary
NAME                                  READY     STATUS               RESTARTS   AGE
agendor-web-canary-1388324162-2qqgr   1/1       Running              0          5h
agendor-web-canary-1388324162-38xlb   1/1       Running              0          1m
agendor-web-canary-1388324162-3g815   1/1       Running              0          5h
agendor-web-canary-1388324162-89j6k   1/1       Running              0          5h
agendor-web-canary-1388324162-gjndp   0/1       ContainerCreating    0          4s
agendor-web-canary-1388324162-n57wc   0/1       ContainerCreating    0          4s
agendor-web-canary-1388324162-q0pcv   0/1       ContainerCreating    0          4s
agendor-web-canary-1388324162-t8v4s   0/1       ContainerCreating    0          4s

Limits tunning

# Default values for agendor-web.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
name: agendor-web
namespace: default
app_env: staging

# autoscale configurations
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 50


...
resources:
  limits:
    cpu: 100m
    memory: 256Mi
  requests:
    cpu: 100m
    memory: 256Mi
# Default values for agendor-web.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
name: agendor-web
namespace: default
app_env: staging

# autoscale configurations
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 30 # decreased


...
resources:
  limits:
    cpu: 200m # increased
    memory: 512Mi # increased
  requests:
    cpu: 100m
    memory: 256Mi

Add Monitor

# back-end
web_canary:
  - name: k8s
    url: "http://agendor-web-canary.production.svc.cluster.local"




# front-end
web_canary:
  rules:
    - "Host:web.agendor.com.br"
    - "HeadersRegexp:Cookie,_AUSER_TYPE=canary"
# back-end
web_canary:
  - name: k8s
    url: "http://agendor-web-canary.production.svc.cluster.local"
web_canary_host:
  - name: k8s
    url: "http://agendor-web-canary.production.svc.cluster.local"

# front-end
web_canary:
  rules:
    - "Host:web.agendor.com.br"
    - "HeadersRegexp:Cookie,_AUSER_TYPE=canary"
web_canary_host:
  rules:
    - "Host:web-canary.agendor.com.br"

#2 Pods Scaling + Deploy

Rolling Update Fail

Deployment vs HPA

kind: Deployment
apiVersion: extensions/v1beta1
metadata:
  name: {{ .Values.name }}
...
spec:
  replicas: {{ .Values.minReplicas }}
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: {{ .Values.name }}
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: {{ .Values.name }}
...
spec:
  scaleTargetRef:
    apiVersion: extensions/v1beta1
    kind: Deployment
    name: {{ .Values.name }}
  minReplicas: {{ .Values.minReplicas }}
  maxReplicas: {{ .Values.maxReplicas }}
  targetCPUUtilizationPercentage: {{ .Values.targetCPUUtilizationPercentage }}
# values.yaml
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 30

strategy:
  rollingUpdate:
    maxSurge: 3
    maxUnavailable: 0

Rolling Update Win

#3 Cluster Scaling

Aviso de lentidão no sistema

Análise da lentidão

Análise da lentidão

SSH into Server

$ ssh -i ~/.ssh/kube_aws_rsa 
    admin@ec2-XX-XXX-XXX-XXX.sa-east-1.compute.amazonaws.com

Debugging Nodes 

$ kubectl top nodes
NAME                                          CPU(cores)   CPU%      MEMORY(bytes)   MEMORY%
ip-172-20-44-176.sa-east-1.compute.internal   35m          15%       2132Mi          55%
ip-172-20-35-247.sa-east-1.compute.internal   82m          40%       2426Mi          62%
ip-172-20-41-137.sa-east-1.compute.internal   125m         68%       2406Mi          62%
ip-172-20-60-122.sa-east-1.compute.internal   130m         40%       2459Mi          63%
ip-172-20-116-96.sa-east-1.compute.internal   429m         21%       2271Mi          58%
ip-172-20-117-4.sa-east-1.compute.internal    140m         46%       2262Mi          58%

Debugging Nodes 

$ kubectl describe nodes

Debugging auto scaler

I0622 18:08:59 scale_up.go:62] Upcoming 1 nodes
I0622 18:08:59 scale_up.go:81] Skipping node group nodes.k8s.agendor.com.br - max size reached
I0622 18:08:59 scale_up.go:132] No expansion options
kubectl logs -f --tail=100 --namespace=kube-system cluster-autoscaler-2462627368-c1hhl
containers:
  - name: cluster-autoscaler
    image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
    imagePullPolicy: "{{ .Values.image.pullPolicy }}"
    command:
      - ./cluster-autoscaler
      - --cloud-provider=aws
    {{- range .Values.autoscalingGroups }}
      - --nodes={{ .minSize }}:{{ .maxSize }}:{{ .name }}
    {{- end }}
autoscalingGroups:
  - name: nodes.k8s.agendor.com.br
    minSize: 1
    maxSize: 15

values.yaml

deployment.yaml

Change instance groups

$ kops get ig
Using cluster from kubectl context: k8s.agendor.com.br

NAME			ROLE	MACHINETYPE	MIN	MAX	SUBNETS
master-sa-east-1a	Master	t2.medium	1	1	sa-east-1a
nodes			Node	t2.medium	2	6	sa-east-1a,sa-east-1c
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2017-02-04T07:30:23Z"
  labels:
    kops.k8s.io/cluster: k8s.agendor.com.br
  name: nodes
spec:
  image: kope.io/k8s-1.6-debian-jessie-amd64-hvm-ebs-2017-05-02
  machineType: t2.medium
  maxSize: 20
  minSize: 2
  role: Node
  rootVolumeSize: 50
  subnets:
  - sa-east-1a
  - sa-east-1c
$ kops edit ig nodes
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2017-02-04T07:30:23Z"
  labels:
    kops.k8s.io/cluster: k8s.agendor.com.br
  name: nodes
spec:
  image: kope.io/k8s-1.6-debian-jessie-amd64-hvm-ebs-2017-05-02
  machineType: t2.medium
  maxSize: 6
  minSize: 2
  role: Node
  rootVolumeSize: 50
  subnets:
  - sa-east-1a
  - sa-east-1c

Change instance groups

$ kops update cluster
Using cluster from kubectl context: k8s.agendor.com.br


*********************************************************************************

A new kubernetes version is available: 1.6.6
Upgrading is recommended (try kops upgrade cluster)

More information: https://github.com/kubernetes/kops/blob/master/permalinks/upgrade_k8s.md#1.6.6

*********************************************************************************

I0623 10:13:07.893227   35173 executor.go:91] Tasks: 0 done / 60 total; 27 can run
I0623 10:13:10.226305   35173 executor.go:91] Tasks: 27 done / 60 total; 14 can run
I0623 10:13:10.719476   35173 executor.go:91] Tasks: 41 done / 60 total; 17 can run
I0623 10:13:11.824240   35173 executor.go:91] Tasks: 58 done / 60 total; 2 can run
I0623 10:13:11.933869   35173 executor.go:91] Tasks: 60 done / 60 total; 0 can run
Will modify resources:
  AutoscalingGroup/nodes.k8s.agendor.com.br
  	MaxSize             	 6 -> 20


Must specify --yes to apply changes

#4 Cluster Scaling

pt. 2

Black Hole

kubectl logs -f --tail=200 --namespace=kube-system \
    $(kubectl get pods --namespace=kube-system \
        | grep cluster-autoscaler \
        | awk '{print $1}' \
        | head -1)
kubectl describe nodes
$ aws ec2 describe-route-tables \
    --filters "Name=route-table-id, Values=rtb-7923f31d" \
    | grep blackhole \
    | awk '{print $2}' \
    | xargs -n 1 \
        aws ec2 delete-route --route-table-id rtb-7923f31d --destination-cidr-block
12:51:26 scale_up.go:62] Upcoming 0 nodes
12:51:26 scale_up.go:145] Best option to resize: nodes.k8s.agendor.com.br
12:51:26 scale_up.go:149] Estimated 1 nodes needed in nodes.k8s.agendor.com.br
12:51:26 scale_up.go:169] Scale-up: setting group nodes.k8s.agendor.com.br size to 7
12:51:26 aws_manager.go:124] Setting asg nodes.k8s.agendor.com.br size to 7
12:51:26 event.go:217] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-system", Name:"cluster-autoscaler-status", UID:"3d1fa43e-59e8-11e7-b0ba-02fccc543533", APIVersion:"v1", ResourceVersion:"22154083", FieldPath:""}): type: 'Normal' reason: 'ScaledUpGroup' Scale-up: group nodes.k8s.agendor.com.br size set to 7
12:51:26 event.go:217] Event(v1.ObjectReference{Kind:"Pod", Namespace:"production", Name:"agendor-web-canary-3076992955-95b3p", UID:"2542369b-5a6e-11e7-b0ba-02fccc543533", APIVersion:"v1", ResourceVersion:"22154097", FieldPath:""}): type: 'Normal' reason: 'TriggeredScaleUp' pod triggered scale-up, group: nodes.k8s.agendor.com.br, sizes (current/new): 6/7
...
12:51:37 scale_up.go:62] Upcoming 1 nodes
12:51:37 scale_up.go:124] No need for any nodes in nodes.k8s.agendor.com.br
12:51:37 scale_up.go:132] No expansion options
12:51:37 static_autoscaler.go:247] Scale down status: unneededOnly=true lastScaleUpTime=2017-06-26 12:51:26.777491768 +0000 UTC lastScaleDownFailedTrail=2017-06-25 20:52:45.172500628 +0000 UTC schedulablePodsPresent=false
12:51:37 static_autoscaler.go:250] Calculating unneeded nodes

Deletar rotas em status "Black Hole"

$ kubectl logs -f --tail=100 --namespace=kube-system 
    $(kubectl get pods --namespace=kube-system |
        grep cluster-autoscaler | awk '{print $1}' | head -1)

#5 Canary Release

Canary Bug Tracking

Traefik redirecionando requests aleatoriamente

Cliente reclama de bug: problema na aplicação ou na nova infra?

Lição sobre redirecionamento aleatório em Canary:

Bom pra testar escalabilidade e performance

Ruim se mantido por muito tempo

Solução:

backends:
  web:
    - name: legacy_1
      values:
        - weight = 5
        - url = "http://ec2-54-94-XXX-XXX.sa-east-1.compute.amazonaws.com"
    - name: legacy_2
      values:
        - weight = 5
        - url = "http://ec2-54-233-XX-XX.sa-east-1.compute.amazonaws.com"
    - name: k8s
      values:
        - weight = 1
        - url = "http://agendor-api-v2-canary.production.svc.cluster.local"

Canary Bug Tracking

# aws.yml (traefik)
backends:
  web:
    - name: legacy_1
      values:
        - weight = 5
        - url = "http://ec2-54-94-XXX-XXX.sa-east-1.compute.amazonaws.com"
    - name: legacy_2
      values:
        - weight = 5
        - url = "http://ec2-54-233-XX-XX.sa-east-1.compute.amazonaws.com"
    - name: k8s
      values:
        - weight = 1
        - url = "http://agendor-api-v2-canary.production.svc.cluster.local"
  web_canary:
    - name: k8s
      url: "http://agendor-web-canary.production.svc.cluster.local"
# aws.yml (traefik)
backends:
  web:
    - name: legacy_1
      url = "http://ec2-54-94-XXX-XXX.sa-east-1.compute.amazonaws.com"
    - name: legacy_2
      url = "http://ec2-54-233-XX-XXX.sa-east-1.compute.amazonaws.com"
  web_canary:
    - name: k8s
      url: "http://agendor-web-canary.production.svc.cluster.local"
# PHP
public function addBetaTesterCookie($accountId = NULL) {
  if ($this->isCanaryBetaTester($accountId)) { // check on Redis
    $this->setCookie(BetaController::$BETA_COOKIE_NAME, 
        BetaController::$BETA_COOKIE_VALUE);
  } else {
    $this->setCookie(BetaController::$BETA_COOKIE_NAME, 
        BetaController::$LEGACY_COOKIE_VALUE);
  }
}

Multiple versions

Múltiplas versões do código simultâneas

Todos devem desenvolver pensando no deploy/release desde o começo

id nome idade
1 Jon 29
2 Arya 16
id nome idade dt. nasc
1 Jon 29 19/07/88
2 Arya 16 07/05/01
id nome dt. nasc
1 Jon 19/07/88
2 Arya 07/05/01

Deploy 1

(add)

Deploy 2

(remove)

#6 Helm + recreate-pods

Helm

--recreate-pods para corrigir o problema dos pods não serem re-criados

Impacto: Rolling Update fail

   env:
    - name: APP_NAME
      value: "{{ .Values.name }}"
    - name: APP_NAMESPACE
      value: "{{ .Values.namespace }}"
    - name: APP_ENV
      value: "{{ .Values.app_env }}"
    - name: APP_FULLNAME
      value: "{{ .Values.name }}-{{ .Values.app_env }}"
    # Force recreate pods
    - name: UPDATE_CONFIG
      value: {{ $.Release.Time }}

Obrigado!

linkedin.com/in/tuliomonteazul

@tuliomonteazul

tulio@agendor.com.br

Everton Ribeiro

Gullit Miranda

Marcus Gadbem

Special thanks to

Canary Release com Kubernetes e Traefik

By Tulio Monte Azul

Canary Release com Kubernetes e Traefik

  • 543