Mastering DevOps

Scaling from 50k to 500k WAU: How to move fast without breaking things

About Me

Yash Mehrotra

Backend Engineer turned DevOps

 

 

Why DevOps

Always had a keen interest in how infrastructure used to work

 

A knack for automating things

Found joy in improving development workflow

Before we start ...

What is DevOps

The umbrella of DevOps: Infra as code

You own all the infrastructure that the applications run on

 

Use IaC to manage all the infrastructure

  • Terraform/Pulumi to create resources
  • Ansible/Puppet to manage changes
  • If you use kubernetes, you already know the benefits of a declarative infrastructure

The umbrella of DevOps: Observability

You cannot avoid issues forever, have systems in place which help you find the root cause instantly

 

Observability is key

  • Logs, metrics and traces are the 3 pillars of observability
  • Know which metrics to track, start with RED principle
    • Rate - the number of requests, per second, you services are serving
    • Errors - the number of failed requests per second
    • Duration - distributions of the amount of time each request takes

 

  • Use an APM to know your application performance
  • Have a time-series database to visualize relevant metrics
  • Use tracing to identify which part of your code is slow
  • Have a global request ID that helps you keep a track of a request across its lifecycle in your infrastructure
  • Logs are your friend, use them wisely, but don't abuse them

The umbrella of DevOps: Observability

The umbrella of DevOps: Alerting

  • Are you tracking crashes and errors somewhere ?
  • How do you know an error is due to a recent deploy ?
  • Do your users tell you that there is a bug ?

 

Alerting is crucial. Define what you classify as an incident and create alerts for that

 

Use paging tools for P0 and P1 alerts

 

Use emails, slack/mattermost/email for all other alerts

A common pitfall of this is setting too many alerts

 

If you get too many pagers, you might get used to avoiding them. Classifying alerts is key

 

If there is no actionable for an alert, it should not be bothering you

The umbrella of DevOps: Alerting

The umbrella of DevOps:

Load Testing

Smart engineers write the best code and hope it doesn't fail

 

Smarter engineers eliminate the possibility of failing

Great tech companies handling immense load in not because they use <insert your favourite stack>

 

They do not crash because they find the bottlenecks of their system and fix those shortcomings


Load test early, Load test often

 

Use benchmarking tools to simulate traffic and know how your system performs under load


As a rule of thumb, you should be able to handle 3x of your current traffic at any given time

The umbrella of DevOps:

Load Testing

The umbrella of DevOps: Autoscaling

Kids scale on CPU and Memory
Legends scale on key performance indicators

We have been taught that scaling CPU & Memory is enough

 

But times have changed. Systems have become much more complex now

Scale on metrics that impact the overall experience

Scenario:

You send an OTP via text message during login.

Everyone uses a service provider, the OTP sending service should scale on the basis of latency, such that all users must receive their OTP in under 30 seconds

If you have a processing queue, scale the number of workers based on processing time

Work with your product and business teams to find these key metrics that scaling of your application will solve

The umbrella of DevOps: Autoscaling

The umbrella of DevOps:

CI/CD

Everybody talks CI/CD ... but, is anyone even doing it ?

 

CI: Push small and regular changes to your master branch

 

CD: Deploy those small changes regularly

The umbrella of DevOps: Continuous Integration

If you don't write your tests, you can't have any pudding.

How can you have any pudding if you don't write your tests?

Its the age of automation. No one has the the time to manually verify whether everything.


Unit, functional and integration testing, all have their own importance. Know which one suits you for your use case.

Pro-tip: Always write tests for failure cases

Have a single step deploy process

 

It can be an ansible playbook

A jenkins job

Or even a git push

Simplify your life so that even Friday evening deploys are blissful

PS: Do not deploy on Friday evenings, have some empathy

The umbrella of DevOps: Continuous Deployment

The umbrella of DevOps: Security

 

  • Security checks should be part of your CI flow
  • You can start with basic checks such as
    • Static code analysis
    • Container scanning
  • The smartest move is to hire a dedicated security engineer if you do not have the time or resources to take it up yourself

Common security mistakes

 

  • Unsanitized input
  • Authenticating tokens but not checking for authorization
  • Installing packages containing malware

The umbrella of DevOps: Security

The umbrella of DevOps: Access Control

  • Do not give everyone master access
  • Create groups and assign groups permission
  • An off-boarding process is a must
  • Always have an audit policy in place
  • Even if you don't have audit logging, still tell everyone you have audit logging

The umbrella of DevOps: Postmortems

It is okay if you have an outage

 

But it is bad if you don't learn anything from it

 

Postmortems are a great way to understand the root-cause of the outage and share your learnings

Common reasons for outages

  • Small outages due to code/logical errors
  • Large outages due to datasource
    • Elasticsearch slowing down due to shard replication
    • RDS became bottleneck
  • In microservices, outages can lead to cascading issues
    • Service A hits Service B which is not responding, Service A threads are busy
    • Service C hits service A which is not responding due to Service B's downtime even though Service C -> service A does not require Service A to hit Service B
  • Worst outages are usually due to infrastructural components
  • Hardest to debug
  • You only monitor them when you know in what ways they could break
  • Things which can go wrong:
    • DNS
    • Load balancer configuration
    • VPC network choking

Common reasons for outages

Case Study: DevOps practices in B2C vs B2B

B2B and B2C companies operate in very different ways

  • Prioritization
  • In B2B, SLOs matter much more since they are contractual obligations
  • In B2C, going down trends on twitter
  • As a DevOps engineer, business goals should be a part of your decisions making process

DevOps workflow in reality: Deployments @ Grofers

  • Around 12-14 services
  • Started with ansible playbooks at first
  • Moved to kubernetes
  • Wrote a python utility to manage & apply manifests
  • Jenkins was used to execute these tasks
  • 110-120 active services
  • Platform to manage manifests which were stored in a database
  • After making changes on the platform, a jenkins job was executed
  • Not scalable since custom solutions require frequent maintenance
  • Moved to helm charts later with Gitlab runners

DevOps workflow in reality: Deployments @ MindTickle

  • Grafana & NewRelic dashboards
  • Sentry for error reporting
  • PagerDuty and slack for alerting
  • Grafana Loki for logs

DevOps workflow in reality: Observability @ Grofers

  • Grafana, Datadog & SumoLogic dashboards
  • Sentry for error reporting
  • Pagerduty and slack for alerting
  • Sumologic for logs

DevOps workflow in reality: Observability @ MindTickle

Redis queue filled up, cascading failures led to outage

 

Few time-taking queries chocked the database, leading to cascading failures

DevOps workflow in reality: Downtimes @ Grofers

Frequent DNS problems in the kubernetes cluster

 

SumoLogic workloads crashed our DNS server

 

Added Linkerd workloads whose resource consumption throttled our applications

DevOps workflow in reality: Downtimes @ MindTickle

Infrastructure from scratch at Bukukas

The what, the why and how

Infrastructure as Code

  • GCP is the primary cloud provider
  • Terraform provisions
    • Kubernetes cluster
    • CloudSQL (Managed PostgreSQL)
    • Elasticsearch (via third-party provider)
    • CDN (Load balancer + Bucket)
    • All networking resources
      • VPC
      • NAT
      • DNS
      • VPN
    • And much more ...

Infrastructure as Code

Why Terraform ?

 

Very easy to start with, and a simple philosophy

 

Writing `terraform apply` will give you your desired state

 

Terraform's state acts as the source of truth for our infrastructure

Kubernetes

All our applications run on Kubernetes

 

But why ? Isn't Kubernetes just used by hipsters ?

Kubernetes

It allows us seamless deployment and scaling.

 

Wrote a new service and want to deploy it ? Just 20 lines of yaml if you have a docker container ready

 

Need to create an exact replica of your production ? Takes 1 min

 

Facing a traffic surge and need to double your replicas ? Just a single command

Kubernetes

All external packages are managed via helm

 

Kustomize is used for generating manifests for our applications

 

Flux watches the kustomize repo and applies the changes (GitOps !)

 

Autoscaling based on key metrics using keda

 

Observability

If only you could perceive metrics as I do

Grafana + Prometheus = <3

 

DataDog is our APM of choice and logs browser

Observability

 

We have grafana dashboards for everything you can imagine:

  • App metrics (Latency, Throughput, Error Rate, CPU, Memory, Replicas, queue backlog)
  • PostgreSQL
  • Elasticsearch
  • Redis
  • Cert Manager
  • DNS & NAT
  • Cluster health
  • VPN

Observability

Observability

Prometheus is the one which scrapes all the metrics and stores them

 

We use prometheus exporters for our rails app, pgbouncer, elasticsearch and Kubernetes nodes

Prometheus has become the de-facto standard for observability now

PS: All our dashboards and scrapers are commited into git

Alerting

Raise the alarms, the bugs are coming

We hate it when our customers experience any problem

 

Our goal is to make sure that in case of any problem, we should be the first one to know

 

We have defined all our alert thresholds (error rate, high latency, restarts) in Grafana

Alerting

Alerting

 

  • Sentry for in app errors
  • Slack for non-essential errors and warnings
  • Pager for P0 and P1 errors
  • Rule of thumb: if its a pager and your immediate reaction is not to open your machine and try to fix the issue, its not a pager
  • Thorough and routine assessment of all the errors, what caused them, and how to fix them

Load Testing

Can a man still be brave if he's afraid?

That is the only time a man can be brave

You should know the limits of your application

 

All flows of your application should be load tested

Load Testing

We use k6 for our load tests

 

It uses javascript so we can add our custom logic and has integration with Postman which is where we document all our APIs

Parting thoughts


Take decisions keeping the next 6 months in mind, and factor in how difficult is it to change

 

Tracking tech-debt is very important

 

Push back to business when you need to, but make sure your end-goals are aligned

Thank You

You can reach out to me on:

 

        yashmehrotra.com

 

        @yashm95

Made with Slides.com