Scaling from 50k to 500k WAU: How to move fast without breaking things
Yash Mehrotra
Backend Engineer turned DevOps
Always had a keen interest in how infrastructure used to work
A knack for automating things
Found joy in improving development workflow
You own all the infrastructure that the applications run on
Use IaC to manage all the infrastructure
You cannot avoid issues forever, have systems in place which help you find the root cause instantly
Observability is key
Alerting is crucial. Define what you classify as an incident and create alerts for that
Use paging tools for P0 and P1 alerts
Use emails, slack/mattermost/email for all other alerts
A common pitfall of this is setting too many alerts
If you get too many pagers, you might get used to avoiding them. Classifying alerts is key
If there is no actionable for an alert, it should not be bothering you
Smart engineers write the best code and hope it doesn't fail
Smarter engineers eliminate the possibility of failing
Great tech companies handling immense load in not because they use <insert your favourite stack>
They do not crash because they find the bottlenecks of their system and fix those shortcomings
Load test early, Load test often
Use benchmarking tools to simulate traffic and know how your system performs under load
As a rule of thumb, you should be able to handle 3x of your current traffic at any given time
Kids scale on CPU and Memory
Legends scale on key performance indicators
We have been taught that scaling CPU & Memory is enough
But times have changed. Systems have become much more complex now
Scale on metrics that impact the overall experience
Scenario:
You send an OTP via text message during login.
Everyone uses a service provider, the OTP sending service should scale on the basis of latency, such that all users must receive their OTP in under 30 seconds
If you have a processing queue, scale the number of workers based on processing time
Work with your product and business teams to find these key metrics that scaling of your application will solve
Everybody talks CI/CD ... but, is anyone even doing it ?
CI: Push small and regular changes to your master branch
CD: Deploy those small changes regularly
If you don't write your tests, you can't have any pudding.
How can you have any pudding if you don't write your tests?
Its the age of automation. No one has the the time to manually verify whether everything.
Unit, functional and integration testing, all have their own importance. Know which one suits you for your use case.
Pro-tip: Always write tests for failure cases
Have a single step deploy process
It can be an ansible playbook
A jenkins job
Or even a git push
Simplify your life so that even Friday evening deploys are blissful
PS: Do not deploy on Friday evenings, have some empathy
Common security mistakes
It is okay if you have an outage
But it is bad if you don't learn anything from it
Postmortems are a great way to understand the root-cause of the outage and share your learnings
Redis queue filled up, cascading failures led to outage
Few time-taking queries chocked the database, leading to cascading failures
Frequent DNS problems in the kubernetes cluster
SumoLogic workloads crashed our DNS server
Added Linkerd workloads whose resource consumption throttled our applications
The what, the why and how
Why Terraform ?
Very easy to start with, and a simple philosophy
Writing `terraform apply` will give you your desired state
Terraform's state acts as the source of truth for our infrastructure
All our applications run on Kubernetes
But why ? Isn't Kubernetes just used by hipsters ?
It allows us seamless deployment and scaling.
Wrote a new service and want to deploy it ? Just 20 lines of yaml if you have a docker container ready
Need to create an exact replica of your production ? Takes 1 min
Facing a traffic surge and need to double your replicas ? Just a single command
All external packages are managed via helm
Kustomize is used for generating manifests for our applications
Flux watches the kustomize repo and applies the changes (GitOps !)
Autoscaling based on key metrics using keda
If only you could perceive metrics as I do
Grafana + Prometheus = <3
DataDog is our APM of choice and logs browser
We have grafana dashboards for everything you can imagine:
Prometheus is the one which scrapes all the metrics and stores them
We use prometheus exporters for our rails app, pgbouncer, elasticsearch and Kubernetes nodes
Prometheus has become the de-facto standard for observability now
PS: All our dashboards and scrapers are commited into git
Raise the alarms, the bugs are coming
We hate it when our customers experience any problem
Our goal is to make sure that in case of any problem, we should be the first one to know
We have defined all our alert thresholds (error rate, high latency, restarts) in Grafana
Can a man still be brave if he's afraid?
That is the only time a man can be brave
You should know the limits of your application
All flows of your application should be load tested
We use k6 for our load tests
It uses javascript so we can add our custom logic and has integration with Postman which is where we document all our APIs
Take decisions keeping the next 6 months in mind, and factor in how difficult is it to change
Tracking tech-debt is very important
Push back to business when you need to, but make sure your end-goals are aligned