Making Distributed systems sane again
Welcome
Setting up a context
Signoi
In other words
Machine learning
To be more precise...
Evolution of Infrastructure
CLIENT
ELB/ALB
COMPUTE
STORAGE
PAST....
NOW
NOW....
Traffic through Internet gateway
Into application load balancer with SSL termination
Into bastion host within public subnet
Into kubernetes worker nodes in private subnets
Public and private subnets wrapped within availability zone.
Distributed across 3 availability zones for redundancy
Wrapped around by autoscaling group
Drived by Terraform and Kubectl via Travis CI
Let's wrap it by stages
Let's wrap it by stages
And so on ....
DEV
QA
PROD
Let's wrap it by region
And so on ....
US
EUROPE
- HA Mongo Cluster
And deploy
- HA Rabbitmq cluster
- HA Redis cluster
- Percona cluster
- ETCD
- Kubernetes master controller
- Pilosa cluster (Bitmap indexing)
- Fluentd cluster / Log aggregator daemon
- HA Reverse proxy / Gateway
- Exporter agent (Metrics/Logs/Traces)
- Ingress Controller
- External-dns
And services...
- User service
- Subscription service
- Uploader service
- Events service
- Notify service
......
Services
- Accounts service
- Interpretation processor
- Dataset service
- Tokenzier, Lemmatizer service
- Web hook service
How do we design systems that is
- Deterministic
- Fault tolerant
- Highly available
- Resilient
- Observable
5 Quest for Distributed Systems at scale
Thrive for stateless
1
A service with no side effects.
A service which does not rely on cache to serve request.
A service which does not rely on another service to serve request.
Service A requires that service B to exists and function.
Example:
- Your service talks to database.
- Your service does cache invalidation.
- Your service deployment requires defined sequence of structure.
Litmus test
Key Idea
Delicate intents by events.
Your service intent is to make database call for persistence, emit an event.
Your service intent is to make email, emit an event.
Immutable infrastructure
2
Provision resource creation as well as deletion.
Allow your infrastructure to be function of time.
Allow your infrastructure to be reproducable.
Litmus test
Service/resource deployment requires DEVops.
Service deployment requires reaching to aws console/cli.
Dependency management is a manual process. Eg: You cannot delete vpc because ec2 is attached to it.
Key Idea
Resource creation and teardown should be your religion.
- Tools like terraform enforces immutability and provides dependency management
Benefits
You know how to fail and come back up and not the other way around.
You remove human's error prone brain function during dependency management.
It is the best documentation of your infrastructure that you will ever have.
Reverts are super easy. Time travel.
In Signoi
We spin up entirely new set of infrastructure in 10 minutes.
Kubernetes
Cloudfront
Ec2 worker nodes
ECR
Security groups, NACL,
ALB
VPC,
3 AZs,
Private and public subnets,
Bastion hosts,
and 35 Load Balancers and 60+ services.
HA mongo cluster
HA redis cluster
HA Pilosa cluster
Exporters
Agents
Fluentd
.....
In Signoi
We spin up dev cluster during weekdays and destroy during weekends.
Observable at glance
3
A service with entry and exit path.
A distributed system with shortest time to root cause analysis.
A distributed systems with trace continuity across process boundaries.
Litmus test
Service does have context of outbound and inbound request.
Service does not propagate context across process boundaries.
Your trace is limited to function call stack within the process.
How to achieve observability?
Instrument and profile every execution path
- Request Response lifecycle
- Http request
- Event from rabbitmq
- Event to rabbitmq
- Database transaction
- ....
Use instrumented http client with custom transport layer.
Use monitor api for database to trace transactions.
Use transport headers for messaging systems to trace events.
Benefits
Favor operator abstraction in persistence
4
Database deployment with automatic backup and recovery.
Unified API to manage all the above irrespective of your database stack.
Automatic promotion and demotion of master/slave db cluster.
Litmus test
You have to hire DBAs.
Your backups and restores are manual.
You require manual work during master/slave replication failure.
Your data storage classes are not well defined. For eg: /data, /journal, /logs
How to achieve?
- Wrap your persistence layer with operators. Eg: Percona.
- Provision backup/recovery and co-ordinator agent along side your database of choice.
- Prefer sentinel client over default database client.
- Provision sentinel cluster for automatic master/slave promotion.
- Wrap the underlying details with configuration. Eg: Kubernetes operator, Custom Resource defintions
Understand hidden cost of Open Source
5
Litmus test
README says "Blazingly fast reverse proxy"
README says "Compiles in < 1 sec"
Official website says "Community Edition"
How to achieve?
- Validate if piece of software follows industry standards for instrumentation and profilings.Eg: Open telemetry, Open census, Open tracing.
- Validate if it comes with sentinel and replication controller built in (DBS)
- Validate if they provide operators for backups and recovery.
- Validate if they have quick release cycles.
In Signoi
- $50,000 per year for mongo operator.
- $25,000 for opencensus exporter for load balancer.
- Incomplete set of terraform modules for kubernetes provisioning of External DNS, Ingress Controller, Storage Class ...
Lastly,
- Design for request propagation and cancelation across process boundaries.
- Abstract storage provisioning with operators and sentinel clusters.
- Allow systems to fail as soon as possible.
- Profile everything to better understand the nature of your systems.
- Use circuit breaker for dependent services.
Thank you
Making distributed systems sane again (l1)
By robus
Making distributed systems sane again (l1)
- 49