Dev To Prod
Systematic-ish approach + Do's and Don'ts
Alex Songe <a@songe.me>
- Kind of software: Web Apps, Backend APIs, Cloud-shaped things
- For all kinds of devs, maybe beginners devops/ops
- What is covered: theory-ish, planning, and practice
DISCLAIMER
This is stuff I've learned from my experience. Hot takes amenable to reason/counterargument
- What we want from our software
- We often have to choose between competing goods or worse, competing evils
- Sorting what things are most important helps us prioritize more fine-grained choices later on
- The first part where idealism and architecture astronaut syndrome kick in
- Correctness - That the system obeys expectations/documentation
- Security
- Privacy
- Reliability
- Performance
- Cost
- Understandability
- Recoverability - Disaster Recovery
- Resilience - Failover
- We make software systems
- "Interesting" cloud infra is systems of systems (of systems)
- Systems are boxes we draw around complex ideas to simplify them, assign properties, and think about blame/responsibility
Included but not limited to:
- Idempotence - operations you can apply 1..n times
- Immutability - data that cannot change
- Statefulness - has data
- Statelessness - has no data
- Crash-safety - acknowledged data writes are not lost
- Blocking/Async semantics
- Backpressure vs Queues
- Many systems are subject to distributed systems logic
- CAP Theorem applies in weird places
- OS Scheduling and GC pauses can cause partitions, no physical network required
- Weird emergent behavior is more frequent here
- We can kinda be systematic, but not formal in any way
- All about managing responsibility/blame
We like to externalize or isolate things
- Stripe is good because it helps avoid PCI-DSS audits
- RDS is good because state (data we want to keep) requires a lot of ops work (backups, recovery, etc)
- Starts in making choices with our apps
- State is like friction in a car, we want all of it possible in a few important places, and none of it in others
- Be careful about ceding too much control, you might be helpless or be subject to rent extraction if it is hard to leave
We all have big dreams but:
- How big is 100% of the target audience/market?
- How big is a reasonable target marketshare of the above?
- How big is the largest tenant in a multi-tenant system?
- How big is the biggest single unit of work possible/likely? Can it fit on a single machine easily?
- Will users scale into the money needed to grow a team and acquire expertise? Can you scale until then?
- Don't borrow trouble!
- Involve business side in rough cost projections
Where the rubber meets the road
- Gain confidence in debugging your chosen architecture
- Limit complexity
- Shape practices in some way around your team or organization's structure (Conway's Law happens anyway)
- Add organizational/logical divisions in case roles need to be split
- Consider team size and skills when choosing big solutions like Kubernetes or Terraform
Vercel/Heroku/Netlify/etc
- They handle scaling for you
- They handle deploying for you
- They help out with live staging/testing deployments
- You pay a lot more for app layer compute
- They are probably cheaper than the time to invest into learning complex solutions
- I would lean more on platforms that support Docker images as an escape hatch
AWS Serverless
- They handle scaling for you a lot more
- It is up to you to set stuff up, but automation is a power user or expert experience
- Fargate is very cost-effective, a bit harder to set up scaling
- AWS Lambda is pricier, but a bit easier to set up scaling
- Note that both of these work on top of Docker images, so moving between platforms that support Docker is easier, but not frictionless
For most deployments, knowledge of linux is essential
- Basis for Docker images
- Basis for CI/CD shell commands
- Knowing when bash vs sh is used is important for lightweight/faster uses of Docker
- You need it anyway if you just deployed straight on Compute or on hardware directly
- AWS RDS - Good way of managing complexity of backups, snapshots, restoration, clustering, fail-over, upgrades, but takes reading the docs
- Look up the top DB providers for your favorite DB
- Turso: SQLite databases that you store in the cloud, good for smaller apps that want high levels of isolation in multi-tenant systems
- Some things I've found good success with
- Some anti-patterns I've seen or maybe done
- Use a load balancer, even if just nginx and not a service
- Use the proxy or reverse proxy capabilities to facilitate infrastructure migration, including a proxy/dns 2-step for 100% uptime migrations
- Use HTTP/2.0 or newer
- Use geographic DNS to point users at a sufficiently close load balancer
- Use a small number of resolutions for a DNS entry
- Be able to switch between long-polling and websockets
- DO NOT use a load balancer across "regions" or over systems with different latencies
- DO NOT use several layers of load balancers
- DO NOT use DNS to round robin servers in different regions/locations, consider a hot standby failover rather than load-balancing to a backup hundreds of miles away
- DO use quotas and reject excess requests in high scale environments
- DO use queues for some kinds of work, understanding that a queue can make errors harder to link to requests, involve distributed tracing to debug
- DO write the general case for error recovery first
- DO use error recovery procedures often
- DO use error recovery, to restore production data into a limited staging environment
- DO pay attention to privacy issues
- DO write the general case for error recovery first
- DO use error recovery procedures often, for instance, to restore production data into a staging environment, to gain confidence
- DO write special logic to do recovery earlier when the general case is expensive
- DO write down your process for data retention (CYA for legal especially)
- DO write systems that can separate
- DO consider the operational overhead of microservices, including
- Versioning
- Declaring version dependencies between microservices
- Implementing distributed tracing
- How much busy-work this might entail for a small team
- DO consider work quotas and rejecting excess requests
- DO NOT default to microservices
- DO NOT deploy microservices without some way to declare version them
- DO NOT deploy microservices without support for developers to debug requests as they propagate through the system
- DO NOT use async without some kind of rate limiting system
- DON'T handle every error individually
- DON'T rely on catch/finally to recover state
- DON'T go months or years without testing recovery or failover, it's better to risk an outage when you're awake and lucid and prepared than at 3am
- DO Introduce CI/CD early
- DO consider using Docker as a lingua franca for an individually deployable piece of infrastructure/code
- DO run unit tests
- DO run dependency audits for security vulnerabilities
- DO rotate responsibility for the CI/CD through the team to avoid single points of failure in terms of people
- DON'T delay automation, it's harder when the system is more complex
- DON'T vary production and development environments except through scale
- DO use docker-compose for development
- DO think about multiple docker-compose files
- DO run filesystem-intensive operations outside of Docker
- DO make use of host.docker.internal to develop on your local host machine and run nginx or other support off the shelf software in Docker
- DO use docker-compose as a reference for the basic service dependencies
- DO use a Dockerfile that's more for deployment
- DO use multistage to keep build tools and info out of prod environment
- DO use environment variables for runtime configuration
- DO use build arguments for docker build time configuration
- DO use RUN --mount and COPY --from to make builds safer and faster, particularly for sensitive information not meant to be promulgated with the final image
- DON'T use a Dockerfile and rebuild in the normal development workflow
- DON'T deploy a ton of docker containers by hand
- DON'T publish secrets
- DO save files in S3, every cloud provider has an S3 compatible API
- DO consider pre-signed direct uploads to S3*
- DO keep track of upload state separately
- DO consider keeping the S3 bucket private-only and have the application sign GET requests to get the data back
- DON'T save files from an app server, it turns a server from part of a herd of cattle into a special pet you have to name and care for
- DON'T save files because it's an extra security hazard
- It matters where you put things
- Flatter is better
- Networks can do weird things
- CDN everything for TCP slowstart reasons
TCP Slowstart:
- 10 packets are sent at first
- # of packets in flight are increased/decreased as bandwidth allows
- You can't send the 11th packet until the first packet is acknowledged with permission to send more
- Smaller round trip (closer to the server), the faster you can ramp up bandwidth
- Packet loss really hurts this flow control
- Browsers require CORS when you cross domains, including subdomains
- When you make a request to backend.example.com from www.example.com, you send 2 requests with AT LEAST 2 round trips
- Doubles the slowstart penalty
- Just stick a /backend route on the CDN or on a load balancer to hit the backend to defeat the penalty
- Fargate Tasks are less like `docker run` and more like docker compose
- Is complete enough to migrate over time to separate services with their own load balancers for discovery, etc
- Like docker-compose.yml, contains a good description of what services connect to what else
- Failures are grouped together in a single highly-interconnected instance, so the whole thing can be unhealthy, whereas separate services might have combinatorial complexity
Alex Songe <a@songe.me>