Software Architecture

Dev To Prod

Systematic-ish approach + Do's and Don'ts

Alex Songe <a@songe.me>

scope

- Kind of software: Web Apps, Backend APIs, Cloud-shaped things

- For all kinds of devs, maybe beginners devops/ops

- What is covered: theory-ish, planning, and practice

DISCLAIMER

This is stuff I've learned from my experience. Hot takes amenable to reason/counterargument

Virtues

Also called values
What we want out of our software
Helps us pick between tradeoffs

VIRTUES

- What we want from our software

- We often have to choose between competing goods or worse, competing evils

- Sorting what things are most important helps us prioritize more fine-grained choices later on

- The first part where idealism and architecture astronaut syndrome kick in

VIRTUES - Cntd

- Correctness - That the system obeys expectations/documentation

- Security

- Privacy

- Reliability

- Performance

- Cost

- Understandability

- Recoverability - Disaster Recovery

- Resilience - Failover

Theory

- We make software systems

- "Interesting" cloud infra is systems of systems (of systems)

- Systems are boxes we draw around complex ideas to simplify them, assign properties, and think about blame/responsibility

Theory: Properties

Included but not limited to:

- Idempotence - operations you can apply 1..n times

- Immutability - data that cannot change

- Statefulness - has data

- Statelessness - has no data

- Crash-safety - acknowledged data writes are not lost

- Blocking/Async semantics

- Backpressure vs Queues

Theory: DISTRIBUTED SYS

- Many systems are subject to distributed systems logic

- CAP Theorem applies in weird places

- OS Scheduling and GC pauses can cause partitions, no physical network required

- Weird emergent behavior is more frequent here

Planning

- We can kinda be systematic, but not formal in any way

- All about managing responsibility/blame

Planning: Blame

We like to externalize or isolate things

- Stripe is good because it helps avoid PCI-DSS audits

- RDS is good because state (data we want to keep) requires a lot of ops work (backups, recovery, etc)

- Starts in making choices with our apps

- State is like friction in a car, we want all of it possible in a few important places, and none of it in others

- Be careful about ceding too much control, you might be helpless or be subject to rent extraction if it is hard to leave

Planning: Scale

We all have big dreams but:

- How big is 100% of the target audience/market?

- How big is a reasonable target marketshare of the above?

- How big is the largest tenant in a multi-tenant system?

- How big is the biggest single unit of work possible/likely? Can it fit on a single machine easily?

- Will users scale into the money needed to grow a team and acquire expertise? Can you scale until then?

- Don't borrow trouble!

- Involve business side in rough cost projections

Practice

Where the rubber meets the road

- Gain confidence in debugging your chosen architecture

- Limit complexity

- Shape practices in some way around your team or organization's structure (Conway's Law happens anyway)

- Add organizational/logical divisions in case roles need to be split

- Consider team size and skills when choosing big solutions like Kubernetes or Terraform

Practice: outsource

Vercel/Heroku/Netlify/etc

- They handle scaling for you

- They handle deploying for you

- They help out with live staging/testing deployments

- You pay a lot more for app layer compute

- They are probably cheaper than the time to invest into learning complex solutions

- I would lean more on platforms that support Docker images as an escape hatch

Practice: Serverless

AWS Serverless

- They handle scaling for you a lot more

- It is up to you to set stuff up, but automation is a power user or expert experience

- Fargate is very cost-effective, a bit harder to set up scaling

- AWS Lambda is pricier, but a bit easier to set up scaling

- Note that both of these work on top of Docker images, so moving between platforms that support Docker is easier, but not frictionless

Practice: LINUX/Shell

For most deployments, knowledge of linux is essential

- Basis for Docker images

- Basis for CI/CD shell commands

- Knowing when bash vs sh is used is important for lightweight/faster uses of Docker

- You need it anyway if you just deployed straight on Compute or on hardware directly

Practice: Databases

- AWS RDS - Good way of managing complexity of backups, snapshots, restoration, clustering, fail-over, upgrades, but takes reading the docs

- Look up the top DB providers for your favorite DB

- Turso: SQLite databases that you store in the cloud, good for smaller apps that want high levels of isolation in multi-tenant systems

Do's and don'ts

- Some things I've found good success with

- Some anti-patterns I've seen or maybe done

Do (Load balancers):

- Use a load balancer, even if just nginx and not a service

- Use the proxy or reverse proxy capabilities to facilitate infrastructure migration, including a proxy/dns 2-step for 100% uptime migrations

- Use HTTP/2.0 or newer

- Use geographic DNS to point users at a sufficiently close load balancer

- Use a small number of resolutions for a DNS entry

- Be able to switch between long-polling and websockets

Do not (Load balancers):

- DO NOT use a load balancer across "regions" or over systems with different latencies

- DO NOT use several layers of load balancers

- DO NOT use DNS to round robin servers in different regions/locations, consider a hot standby failover rather than load-balancing to a backup hundreds of miles away

Do (Load & QUEUS):

- DO use quotas and reject excess requests in high scale environments

- DO use queues for some kinds of work, understanding that a queue can make errors harder to link to requests, involve distributed tracing to debug

Do (error recovery):

- DO write the general case for error recovery first

- DO use error recovery procedures often

- DO use error recovery, to restore production data into a limited staging environment

- DO pay attention to privacy issues

Do (BACKUP + RECOVERY):

- DO write the general case for error recovery first

- DO use error recovery procedures often, for instance, to restore production data into a staging environment, to gain confidence

- DO write special logic to do recovery earlier when the general case is expensive

- DO write down your process for data retention (CYA for legal especially)

Do (microservices):

- DO write systems that can separate

- DO consider the operational overhead of microservices, including

- Versioning

- Declaring version dependencies between microservices

- Implementing distributed tracing

- How much busy-work this might entail for a small team

- DO consider work quotas and rejecting excess requests

Don'T (microservices):

- DO NOT default to microservices

- DO NOT deploy microservices without some way to declare version them

- DO NOT deploy microservices without support for developers to debug requests as they propagate through the system

- DO NOT use async without some kind of rate limiting system

Don't (BACKUP + RECOVERY):

- DON'T handle every error individually

- DON'T rely on catch/finally to recover state

- DON'T go months or years without testing recovery or failover, it's better to risk an outage when you're awake and lucid and prepared than at 3am

Do (AUTOMATION):

- DO Introduce CI/CD early

- DO consider using Docker as a lingua franca for an individually deployable piece of infrastructure/code

- DO run unit tests

- DO run dependency audits for security vulnerabilities

- DO rotate responsibility for the CI/CD through the team to avoid single points of failure in terms of people

Don't (AUTOMATION):

- DON'T delay automation, it's harder when the system is more complex

- DON'T vary production and development environments except through scale

Do (Docker):

- DO use docker-compose for development

- DO think about multiple docker-compose files

- DO run filesystem-intensive operations outside of Docker

- DO make use of host.docker.internal to develop on your local host machine and run nginx or other support off the shelf software in Docker

- DO use docker-compose as a reference for the basic service dependencies

Do (Docker):

- DO use a Dockerfile that's more for deployment

- DO use multistage to keep build tools and info out of prod environment

- DO use environment variables for runtime configuration

- DO use build arguments for docker build time configuration

- DO use RUN --mount and COPY --from to make builds safer and faster, particularly for sensitive information not meant to be promulgated with the final image

DoN'T (Docker):

- DON'T use a Dockerfile and rebuild in the normal development workflow

- DON'T deploy a ton of docker containers by hand

- DON'T publish secrets

DO (STORAGE):

- DO save files in S3, every cloud provider has an S3 compatible API

- DO consider pre-signed direct uploads to S3*

- DO keep track of upload state separately

- DO consider keeping the S3 bucket private-only and have the application sign GET requests to get the data back

DON'T (STORAGE):

- DON'T save files from an app server, it turns a server from part of a herd of cattle into a special pet you have to name and care for

- DON'T save files because it's an extra security hazard

TOPOLOGY DIAGRAMS

- It matters where you put things

- Flatter is better

- Networks can do weird things

- CDN everything for TCP slowstart reasons

USE A CDN IN PROD

TCP Slowstart:

- 10 packets are sent at first

- # of packets in flight are increased/decreased as bandwidth allows

- You can't send the 11th packet until the first packet is acknowledged with permission to send more

- Smaller round trip (closer to the server), the faster you can ramp up bandwidth

- Packet loss really hurts this flow control

beware cors

BEWARE CORS

- Browsers require CORS when you cross domains, including subdomains

- When you make a request to backend.example.com from www.example.com, you send 2 requests with AT LEAST 2 round trips

- Doubles the slowstart penalty

- Just stick a /backend route on the CDN or on a load balancer to hit the backend to defeat the penalty

SMALL + ADVANCED

- Fargate Tasks are less like `docker run` and more like docker compose

- Is complete enough to migrate over time to separate services with their own load balancers for discovery, etc

- Like docker-compose.yml, contains a good description of what services connect to what else

- Failures are grouped together in a single highly-interconnected instance, so the whole thing can be unhealthy, whereas separate services might have combinatorial complexity

FIN

Alex Songe <a@songe.me>