Software Architecture
Dev To Prod
Systematic-ish approach + Do's and Don'ts
Alex Songe <a@songe.me>
scope
- Kind of software: Web Apps, Backend APIs, Cloud-shaped things
- For all kinds of devs, maybe beginners devops/ops
- What is covered: theory-ish, planning, and practice
DISCLAIMER
This is stuff I've learned from my experience. Hot takes amenable to reason/counterargument
Virtues
- Also called values
- What we want out of our software
- Helps us pick between tradeoffs
VIRTUES
- What we want from our software
- We often have to choose between competing goods or worse, competing evils
- Sorting what things are most important helps us prioritize more fine-grained choices later on
- The first part where idealism and architecture astronaut syndrome kick in
VIRTUES - Cntd
- Correctness - That the system obeys expectations/documentation
- Security
- Privacy
- Reliability
- Performance
- Cost
- Understandability
- Recoverability - Disaster Recovery
- Resilience - Failover
Theory
- We make software systems
- "Interesting" cloud infra is systems of systems (of systems)
- Systems are boxes we draw around complex ideas to simplify them, assign properties, and think about blame/responsibility
Theory: Properties
Included but not limited to:
- Idempotence - operations you can apply 1..n times
- Immutability - data that cannot change
- Statefulness - has data
- Statelessness - has no data
- Crash-safety - acknowledged data writes are not lost
- Blocking/Async semantics
- Backpressure vs Queues
Theory: DISTRIBUTED SYS
- Many systems are subject to distributed systems logic
- CAP Theorem applies in weird places
- OS Scheduling and GC pauses can cause partitions, no physical network required
- Weird emergent behavior is more frequent here
Planning
- We can kinda be systematic, but not formal in any way
- All about managing responsibility/blame
Planning: Blame
We like to externalize or isolate things
- Stripe is good because it helps avoid PCI-DSS audits
- RDS is good because state (data we want to keep) requires a lot of ops work (backups, recovery, etc)
- Starts in making choices with our apps
- State is like friction in a car, we want all of it possible in a few important places, and none of it in others
- Be careful about ceding too much control, you might be helpless or be subject to rent extraction if it is hard to leave
Planning: Scale
We all have big dreams but:
- How big is 100% of the target audience/market?
- How big is a reasonable target marketshare of the above?
- How big is the largest tenant in a multi-tenant system?
- How big is the biggest single unit of work possible/likely? Can it fit on a single machine easily?
- Will users scale into the money needed to grow a team and acquire expertise? Can you scale until then?
- Don't borrow trouble!
- Involve business side in rough cost projections
Practice
Where the rubber meets the road
- Gain confidence in debugging your chosen architecture
- Limit complexity
- Shape practices in some way around your team or organization's structure (Conway's Law happens anyway)
- Add organizational/logical divisions in case roles need to be split
- Consider team size and skills when choosing big solutions like Kubernetes or Terraform
Practice: outsource
Vercel/Heroku/Netlify/etc
- They handle scaling for you
- They handle deploying for you
- They help out with live staging/testing deployments
- You pay a lot more for app layer compute
- They are probably cheaper than the time to invest into learning complex solutions
- I would lean more on platforms that support Docker images as an escape hatch
Practice: Serverless
AWS Serverless
- They handle scaling for you a lot more
- It is up to you to set stuff up, but automation is a power user or expert experience
- Fargate is very cost-effective, a bit harder to set up scaling
- AWS Lambda is pricier, but a bit easier to set up scaling
- Note that both of these work on top of Docker images, so moving between platforms that support Docker is easier, but not frictionless
Practice: LINUX/Shell
For most deployments, knowledge of linux is essential
- Basis for Docker images
- Basis for CI/CD shell commands
- Knowing when bash vs sh is used is important for lightweight/faster uses of Docker
- You need it anyway if you just deployed straight on Compute or on hardware directly
Practice: Databases
- AWS RDS - Good way of managing complexity of backups, snapshots, restoration, clustering, fail-over, upgrades, but takes reading the docs
- Look up the top DB providers for your favorite DB
- Turso: SQLite databases that you store in the cloud, good for smaller apps that want high levels of isolation in multi-tenant systems
Do's and don'ts
- Some things I've found good success with
- Some anti-patterns I've seen or maybe done
Do (Load balancers):
- Use a load balancer, even if just nginx and not a service
- Use the proxy or reverse proxy capabilities to facilitate infrastructure migration, including a proxy/dns 2-step for 100% uptime migrations
- Use HTTP/2.0 or newer
- Use geographic DNS to point users at a sufficiently close load balancer
- Use a small number of resolutions for a DNS entry
- Be able to switch between long-polling and websockets
Do not (Load balancers):
- DO NOT use a load balancer across "regions" or over systems with different latencies
- DO NOT use several layers of load balancers
- DO NOT use DNS to round robin servers in different regions/locations, consider a hot standby failover rather than load-balancing to a backup hundreds of miles away
Do (Load & QUEUS):
- DO use quotas and reject excess requests in high scale environments
- DO use queues for some kinds of work, understanding that a queue can make errors harder to link to requests, involve distributed tracing to debug
Do (error recovery):
- DO write the general case for error recovery first
- DO use error recovery procedures often
- DO use error recovery, to restore production data into a limited staging environment
- DO pay attention to privacy issues
Do (BACKUP + RECOVERY):
- DO write the general case for error recovery first
- DO use error recovery procedures often, for instance, to restore production data into a staging environment, to gain confidence
- DO write special logic to do recovery earlier when the general case is expensive
- DO write down your process for data retention (CYA for legal especially)
Do (microservices):
- DO write systems that can separate
- DO consider the operational overhead of microservices, including
- Versioning
- Declaring version dependencies between microservices
- Implementing distributed tracing
- How much busy-work this might entail for a small team
- DO consider work quotas and rejecting excess requests
Don'T (microservices):
- DO NOT default to microservices
- DO NOT deploy microservices without some way to declare version them
- DO NOT deploy microservices without support for developers to debug requests as they propagate through the system
- DO NOT use async without some kind of rate limiting system
Don't (BACKUP + RECOVERY):
- DON'T handle every error individually
- DON'T rely on catch/finally to recover state
- DON'T go months or years without testing recovery or failover, it's better to risk an outage when you're awake and lucid and prepared than at 3am
Do (AUTOMATION):
- DO Introduce CI/CD early
- DO consider using Docker as a lingua franca for an individually deployable piece of infrastructure/code
- DO run unit tests
- DO run dependency audits for security vulnerabilities
- DO rotate responsibility for the CI/CD through the team to avoid single points of failure in terms of people
Don't (AUTOMATION):
- DON'T delay automation, it's harder when the system is more complex
- DON'T vary production and development environments except through scale
Do (Docker):
- DO use docker-compose for development
- DO think about multiple docker-compose files
- DO run filesystem-intensive operations outside of Docker
- DO make use of host.docker.internal to develop on your local host machine and run nginx or other support off the shelf software in Docker
- DO use docker-compose as a reference for the basic service dependencies
Do (Docker):
- DO use a Dockerfile that's more for deployment
- DO use multistage to keep build tools and info out of prod environment
- DO use environment variables for runtime configuration
- DO use build arguments for docker build time configuration
- DO use RUN --mount and COPY --from to make builds safer and faster, particularly for sensitive information not meant to be promulgated with the final image
DoN'T (Docker):
- DON'T use a Dockerfile and rebuild in the normal development workflow
- DON'T deploy a ton of docker containers by hand
- DON'T publish secrets
DO (STORAGE):
- DO save files in S3, every cloud provider has an S3 compatible API
- DO consider pre-signed direct uploads to S3*
- DO keep track of upload state separately
- DO consider keeping the S3 bucket private-only and have the application sign GET requests to get the data back
DON'T (STORAGE):
- DON'T save files from an app server, it turns a server from part of a herd of cattle into a special pet you have to name and care for
- DON'T save files because it's an extra security hazard
TOPOLOGY DIAGRAMS
- It matters where you put things
- Flatter is better
- Networks can do weird things
- CDN everything for TCP slowstart reasons
USE A CDN IN PROD

USE A CDN IN PROD
TCP Slowstart:
- 10 packets are sent at first
- # of packets in flight are increased/decreased as bandwidth allows
- You can't send the 11th packet until the first packet is acknowledged with permission to send more
- Smaller round trip (closer to the server), the faster you can ramp up bandwidth
- Packet loss really hurts this flow control
beware cors

BEWARE CORS
- Browsers require CORS when you cross domains, including subdomains
- When you make a request to backend.example.com from www.example.com, you send 2 requests with AT LEAST 2 round trips
- Doubles the slowstart penalty
- Just stick a /backend route on the CDN or on a load balancer to hit the backend to defeat the penalty
SMALL + ADVANCED

SMALL + ADVANCED
- Fargate Tasks are less like `docker run` and more like docker compose
- Is complete enough to migrate over time to separate services with their own load balancers for discovery, etc
- Like docker-compose.yml, contains a good description of what services connect to what else
- Failures are grouped together in a single highly-interconnected instance, so the whole thing can be unhealthy, whereas separate services might have combinatorial complexity
FIN
Alex Songe <a@songe.me>
Software Architecture
By asonge
Software Architecture
- 35