Victoria Shoes

Major Product Launch

Project Summary

  • Complete Website Overhaul
  • UX High Fidelity Mocks Complete
  • May 2021 MVP Delivery
  • August 2021 Product Launch
  • ~250k Daily Page Views Current
  • ~25M Daily Page Views Expected
  • Ensure Proper Site Reliability
  • Dynamic Deployment on Launch Date
  • Upskill/Train Customer Engineers
  • Restricted A/B Focus Testing Environment
  • Application Monitoring and Notifications
  • On-Demand Deployment

Main Concerns

  • Lithuanian Separatist Group
    • Previous DDOS Attacks
  • Focus Group Testing
    • Multiple Versions Available
    • CMO to Approve Each Feature
    • Preventing Leaks
  • Site Reliability
    • No Dedicated Database Administrators
    • Engineer Training
  • Hosting Platform
    • Public Cloud or On-Premises Hosting
    • Datacenter Investment

Concern:

Lithuanian Group

Knowns

  • Crude DDOS attack indicates unsophisticated technical knowledge
  • VMs crashed at approximately 2500 requests/min
  • Requests for large assets separate from page requests

Assumptions

  • Attacks will continue with new product launch
  • Attacks may become more sophisticated
  • Attacks will likely target commonly known exploits and attack vectors

Mitigation Plan

  • Store thumbnail/various scaled assets on CDN
  • Implement auto IP banning solutions for
    • Number of Authentication Requests
    • Number of Requests per Minute
    • Known Botnet Agents
    • Number of 404 Requests per Minute
    • Known bad routes
  • Implement pre-forking web services
    • Reduces the chance of memory leaks
    • Allows for parallel request processing
    • Blocking requests are killed without cross-interference
  • CAPTCHAS
    • Prevent brute force authentication attempts

Concern:

Focus Group Testing

Knowns

  • Focus group must be secured from internet traffic
  • Requires multiple versions for A/B testing and approval
  • CMO must approve before feature is accepted

Assumptions

  • The presentation layer will be the primary differentiation between focus group versions
  • Development velocity should be minimally impacted by focus group testing

Mitigation Plan

  • Scalable application architecture
    • CLEAN (Adapter/Port Pattern)
    • Allows for multiple versions and interchangeable components
    • Separates business logic from implementation details
    • Allows developers to continue with new features while awaiting approval from CMO
  • Variants can be deployed utilizing proper configuration management tools
  • Automated deployment and environment configurations allow for rapid setup with minimal staff intervention

Concern:

Site Reliability

Knowns

  • Site catalog and content stored in MySQL database
  • Relatively small (<500) catalog of products
  • Current database schema to be utilized going forward
  • Production support via customer engineers

Assumptions

  • The database schema is not likely optimized
  • On-premises data center cannot handle expected site traffic
  • Most of the site is fairly static in nature
  • Engineers are not currently equipped to support site

Mitigation Plan

  • A deep review of database optimizations
    • Proper indices, storage, column types, and sizes
    • Hardware evaluation
    • Query profiling and Caching
  • Self-documenting code
    • Documentation is never stale
    • Request/Response validation is automatic
  • Utilize in-memory caching services, e.g. Memcached
    • Minimizes disk reads
    • Improves response times
  • Application and system monitoring
    • User telemetry collecting
    • Monitoring and reporting tools

Concern:

Hosting Platform

Hosting Contention

  • Large capital expenditure has already been utilized to upgrade the data center.
  • It is assumed by some that a public cloud provider is the only cost-effective solution.
  • Launch date traffic increase is estimated to be 100x the current traffic.
  • On-prem and public cloud providers both have their strengths and challenges.
  • Any downtime is not an option.*

Downtime is Unavoidable

  • Inter-Connect Peering and Backbone Providers
    • Disputes over billing and data transmissions
    • Aging equipment
    • Malicious DDOS Attacks
  • Internet Service Providers
    • Electrical Grid Outages
    • Downed Transmission Lines
    • Construction Mishaps
  • Trans-oceanic Cables
    • Sharks...
    • Ship anchors, and other equipment

How to Maximize Uptime?

  • Host on-premises, hoping for the best.
  • Host development and testing environments on-premises, deploy production on AWS.
  • Host on-premises and scale to AWS as needed.
  • Host on AWS and fail-over to on-premises as needed
  • Host entirely on AWS, recouping some of the data center investment via the sale of equipment.

Pros

  • Scalability is only limited by operational cost
  • Many support options available
  • No single point of failure
  • Amazon retail integrations

Cons

  • The operational cost can quickly get out of hand
  • Hard to move away from AWS services once setup
  • Configuration and setup can be fairly confusing for the non-initiated

Public Cloud (AWS)

Pros

  • More granular control of the system
  • Easier to port to a public cloud provider later
  • Utilizes the updated data center
  • Lower operational cost

Cons

  • Limited support options
  • The scale is limited by hardware footprint, capital budget, physical resources
  • Points of failure are not distributed

Self-Hosting

Plan: Benchmark Tests

  • Unknowns:
    • The necessary level of support.
    • The capabilities of the customer's data center.
    • The skillset of engineers.
  • It is nearly impossible to determine the most cost-effective solution without data.
    • Cost analysis must be performed to determine the scale and operational cost of each option.
    • Maintainability will be determined by engineers' skillset.
  • A hybrid solution is likely the most effective option.

Suggestion:

Development Strategy

Clean Architecture

Automated Integration

  • Code is branched for new features
    • Peer-Reviewed
    • Automated unit, integration, regression tests
    • Create feature flag and artifact
  • Artifacts deployed to focus group
    • User Testing/Feedback
  • CMO approves artifact
    • The artifact is tagged and integrated
    • Application configuration updated and tagged
  • Production-ready
    • Final application configuration is deployed

Tools/Services

  • Declarative CI/CD pipelines
    • Drone.io
    • Concourse CI
  • Artifact repositories
    • Sonatype Nexus
    • JFrog
  • Containerized configurations

Suggestion:

Hosting Strategy

AWS

  • ELB
    • Highly Available
    • Security Services
    • Autoscaling
  • EC2
    • Application Components
  • CloudFront
    • Global CDN

Datacenter

  • CI/CD Pipelines
  • Artifact Repository
    • Libraries/tools
    • Container images
  • Focus Group
    • VPN secured
    • Mock environment using containerized components

Hybrid Solution*

Suggestion:

Deployment Strategy

Deployment

  • Deploy the current content, revamped when:
    • Passes focus group
    • Approved by CMO
    • All tests pass
  • This allows time for actual user feedback:
    • To fix bugs and user issues
    • Hardening of framework
  • New content should utilize same framework
    • Flagged as unavailable/inactive
  • Activate new content utilizing
    • Internal company chat services
    • Twitter bot
    • Webhook

Plan:

Production Support and Reliability Strategy

Proactive Monitoring

  • Leverage WWT partners for ingesting metrics from full-stack
    • Network bandwidth and request metrics
    • Application usage, bottlenecks and stack trace
    • System resources
    • Receive alerts before a problem occurs
  • Utilize visualization tools, e.g. Grafana, to recognize problematic patterns
  • Services like Chronograf to review and search logs related to customer issues

Site Reliability

  • Leverage AWS auto-scaling features to handle sudden increases in site traffic
  • Distribute traffic across multiple regions, and properly utilize CDNs to optimize site availability and responsiveness
  • Utilize Infrastructure as Code to enable deployment across various cloud providers if necessary
  • Implement tooling to detect malicious activity and automatic IP banning