Rearchitecting Client Events at Stitch Fix
Rob Wierzbowski
Stitch Fix is a data-driven company
Track the user's experience
Route loads, views and clicks
Real-time personalization
Measure KPIs
Guide business decisions
CLient events
Setting the stage
No centralized responsible party to vet event requirements.
No standardization of event schemas. Teams created new event schemas for most features and use cases.
No standardization of event triggers. Events were triggered in different layers of an application, with different heuristics and success ratios (e.g., a screen load event sent from a SPA on route load, or from the backend on route request).
Complicated, non-standardized contextual data was added to each event. Often known as “subsource”, it could contain information about UI near the event trigger, site region (e.g., checkout, product page), actions the client took in the past, and even other subsources.
Low to zero feedback loop on event impact after implementation
Major Issues
No centralized responsible party to vet event requirements.
No standardization of event schemas. Teams created new event schemas for most features and use cases.
No standardization of event triggers. Events were triggered in different layers of an application, with different heuristics and success ratios (e.g., a screen load event sent from a SPA on route load, or from the backend on route request).
Complicated, non-standardized contextual data was added to each event. Often known as “subsource”, it could contain information about UI near the event trigger, site region (e.g., checkout, product page), actions the client took in the past, and even other subsources.
Low to zero feedback loop on event impact after implementation
Major Issues
Doubled time to feature completion. Teams reported spending 50% of feature development time on Client Events, up from negligible with GA.
Stressful negotiation of event data. Many event context requests involved large lifts, which engineers resisted.
Highly coupled app code. Event values were passed deeply through component trees, creating frustrating spaghetti codebases.
Increased maintenance costs, with causes including databases to support event context, maintaining unused events, and refactoring friction due to coupled code.
Low ability to analyze events across teams, due to non-standard event shapes.
Incorrect KPI and personalization results caused by frequent bugs. Testing was difficult and errors were often unnoticed.
Major Impacts
Doubled time to feature completion. Teams reported spending 50% of feature development time on Client Events, up from negligible with GA.
Stressful negotiation of event data. Many event context requests involved large lifts, which engineers resisted.
Highly coupled app code. Event values were passed deeply through component trees, creating frustrating spaghetti codebases.
Increased maintenance costs, with causes including databases to support event context, maintaining unused events, and refactoring friction due to coupled code.
Low ability to analyze events across teams, due to non-standard event shapes.
Incorrect KPI and personalization results caused by frequent bugs. Testing was difficult and errors were often unnoticed.
Major Impacts
Solve: Rearchitecting Client Events
- Increase standardization
- Reduce time per feature
- Balance responsibilities between Eng and Algos orgs
Goals
- Business critical actions and entities:
- View and Click (Select)
- Routes (Screens), Categories, Outfits, SKUs, Generics
- Minimal, atomic events
- Strict definitions; unions where possible
- TypeScript compiled into JSON Schema for transport
Strongly typed schemas
Strongly typed schemas
- Distributed tool to send events from frontend apps
- Well documented, fully featured (test helpers, bug tracker integration, etc)
- Optimizes transport and triggers
- Automates event contex
Event Reporter
Event reporter
- Intersection Observers trigger an event when 60% of a component is in viewport, or when the component covers 30% of the available viewport
- Transport via keepalive Fetch
- Batching and Gzip for performant bytes over wire
Event reporter
Event Reporter
- Validation
- Security
- Message transportation (RabbitMQ)
Client Event Service
- A long process
- Many revisions in the early stages
- Thin implementations
- Parellel implementations
- Building support from the ground up
Rollout
-
Client Event implementation has fallen to 10-15% of feature time. Reducing time and stress around events was by far our biggest goal, and teams using the new architecture report great success. Event content negotiation time has fallen by 90%.
-
Teams are decoupling components, increasing isolation, reuse, and test coverage with Event Reporter.
-
Communication is improved and responsibility is clear. Questions and discussions are handled efficiently, and teams share well-defined, standardized nomenclature around Client Events.
-
Algos can analyze data across all client facing apps. New events have flexibility that will lead to improved analysis over time.
-
Bespoke Algos pipelines and event processing are being removed.
-
Personalization and KPI report accuracy has improved due to improved testing, strict schemas, and multiple validation layers.
-
We have not lost significant analysis ability by switching to a simplified event context.
Impact
Discussion
Minimal
By Rob Wierzbowski
Minimal
- 217