Ramon Perez
With ❤️ from Sydney, AU
Former
Currently
By the end of this session, you will,
Why...
should you understand them, and
be excited about what they offer?
How...
do stores add value,
what solutions are out there, and
how do I figure out what to look for in them?
What are...
metrics, features, and their stores?
Metrics
If you can not measure it, you cannot improve it.
Lord Kelvin
Metrics are what companies use to measure, track and improve the components that have a direct effect on their value proposition.
A standard for measuring or evaluating something, especially one that uses figures or statistics.
A standard of measurement."
A quantifiable measure that is used to track and assess the status of a specific process.
Dating Apps
Daily Active Users
Premium Memberships
Churn
Clothing Brands
Customer Lifetime Value
News Outlets
Web Traffic
All Employers Should
Happiness (Employee Satisfaction)
Source: How do organizations manage metrics today? by Transform.co
Tied to the success of the company.
Data points aggregated over time.
Used to track a company's financial health, among many things.
Tied to a specific goal such as the success of an ad campaign.
Frequencies and aggregated numbers that provide direction towards achieving a goal.
Can change regularly for similar or different goals that may or may not affect revenue.
Have a clear goal
Have a consistent definition
Are actionable
Are relevant to the botton line
Are measurable
Are readable
A Metrics Store is a tool that allows you to define metrics as code, govern them, and serve them to a variety of downstream applications.
Data Storage
Metrics Store
Data Sources
Data Analysis and other tools
This assumes an ETL approach took place.
In addition, Metrics Stores allow you to productionize data for diverse business use cases and stakeholders of varying technical levels.
Data Analysis and other tools
This assumes an ELT approach took place.
Data Storage
Metrics Store
Data Sources
Tranformations
Each use case would have its own metrics logic definition prior to the analytics stage.
Data Analysis and other tools
Data Storage
Metrics Logic +
Data Sources
Inconsistent metrics definitions across teams.
Wasted time writing queries/code rather than insights.
Inaccurate values at the time of reporting.
Untraceability of code/queries.
Lack of governance and trust. Who created what, when and how?
Duplicate data.
Increased costs with the use of cloud resources.
Source: How do organizations manage metrics today? by Transform.co
Connect to the flavour(s) of warehouse or database(s) available at your organization.
Data producers define metrics as code in multiple ways.
Downstream Tools
BI
Metrics Store
SaaS
ML
Data Producers
Marketing
Product Team
Finance
Downstream tools connect to your Store.
Airbnb has Minerva
LinkedIn has UMP
Uber has uMetric (and M3)
Features
Problem/Goal
Data Sources
Prepare Data
Train Model
Evaluate
Fine Tune
Deploy
Monitor
Features and Labels
Features are numeric representations of raw data that serve as the fuel for machine learning models.
Student | Month | Income |
---|---|---|
Yes | Feb | $20K |
No | Mar | $75K |
No | Jul | $60K |
Yes | Jan | $22K |
Yes | Dec | $10K |
Not Features
Features
Student | Month | Income |
---|---|---|
1 | 2 | 20000 |
0 | 3 | 75000 |
0 | 7 | 60000 |
1 | 1 | 22000 |
1 | 12 | 10000 |
By re-coding all non-numerical values into numerical ones, e.g. Likert scale-type question into numbers
Date_Time |
---|
31-12-2021 14:22:55 |
20-07-2021 11:40:13 |
By extracting information from different data points, e.g. dates (not those dates 👩❤️👨)
Day | Month | Year | Hour | Min | Secs |
---|---|---|---|---|---|
31 | 12 | 2021 | 14 | 22 | 55 |
20 | 7 | 2021 | 11 | 40 | 13 |
How satisfied were you with your chicken burger?
Very Dissatisfied | Dissatisfied | Neutral | Satisfied | Very Satisfied
Before | After |
---|---|
Satisfied | 4 |
Dissatisfied | 2 |
dad_jokes |
---|
How did the picture end up in jail? It was framed! |
I made a pencil with two erasers. It was pointless. |
Where do lizards go to fix their fallen tails? The retail shop. |
did | do | end | ... | to | two | up | was | |
---|---|---|---|---|---|---|---|---|
1 | 1 | 0 | 1 | ... | 0 | 0 | 1 | 1 |
2 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 1 |
3 | 0 | 1 | 0 | ... | 1 | 0 | 0 | 0 |
...is the process of formulating the most appropriate features given the data, the model, and the task. ~ Alice Zheng and Amanda Casari
...it is using domain knowledge of the data to create new features that increase the signal coming from the data.
Capture subtle to complex relationships in the raw data.
Increase the accuracy of our models.
Get rid of duplicate features.
Reduce the magnitude and scale of the features.
Deal with of outliers.
Reduce the dimensionality of our data.
Strive for simplicity.
Select independent features.
Avoid useless features.
Avoid redundant features.
Pick a good starting point for the minimum number of features your problem will need.
...an ML-specific data system that runs data pipelines that transform raw data into feature values, stores and manages the feature data itself, and serves data consistently for training and inference purposes. ~ Willem Pienaar & Mike Del Baso, 2021
Redundancy when building features.
Slow online serving (i.e. real-time predictions).
The scaling large ML models.
Discovering new features at scale.
The monitoring of a feature pipelines' health in production.
Having to provide extensive engineering support.
Problem/Goal
Get Data
Feature Store
Model
Deploy
Transform
Online
Batch
Serve
Define and Register
Share and Monitor
Spotify has JukeBox
Airbnb has Zipline
Gojek has Feast
Uber has Michelangelo Palette
Netflix has Metaflow
Why...
should you understand them, and
be excited about what they offer?
How...
do stores add value,
what solutions are out there, and
how do I figure out what to look for in them?
What are...
metrics, features, and their stores?
To measure and improve business outcomes.
To track how the evolution of our products and services affect the bottom line.
To test the effects of new products and services in different areas of the business.
To understand our customers behaviour.
To spot dips and spikes in performance and prevent churn and turnover.
Metrics are the "shared language" for organizations to make decisions on.
To represent raw data from the real world.
To improve our products and services with machine learning.
To train machine learning models.
To understand our customers' behaviour.
To provide recommendations to consumers.
To standardize the way in which goals are tracked within the organization.
To apply well-tested software engineering best practices to our analytics functions.
To let our visualization and reporting tools do what they do best and move all metrics' logic to a single place.
To stop duplicating tables at the warehouse level at the time of metrics logic definition.
Because the amount of teams doing and taking advantage of analytics across organisations continues to increase.
Because serving predictions in real-time is hard.
To reduce the latency between getting raw data, transforming it, and making a prediction.
To stop duplicating features for the same purpose.
To automatically backfill newly selected features as needed.
To detect drift between data sources.
Automate metrics/features creation.
Automatically backfill metrics/feature computation and logging.
Enable software engineering best practices.
Increases consistency between training and serving data
Enable the sharing of metrics/features across different teams.
Reduce costs.
Increase experimental/productionization velocity.
Build trust among end-users with consistent definitions.
Abstract away the complexity from multiple data pipelines.
Can have a steep learning curve for data professionals without a coding background.
Advanced level of engineering required to set up.
Still early days which means a lot of testing and development is still in progress.
Complex feature creation in real-time is still a challenge.
It is easier to adopt for big companies.
Handling thousands of features is a challenge.
Handling large datasets can be challenging.
In both, you can define things once and use them everywhere.
Update your definitions once and the changes happen globally, which means metrics/features get backfilled.
Both act as a centralized repository of knowledge to help create value from data.
Both work best and are optimized for, structured data.
Both have interfaces to similar tools like Jupyter Lab, R Studio, etc.
Churn, defined in the metrics context, is the rate at which customers stop doing business with us. In terms of features and in the machine learning context, churn is a 1 or a 0.
month | total # | churned |
---|---|---|
12 | 7023 | 150 |
11 | 7090 | 110 |
10 | 6903 | 133 |
9 | 7541 | 98 |
8 | 7209 | 122 |
7 | 7387 | 170 |
churn rate |
---|
2.13% |
1.55% |
1.92% |
1.29% |
1.69% |
2.30% |
ID | Gender | Tenure | Total Charges |
---|---|---|---|
1234 | 1 | 3 | 4150 |
5678 | 0 | 30 | 10110 |
9101 | 1 | 42 | 7133 |
1213 | 1 | 10 | 598 |
1415 | 0 | 25 | 9122 |
1617 | 1 | 13 | 2170 |
Churn |
---|
1 |
0 |
0 |
0 |
1 |
0 |
Why...
should you understand them, and
be excited about what they offer?
How...
do stores add value,
what solutions are out there, and
how do I figure out what to look for in them?
What are...
metrics, features, and their stores?
Metrics Stores
Feature Stores
Metrics allow us to track what matters for our business.
Features are the fuel of our machine learning models and they often need reshaping before we get to use them.
Metrics Stores provide us with a way to write, govern, and serve metrics in a common language.
Feature Stores abstract away the reshaping of features while providing scalability in both, online serving and offline training.
If you need consistency, scalability, reusability, and a performance boost in your analytics/ML operations then adopt...
Metrics Stores and Feature Stores are like the maestro/conductor of an orchestra, while the muscisians can still perform without one, you can only hope for good synchronization.