https://blogs.nvidia.com/blog/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/
AI has been a topic of imagination and research since 1956, where computer scientists formally established the field.
Early machine learning included artificial neural networks inspired by biology of our brains: interconnections between neurons. Basically laughed at until 2012.
Computer vision emerged as a major application area for machine learning, though it initially required significant hand-coding.
Deep learning starting in 2015 made advances on all of the above. This recent AI explosion driven by the availability of GPUs.
Graphical Processing Unit (GPU)
a specialized electronic circuit designed to accelerate the processing of images and videos. Prior to AI/ML, they were primarily used for rendering graphics in computers and gaming systems.
...
They are highly effective at parallel processing, which allows them to handle many computations simultaneously. This capability significantly speeds up tasks that involve large-scale data processing, making them essential for training complex machine learning models.
Colloquially, systems that use any of these tend to be called "algorithms"
What is fairness to an algorithm?
due process:
a legal principle ensuring that everyone is entitled to a fair and impartial procedure before being deprived of life, liberty, or property.
...
For algorithms in society, this means that people should have the right to:
Burton, Emanuelle; Goldsmith, Judy; Mattei, Nicholas; Siler, Cory; Swiatek, Sara-Jo. Computing and Technology Ethics: Engaging through Science Fiction (pp. 117-118). MIT Press.
hungry judges
a study of parole boards in 2011 found that parole was granted nearly 65% of the time at the start of a session, barely above 0% right before a meal break, and again nearly 65% after a break
...
hungry judges are harsher judges!
What is fairness to an algorithm?
https://web.stanford.edu/class/cs182/
https://web.stanford.edu/class/cs182/
How can we evaluate algorithmic systems to ensure these?
Today, many (most?) ethical decisions are written in software
question: what unintended social consequences arise from the code we write?
question: can we identify potentially unethical code before it hurts people?
https://web.stanford.edu/class/cs182/
Catchphrases:
Information about the world
Brain
Complete task based on learned information
Training Data
Prediction
Model
Information about the world
Brain
Complete task based on learned information
Training Data
Prediction
Model
rows = individual items
Information about the world
Brain
Complete task based on learned information
Training Data
Prediction
Model
Information about the world
Brain
Complete task based on learned information
Training Data
Prediction
Model
question: is this a "good" model?
How could you evaluate this model to find out?
question: is this a "good" model?
How could you evaluate this model to find out?
question: is this a "good" model?
"total error"?
sure, but what are the bounds? what's an acceptable value vs. unacceptable?
question: is this a "good" model?
more interesting measures:
on average, how much error can we expect in our prediction?
how much variation can we expect in that error?
You may recognize this problem as Linear Regression
Information about the world
Brain
Complete task based on learned information
Training Data
Prediction
Model
image from https://paperswithcode.com/task/classification
let's take a closer look at the "classification" task
Information about the world
Brain
Complete task based on learned information
Training Data
Prediction
Model
possible model?
x = lbs of food eaten
y = neediness
Information about the world
Brain
Complete task based on learned information
Training Data
Prediction
Model
x = lbs of food eaten
y = neediness
cat zone
dog zone
Information about the world
Brain
Complete task based on learned information
Training Data
Prediction
Model
x = lbs of food eaten
y = neediness
cat zone
dog zone
Information about the world
Brain
Complete task based on learned information
Training Data
Prediction
Model
cat zone
dog zone
question: are these "good" models?
How could you evaluate these models to find out which is better?
instead of measuring error, we can look at accuracy
instead of measuring error, we can look at accuracy
instead of measuring error, we can look at accuracy
instead of measuring error, we can look at accuracy
instead of measuring error, we can look at accuracy
we can also ask, which classes get confused with others?
instead of measuring error, we can look at accuracy
we can also ask, which classes get confused with others?
predicted labels
true labels
predicted labels
true labels
2
1
0
3
predicted labels
true labels
3
0
1
2
predicted labels
true labels
3
0
1
2
predicted labels
true labels
3
0
1
2
predicted labels
true labels
3
0
1
2
Rows are called False Negatives
Columns are called False Positives
Entries on the diagonal are called True Positives
This is called a Confusion Matrix
Suppose we are evaluating a machine learning algorithm that is trying to perform object detection in this image
We specify three objects of interest that we want the model to classify. These are our labels L
L = {tree, bicycle, shoe}
Here are the model's output predictions
Let's evaluate how the model performed!
Step 1: fill in the data table
Step 2: complete the confusion matrix
true labels
predicted labels
Step 3: tally the errors
Step 1: fill in the data table
Step 2: complete the confusion matrix
true labels
predicted labels
Step 3: tally the errors
Step 2: complete the confusion matrix
true labels
predicted labels
Step 1: fill in the data table
7
Step 3: tally the errors
So overall accuracy = 7/14 = 50%
...
Is this enough to describe the performance?
Step 2: complete the confusion matrix
Step 1: fill in the data table
FN = 1
7
true labels
predicted labels
Step 3: tally the errors
Step 2: complete the confusion matrix
Step 1: fill in the data table
FN = 1
FN = 1
7
true labels
predicted labels
Step 3: tally the errors
Step 2: complete the confusion matrix
Step 1: fill in the data table
FN = 1
FN = 1
7
true labels
predicted labels
FN = 5
7
Step 3: tally the errors
Step 2: complete the confusion matrix
7
7
Step 1: fill in the data table
FN = 1
FN = 1
FN = 5
true labels
predicted labels
FP = 5
Step 3: tally the errors
Step 2: complete the confusion matrix
7
7
Step 1: fill in the data table
FN = 1
FN = 1
FN = 5
true labels
predicted labels
FP = 5
FP = 1
Step 3: tally the errors
Step 2: complete the confusion matrix
7
7
Step 1: fill in the data table
FN = 1
FN = 1
FN = 5
true labels
predicted labels
FP = 5
FP = 1
FP = 1
7
Step 3: tally the errors
true labels
predicted labels
If you're not a fan of this whole summing across the rows and columns thing...
instead you can break them down into 2x2's and then sum them in the 3rd dimension
Positive | Negative | |
---|---|---|
Positive | TP | FN |
Negative | FP | ~ |
FN = 1
FP = 5
tree | not tree | |
tree | 5 | |
not tree | ~ |
true labels
predicted labels
true labels
predicted labels
If you're not a fan of this whole summing across the rows and columns thing...
instead you can break them down into 2x2's and then sum them in the 3rd dimension
Positive | Negative | |
---|---|---|
Positive | TP | FN |
Negative | FP | ~ |
FN = 1
FN = 1
FN = 5
FP = 5
FP = 1
FP = 1
tree | not tree | |
tree | 5 | |
not tree | ~ |
true labels
predicted labels
bike | not bike | |
bike | 1 | |
not bike | ~ |
true labels
predicted labels
shoe | not shoe | |
shoe | 1 | |
not shoe | ~ |
true labels
predicted labels
true labels
predicted labels
If you're not a fan of this whole summing across the rows and columns thing...
instead you can break them down into 2x2's and then sum them in the 3rd dimension
Positive | Negative | |
---|---|---|
Positive | TP | FN |
Negative | FP | ~ |
FN = 1
FN = 1
FN = 5
FP = 5
FP = 1
FP = 1
tree | not tree | |
tree | 5 | FN =1 |
not tree | ~ |
true labels
predicted labels
bike | not bike | |
bike | 1 | FN = 1 |
not bike | ~ |
true labels
predicted labels
shoe | not shoe | |
shoe | 1 | FN = 5 |
not shoe | ~ |
true labels
predicted labels
true labels
predicted labels
If you're not a fan of this whole summing across the rows and columns thing...
instead you can break them down into 2x2's and then sum them in the 3rd dimension
Positive | Negative | |
---|---|---|
Positive | TP | FN |
Negative | FP | ~ |
FN = 1
FN = 1
FN = 5
FP = 5
FP = 1
FP = 1
tree | not tree | |
tree | 5 | FN =1 |
not tree | FP = 5 | ~ |
true labels
predicted labels
bike | not bike | |
bike | 1 | FN = 1 |
not bike | FP = 1 | ~ |
true labels
predicted labels
shoe | not shoe | |
shoe | 1 | FN = 5 |
not shoe | FP = 1 | ~ |
true labels
predicted labels
To interpret the results, we need to know "out of how many?" for each class
predicted labels
true labels
We have some class imbalance
To interpret the results, we need to know "out of how many?" for each class
predicted labels
true labels
We have some class imbalance
When we have class imbalance, we should report these as percentages or rates, or "out of how many?"
predicted labels
true labels
When we have class imbalance, we should report these as percentages or rates, or "out of how many?"
For reasons, we call the True Positive Rate Recall
Recall uses all positives as denominator
predicted labels
true labels
When we have class imbalance, we should report these as percentages or rates, or "out of how many?"
For reasons, we call the True Positive Rate Recall
Recall uses all positives as denominator
predicted labels
true labels
When we have class imbalance, we should report these as percentages or rates, or "out of how many?"
For reasons, we call the True Positive Rate Recall
Recall uses all positives as denominator
Precision uses predicted positives as denominator
predicted labels
true labels
When we have class imbalance, we should report these as percentages or rates, or "out of how many?"
For reasons, we call the True Positive Rate Recall
Recall uses all positives as denominator
Precision uses predicted positives as denominator
An ideal classifier has Precision = 1 and Recall = 1
When we have class imbalance, we should report these as percentages or rates, or "out of how many?"
For reasons, we call the True Positive Rate Recall
Recall uses all positives as denominator
Precision uses predicted positives as denominator
Precision and Recall are both asking about "how many correct" but from different perspectives:
When we have class imbalance, we should report these as percentages or rates, or "out of how many?"
Recall uses all positives as denominator
Precision uses predicted positives as denominator
Tree | Bicycle | Shoe | Overall | |
---|---|---|---|---|
Precision | ||||
Recall |
When we have class imbalance, we should report these as percentages or rates, or "out of how many?"
Recall uses all positives as denominator
Precision uses predicted positives as denominator
Tree | Bicycle | Shoe | Overall | |
---|---|---|---|---|
Precision | .5 | .5 | .5 | .5 |
Recall | .83 | .5 | .16 | .5 |
Now we can explain our classifier with much more descriptive language that can help others understand whether it might treat some classes differently than others!
When we have class imbalance, we should report these as percentages or rates, or "out of how many?"
Recall uses all positives as denominator
Precision uses predicted positives as denominator
Tree | Bicycle | Shoe | Overall | |
---|---|---|---|---|
Precision | .5 | .5 | .5 | .5 |
Recall | .83 | .5 | .16 | .5 |
There's one more summary metric we can compute, called F1 Score:
harmonic mean of precision & recall
HM, compared to arithmetic mean, tends to mitigate the impact of large outliers and puts more importance on the impact of small ones
When we have class imbalance, we should report these as percentages or rates, or "out of how many?"
Recall uses all positives as denominator
Precision uses predicted positives as denominator
Tree | Bicycle | Shoe | Overall | |
---|---|---|---|---|
Precision | .5 | .5 | .5 | .5 |
Recall | .83 | .5 | .16 | .5 |
F1 | .62 | .5 | .25 | .46 |
There's one more summary metric we can compute, called F1 Score:
harmonic mean of precision & recall
HM, compared to arithmetic mean, tends to mitigate the impact of large outliers and puts more importance on the impact of small ones
https://en.wikipedia.org/wiki/Precision_and_recall
https://scikit-learn.org/1.5/auto_examples/model_selection/plot_precision_recall.html
https://scikit-learn.org/1.5/auto_examples/model_selection/plot_precision_recall.html
Text
https://stripe.com/en-gi/guides/primer-on-machine-learning-for-fraud-protection
Precision
0.0
1.0
https://modelcards.withgoogle.com/about
https://modelcards.withgoogle.com/object-detection
https://openai.com/index/openai-o1-system-card/
https://scikit-learn.org/1.5/auto_examples/model_selection/plot_cost_sensitive_learning.html#sphx-glr-auto-examples-model-selection-plot-cost-sensitive-learning-py
Stay tuned: these are so important for when we talk about fairness in AI
Any time someone doesn't report at least precision and recall, but ideally a confusion matrix, it's as good as....
https://stats.stackexchange.com/questions/423/what-is-your-favorite-data-analysis-cartoon?page=2&tab=votes#tab-top