Calibration class project
Sheng Long
12/03/24
What did I do?
🧭 I had three guiding questions:
- Scoring rules, (financial) incentives, and calibration --- how are these concepts related? What are some implied assumptions?
- e.g. are proper scoring rules better than non-proper scoring rules as payoff functions for participants? Do participants perceive/react to them in different ways?
- What kind of behavioral research has been done related to calibration?
- 📊 What are some existing ways to graphically communicate classification performance metrics, such as calibration? How effective are they for e.g., choosing classifiers?
Scoring rules, (financial) incentives, and calibration
Winkler, Robert L., et al. "Scoring rules and the evaluation of probabilities." Test 5 (1996): 1-60.
- How are these concepts related?
- Scoring rules evaluate the "goodness" of probability forecasts (against observed frequencies)
- Calibration is one potential attribute of "goodness"
- Strictly proper scoring rules provide an incentive for forecasters to report truthfully
- Forecasters must report truthfully to maximize expected score
- Strictly proper scoring rules imply that all other things being equal, a perfectly calibrated forecaster will earn a higher average score
... for probability assessment
Scoring rules, financial incentives, and calibration
Winkler, Robert L., et al. "Scoring rules and the evaluation of probabilities." Test 5 (1996): 1-60.
- Calibration can be viewed as one attribute of "goodness" of probability forecasts
- Other attributes include:
- discrimination, or the degree to which the probabilities discriminate between occasions on which \(A\) occurs and occasions on which \(A\) does not occur
- resolution, or the degree to which conditional means \(E(x|p)\) differ from the base rate \(E(x)\)
- unconditional bias, or the degree to which the overall mean probability \(E(p)\) differs from the base rate
Scoring rules, financial incentives, and calibration
Camerer, Colin F., and Robin M. Hogarth. "The effects of financial incentives in experiments: A review and capital-labor-production framework." Journal of risk and uncertainty 19 (1999): 7-42.
- What are the effects of financial incentives in experiments?
- Camerer and Hogarth proposed a "capital-labor-production framework"
- Experiment participants come to experiments with knowledge and goal
cognitive capital
(e.g. heuristics, perceptual skills)
some objective function to maximize
(e.g. maximize monetary payoff while minimizing effort)
- Different tasks have different production requirements
- some tasks require attention and diligence, others require domain-specific knowledge
- Labor = effort
Scoring rules, financial incentives, and calibration
Camerer, Colin F., and Robin M. Hogarth. "The effects of financial incentives in experiments: A review and capital-labor-production framework." Journal of risk and uncertainty 19 (1999): 7-42.
- Camerer and Hogarth proposed a "capital-labor-production framework" and reviewed 74 experimental papers in which the level of financial performance-based incentive given to subjects was varied
- They found that
- incentives sometimes improve performance, but often don't
- incentives improve performance in easy tasks that are effort-responsive
- incentives sometimes hurt when problems are too difficult or when thinking harder makes things worse
- incentives don't affect mean performance but often reduce variance in responses
Scoring rules, financial incentives, and calibration
Camerer, Colin F., and Robin M. Hogarth. "The effects of financial incentives in experiments: A review and capital-labor-production framework." Journal of risk and uncertainty 19 (1999): 7-42.
- In the experimental papers surveyed by Camerer and Hogarth, the following papers utilize scoring rules as payoff functions to participants:
- Phillpis and Edwards (1966)
- Grether (1980)
- Wright and Aboull-Ezz (1988)
- Wallste, Budescu, and Zwick (1993)
Scoring rules, financial incentives, and calibration
Camerer, Colin F., and Robin M. Hogarth. "The effects of financial incentives in experiments: A review and capital-labor-production framework." Journal of risk and uncertainty 19 (1999): 7-42.
- Wright and Aboull-Ezz (1988) --- Effects of Extrinsic Incentives on the Quality of Frequency Assessments
- Participants: 51 MBA students
- Task: estimate frequency for 3 variables
- salaries, age, GMAT scores
- given range, assign frequencies into "buckets"
Scoring rules, financial incentives, and calibration
Camerer, Colin F., and Robin M. Hogarth. "The effects of financial incentives in experiments: A review and capital-labor-production framework." Journal of risk and uncertainty 19 (1999): 7-42.
- Wright and Aboull-Ezz (1988) --- Effects of Extrinsic Incentives on the Quality of Frequency Assessments
- Two incentive conditions
- non-performance contingent, flat rate ($3 each)
- flat rate $3 + additional amount contingent on relative accuracy of assessment (overall score = mean squared error over 27 frequency categories)
- They found that incentivized subjects did have lower absolute errors in probability than the no-incentive subjects
- However, Camerer and Hogarth argue that there is a confound in their experiment design
Scoring rules, financial incentives, and calibration
Camerer, Colin F., and Robin M. Hogarth. "The effects of financial incentives in experiments: A review and capital-labor-production framework." Journal of risk and uncertainty 19 (1999): 7-42.
- Wright and Aboull-Ezz (1988) --- Effects of Extrinsic Incentives on the Quality of Frequency Assessments
- The confound is between the incentive and the use and explanation of scoring rule (MSE)
- There should be four conditions:
- flat rate pay, no feedback
- flat rate pay, scoring rule feedback
- scoring rule pay, scoring rule feedback
- scoring rule pay, no feedback
Calibration-related behavioral research
Calibration of probabilities: The state of the art to 1980
1982
... meterologists have been interested in calibration as far back as 1906
Calibration-related behavioral research
Calibration of probabilities: The state of the art to 1980
1982
1906
meteorologists
1950
Glenn Brier (meteorologist) proposed an approach to verify the accuracy of probabilistic forecasts
Calibration-related behavioral research
Calibration of probabilities: The state of the art to 1980
1982
1906
meteorologists
1950
Glenn Brier & Brier score
1991
Calibration and probability judgements: Conceptual and methodological issues
... much of the research on calibration is "dust-bowl empiricism"
Calibration-related behavioral research
Calibration of probabilities: The state of the art to 1980
1982
1906
meteorologists
1950
Glenn Brier & Brier score
1991
Calibration and probability judgements: Conceptual and methodological issues
and many many more papers and stylized facts
Encoding Subjective Probabilities: A Psychological and Psychometric Review
subjective probability
1983
2004
Perspectives on probability judgment calibration
1973
Subjective probability and its measurement
e.g. hard-easy effect --- people tend to be over-confident in hard questions but under-confident in easy questions
Calibration-related behavioral research
- Some concluding thoughts:
- Huge and diverse literature (statistics, meteorology, decision/risk analysis, psychology) but not much cross-fertilization
- With behavior research on probability judgments, one cannot escape the deeply entwined normative and descriptive aspects.
- From a normative perspective:
- How should information about uncertainty be conveyed in a most accurate, coherent, and efficient way? How do we assess whether such criteria have been satisfied?
- From a descriptive perspective:
- How do people respond to probability judgments/forecasts?
- Should scoring rules be designed around known cognitive biases or try to correct for them?
philosophical tangent : subjective probability vs observed frequency
Graphs of classification performance metric
- Receiver Operator Characteristic (ROC)
- Area under the curve (AUC)
- Precision-Recall Plot
- Calibration plot
more related to discrimination
i.e., the ability to distinguish between different outcomes
predictions agreeing with actual rates of outcomes
Lindhiem, Oliver, et al. "The importance of calibration in clinical psychology." Assessment 27.4 (2020): 840-854.
ROC curve
- \(x\): false positive rate
- \(y\): true positive rate
predicted positive | predicted negative | |
---|---|---|
Positive | True positive | False negative |
Negative | False positive | True negative |
confusion/error matrix
Wixted, John T. “The Forgotten History of Signal Detection Theory.” Journal of Experimental Psychology. Learning, Memory, and Cognition, vol. 46, no. 2, Feb. 2020, pp. 201–33. PubMed, https://doi.org/10.1037/xlm0000732.
False positive
True positive
Calibration plot
- \(x\): predicted probability
- \(y\): observed frequency
45 degree line \(\implies\) perfect calibration
🤔 How calibrated is this classifier?
Over-extremity
predictions are consistently too close to 0 or 1
To ground things more concretely, suppose we built classifiers for children ADHD detection.
Calibration plot
- \(x\): predicted probability
- \(y\): observed frequency
Over-prediction
predicted probability consistently > observed proportion
🤔 How calibrated is this classifier?
Calibration plot
- \(x\): predicted probability
- \(y\): observed frequency
Benefits of graphical approaches:
- can easily spot mis-calibration by looking for the deviation from the diagonal
Calibration plot --- Quiz time!
❓Which model(s) has good discrimination and which has good calibration?
A: good discrimination, good calibration
B: good discrimination, bad calibration
C: bad discrimination, good calibration
D: bad discrimination, bad calibration
To summarize ...
What did I do?
Next steps?
- identified key survey papers related to existing behavioral research related to calibration
- identified experimental papers that utilize scoring rules
- explored visualization types for classification performance
- Experiment design --- how do we separate the impact of financial incentives and scoring rule?
- Scoring rules that account for task difficulty
- Behavioral assumptions to extend rational agent benchmark framework (e.g., prospect theory)
Calibration Project Presentation
By Sheng Long
Calibration Project Presentation
- 3