IMS-APRM 2026
The objective of binary classification is to categorize each instance into one of two classes.
where \(\mathbf{1}(\cdot)\) is an indicator function.
Aim. To obtain the Bayes classifier or the best classifier:
$$R(\hat{f}_n) \to R(f^*) \quad f^* := \argmin R(f)$$
$$R(f) = 1 - \text{Acc}(f) = \mathbb{E}( \mathbf{1}(Y f(\mathbf{X}) \leq 0) ),$$
Due to the discontinuity of the indicator function:
$$R(f) = 1 - \text{Acc}(f) = \mathbb{E}( \mathbf{1}(Y f(\mathbf{X}) \leq 0) ),$$
the zero-one loss is usually replaced by a convex and classification-calibrated loss \(\phi\) to facilitate the empirical computation (Lin, 2004; Zhang, 2004; Bartlett et al., 2006):
$$ R_{\phi}(f) = \mathbb{E}( \phi(Y f(\mathbf{X})) ) $$
For example, the hinge loss for SVM, exponential loss for AdaBoost, and logistic loss for logistic regression all follow this framework.
Due to the discontinuity of the indicator function:
$$R(f) = 1 - \text{Acc}(f) = \mathbb{E}( \mathbf{1}(Y f(\mathbf{X}) \leq 0) ),$$
(If we optimize with respect to \(\phi\), will the resulting solution still be the function \(f^*\) that we need?)
That's why we need the loss \(\phi\) to be calibrated/consistent?
the zero-one loss is usually replaced by a convex and classification-calibrated loss \(\phi\) to facilitate the empirical computation (Lin, 2004; Zhang, 2004; Bartlett et al., 2006):
$$ R_{\phi}(f) = \mathbb{E}( \phi(Y f(\mathbf{X})) ) $$
For example, the hinge loss for SVM, exponential loss for AdaBoost, and logistic loss for logistic regression all follow this framework.
A portrait of Sir Ronald Aylmer Fisher. Image from the MacTutor History of Mathematics Archive.
The cover page of Ronald A. Fisher (1922), where definition of consistency is provided.
(ERM via a surrogate loss).
$$ \hat{f}_{\phi} = \argmin_f \mathbb{E}_n \phi\big(Y f(\mathbf{X}) \big) $$
$$Acc \big ( \hat{\delta}(F_n) \big) \xrightarrow{\mathbb{P}} Acc \big( \delta(F) \big) = \max_{\delta} Acc(\delta)$$
(ERM via a surrogate loss).
$$ \hat{f}_{\phi} = \argmin_f \mathbb{E}_n \phi\big(Y f(\mathbf{X}) \big) $$
$$ f_\phi = \argmin_f \mathbb{E} \phi\big(Y f(\mathbf{X}) \big)$$
$$Acc \big ( \hat{\delta}(F_n) \big) \xrightarrow{\mathbb{P}} Acc \big( \delta(F) \big) = \max_{\delta} Acc(\delta)$$
FC leads to conditions for "consistent" surrogate losses
(If we optimize with respect to \(\phi\), will the resulting solution still be the function \(f^*\) that we need?)
FC: The population minimizer of \(R_\phi\) also minimizes \(R\)
Definition 1 (Bartlett et al. (2004)). A loss function \(\phi(\cdot)\) is classification-calibrated, if for every sequence of measurable function \(f_n\) and every probability distribution on \( \mathcal{X} \times \{\pm 1\}\),
$$R_{\phi}(f_n) \to \inf_{f} R_{\phi}(f) \ \text{ implies that } \ R(f_n) \to \inf_{f} R(f).$$
A calibrated loss function \(\phi\) guarantees that any sequence \(f_n\) that optimizes \(R_\phi\) will eventually also optimize \(R\), thereby ensuring consistency in maximizing classification accuracy.
A series of studies (Lin, 2001; Zhang, 2004; Bartlett et al., 2004) culminates in the following theorem for iff conditions of FC:
A series of studies (Lin, 2001; Zhang, 2004; Bartlett et al., 2004) culminates in the following theorem for iff conditions of FC:
Theorem 3.1 in Lin. (2001) (informal) Consider a loss function \(\phi\) satisfying the following conditions:
$$(1) \quad \phi(z) < \phi(-z), \forall z >0; \qquad (2) \quad \phi'(z) \neq 0 \text{ exits}.$$
Then \(\phi\) is Fisher consistent.
Theorem 3.1 in Lin. (2001) (informal) Consider a loss function \(\phi\) satisfying the following conditions:
$$(1) \quad \phi(z) < \phi(-z), \forall z >0; \qquad (2) \quad \phi'(z) \neq 0 \text{ exits}.$$
Then \(\phi\) is Fisher consistent.
(A closely related version appears as Theorem 4.1 in Zhang, 2004.)
Theorem 2 in Bartlett et al. (2004) (informal) Let \(\phi\) be convex. Then \(\phi\) is consistency iff it is differentiable at 0 and \(\phi'(0) < 0\).
A series of studies (Lin, 2001; Zhang, 2004; Bartlett et al., 2004) culminates in the following theorem for iff conditions of FC:
Theorem 3.1 in Lin. (2001) (informal) Consider a loss function \(\phi\) satisfying the following conditions:
$$(1) \quad \phi(z) < \phi(-z), \forall z >0; \qquad (2) \quad \phi'(z) \neq 0 \text{ exits}.$$
Then \(\phi\) is Fisher consistent.
(A closely related version appears as Theorem 4.1 in Zhang, 2004.)
Theorem 2 in Bartlett et al. (2004) (informal) Let \(\phi\) be convex. Then \(\phi\) is consistency iff it is differentiable at 0 and \(\phi'(0) < 0\).
A series of studies (Lin, 2001; Zhang, 2004; Bartlett et al., 2004) culminates in the following theorem for iff conditions of FC:
FC in Multiclass. Lee, Lin and Wahba (2012); Liu (2007); Zou, Zhu and Hastie (2008)
(i) Select a convex and calibrated (CC) loss function \(\phi\)
$$ \widehat{f}_{n} = \argmin_{f \in \mathcal{F}} \ \widehat{R}_{\phi} (f), \quad \widehat{R}_{\phi} (f) := \frac{1}{n} \sum_{i=1}^n \phi \big( y_i f(\mathbf{x}_i) \big).$$
(i) Select a convex and calibrated (CC) loss function \(\phi\)
(ii) Directly minimizes the ERM of \(R_\phi\) to obtain \(f_n\)
(SGD is widely adopted for its scalability and generalization when dealing with large-scale datasets and DL models)
$$\pmb{\theta}^{(t+1)} = \pmb{\theta}^{(t)} - \gamma \frac{1}{B} \sum_{i \in \mathcal{I}_B} \nabla_{\pmb{\theta}} \phi( y_{i} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}) )$$
$$= \pmb{\theta}^{(t)} - \gamma \frac{1}{B} \sum_{i \in \mathcal{I}_B} y_i \partial \phi( y_{i} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}) ) \nabla_{\pmb{\theta}} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}),$$
The ERM paradigm with calibrated losses, when combined with ML/DL models and optimized using SGD, has achieved tremendous success in numerous real-world applications.
However, in practical applications, determining which loss function performs better is a very challenging problem, as it is typically unknown and can vary across datasets.
| BCE > Hinge | Hinge > BCE | |
| ResNet34 | 3 | 42 |
| ResNet50 | 26 | 19 |
| VGG16 | 9 | 36 |
| VGG19 | 13 | 32 |
We examine two of the most popular losses: BCE and Hinge loss. We provide experimental results on 45 CIFAR2 datasets, which also confirms that the superiority of the loss function is not consistent across different models / datasets.
However, in practical applications, determining which loss function performs better is a very challenging problem, as it is typically unknown and can vary across datasets.
| BCE > Hinge | Hinge > BCE | |
| ResNet34 | 3 | 42 |
| ResNet50 | 26 | 19 |
| VGG16 | 9 | 36 |
| VGG19 | 13 | 32 |
We examine two of the most popular losses: BCE and Hinge loss. We provide experimental results on 45 CIFAR2 datasets, which also confirms that the superiority of the loss function is not consistent across different models / datasets.
Current approaches in theoretical analysis include:
Current limitations in theoretical analysis are:
Mitchell, et al. "Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time." ICML, 2022.
Mitchell, et al. "Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time." ICML, 2022.
Limitation
Computational Intensity: Bagging requires training multiple base models (often hundreds), making it computationally expensive compared to single models.
Some methods require additional validation sets.
Mitchell, et al. "Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time." ICML, 2022.
Limitation
Computational Intensity: Bagging requires training multiple base models (often hundreds), making it computationally expensive compared to single models.
Some methods require additional validation sets.
Let's Dropout!
Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." JMLR, 2014
$$\pmb{\theta}^{(t+1)} = \pmb{\theta}^{(t)} - \gamma \frac{1}{B} \sum_{i \in \mathcal{I}_B} \nabla_{\pmb{\theta}} \phi( y_{i} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}) )$$
Inspired by Dropout
(model ensemble over one training process)
$$\pmb{\theta}^{(t+1)} = \pmb{\theta}^{(t)} - \gamma \frac{1}{B} \sum_{i \in \mathcal{I}_B} \nabla_{\pmb{\theta}} \phi( y_{i} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}) )$$
Inspired by Dropout
(model ensemble over one training process)
$$= \pmb{\theta}^{(t)} - \gamma \frac{1}{B} \sum_{i \in \mathcal{I}_B} \partial \phi( y_{i} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}) ) \nabla_{\pmb{\theta}} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}),$$
$$\pmb{\theta}^{(t+1)} = \pmb{\theta}^{(t)} - \gamma \frac{1}{B} \sum_{i \in \mathcal{I}_B} \nabla_{\pmb{\theta}} \phi( y_{i} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}) )$$
$$= \pmb{\theta}^{(t)} - \gamma \frac{1}{B} \sum_{i \in \mathcal{I}_B} \partial \phi( y_{i} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}) ) \nabla_{\pmb{\theta}} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}),$$
CIFAR
We construct binary CIFAR (CIFAR2), by selecting all possible pairs from CIFAR10, resulting:
10 x 9 / 2 = 45
CIFAR2 datasets.
PCam
PCam is a binary image classification dataset comprising 327,680 96x96 images from histopathologic scans of lymph node sections
OpenML
We applied a filering:
n >= 1000
d >= 1000
at least one official run
resulting 14 datasets
no dataset cherry pick
CIFAR2
OpenML
PCam
EnsLoss is a more desirable option compared to fixed losses in image data;
and it is a viable option worth considering in tabular data.
Epoch-level performance
Compatibility of prevent-overfitting methods
EnsLoss consistently outperforms the fixed losses across epochs;
and it is compatible with other methods, and their combination yields additional improvement.
The primary motivation of EnsLoss behind consists of two components: “ensemble” and the “CC” of the loss functions.
Dai, B. (2025). EnsLoss: Stochastic Calibrated Loss Ensembles for Preventing Overfitting in Classification. ICML.