ensLoss_full

EnsLoss: Stochastic Calibrated Loss Ensembles for Preventing Overfitting in Classification

Ben Dai (CUHK)

IMS-APRM 2026

Background

The objective of binary classification is to categorize each instance into one of two classes.

Data: $ \mathbf{X} \in \mathbb{R}^d \to Y \in \{-1,+1\} $
Classifier: $ f(\mathbf{X}): \mathbb{R}^d \to \mathbb{R} $
Predicted label: $ \widehat{Y} = \text{sgn}(f(\mathbf{X})) $
Evaluation via Misclassification error (risk):

where $\mathbf{1}(\cdot)$ is an indicator function.

Aim. To obtain the Bayes classifier or the best classifier:

$$R(\hat{f}_n) \to R(f^*) \quad f^* := \argmin R(f)$$

$$R(f) = 1 - \text{Acc}(f) = \mathbb{E}( \mathbf{1}(Y f(\mathbf{X}) \leq 0) ),$$

Background

Due to the discontinuity of the indicator function:

$$R(f) = 1 - \text{Acc}(f) = \mathbb{E}( \mathbf{1}(Y f(\mathbf{X}) \leq 0) ),$$

the zero-one loss is usually replaced by a convex and classification-calibrated loss $\phi$ to facilitate the empirical computation (Lin, 2004; Zhang, 2004; Bartlett et al., 2006):

$$ R_{\phi}(f) = \mathbb{E}( \phi(Y f(\mathbf{X})) ) $$

For example, the hinge loss for SVM, exponential loss for AdaBoost, and logistic loss for logistic regression all follow this framework.

Background

Due to the discontinuity of the indicator function:

$$R(f) = 1 - \text{Acc}(f) = \mathbb{E}( \mathbf{1}(Y f(\mathbf{X}) \leq 0) ),$$

(If we optimize with respect to $\phi$, will the resulting solution still be the function $f^*$ that we need?)

That's why we need the loss $\phi$ to be calibrated/consistent?

the zero-one loss is usually replaced by a convex and classification-calibrated loss $\phi$ to facilitate the empirical computation (Lin, 2004; Zhang, 2004; Bartlett et al., 2006):

$$ R_{\phi}(f) = \mathbb{E}( \phi(Y f(\mathbf{X})) ) $$

For example, the hinge loss for SVM, exponential loss for AdaBoost, and logistic loss for logistic regression all follow this framework.

Fisher Consistency

A portrait of Sir Ronald Aylmer Fisher. Image from the MacTutor History of Mathematics Archive.

The cover page of Ronald A. Fisher (1922), where definition of consistency is provided.

Fisher Consistency

(ERM via a surrogate loss).

Obtain $ \hat{f} $ from a ERM with a surrogate loss ($ \hat{\delta} = \mathbb{I}(\hat{f} \geq 0) $)

$$ \hat{f}_{\phi} = \argmin_f \mathbb{E}_n \phi\big(Y f(\mathbf{X}) \big) $$

$$Acc \big ( \hat{\delta}(F_n) \big) \xrightarrow{\mathbb{P}} Acc \big( \delta(F) \big) = \max_{\delta} Acc(\delta)$$

Fisher Consistency

(ERM via a surrogate loss).

Obtain $ \hat{f} $ from a ERM with a surrogate loss ($ \hat{\delta} = \mathbb{I}(\hat{f} \geq 0) $)

$$ \hat{f}_{\phi} = \argmin_f \mathbb{E}_n \phi\big(Y f(\mathbf{X}) \big) $$

$$ f_\phi = \argmin_f \mathbb{E} \phi\big(Y f(\mathbf{X}) \big)$$

$$Acc \big ( \hat{\delta}(F_n) \big) \xrightarrow{\mathbb{P}} Acc \big( \delta(F) \big) = \max_{\delta} Acc(\delta)$$

FC leads to conditions for "consistent" surrogate losses

(If we optimize with respect to $\phi$, will the resulting solution still be the function $f^*$ that we need?)

FC: The population minimizer of $R_\phi$ also minimizes $R$

Fisher Consistency

Definition 1 (Bartlett et al. (2004)). A loss function $\phi(\cdot)$ is classification-calibrated, if for every sequence of measurable function $f_n$ and every probability distribution on $ \mathcal{X} \times \{\pm 1\}$,

$$R_{\phi}(f_n) \to \inf_{f} R_{\phi}(f) \ \text{ implies that } \ R(f_n) \to \inf_{f} R(f).$$

A calibrated loss function $\phi$ guarantees that any sequence $f_n$ that optimizes $R_\phi$ will eventually also optimize $R$, thereby ensuring consistency in maximizing classification accuracy.

A series of studies (Lin, 2001; Zhang, 2004; Bartlett et al., 2004) culminates in the following theorem for iff conditions of FC:

Fisher Consistency

A series of studies (Lin, 2001; Zhang, 2004; Bartlett et al., 2004) culminates in the following theorem for iff conditions of FC:

Theorem 3.1 in Lin. (2001) (informal) Consider a loss function $\phi$ satisfying the following conditions:

$$(1) \quad \phi(z) < \phi(-z), \forall z >0; \qquad (2) \quad \phi'(z) \neq 0 \text{ exits}.$$

Then $\phi$ is Fisher consistent.

Fisher Consistency

Theorem 3.1 in Lin. (2001) (informal) Consider a loss function $\phi$ satisfying the following conditions:

$$(1) \quad \phi(z) < \phi(-z), \forall z >0; \qquad (2) \quad \phi'(z) \neq 0 \text{ exits}.$$

Then $\phi$ is Fisher consistent.

(A closely related version appears as Theorem 4.1 in Zhang, 2004.)

Theorem 2 in Bartlett et al. (2004) (informal) Let $\phi$ be convex. Then $\phi$ is consistency iff it is differentiable at 0 and $\phi'(0) < 0$.

A series of studies (Lin, 2001; Zhang, 2004; Bartlett et al., 2004) culminates in the following theorem for iff conditions of FC:

Fisher Consistency

Theorem 3.1 in Lin. (2001) (informal) Consider a loss function $\phi$ satisfying the following conditions:

$$(1) \quad \phi(z) < \phi(-z), \forall z >0; \qquad (2) \quad \phi'(z) \neq 0 \text{ exits}.$$

Then $\phi$ is Fisher consistent.

(A closely related version appears as Theorem 4.1 in Zhang, 2004.)

Theorem 2 in Bartlett et al. (2004) (informal) Let $\phi$ be convex. Then $\phi$ is consistency iff it is differentiable at 0 and $\phi'(0) < 0$.

A series of studies (Lin, 2001; Zhang, 2004; Bartlett et al., 2004) culminates in the following theorem for iff conditions of FC:

FC in Multiclass. Lee, Lin and Wahba (2012); Liu (2007); Zou, Zhu and Hastie (2008)

Classification ERM framework

(i) Select a convex and calibrated (CC) loss function $\phi$

Classification ERM framework

$$ \widehat{f}_{n} = \argmin_{f \in \mathcal{F}} \ \widehat{R}_{\phi} (f), \quad \widehat{R}_{\phi} (f) := \frac{1}{n} \sum_{i=1}^n \phi \big( y_i f(\mathbf{x}_i) \big).$$

(i) Select a convex and calibrated (CC) loss function $\phi$

(ii) Directly minimizes the ERM of $R_\phi$ to obtain $f_n$

(SGD is widely adopted for its scalability and generalization when dealing with large-scale datasets and DL models)

$$\pmb{\theta}^{(t+1)} = \pmb{\theta}^{(t)} - \gamma \frac{1}{B} \sum_{i \in \mathcal{I}_B} \nabla_{\pmb{\theta}} \phi( y_{i} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}) )$$

$$= \pmb{\theta}^{(t)} - \gamma \frac{1}{B} \sum_{i \in \mathcal{I}_B} y_i \partial \phi( y_{i} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}) ) \nabla_{\pmb{\theta}} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}),$$

The ERM paradigm with calibrated losses, when combined with ML/DL models and optimized using SGD, has achieved tremendous success in numerous real-world applications.

Loss selection in Practice

However, in practical applications, determining which loss function performs better is a very challenging problem, as it is typically unknown and can vary across datasets.

	BCE > Hinge	Hinge > BCE
ResNet34	3	42
ResNet50	26	19
VGG16	9	36
VGG19	13	32

We examine two of the most popular losses: BCE and Hinge loss. We provide experimental results on 45 CIFAR2 datasets, which also confirms that the superiority of the loss function is not consistent across different models / datasets.

Loss selection in Practice

However, in practical applications, determining which loss function performs better is a very challenging problem, as it is typically unknown and can vary across datasets.

	BCE > Hinge	Hinge > BCE
ResNet34	3	42
ResNet50	26	19
VGG16	9	36
VGG19	13	32

Current approaches in theoretical analysis include:

Convergence rate of $R_{\phi}(\hat{f}_n) \to R^*_{\phi}$
Excess risk bounds (how $R_{\phi}(\hat{f}_n) \to R^*_{\phi}$ implies $R(\hat{f}_n) \to R^*$)

Current limitations in theoretical analysis are:

Rates are difficult to characterize under finite sample situations
Most existing theories are distribution-free, whereas in practice data is given but its distribution is unknown

Ensemble idea

Mitchell, et al. "Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time." ICML, 2022.

Ensemble idea

Mitchell, et al. "Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time." ICML, 2022.

Limitation

Computational Intensity: Bagging requires training multiple base models (often hundreds), making it computationally expensive compared to single models.
Some methods require additional validation sets.

Ensemble idea

Mitchell, et al. "Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time." ICML, 2022.

Limitation

Computational Intensity: Bagging requires training multiple base models (often hundreds), making it computationally expensive compared to single models.
Some methods require additional validation sets.

Let's Dropout!

Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." JMLR, 2014

$$\pmb{\theta}^{(t+1)} = \pmb{\theta}^{(t)} - \gamma \frac{1}{B} \sum_{i \in \mathcal{I}_B} \nabla_{\pmb{\theta}} \phi( y_{i} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}) )$$

EnsLoss: Calibrated Loss Ensembles

Inspired by Dropout

(model ensemble over one training process)

$$\pmb{\theta}^{(t+1)} = \pmb{\theta}^{(t)} - \gamma \frac{1}{B} \sum_{i \in \mathcal{I}_B} \nabla_{\pmb{\theta}} \phi( y_{i} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}) )$$

EnsLoss: Calibrated Loss Ensembles

Inspired by Dropout

(model ensemble over one training process)

$$= \pmb{\theta}^{(t)} - \gamma \frac{1}{B} \sum_{i \in \mathcal{I}_B} \partial \phi( y_{i} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}) ) \nabla_{\pmb{\theta}} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}),$$

EnsLoss: Calibrated Loss Ensembles

$$\pmb{\theta}^{(t+1)} = \pmb{\theta}^{(t)} - \gamma \frac{1}{B} \sum_{i \in \mathcal{I}_B} \nabla_{\pmb{\theta}} \phi( y_{i} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}) )$$

$$= \pmb{\theta}^{(t)} - \gamma \frac{1}{B} \sum_{i \in \mathcal{I}_B} \partial \phi( y_{i} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}) ) \nabla_{\pmb{\theta}} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}),$$