EnsLoss: Stochastic Calibrated Loss Ensembles for Preventing Overfitting in Classification

Ben Dai (CUHK)

ICML 2025

Background

The objective of binary classification is to categorize each instance into one of two classes.

Data: $ \mathbf{X} \in \mathbb{R}^d \to Y \in \{-1,+1\} $
Classifier: $ f(\mathbf{X}): \mathbb{R}^d \to \mathbb{R} $
Predicted label: $ \widehat{Y} = \text{sgn}(f(\mathbf{X})) $
Evaluation via Misclassification error (risk):

where $\mathbf{1}(\cdot)$ is an indicator function.

Aim. To obtain the Bayes classifier or the best classifier:

$$f^* := \argmin R(f)$$

$$R(f) = 1 - \text{Acc}(f) = \mathbb{E}( \mathbf{1}(Y f(\mathbf{X}) \leq 0) ),$$

Background

Due to the discontinuity of the indicator function:

$$R(f) = 1 - \text{Acc}(f) = \mathbb{E}( \mathbf{1}(Y f(\mathbf{X}) \leq 0) ),$$

the zero-one loss is usually replaced by a convex and classification-calibrated loss $\phi$ to facilitate the empirical computation (Lin, 2004; Zhang, 2004; Bartlett et al., 2006):

$$ R_{\phi}(f) = \mathbb{E}( \phi(Y f(\mathbf{X})) ) $$

For example, the hinge loss for SVM, exponential loss for AdaBoost, and logistic loss for logistic regression all follow this framework.

Background

Due to the discontinuity of the indicator function:

$$R(f) = 1 - \text{Acc}(f) = \mathbb{E}( \mathbf{1}(Y f(\mathbf{X}) \leq 0) ),$$

(If we optimize with respect to $\phi$, will the resulting solution still be the function $f^*$ that we need?)

That's why we need the loss $\phi$ to be calibrated?

$$ R_{\phi}(f) = \mathbb{E}( \phi(Y f(\mathbf{X})) ) $$

For example, the hinge loss for SVM, exponential loss for AdaBoost, and logistic loss for logistic regression all follow this framework.

the zero-one loss is usually replaced by a convex and classification-calibrated loss $\phi$ to facilitate the empirical computation (Lin, 2004; Zhang, 2004; Bartlett et al., 2006):

Background

Definition 1 (Bartlett et al. (2006)). A loss function $\phi(\cdot)$ is classification-calibrated, if for every sequence of measurable function $f_n$ and every probability distribution on $ \mathcal{X} \times \{\pm 1\}$,

$$R_{\phi}(f_n) \to \inf_{f} R_{\phi}(f) \ \text{ implies that } \ R(f_n) \to \inf_{f} R(f).$$

A calibrated loss function $\phi$ guarantees that any sequence $f_n$ that optimizes $R_\phi$ will eventually also optimize $R$, thereby ensuring consistency in maximizing classification accuracy.

Background

$$R_{\phi}(f_n) \to \inf_{f} R_{\phi}(f) \ \text{ implies that } \ R(f_n) \to \inf_{f} R(f).$$

Theorem 1 (Bartlett et al. (2006)) Let $\phi$ be convex. Then $\phi$ is classification-calibrated iff it is differentiable at 0 and $\phi'(0) < 0$.

A series of studies (Lin, 2004; Zhang, 2004; Bartlett et al., 2006) culminates in the following theorem for iff conditions of calibration:

Classification ERM framework

(i) Select a convex and calibrated (CC) loss function $\phi$

Classification ERM framework

$$ \widehat{f}_{n} = \argmin_{f \in \mathcal{F}} \ \widehat{R}_{\phi} (f), \quad \widehat{R}_{\phi} (f) := \frac{1}{n} \sum_{i=1}^n \phi \big( y_i f(\mathbf{x}_i) \big).$$

(i) Select a convex and calibrated (CC) loss function $\phi$

(ii) Directly minimizes the ERM of $R_\phi$ to obtain $f_n$

(SGD is widely adopted for its scalability and generalization when dealing with large-scale datasets and DL models)

$$\pmb{\theta}^{(t+1)} = \pmb{\theta}^{(t)} - \gamma \frac{1}{B} \sum_{i \in \mathcal{I}_B} \nabla_{\pmb{\theta}} \phi( y_{i} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}) )$$

$$= \pmb{\theta}^{(t)} - \gamma \frac{1}{B} \sum_{i \in \mathcal{I}_B} \partial \phi( y_{i} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}) ) \nabla_{\pmb{\theta}} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}),$$

The ERM paradigm with calibrated losses, when combined with ML/DL models and optimized using SGD, has achieved tremendous success in numerous real-world applications.

$$\pmb{\theta}^{(t+1)} = \pmb{\theta}^{(t)} - \gamma \frac{1}{B} \sum_{i \in \mathcal{I}_B} \nabla_{\pmb{\theta}} \phi( y_{i} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}) )$$

EnsLoss: Calibrated Loss Ensembles

Inspired by Dropout

(model ensemble over one training process)

$$\pmb{\theta}^{(t+1)} = \pmb{\theta}^{(t)} - \gamma \frac{1}{B} \sum_{i \in \mathcal{I}_B} \nabla_{\pmb{\theta}} \phi( y_{i} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}) )$$

EnsLoss: Calibrated Loss Ensembles

Inspired by Dropout

(model ensemble over one training process)

$$= \pmb{\theta}^{(t)} - \gamma \frac{1}{B} \sum_{i \in \mathcal{I}_B} \partial \phi( y_{i} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}) ) \nabla_{\pmb{\theta}} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}),$$

EnsLoss: Calibrated Loss Ensembles

$$\pmb{\theta}^{(t+1)} = \pmb{\theta}^{(t)} - \gamma \frac{1}{B} \sum_{i \in \mathcal{I}_B} \nabla_{\pmb{\theta}} \phi( y_{i} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}) )$$

$$= \pmb{\theta}^{(t)} - \gamma \frac{1}{B} \sum_{i \in \mathcal{I}_B} \partial \phi( y_{i} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}) ) \nabla_{\pmb{\theta}} f_{\pmb{\theta}^{(t)}}(\mathbf{x}_{i}),$$