Ronald Fisher (1890 - 1962)
A statistic is said to be a consistent estimate of any parameter, if when calculated from an indefinitely large sample it tends to be accurately equal to that parameter.
- Fisher (1925) Theory of Statistical Estimation
Ronald Fisher (1890 - 1962)
A statistic is said to be a consistent estimate of any parameter, if when calculated from an indefinitely large sample it tends to be accurately equal to that parameter.
- Fisher (1925) Theory of Statistical Estimation
\( \hat{\theta}(X_1, \cdots, X_n) \xrightarrow{\mathbb{P}} \theta_0, \quad n \to \infty \)
Probability Consistency
Ronald Fisher (1890 - 1962)
Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics.
Consistency -- A statistic satisfies the criterion of consistency, if, when it is calculated from the whole population, it is equal to the required parameter.
- Fisher (1922)
\( \hat{\theta}(X_1, \cdots, X_n) \xrightarrow{\mathbb{P}} \theta_0, \quad n \to \infty \)
Probability Consistency (PC)
\( \hat{\theta} = \hat{\theta}(X_1, \cdots, X_n) \)
Fisher / Fisherian
\( \hat{\theta}(X_1, \cdots, X_n) \xrightarrow{\mathbb{P}} \theta_0, \quad n \to \infty \)
Probability Consistency (PC)
\( \hat{\theta} = \hat{\theta}(X_1, \cdots, X_n) \)
\( \hat{\theta}= \hat{\theta}(F_n) \)
Fisher / Fisherian
Fisher / Fisherian
\( \hat{\theta}(F_n) \xrightarrow{\mathbb{P}} \theta(F), \quad n \to \infty \)
\( \hat{\theta}(X_1, \cdots, X_n) \xrightarrow{\mathbb{P}} \theta_0, \quad n \to \infty \)
Probability Consistency (PC)
\( \hat{\theta} = \hat{\theta}(X_1, \cdots, X_n) \)
\( \hat{\theta}= \hat{\theta}(F_n) \)
Fisher / Fisherian
Fisher / Fisherian
\( \hat{\theta}(F_n) \xrightarrow{\mathbb{P}} \theta(F), \quad n \to \infty \)
\( \hat{\theta}(X_1, \cdots, X_n) \xrightarrow{\mathbb{P}} \theta_0, \quad n \to \infty \)
Probability Consistency (PC)
\( \hat{\theta} = \hat{\theta}(X_1, \cdots, X_n) \)
\( \hat{\theta}= \hat{\theta}(F_n) \)
\(\theta(F) = \theta_0\)
Fisher / Fisherian
Fisher / Fisherian
\( \hat{\theta}(F_n) \xrightarrow{\mathbb{P}} \theta(F), \quad n \to \infty \)
\( \hat{\theta}(X_1, \cdots, X_n) \xrightarrow{\mathbb{P}} \theta_0, \quad n \to \infty \)
Probability Consistency (PC)
\( \hat{\theta} = \hat{\theta}(X_1, \cdots, X_n) \)
\( \hat{\theta}= \hat{\theta}(F_n) \)
\(\theta(F) = \theta_0\)
Fisher Consistency (FC)
Fisher / Fisherian
Fisher / Fisherian
\( \hat{\theta}(F_n) \xrightarrow{\mathbb{P}} \theta(F), \quad n \to \infty \)
\( \hat{\theta}(X_1, \cdots, X_n) \xrightarrow{\mathbb{P}} \theta_0, \quad n \to \infty \)
Probability Consistency (PC)
\( \hat{\theta} = \hat{\theta}(X_1, \cdots, X_n) \)
\( \hat{\theta}= \hat{\theta}(F_n) \)
\(\theta(F) = \theta_0\)
Fisher Consistency (FC)
Gerow, K. (1989): In fact, for many years, Fisher took his two definitions to be describing the same thing... Fisher 34 years to polish the definitions of consistency to their present form.
Fisher / Fisherian
\( \hat{\theta}(F_n) \xrightarrow{\mathbb{P}} \theta(F), \quad n \to \infty \)
\( \hat{\theta}(X_1, \cdots, X_n) \xrightarrow{\mathbb{P}} \theta_0, \quad n \to \infty \)
Probability Consistency (PC)
\( \hat{\theta} = \hat{\theta}(X_1, \cdots, X_n) \)
\( \hat{\theta}= \hat{\theta}(F_n) \)
\(\theta(F) = \theta_0\)
Glivenko–Cantelli theorem (1933);
Fisher Consistency (FC)
Fisher / Fisherian
Hodges and Le Cam example is PC but not FC!
CR, Rao. (1962) Apparent Anomalies and Irregularities in Maximum Likelihood Estimation
CR, Rao. (1962) Apparent Anomalies and Irregularities in Maximum Likelihood Estimation
With continuous functionals; FC iff PC.
Not all estimators can be easily expressed in the form of an empirical cdf.
iid assumptions
Not all estimators can be easily expressed in the form of an empirical cdf.
iid assumptions
Classification
$$ Acc( \delta) = \mathbb{E}\big( \mathbb{I}( Y = \delta(\mathbf{X}) )\big) $$
Classification
$$ Acc( \delta) = \mathbb{E}\big( \mathbb{I}( Y = \delta(\mathbf{X}) )\big) $$
Due to the computational difficulty, direct optimization is often infeasible.
Classification
$$ Acc( \delta) = \mathbb{E}\big( \mathbb{I}( Y = \delta(\mathbf{X}) )\big) $$
Consistency. What ensures that your estimator achieves good accuracy?
Classification
$$ Acc( \delta) = \mathbb{E}\big( \mathbb{I}( Y = \delta(\mathbf{X}) )\big) $$
$$Acc \big ( \hat{\delta}(F_n) \big) \xrightarrow{\mathbb{P}} Acc \big( \delta(F) \big)$$
Consistency. What ensures that your estimator achieves good accuracy?
Suppose we develop a classifier as a functional of empirical cdf, that is \( \hat{\delta}(F_n) \).
Classification
$$ Acc( \delta) = \mathbb{E}\big( \mathbb{I}( Y = \delta(\mathbf{X}) )\big) $$
$$Acc \big ( \hat{\delta}(F_n) \big) \xrightarrow{\mathbb{P}} Acc \big( \delta(F) \big) = \max_{\delta} Acc(\delta)$$
FC
Consistency. What ensures that your estimator achieves good accuracy?
Suppose we develop a classifier as a functional of empirical cdf, that is \( \hat{\delta}(F_n) \).
Suppose we develop a classifier as a functional of empirical cdf, that is \( \hat{\delta}(F_n) \).
$$Acc \big ( \hat{\delta}(F_n) \big) \xrightarrow{\mathbb{P}} Acc \big( \delta(F) \big) = \max_{\delta} Acc(\delta)$$
FC
How can we develop a FC method?
Classification
$$ Acc( \delta) = \mathbb{E}\big( \mathbb{I}( Y = \delta(\mathbf{X}) )\big) $$
Consistency. What ensures that your estimator achieves good accuracy?
Approach 1 (Plug-in rule).
$$ \delta^* = \argmax_{\delta} \ Acc(\delta) $$
$$ \delta^* = \delta^*(F) $$
Approach 1 (Plug-in rule).
$$ \delta^* = \argmax_{\delta} \ Acc(\delta) $$
$$ \delta^* = \delta^*(F) \to \delta^*(F_n)$$
Approach 1 (Plug-in rule).
$$ \delta^* = \argmax_{\delta} \ Acc(\delta) \ \ \to \ \delta^*(\mathbf{x}) = \mathbb{I}( p(\mathbf{x}) \geq 0.5 ) $$
$$ p(\mathbf{x}) = \mathbb{P}(Y=1|\mathbf{X}=\mathbf{x})$$
Approach 1 (Plug-in rule).
$$ \delta^* = \argmax_{\delta} \ Acc(\delta) \ \ \to \ \delta^*(\mathbf{x}) = \mathbb{I}( p(\mathbf{x}) \geq 0.5 ) $$
$$ p(\mathbf{x}) = \mathbb{P}(Y=1|\mathbf{X}=\mathbf{x})$$
$$ \hat{\delta}(\mathbf{x}) = \mathbb{I}( \hat{p}_n(\mathbf{x}) \geq 0.5 ) $$
Approach 1 (Plug-in rule).
$$ \delta^* = \argmax_{\delta} \ Acc(\delta) \ \ \to \ \delta^*(\mathbf{x}) = \mathbb{I}( p(\mathbf{x}) \geq 0.5 ) $$
$$ p(\mathbf{x}) = \mathbb{P}(Y=1|\mathbf{X}=\mathbf{x})$$
$$ \hat{\delta}(\mathbf{x}) = \mathbb{I}( \hat{p}_n(\mathbf{x}) \geq 0.5 ) $$
Plug-in rule method:
Approach 2 (ERM via a surrogate loss).
$$Acc \big ( \hat{\delta}(F_n) \big) \xrightarrow{\mathbb{P}} Acc \big( \delta(F) \big) = \max_{\delta} Acc(\delta)$$
$$ \hat{f}_{\phi} = \argmin_f \mathbb{E}_n \phi\big(Y f(\mathbf{X}) \big) $$
Approach 2 (ERM via a surrogate loss).
$$ \hat{f}_{\phi} = \argmin_f \mathbb{E}_n \phi\big(Y f(\mathbf{X}) \big) $$
$$ f_\phi = \argmin_f \mathbb{E} \phi\big(Y f(\mathbf{X}) \big)$$
$$Acc \big ( \hat{\delta}(F_n) \big) \xrightarrow{\mathbb{P}} Acc \big( \delta(F) \big) = \max_{\delta} Acc(\delta)$$
Approach 2 (ERM via a surrogate loss).
$$ \hat{f}_{\phi} = \argmin_f \mathbb{E}_n \phi\big(Y f(\mathbf{X}) \big) $$
$$ f_\phi = \argmin_f \mathbb{E} \phi\big(Y f(\mathbf{X}) \big)$$
$$Acc \big ( \hat{\delta}(F_n) \big) \xrightarrow{\mathbb{P}} Acc \big( \delta(F) \big) = \max_{\delta} Acc(\delta)$$
FC leads to conditions for "consistent" surrogate losses
Approach 2 (ERM via a surrogate loss).
$$ \hat{f}_{\phi} = \argmin_f \mathbb{E}_n \phi\big(Y f(\mathbf{X}) \big) $$
$$ f_\phi = \argmin_f \mathbb{E} \phi\big(Y f(\mathbf{X}) \big)$$
$$Acc \big ( \hat{\delta}(F_n) \big) \xrightarrow{\mathbb{P}} Acc \big( \delta(F) \big) = \max_{\delta} Acc(\delta)$$
FC leads to conditions for "consistent" surrogate losses
Theorem. (Bartlett et al. (2006); informal) Let \(\phi\) be convex. \(\phi\) is "consistent" iff it is differentiable at 0 and \( \phi'(0) < 0 \).
Convex loss. Zhang (2004), Lugosi and Vayatis (2004), Steinwart (2005)
Non-convex loss. Mason et al. (1999), Shen et al. (2003)
Medical image segmentation
Autonomous vehicles
Agriculture
Input
output
Input: \(\mathbf{X} \in \mathbb{R}^d\)
Outcome: \(\mathbf{Y} \in \{0,1\}^d\)
Segmentation function:
Predicted segmentation set:
Input
output
Input: \(\mathbf{X} \in \mathbb{R}^d\)
Outcome: \(\mathbf{Y} \in \{0,1\}^d\)
Segmentation function:
Predicted segmentation set:
Input
output
$$ Y_j | \mathbf{X}=\mathbf{x} \sim \text{Bern}\big(p_j(\mathbf{x})\big)$$
$$ p_j(\mathbf{x}) := \mathbb{P}(Y_j = 1 | \mathbf{X} = \mathbf{x})$$
Probabilistic model:
The Dice and IoU metrics are introduced and widely used in practice:
IoU
The Dice and IoU metrics are introduced and widely used in practice:
Goal: learn segmentation function \( \pmb{\delta} \) maximizing Dice / IoU
Dice
Classification-based loss
CE + Focal
CE
CE
We aim to leverage the principles of FC to develop an consistent segmentation method.
Recall (Plug-in rule in classification).
$$ \delta^* = \argmax_{\delta} \ Acc(\delta) $$
$$ \delta^* = \delta^*(F) \to \delta^*(F_n)$$
We aim to leverage the principles of FC to develop an consistent segmentation method.
Recall (Plug-in rule in segmentation).
$$ \delta^* = \argmax_{\delta} \ Acc(\delta) $$
$$ \delta^* = \delta^*(F) \to \delta^*(F_n)$$
$$ \pmb{\delta}^* = \text{argmax}_{\pmb{\delta}} \ \text{Dice}_\gamma ( \pmb{\delta})$$
Bayes segmentation rule
We aim to leverage the principles of FC to develop an consistent segmentation method.
Recall (Plug-in rule in segmentation).
$$ \delta^* = \argmax_{\delta} \ Acc(\delta) $$
$$ \delta^* = \delta^*(F) \to \delta^*(F_n)$$
$$ \pmb{\delta}^* = \text{argmax}_{\pmb{\delta}} \ \text{Dice}_\gamma ( \pmb{\delta})$$
Bayes segmentation rule
What form would the Bayes segmentation rule take?
Theorem 1 (Dai and Li, 2023). A segmentation rule \(\pmb{\delta}^*\) is a global maximizer of \(\text{Dice}_\gamma(\pmb{\delta})\) if and only if it satisfies that
\( \tau^*(\mathbf{x}) \) is called optimal segmentation volume, defined as
where \(J_\tau(\mathbf{x})\) is the index set of the \(\tau\)-largest probabilities, \(\Gamma(\mathbf{x}) = \sum_{j=1}^d {B}_{j}(\mathbf{x})\), and \( {\Gamma}_{- j}(\mathbf{x}) = \sum_{j' \neq j} {B}_{j'}(\mathbf{x})\) are Poisson-binomial random variables.
The Dice measure is separable w.r.t. \(j\)
Theorem 1 (Dai and Li, 2023). A segmentation rule \(\pmb{\delta}^*\) is a global maximizer of \(\text{Dice}_\gamma(\pmb{\delta})\) if and only if it satisfies that
\( \tau^*(\mathbf{x}) \) is called optimal segmentation volume, defined as
where \(J_\tau(\mathbf{x})\) is the index set of the \(\tau\)-largest probabilities, \(\Gamma(\mathbf{x}) = \sum_{j=1}^d {B}_{j}(\mathbf{x})\), and \( {\Gamma}_{- j}(\mathbf{x}) = \sum_{j' \neq j} {B}_{j'}(\mathbf{x})\) are Poisson-binomial random variables.
Theorem 1 (Dai and Li, 2023). A segmentation rule \(\pmb{\delta}^*\) is a global maximizer of \(\text{Dice}_\gamma(\pmb{\delta})\) if and only if it satisfies that
\( \tau^*(\mathbf{x}) \) is called optimal segmentation volume, defined as
Obs: both the Bayes segmentation rule \(\pmb{\delta}^*(\mathbf{x})\) and the optimal volume function \(\tau^*(\mathbf{x})\) are achievable when the conditional probability \(\mathbf{p}(\mathbf{x}) = ( p_1(\mathbf{x}), \cdots, p_d(\mathbf{x}) )^\intercal\) is well-estimated
Theorem 1 (Dai and Li, 2023). A segmentation rule \(\pmb{\delta}^*\) is a global maximizer of \(\text{Dice}_\gamma(\pmb{\delta})\) if and only if it satisfies that
\( \tau^*(\mathbf{x}) \) is called optimal segmentation volume, defined as
where \(J_\tau(\mathbf{x})\) is the index set of the \(\tau\)-largest probabilities, \(\Gamma(\mathbf{x}) = \sum_{j=1}^d {B}_{j}(\mathbf{x})\), and \( {\Gamma}_{- j}(\mathbf{x}) = \sum_{j' \neq j} {B}_{j'}(\mathbf{x})\) are Poisson-binomial random variables.
RankDice inspired by Thm 1 (plug-in rule)
Ranking the conditional probability \(p_j(\mathbf{x})\)
Theorem 1 (Dai and Li, 2023+). A segmentation rule \(\pmb{\delta}^*\) is a global maximizer of \(\text{Dice}_\gamma(\pmb{\delta})\) if and only if it satisfies that
\( \tau^*(\mathbf{x}) \) is called optimal segmentation volume, defined as
RankDice inspired by Thm 1
Ranking the conditional probability \(p_j(\mathbf{x})\)
searching for the optimal volume of the segmented features \(\tau(\mathbf{x})\)
Note that (6) can be rewritten as:
In practice, the DFT–CF method is generally recommended for computing. The RF1 method can also been used when n < 1000, because there is not much difference in computing time from the DFT–CF method. The RNA method is recommended when n > 2000 and the cdf needs to be evaluated many times. As shown in the numerical study, the RNA method can approximate the cdf well, when n is large, and is more computationally efficient.
Hong. (2013) On computing the distribution function for the Poisson binomial distribution
In practice, the DFT–CF method is generally recommended for computing. The RF1 method can also been used when n < 1000, because there is not much difference in computing time from the DFT–CF method. The RNA method is recommended when n > 2000 and the cdf needs to be evaluated many times. As shown in the numerical study, the RNA method can approximate the cdf well, when n is large, and is more computationally efficient.
Hong. (2013) On computing the distribution function for the Poisson binomial distribution
Lemma 3 (Dai and Li, 2023). If \(\sum_{s=1}^{\tau} \widehat{q}_{j_s}(\mathbf{x}) \geq (\tau + \gamma + d) \widehat{q}_{j_{\tau+1}}(\mathbf{x})\), then \(\bar{\pi}_\tau(\mathbf{x}) \geq \bar{\pi}_{\tau'}(\mathbf{x})\) for all \(\tau' >\tau\)
Early stop!
It is unnecessary to compute all \(\mathbb{P}(\widehat{\Gamma}_{-j}(\mathbf{x}) = l)\) and \(\mathbb{P}(\widehat{\Gamma}(\mathbf{x}) = l)\) for \(l=1, \cdots, d\), since they are negligibly close to zero when \(l\) is too small or too large.
Truncation!
\(\widehat{\sigma}^2(\mathbf{x}) = \sum_{j=1}^d \widehat{q}_j(\mathbf{x}) (1 - \widehat{q}_j(\mathbf{x})) \to \infty \quad \text{as } d \to \infty \)
\(o_P(1)\)
GPU via CUDA
\(o_P(1)\)
Fisher consistency or Classification-Calibration
(Lin, 2004, Zhang, 2004, Bartlett et al 2006)
Classification
Segmentation
Source: Visual Object Classes Challenge 2012 (VOC2012)
Source: Visual Object Classes Challenge 2012 (VOC2012)
Source: The Cityscapes Dataset: Semantic Understanding of Urban Street Scenes
Source: Visual Object Classes Challenge 2012 (VOC2012)
Source: The Cityscapes Dataset: Semantic Understanding of Urban Street Scenes
Jha et al (2020) Kvasir-seg: A segmented polyp dataset
DeepLab: Chen et al (2018) Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
DeepLab: Chen et al (2018) Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
PSPNet: Zhao et al (2017) Pyramid Scene Parsing Network
DeepLab: Chen et al (2018) Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
PSPNet: Zhao et al (2017) Pyramid Scene Parsing Network
FCN: Long, et al. (2015) Fully convolutional networks for semantic segmentation
The optimal threshold is NOT 0.5, and it is adaptive over different images/inputs
The optimal threshold is NOT fixed, and it is adaptive over different images/inputs
Long, et al. (2015) Fully convolutional networks for semantic segmentation
To our best knowledge, the proposed ranking-based segmentation framework RankDice, is the first consistent segmentation framework with respect to the Dice metric.
Three numerical algorithms with GPU parallel execution are developed to implement the proposed framework in large-scale and high-dimensional segmentation.
We establish a theoretical foundation of segmentation with respect to the Dice metric, such as the Bayes rule, Dice-calibration, and a convergence rate of the excess risk for the proposed RankDice framework, and indicate inconsistent results for the existing methods.
Our experiments suggest that the improvement of RankDice over the existing frameworks is significant.
If you like RankDice
please star 🌟 our Github repository, thank you for your support!