(Joint work with Xiaotong Shen and Wei Pan)
p-values for significance of keypoints to facial expression recognition based on the VGG neural network.
FER2013: Facial expression dataset
Why significance tests? Hypothesis testing, feature interpretation, XAI, make black-box models more reliable ...
Why black-box models? Significant improvement in prediction performance, which enforce us to believe that a black-box model is a better option to model real data. For example, use a CNN to formulate image data. (There is an overwhelming amount of evidence supporting this, and it has already been widely accepted.)
Why region tests? For image analysis, the impact of each pixel is negligible but a pattern of a collection of pixels (e.g. a region) may instead become salient.
Lewis, M. et al. (2021). Comparison of deep learning with traditional models to predict preventable acute care use and spending among heart failure patients. Scientific reports, 11(1), 1164.
He, K. et al. (2015) Deep residual learning for image recognition. CVPR.
FER2013: Facial expression dataset
Overparametrized models. When the number of parameters increase, both training / testing errors decrease, and the training error could be zero.
Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., & Srebro, N. (2018). Towards understanding the role of over-parametrization in generalization of neural networks.
Dataset \((\mathbf{X}_i, \mathbf{Y}_i)_{i=1}^N\)
Black-box model \( f: \mathbb{R}^d \to \mathbb{R}^K \)
good performance, or small generalization error, or reasonable convergence rate
Flexible computing platforms for a general loss function \(l(\cdot, \cdot)\)
TensorFlow, Keras, Pytorch
Dataset \((\mathbf{X}_i, \mathbf{Y}_i)_{i=1}^N\)
Black-box model \( f: \mathbb{R}^d \to \mathbb{R}^K \)
good performance, or small generalization error, or reasonable convergence rate
Flexible computing platforms for a general loss function \(l(\cdot, \cdot)\)
TensorFlow, Keras, Pytorch
Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. ICML, PMLR.
The (null) hypothesis needs to be reformulated!
Hypothesis for feature relevance. One of the most frequently used hypothesis is conditional independence.
$$H_0: \ \bm{Y} \perp \bm{X}_\mathcal{S} \mid \bm{X}_{\mathcal{S}^c}$$
Conditional randomization test (CRT; Candes et al. 2018) and holdout randomization test (HRT; Tansey et al. 2018)
require conditional Prob of hypothesized feature given the others. It is usually difficult to estimate for complex datasets.
Candes, E., Fan, Y., Janson, L., & Lv, J. (2018). Panning for gold:‘model-X’knockoffs for high dimensional controlled variable selection. JRSSb
Tansey, W., Veitch, V., Zhang, H., Rabadan, R., & Blei, D. M. (2022). JCGS
Slides "LOCO: The Good, the Bad, and the Ugly" by Ryan Tibshirani
Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., & Wasserman, L. (2018). Distribution-free predictive inference for regression. JASA.
LOCO Hypothesis
Slides "LOCO: The Good, the Bad, and the Ugly" by Ryan Tibshirani
Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., & Wasserman, L. (2018). Distribution-free predictive inference for regression. JASA.
We utilize the following example to better illustrate the issues of LOCO tests.
LOCO Tests
What is the null hypothesis?
\(H_{0,n}:\) The performance yielded by \(\widehat{f}_n\) and \(\widehat{g}_n\) is equivalent.
\(H_{0, 100} \quad \to \quad H_{0, 200} \quad \to \quad H_{0, n} \quad \to \quad H_0\)
\(p_{0, 100} \quad \to \quad p_{0, 200} \quad \to \quad p_{0, n} \quad \to \quad p\)
(a null hypothesis that hinges upon the particular training set utilized)
(pop H?)
?
Several questions remain unanswered:
$$H_0: R(f^*)- R_\mathcal{S}(g^*)=0, \text{ versus } H_a: R(f^*) - R_\mathcal{S}(g^*)<0.$$
$$H_0: R(f^*)- R_\mathcal{S}(g^*)=0, \text{ versus } H_a: R(f^*) - R_\mathcal{S}(g^*)<0.$$
Best prediction based on \(\mathbf{X}\)
$$H_0: R(f^*)- R_\mathcal{S}(g^*)=0, \text{ versus } H_a: R(f^*) - R_\mathcal{S}(g^*)<0.$$
Best prediction based on \(\mathbf{X}\)
Best prediction based on \(\mathbf{X}_{\mathcal{S}^c}\)
Relationships among the proposed hypothesis,and conditional independence;
$$\ \bm{Y} \perp \bm{X}_\mathcal{S} \mid \bm{X}_{\mathcal{S}^c}$$
Lemma 1 (Dai et al. 2022). For any loss function, conditional independent implies risk invariance, or
$$\bm{Y} \perp \bm{X}_\mathcal{S} \mid \bm{X}_{\mathcal{S}^c} \ \Longrightarrow \ R(f^*) - R_{\mathcal{S}}(g^*) = 0.$$
Moreover, if the negative log-likelihood is used, then \(H_0\) is equivalent to conditional independence almost surely:
$$\bm{Y} \perp \bm{X}_\mathcal{S} \mid \bm{X}_{\mathcal{S}^c} \iff R(f^*) - R_{\mathcal{S}}(g^*) = 0.$$
Zhang et al. 2017: Deep neural networks easily fit shuffled pixels, random pixels.
Question: do we need to split data? Yes!
training loss converge to zero under random pixels, yet the testing loss is still sensible.
Theoretically, it is not easy to find a limiting distribution based on a black-box model.
Splitting the full dataset into estimation and inference sets and generate dual datasets
$$(\bm{X}_i, \bm{Y}_i)_{i=1}^N$$
$$(\bm{X}_i, \bm{Y}_i)_{i=1}^n$$
$$(\bm{X}_{n+j}, \bm{Y}_{n+j})_{j=1}^m$$
Splitting the full dataset into estimation and inference sets and generate dual datasets
$$(\bm{X}_i, \bm{Y}_i)_{i=1}^N$$
$$(\bm{X}_i, \bm{Y}_i)_{i=1}^n$$
$$(\bm{X}_{n+j}, \bm{Y}_{n+j})_{j=1}^m$$
$$(\bm{Z}_i, \bm{Y}_i)_{i=1}^n$$
$$(\bm{Z}_{n+j}, \bm{Y}_{n+j})_{j=1}^m$$
Splitting the full dataset into estimation and inference sets and generate dual datasets
$$(\bm{X}_i, \bm{Y}_i)_{i=1}^N$$
$$(\bm{X}_i, \bm{Y}_i)_{i=1}^n$$
$$(\bm{X}_{n+j}, \bm{Y}_{n+j})_{j=1}^m$$
$$(\bm{Z}_i, \bm{Y}_i)_{i=1}^n$$
$$(\bm{Z}_{n+j}, \bm{Y}_{n+j})_{j=1}^m$$
$$\widehat{R}( \widehat{f}_n ) - \widehat{R}_{\mathcal{S}}( \widehat{g}_n )$$
Splitting the full dataset into estimation and inference sets and generate dual datasets
$$(\bm{X}_i, \bm{Y}_i)_{i=1}^N$$
$$(\bm{X}_i, \bm{Y}_i)_{i=1}^n$$
$$(\bm{X}_{n+j}, \bm{Y}_{n+j})_{j=1}^m$$
$$(\bm{Z}_i, \bm{Y}_i)_{i=1}^n$$
$$(\bm{Z}_{n+j}, \bm{Y}_{n+j})_{j=1}^m$$
$$\widehat{R}( \widehat{f}_n ) - \widehat{R}_{\mathcal{S}}( \widehat{g}_n )$$
Splitting the full dataset into estimation and inference sets and generate dual datasets
$$(\bm{X}_i, \bm{Y}_i)_{i=1}^N$$
$$(\bm{X}_i, \bm{Y}_i)_{i=1}^n$$
$$(\bm{X}_{n+j}, \bm{Y}_{n+j})_{j=1}^m$$
$$(\bm{Z}_i, \bm{Y}_i)_{i=1}^n$$
$$(\bm{Z}_{n+j}, \bm{Y}_{n+j})_{j=1}^m$$
$$\widehat{R}( \widehat{f}_n ) - \widehat{R}_{\mathcal{S}}( \widehat{g}_n )$$
Question:
Consider the following decomposition
\( \widehat{R}( \widehat{f}_n ) - \widehat{R}_{\mathcal{S}}( \widehat{g}_n) = \widehat{R}( \widehat{f}_n ) - R( \widehat{f}_n ) + R_\mathcal{S}(\widehat{g}_n) - \widehat{R}_{\mathcal{S}}( \widehat{g}_n )\)
$$+R( \widehat{f}_n ) - R( f^* ) + R_\mathcal{S}(g^*) - R_\mathcal{S}(\widehat{g}_n)$$
$$+R(f^*) - R_\mathcal{S}(g^*) $$
$$= T_1 + T_2 + T_3$$
Consider the following decomposition
\( \widehat{R}( \widehat{f}_n ) - \widehat{R}_{\mathcal{S}}( \widehat{g}_n) = \widehat{R}( \widehat{f}_n ) - R( \widehat{f}_n ) + R_\mathcal{S}(\widehat{g}_n) - \widehat{R}_{\mathcal{S}}( \widehat{g}_n )\)
$$+R( \widehat{f}_n ) - R( f^* ) + R_\mathcal{S}(g^*) - R_\mathcal{S}(\widehat{g}_n)$$
$$+R(f^*) - R_\mathcal{S}(g^*) $$
$$= T_1 + T_2 + T_3$$
$$T_3 = R(f^*) - R(g^*) = 0, \quad \text{under } H_0$$
$$T_1 = \widehat{R}( \widehat{f}_n ) - R( \widehat{f}_n ) + R_\mathcal{S}(\widehat{g}_n) - \widehat{R}_{\mathcal{S}}( \widehat{g}_n )$$
$$= \frac{1}{m} \sum_{j=1}^{m} \Big( \Delta_{n,j} - \mathbb{E}\big( \Delta_{n,j} \big| (\bm{X}_i,\bm{Y}_i)^n_{i=1} \big) \Big)$$
$$T_1 = \widehat{R}( \widehat{f}_n ) - R( \widehat{f}_n ) + R_\mathcal{S}(\widehat{g}_n) - \widehat{R}_{\mathcal{S}}( \widehat{g}_n )$$
$$= \frac{1}{m} \sum_{j=1}^{m} \Big( \Delta_{n,j} - \mathbb{E}\big( \Delta_{n,j} \big| (\bm{X}_i,\bm{Y}_i)^n_{i=1} \big) \Big)$$
$$T_2 = R( \widehat{f}_n ) - R( f^* ) + R_\mathcal{S}(g^*) - R_\mathcal{S}(\widehat{g}_n)$$
$$\leq max\{ R( \widehat{f}_n ) - R( f^* ), R_\mathcal{S}(\widehat{g}_n) - R_\mathcal{S}(g^*) \} = O_P( n^{-\gamma})$$
In the literature, the convergence rate \(\gamma > 0\) has been extensively investigated (Wasserman, et al. 2016; Schmidt et al. 2020)
After normalization, \(T_2\) is convergence in prob when \(n \to \infty\), and \(T_3 = 0\) under \(H_0\).
After normalization, \(T_2\) is convergence in prob when \(n \to \infty\), and \(T_3 = 0\) under \(H_0\).
$$T = \frac{\sqrt{m}}{\widehat{\sigma}_n} \big(\widehat{R}( \widehat{f}_n ) - \widehat{R}_{\mathcal{S}}( \widehat{g}_n ) \big) = \frac{ \sum_{j=1}^{m} \Delta_{n,j}}{\sqrt{m}\widehat{\sigma}_n} = \frac{\sqrt{m}}{\widehat{\sigma}_n} T_1 + \frac{\sqrt{m}}{\widehat{\sigma}_n} T_2 + \frac{\sqrt{m}}{\widehat{\sigma}_n} T_3 $$
where \(\widehat{\sigma}_n\) is a sample standard deviation of differenced evaluations on inference set, that is, \(\{ \Delta_{n,j} \}_{j=1}^m\).
After normalization, \(T_2\) is convergence in prob when \(n \to \infty\), and \(T_3 = 0\) under \(H_0\).
$$T = \frac{\sqrt{m}}{\widehat{\sigma}_n} \big(\widehat{R}( \widehat{f}_n ) - \widehat{R}_{\mathcal{S}}( \widehat{g}_n ) \big) = \frac{ \sum_{j=1}^{m} \Delta_{n,j}}{\sqrt{m}\widehat{\sigma}_n} = \frac{\sqrt{m}}{\widehat{\sigma}_n} T_1 + \frac{\sqrt{m}}{\widehat{\sigma}_n} T_2 + \frac{\sqrt{m}}{\widehat{\sigma}_n} T_3 $$
Asymptotically Normally Distributed?
It may be WRONG!!
One unusual issue for the test statistic is varnishing standard deviation:
$$\text{ Under } H_0, \text{ if } \widehat{f}_n \stackrel{p}{\longrightarrow} f^*, \widehat{g}_n \stackrel{p}{\longrightarrow} g^*, \text{ and } f^* = g^*, \text{ then } \widehat{\sigma}_n \stackrel{p}{\longrightarrow} 0$$
CLT may not hold for \(T_1\). CLT requires a standard derivation is fixed, or bounded away from zero.
Bias-sd-ratio. Both bias and sd are convergence to zeros:
$$\frac{\sqrt{m}T_2}{\widehat{\sigma}_n} = \sqrt{m} \big( \frac{R( \widehat{f}_n ) - R( f^* ) + R_\mathcal{S}(g^*) - R_\mathcal{S}(\widehat{g}_n)}{ \widehat{\sigma}_n } \big) = \sqrt{m} \big( \frac{bias \stackrel{p}{\to} 0}{sd \stackrel{p}{\to} 0} \big).$$
If \(T_2\) and \(\widehat{\sigma}_n\) are in the same order, \(\sqrt{m}\widehat{\sigma}^{-1}_n T_2 = O_P(\sqrt{m})\), kills the null distribution!
Observation: Testing results or p-values are sensitive to the ratio between the sizes of the training and test datasets.
It can be explained by our computation. Recall the Assumption:
$$\max\{ R( \widehat{f}_n ) - R( f^* ), R_\mathcal{S}(\widehat{g}_n) - R_\mathcal{S}(g^*) \} = O_P( n^{-\gamma})$$
Then, \(\sqrt{m}\widehat{\sigma}^{-1}_n T_2 = O_P(\sqrt{m} \widehat{\sigma}^{-1}_n n^{-\gamma} )\)
Splitting condition.
Observation: Testing results or p-values are sensitive to the ratio between the sizes of the training and test datasets.
We utilize the following example to better illustrate this issue.
One-sample T-test for model comparison
Observation: Testing results or p-values are sensitive to the ratio between the sizes of the training and test datasets.
It can be explained by our computation. Recall the Assumption:
$$\max\{ R( \widehat{f}_n ) - R( f^* ), R_\mathcal{S}(\widehat{g}_n) - R_\mathcal{S}(g^*) \} = O_P( n^{-\gamma})$$
Then, \(\sqrt{m}\widehat{\sigma}^{-1}_n T_2 = O_P(\sqrt{m} \widehat{\sigma}^{-1}_n n^{-\gamma} )\)
Splitting condition. The ratio of training to testing should not be arbitrarily defined; instead, it is contingent upon the complexity of the problem and the convergence rate of the method.
Bias-sd-ratio issue is caused by vanishing standard deviation, we can address it by perturbation.
One-split test. The proposed test statistic is given as:
$$\Lambda^{(1)}_n = \frac{ \sum_{j=1}^{m} \Delta^{(1)}_{n,j}}{\sqrt{m}\widehat{\sigma}_n}, \quad \Delta^{(1)}_{n,j} = \Delta_{n,j} + \rho_n \varepsilon_j$$
$$\Lambda^{(1)}_{n} = \frac{\sqrt{m}}{ \widehat{\sigma}^{(1)}_{n}} \Big( \frac{1}{m} \sum_{j=1}^{m} \big( \Delta^{(1)}_{n,j} - \mathbb{E}\big( \Delta^{(1)}_{n,j} | \mathcal{E}_n \big) \big) \Big)$$
$$+\frac{\sqrt{m}}{\widehat{\sigma}^{(1)}_n} \Big( R(\widehat{f}_n) - R(f^*) - \big( R_\mathcal{S}(\widehat{g}_n) - R_\mathcal{S}(g^*) \big) \Big)$$
$$+\frac{\sqrt{m}}{\widehat{\sigma}^{(1)}_{n}} \big( R(f^*) - R_\mathcal{S}(g^*) \big)$$
$$\Lambda^{(1)}_{n} = \frac{\sqrt{m}}{ \widehat{\sigma}^{(1)}_{n}} \Big( \frac{1}{m} \sum_{j=1}^{m} \big( \Delta^{(1)}_{n,j} - \mathbb{E}\big( \Delta^{(1)}_{n,j} | \mathcal{E}_n \big) \big) \Big)$$
$$+\frac{\sqrt{m}}{\widehat{\sigma}^{(1)}_n} \Big( R(\widehat{f}_n) - R(f^*) - \big( R_\mathcal{S}(\widehat{g}_n) - R_\mathcal{S}(g^*) \big) \Big)$$
$$+\frac{\sqrt{m}}{\widehat{\sigma}^{(1)}_{n}} \big( R(f^*) - R_\mathcal{S}(g^*) \big)$$
= 0 under \( H_0\)
$$\Lambda^{(1)}_{n} = \frac{\sqrt{m}}{ \widehat{\sigma}^{(1)}_{n}} \Big( \frac{1}{m} \sum_{j=1}^{m} \big( \Delta^{(1)}_{n,j} - \mathbb{E}\big( \Delta^{(1)}_{n,j} | \mathcal{E}_n \big) \big) \Big)$$
$$+\frac{\sqrt{m}}{\widehat{\sigma}^{(1)}_n} \Big( R(\widehat{f}_n) - R(f^*) - \big( R_\mathcal{S}(\widehat{g}_n) - R_\mathcal{S}(g^*) \big) \Big)$$
$$+\frac{\sqrt{m}}{\widehat{\sigma}^{(1)}_{n}} \big( R(f^*) - R_\mathcal{S}(g^*) \big)$$
\(\to N(0,1)\) by conditional CLT of the triangular array
$$\Lambda^{(1)}_{n} = \frac{\sqrt{m}}{ \widehat{\sigma}^{(1)}_{n}} \Big( \frac{1}{m} \sum_{j=1}^{m} \big( \Delta^{(1)}_{n,j} - \mathbb{E}\big( \Delta^{(1)}_{n,j} | \mathcal{E}_n \big) \big) \Big)$$
$$+\frac{\sqrt{m}}{\widehat{\sigma}^{(1)}_n} \Big( R(\widehat{f}_n) - R(f^*) - \big( R_\mathcal{S}(\widehat{g}_n) - R_\mathcal{S}(g^*) \big) \Big)$$
$$+\frac{\sqrt{m}}{\widehat{\sigma}^{(1)}_{n}} \big( R(f^*) - R_\mathcal{S}(g^*) \big)$$
\(= O_p(m^{1/2} n^{-\gamma})\) by prediction consistency
If the splitting condition \(m^{1/2} n^{-\gamma} = o_p(1)\) is satisfied, then \(\Lambda_n^{(1)} \stackrel{d}{\longrightarrow} N(0,1)\) under \(H_0\).
Theorem 2 (Dai et al. 2022). Suppose Assumptions A-C, and \( m = o(n^{2\gamma})\), then under \(H_0\),
$$\Lambda^{(1)}_n \stackrel{d}{\longrightarrow} N(0,1), \quad \text{as} \quad n \to \infty.$$
Assumption A (Pred Consistency). For some \(\gamma > 0\), \((\widehat{f}_n, \widehat{g}_n)\) satisfies
$$\big(R(\widehat{f}_n) - R(f^*)\big) - \big( R_\mathcal{S}(\widehat{g}_n) - R_\mathcal{S}(g^*) \big) = O_p( n^{-\gamma}).$$
Assumptions B-C are standard assumptions for CLT under triangle arrays (cf. Cappe et al. 2006).
According to the asymptotic null distribution of \( \Lambda^{(1)}_n\) in Theorem 2, we calculate the p-value \(P^{(1)}=\Phi(\Lambda^{(1)}_n)\).
Lemma 4 (Dai et al. 2022). The estimation and inference sample sizes \((n, m)\), determined by the log-ratio sample splitting formula, satisfies \(m = o(n^{2\gamma})\) for any \(\gamma > 0\) in Assumption A.
Question: How to determine the estimation / inference ratio?
\(m = o(n^{2\gamma})\) for an unknown \(\gamma > 0\).
Log-ratio sample splitting scheme. Specifically, given a sample size \(N \geq N_0\), the estimation and inference sizes \(n\) and \(m\) are obtained:
$$\quad \{x + \frac{N_0}{2 \log(N_0/2)} \log(x) = N\};$$
\(n\) is a solution of
$$m = N -n$$
Then Splitting ratio condition is automatically satisfied!
Two-split test. One-split test is valid for any perturbation \(\rho > 0\), if you don't like a custom parameter, use two-split test (further splitting an inference sample into two equal subsamples yet the perturbation is not required)
We derive the one-split/two-split tests based on the differenced empirical loss with and without hypothesized features. Theoretically, we show that the one-split and two-split tests, as well as their combined tests, can control the Type I error while being consistent in terms of power;
Though this project may not serve as a perfect solution, it draws attention to certain critical issues prevailing in the field, particularly the bias-sd ratio and splitting ratio.
If you like this work, please star 🌟 our Github repository. Your support is deeply appreciated!