Any institution engaged in algorithmic decision-making is legally required to justify those decisions to any person whose data they hold on request, a challenge that most are ill-equipped to meet.
- Watson et al. (2019) "Clinical applications of machine learning algorithms: beyond the black box."
AlphaGo: AlphaGo is the first computer program to defeat a professional human Go player.
Source: DeepMind
Train a 34-layer convolutional neural network (CNN) to detect arrhythmias in ECG time-series
Source: Stanford ML
\( (\mathbf{x}_i, y_i)_{i=1, \cdots, n} \), \(\mathbf{x}_i \in \mathbb{R}^p \) is the feature , and \( y_i \) is the label.
$$ d(\mathbf{x}) $$
\( (\mathbf{x}_i, y_i)_{i=1, \cdots, n} \), \(\mathbf{x}_i \in \mathbb{R}^p \) is the feature , and \( y_i \) is the label.
$$ d(\mathbf{x}) $$
How to define discriminative feature?
$$ d(\mathbf{x}) $$
How to define discriminative feature?
$$ d(\mathbf{x}) $$
How to define discriminative feature?
Predictive importance. The prediction accuracy of the model would significantly deteriorate without discriminative features. Alternatively, discriminative features can explain a large proportion of its predictive performance.
$$ d(\mathbf{x}) $$
How to define discriminative feature?
Effectiveness. Discriminative features should effectively describe the discrimination of the outcome.
Therefore, under the same predictive importance, the number/amount of localized discriminative features should be as small as possible.
$$ d(\mathbf{x}) $$
How to define discriminative feature?
Adaptiveness. Discriminative feature extraction has to be adaptive to an input instance.
Introduce a new neural network to attack the full features
FIXED
Adaptiveness
Effectiveness
Predictive importance
Adaptiveness. We introduce a localizer \( \mathbf{\delta}(\mathbf{x}) = ( \delta_1(\mathbf{x}), \cdots, \delta_p(\mathbf{x}) )^\intercal \) to produce an attack adaptively based on an instance \( \mathbf{x} \) to attack features \(\mathbf{x}_{\mathbf{\delta}} = \mathbf{x} - \mathbf{\delta}(\mathbf{x})\)
Effectiveness. We have two constraints of \( \mathbf{\delta}(\mathbf{x}) \):
$$ (i) \ \sup_{\mathbf x} \| \mathbf{\delta}(\mathbf{x}) \|_\infty \leq 1; \quad (ii) \ J(\bm{\delta}) := \sup_{\bm{x}} \| \bm{\delta}(\bm{x}) \|_1 \leq \tau $$
Total amount of attack
Adaptiveness. We introduce a localizer \( \mathbf{\delta}(\mathbf{x}) = ( \delta_1(\mathbf{x}), \cdots, \delta_p(\mathbf{x}) )^\intercal \) to produce an attack adaptively based on an instance \( \mathbf{x} \) to attack features \(\mathbf{x}_{\mathbf{\delta}} = \mathbf{x} - \mathbf{\delta}(\mathbf{x})\)
Effectiveness. We have two constraints of \( \mathbf{\delta}(\mathbf{x}) \):
$$ (i) \ \sup_{\mathbf x} \| \mathbf{\delta}(\mathbf{x}) \|_\infty \leq 1; \quad (ii) \ J(\bm{\delta}) := \sup_{\bm{x}} \| \bm{\delta}(\bm{x}) \|_1 \leq \tau $$
Total amount of attack
Solution. Implement an arbitrary neural network with following activation functions -> the constraints are automatically satisfied
$$A_\tau(\bm{z}) = \text{TReLu} \big( \tau \cdot \text{softmax}(\bm{z}) \big), \text{ or } A_\tau(\bm{z}) = \text{Tanh} \big( \tau \cdot \text{softmax}(\bm{z}) \big)$$
Predictive importance. To measure the degree of predictive importance of a localizer \( \mathbf{\delta}(\cdot) \), we introduce a generalized partial R2, which mimics the partial R2 in regression.
Recall the partial R2:
$$ R^2 = \frac{\text{SSE(reduced)} - \text{SSE(full)}}{\text{SSE(reduced)}} = 1 - \frac{\text{SSE(full)}}{\text{SSE(reduced)}} $$
Predictive importance. To measure the degree of predictive importance of a localizer \( \mathbf{\delta}(\cdot) \), we introduce a generalized partial R2, which mimics the partial R2 in regression.
Recall the partial R2:
$$ R^2 = \frac{\text{SSE(reduced)} - \text{SSE(full)}}{\text{SSE(reduced)}} = 1 - \frac{\text{SSE(full)}}{\text{SSE(reduced)}} $$
Predictive importance + adaptiveness + Effectiveness
Predictive importance + adaptiveness + Effectiveness
Predictive importance + adaptiveness + Effectiveness
Goal: estimate a \(r^2\)-discriminative localizer for a trained DL model \(d\) given a dataset
Fisher consistency
Est by a testing set
Step 1: set a value of r2 (want to explain)
Step 2: solve (5) for a seq of increasing \(\tau\)
Step 3: compute R2 based on the fitted localizer
Step 4: Stop if r2 is obtained
Dr. Lin Yee Chen, MD MS
Professor of Medicine, UMN
ABIM board-certified cardiac
electrophysiologist
ECG clinician
The localized regions of ECG complexes in sinus rhythm are most informative in distinguishing presence of ventricular ectopic beats from supraventricular ectopic beats.
- The localized regions lie in the QRS complex, which correlates with ventricular depolarization or electrical propagation.
- Ion channel aberrations and structural abnormalities in the ventricles can affect electrical conduction in the ventricles, manifesting with subtle anomalies in the QRS complex in sinus rhythm that may not be discernible by the naked eye but is detectable by the convolutional auto-encoder.
We propose a generalized partial R2 to quantify the degree of predictive importance of discriminative features so that they can be interpreted similarly as in classical statistical analysis.
The proposed framework (5) is able to simultaneously consider both predictive importance and effectiveness, it provides a flexible framework to localize discriminative features corresponding to a certain amount of accuracy, as measured by an R2 .
Numerical results (both MNIST and ECG datasets) suggest the validity of the proposed framework.