Oct 10, 2023
Sheng and Anthony
CS 520
Machine learning and graphical causality were developed separately.
Now, there are needs to integrate two methods and to find causal inference.
A central problem for AI and causality is Causal Representation Learning — the discovery of high-level causal variables from low-level observations.
Issue 1: Robustness
In the real world, often little control over the distribution of observed data.
E.g., in computer vision, changes in variable distribution may come from aberrations like camera blur, noise, or compression quality.
There are several ways to test the generalization of classification, but no definitive consensus about the generalization
The causal model can be a way to handle these problems by using statistical dependences and distribution shifts, e.g., intervention.
Issue 2: Learning reusable mechanism
Repeating learning process every time whenever we learn new knowledge is waste of resources.
Need to re-use previous knowledge and skill in novel scenarios \(\to\) Modular representation
Modular representation behave similarly across different tasks and environments.
Issue 3: A Causality perspective
Conditional probabilities cannot predict the outcome of an active intervention.
E.g., “seeing people with open umbrellas suggest that it is raining”, but it doesn’t mean ”closing umbrellas does not stop the rain”
Causation requires the additional notion of intervention.
Thus, discovering causal relations requires robust knowledge that holds beyond the observed data distribution
\(\to\) Why we need Causal Representational Learning
Q: Where will the cube land?
Q: Where will the cube land?
Physical modeling, differential equations:
\(F = ma\)
\( a = \frac{d}{dt} v(t) \)
\(v = \frac{d}{dt}s(t)\)
Q: Where will the cube land?
statistical learning
Predicting in the i.i.d. setting
What is the probability that this particular image contains a dog?
What is the probability of heart failure given certain diagnostic measurements (e.g., blood pressure) carried out on a patient?
Predicting Under Distribution Shifts
Is increasing the number of storks in a country going to boost its human birth rate?
Would fewer people smoke if cigarettes were more socially stigmatized?
Answering Counterfactual Questions
Intervention: How does the probability of heart failure change if we convince a patient to exercise regularly?
Counterfactual: Would a given patient have suffered heart failure if they had started exercising a year earlier?
Learning from data
Observational vs. Interventional data
E.g., images vs. experimental data (RCT)
Structured vs. Unstructured data
E.g., Principal Component Analysis (PCA) vs. raw data
Methods driven by i.i.d. data
The Reichenbach Principle: From Statistics to Causality
Structural Causal Models (SCMs)
Difference Between Statistical Models, Causal Graphical Models, and SCMs
Methods driven by i.i.d. data
In many cases i.i.d. assumption can’t be guaranteed, e.g., selection bias \(\to\) Can't make causal inferences
Mainly used for Statistical Inference Models
The Reichenbach Principle (aka Common Cause Principle)
If two observables \(X\) and \(Y\) are statistically dependent, then there exists a variable \(Z\) that causally influences both and explains all the dependence in the sense of making them independent when conditioned on \(Z\).
\(X \to Z \to Y\); \(X \leftarrow Z \leftarrow Y\); \(X \leftarrow Z \rightarrow Y\)
Q: Do these three causal relations share the same set of conditional independence?
The Reichenbach Principle (aka Common Cause Principle)
If two observables \(X\) and \(Y\) are statistically dependent, then there exists a variable \(Z\) that causally influences both and explains all the dependence in the sense of making them independent when conditioned on \(Z\).
\(X \to Z \to Y\); \(X \leftarrow Z \leftarrow Y\); \(X \leftarrow Z \rightarrow Y\)
Without additional assumptions, we can’t distinguish these three cases \(\to\) Observational distribution over \(X\) and \(Y\) is the same in all three cases
Structural Causal Models (SCMs)
Consists of a set of random variable X with directed edges
●
The set of noises is assumed to be jointly independent.
The independence of noises allows causal factorization.
\(P(X_1, ..., X_n) = \prod_{i=1}^n P(X_i | \mathbf{PA}_i)\)
Difference b/w Statistical Models, Causal Graphical Models, and SCMs
Statistical model:
Same conditional independence (Markov equivalence) \(\rightarrow \) insufficient for causal discovery
\(X \rightarrow Z \rightarrow Y\)
\(X \leftarrow Z \leftarrow Y\)
\(X \leftarrow Z \rightarrow Y\)
Difference b/w Statistical Models, Causal Graphical Models, and SCMs
Causal Graphical Models
Difference b/w Statistical Models, Causal Graphical Models, and SCMs
Structural Causal Models
Composed of a set of causal variables and a set of structural equations with noise variables \(U\).
Intervention and Counterfactuals
Difference b/w Statistical Models, Causal Graphical Models, and SCMs
The causal generative process of a system’s variables is composed of autonomous modules that do not inform or influence each other. In the probabilistic case, this means that the conditional distribution of each variable given its causes (i.e., its mechanism) does not inform or influence the other mechanisms.
Applying ICM to causal factorization \(P(X_1, ..., X_n) = \prod_{i=1}^n P(X_i | \mathbf{PA}_i)\) implies that factors should be independent in the sense that
\(\to\) independence of influence
\(\to\) independence of information
Small distribution changes tend to manifest themselves in a sparse or local way in the causal/disentangled factorization, i.e., they should usually not affect all factors simultaneously.
Recall: causal/disentangled factorization is
\(P(X_1, ..., X_n) = \prod_{i=1}^n P(X_i | \mathbf{PA}_i)\)
Intellectual descendant of Simon's invariance criterion, i.e., that the causal structure remains invariant across changing background conditions
SMS has been recently used to learn causal models, modular architectures, and disentangled representations
If we make assumptions about the function class \(f\), we can solve the above two challenges.
Nevertheless, more often than not, causal variables are not given and need to be learned.
representation
Example: causal representation learning problem setting
unknown causal structure
observed data \(X\)
Formally:
\(U\) factorizes according to unknown DAG \(\mathcal{G}\).
Q: How many nodes does \(\mathcal{G}\) have?
... in light of causal representation learning
ICM Principle implies
Problem: Given data \(X = (X_1, \ldots, X_d)\), construct causal variables \(S_1, ..., S_n (n \ll d)\) and mechanisms \(S_i := f_i(\mathbf{PA}_i, U_i)\)
Step 1: Use an encoder \(q: \mathbb{R}^d \to \mathbb{R}^n\) to take \(X\) into a latent representation
Step 2: mapping \(f(U)\) determined by structural assignments \(f_1, ..., f_n\)
Step 3: Apply a decoder \(p: \mathbb{R}^n \to \mathbb{R}^d\)
focus
Neural networks are "brittle" against adversarial attacks.