Aditya Chattopadhyay, Stewart Slocum, Benjamin Haeffele, Rene Vidal, and Donald Geman
There is broad consensus on the fact that current interpretability tools are hard to draw accurate conclusions from.
There are countless new papers in the area but most have little value, in large part because there isn't a clear objective defining what the field should aim for.
This precludes building evaluation criteria like benchmarks, which have been key to progress in the field of ML.
We put forward 4 objectives for interpretability tools
The most popular interpretability methods, which are based on local measures of feature importance (e.g. the gradient for saliency maps) are notoriously misleading.
We propose a definition of trustworthiness in terms of approximate statistical sufficiency.
If a method satisfies this criteria, it then generates causal explanations of model predictions.
Interpretability tools should also generate explanations that are understandable by human user.
It is not clear that pixel and feature attribution maps are the most interpretable. Rather, we prefer to think of explanations in terms of high-level concepts.
We designed our explanations to be sets of queries about the input.
Short explanations are much easier to understand than long, complicated ones.
Even classically interpretable models like linear models or decision trees become uninterpretable when they contain large numbers of coefficients or branches.
We formalize the first three objectives into a mathematical formulation described by approximate Semantic Entropy:
The Semantic Entropy of a dataset is the average length of the minimal sufficient explanation policy \(\pi\).
The resulting explanations resemble a game of 20 questions, where queries belong to \(Q\) and the goal is to reach a high-level of confidence in the correct answer \(y\).
Finding the minimal sufficient explanation for a point is an NP-hard problem.
We use an approximate greedy strategy called Information Pursuit.
This requires a mutual information calculation, which we implement with a VAE.