Changing My View on AI Explainability

2025

This essay is in draft form.

Explainability has been an important goal in AI systems beginning with the first instances machine learning was deployed in real world applications. Learning algorithms mostly use gradients over data to learn parameters which map inputs to desired responses. But exactly how these parameters inform the decision or outputs of these models is a bit of a mystery. While nothing prevents inspecting the mathematics of the internals of the model, it isn't clear how different parts of them as learned over sets of data result in their output. This has led to their being labeled 'black boxes,' especially the more complex models like neural networks.

Yet, explainability is important for many reasons. For the AI teams I worked on across different domains it was often the case that we would want to know things like which inputs to the model were important and worth including. People run ablation studies that leave out single groups of features iteratively to find where performance drops as a way to determine this. But this is still very different than explaining why the model is acting a certain way.

At other times, a model makes a decision that is wrong and engineers who want to improve the model or analysts who use the model's output in conjunction with other information would like to know why. There is a whole domain of explainability methods that try to do things like model localities with linear functions (LIME) or to approach the feature importance problems SHAP values which assign credit to input features using a method borrowed from game theory. These methods are helpful but limited.

For years, I was an explainability pessimist. My feeling was that there usually is simply too much happening to explain to people in a simple way how the model is behaving. There is the idea that simpler models are easier to explain, too, and while this is somewhat true even simple models can be misleading in their explanations. Take the case of a model that is a simple rule, if the value of the car is less than $5000 buy it. Well in that case you have the explanation. People go further and attempt to explain using linear models (weighted sums of the input features) by inferring that large value model weights relative to others indicate importance to the model. So if you model the price of a home by the number of bathrooms, square feet, walkability score, and nearby school ratings, if the weight of square feet is high and others lower than the model is assuming price based on that mostly (and you can look at the input values to help explain any particular instance.) The problem however is that for a lot of features that are correlated high weights can be spread across the features leading to lower ones for important concepts. Similarly, for models that consist of sets of learned rules (decision trees,) there are usually so many rules together that any particular one isn't too helpful. With single decision trees, explanations can be somewhat clear, but in large ensembles like random forests, the number of decision paths becomes too great for any single one to offer a helpful explanation. People also try to derive statistics of which features drive decisions in trees, but this is still not convincing practically. This is all before we get into the non-linearities of neural networks where features are blended together.

So I have been an explainability pessimist. But I am turning around, and mostly because I think the applications and reasons for understanding a system doesn't just need to be an explanation. Understanding even in small ways can have value.

I recently left my role at an AI startup to become a visiting researcher at my friend Bob Sturm's Music AI lab at KTH. One of the areas of research in Bob's lab is understanding the outputs of the larger music generation models used by Suno and Unio. Can one look at input-output pairs or statistics of outputs to learn something about generalization vs. memorization (stealing a singer's voice) or classifying whether a piece of music is AI generated or not? Bob is interested in mechanistic interpretability, too, and I think the lens of music is an interesting one. These models synthesize symbols or audio, and explanations are usually not of a decision but about properties of that output.

I am changing my view because I think model interpretation is important for a number of reasons. First, and especially for the LLMs out of foundation labs, these models are starting to power important automated decisions in the world (agents.) It would be helpful for the alignment, safety problem to find ways that can indicate to us whether the AI is being forthright in its answer or deceptive, and whether we can detect sub-network activations (circuits) that correlate with deception or misaligned behavior. Second, I think interpreting what is happening in the networks can help provide insights on new architectures or techniques for improving models. For instance, when a model is hallucinating, does that look different in some areas of the network than when it is not? This is an area of ongoing research, but it is conceivable that an interpretability can provide insight into how to detect and prevent hallucinations. Finally, I can imagine these techniques as being an alternative way to learning to control the outputs themselves without having to do additional training.

So, I'm in on mechanistic interpretability.


Back to all writing