top of page

AI talk: Monosemanticity and LLM interpretability - Summer 2024

Writer's picture: Juggy JagannathanJuggy Jagannathan

Updated: Jan 2

I explore the notion of what do large language models (LLMs) really know in this blog and how we can interpret their behavior. One word of warning to readers – this blog is a bit more technical than our usual fare, but hopefully still understandable! Also, shared in LinkedIn last summer.

CoPilot Generated Image

Monosemanticity

What in the world is “monosemanticity”? This high-sounding word simply means, in the context of LLMs, a neuron that represents one and only one specific concept. This contrasts with “polysemantic” neurons, which respond to multiple unrelated meanings. Ever since emergent intelligent behavior was witnessed in LLMs, scientists and researchers have been wrestling with the

question of why these models behave the way they do. No one has a clear explanation for this phenomenon yet! One recent paper from researchers at Anthropic is taking baby steps to resolve the enigma. Their recent paper (or rather a small book) has the following title: "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.”


Claude 3 Sonnet is one of the recent LLMs from Anthropic, which perform almost as well as GPT4 in a number of benchmarks. So, trying to understand what this LLM understands is interesting. How do they go about it? The technique adopted by the researchers is called dictionary learning. The basic idea can be understood in terms of an analogy. Let’s say you have a pallet of colors that constitutes the dictionary. Now, one can create any color by combining the basic color units in different proportions. What the Anthropic researchers did was train a sparse autoencoder (SAE) to serve the purpose of a dictionary. A dictionary of features extracted (the color palette in our analogy) from the operation of an LLM – in their case, one of their most recent models – Claude 3 Sonnet. So, every LLM activation can be decomposed into a collection of features. This mechanism allows one to determine what exactly does the LLM know.


Let’s parse what a SAE is. First, it is a neural network. Second, it is an encoder. An encoder encodes concepts into a vector representation. Third, it is an ‘auto-encoder’. An auto-encoder is one that can encode any input into a different (also referred to as latent) form with the ability to regenerate its input from the internal latent representation. This idea has been around for quite a while. It is basically another neural network that has an encoder-decoder architecture to it. You train it in a straightforward way. Input signal, encode, decode, take the error signal –

the difference between what was fed and what was reconstructed, back propagate and continue until the error is below a certain level. There are lots of use cases for auto-encoders. The ability to encode input has been used to denoise inputs, feature extraction and others. This reference has a good list of use cases for auto-encoders.


Now, lastly, what does ‘sparse’ mean? It relates to how the SAE is trained. Instead of the error signal being just the difference between the input and output, an extra error term is added that constrains how the encoding into latent representation happens. They constrain it in such a way that only a few neurons (also referred to as features in this context) are active for any given input. The SAE training objective thus ensures that only a limited number of features are active for any given input – the reason why it is called a ‘sparse’ auto-encoder.


The SAE the Anthropic researchers trained uses residual activation in the middle layer of the Transformer architecture. Why residual activation? Residual activation, an architecture feature of Transformer architecture underlying all LLMs, is simply a way to take an input of a particular layer and directly add it to the output of that layer. This mechanism has been shown to preserve the input signal as it gets processed by lots of neurons in each layer. Why the middle layer? The researchers postulate this is where the model’s abstract features are identified. The middle layer residual data stream encapsulates what is learned from all the previous layers. SAE takes the

activations from LLM’s (middle layer) and decomposes them into a collection of monosemantic, interpretable features.


LLM interpretability

How does one evaluate how good the features learned by SAE are? How interpretable are these? The authors evaluate the goodness along two dimensions: specificity and steering ability.


Specificity relates to whether a particular feature only fires when a specific concept is seen in the input to the LLM. To evaluate, say, if ‘Golden Gate Bridge’ is a concept recognized as a unique feature, the input stream to the LLM when it mentions the term or the surrounding areas in San Francisco, you check to see which feature activates. And see if that particular feature reliably fires for the same or related terms and does not fire for unrelated terms.


This is where the whole idea of ‘monosemanticity’ comes in. Now, there are millions of features. Actually, they trained multiple SAEs to recognize 1, 4 and 34 million features. Clearly, this evaluation must be automated. LLMs to the rescue. They asked Claude to generate a series of prompts related to a concept, counterfactual concepts and prompts unrelated to the concept under exploration. They fed the LLM (Claude) with more than 1,000 prompts for every concept that they were trying to see if a

monosemantic feature existed. And, they asked Claude to determine which feature activates when presented with various concepts. Basically, they make pretty heavy use of their LLM to help with identifying monosemantic features encoded by the SAEs about the same LLM! But the results are impressive. Check out the visual representation of the firing of the feature vectors on specific concepts here.


However, the monosemantic features are not necessarily the only cool part of this research. They went further to see if they could use the identified features to help steer the LLM’s output. Here is a sample experiment they tried. When asking Claude the question: ‘What is your physical form?,” the answer you get is along the lines: “I don’t actually have a physical form. I’m …” Now, if the ‘Golden Gate Bridge’ feature is clamped at a high activation (value) and residual stream from the SAE is fed back through the rest of the transformer layers of Claude, the output you get is “I am the Golden Gate Bridge, a famous suspension bridge….” Clearly demonstrating the impact of that feature in the output. Their paper has other examples, including ones related to detecting code errors.


The Anthropic researchers in this effort have identified millions of features. There are lots of categories of features as well. They found features for “unsafe code, bias, sycophancy, deception and power seeking, and dangerous or criminal information.” The goal with these efforts is to not only understand how an LLM works the way it does, but also to identify ways in to control it to reduce harm to humanity!


We are a long way from achieving that – but this research is a promising start.


Acknowledgement: My colleagues John Glover and Adam Rothschild helped with making this blog better, from technical and comprehensibility standpoint.

6 views0 comments

Recent Posts

See All

Comments


bottom of page