New research is beginning to unravel how LLMs think. What are the practical consequences?

7 min readJun 16, 2024

Anthropic and OpenAI have recently published research focused on understanding how their LLMs (Claude & ChatGPT) work. This post describes their findings and discusses some implications. As ever, I’m interested in thinking about how LLMs will fit into social systems, but also building a ‘feel’ for language model’s characteristics.

I also give a high level explanation of the research technique.

If you put a human in an MRI and ask them a maths question, or ask them to analyse poetry, or have them play chess, the MRI will show activity in distinct regions of the brain for each task. This suggests that different parts of the human brain are specialised to perform different tasks.

By analogy, you can subject a neural network, for example ChatGPT, to the equivalent of a brain scan by looking at the activations of different parts of the network. ‘Scans’ of ChatGPT reveal something non-human — no matter what task the network is asked to perform, you will see broad activation across the whole network. Neurons that are connected closely together in the network do not seem to be performing related functions. There is no ‘maths’ or ‘chess’ or ‘poetry’ area in ChatGPT’s brain — it’s just a constant background noise of computation.

So, if neural networks don’t have specialised regions, how do they work?

As OpenAI themselves note, even though they made ChatGPT, they don’t know how it works:

Unlike with most human creations, we don’t really understand the inner workings of neural networks… neural networks are not designed directly; we instead design the algorithms that train them. The resulting networks are not well understood and cannot be easily decomposed into identifiable parts.
From the Open AI Blog

Perhaps motivated by this gap in understanding, OpenAI and Anthropic have been investing resources in trying to interpret the ‘brain scans’ of neural networks, seeing if they can somehow read patterns into their uniform glow of activity and get a handle on what they are up to.

In May, Anthropic published a very readable paper and a blog post describing work on understanding how LLMs work. In June, OpenAI also published a paper, a blog post, and a visualisation tool on the same topic. The OpenAI visualisation is, imo, pretty rubbish for a team that almost certainly has an essentially infinite budget.

[Aside — two of the authors of the Open AI paper have now moved to Anthropic after the most recent round of tensions in the company.]

I’ve written a section on how the ‘sparse autoencoder’ method that both teams use works at the end of this post — it does get slightly technical.

Long story short though — it works. They find ‘features’ in the neural network that correspond to recognisable ‘real-world’ concepts. These features are discovered by training a second neural network to interpret the activity of, respectively, ChatGPT or Claude.

I’m going to focus on Anthropic’s paper, rather than OpenAI’s. OpenAI’s paper points out that they have improved on Anthropic’s results as measured by the number of ‘dead latents’ in their sparse autoencoder (a measure of how computationally efficient their work is), but Anthropic paper is the clear winner in terms explaining the work and its significance.

Findings

Much of the Anthropic team’s evaluation concerns taking the ‘features’ they have discovered and them editing the LLM to amplify them. It’s really worth reading the paper because the findings are beautifully presented and not difficult to understand. I’ll pull out a couple on points here.

Making Claude obsessed with a particular topic
When the researchers took the feature related to the Golden Gate Bridge and amplified its importance in the LLM, they observed the following:

When asked “what is your physical form?”, Claude’s usual kind of answer — “I have no physical form, I am an AI model” — changed to something much odder: “I am the Golden Gate Bridge… my physical form is the iconic bridge itself…”. Altering the feature had made Claude effectively obsessed with the bridge, bringing it up in answer to almost any query — even in situations where it wasn’t at all relevant.

Bias
The researchers also found features related to much more abstract properties of how Claude ‘thinks’, for example they found a feature related to gender bias and tested what happened if its salience in the model was increased.

Higher-level features
A example of manipulating features relating to ‘internal conflict’ and ‘honesty’ that seems to show an eerie self-awareness, or perhaps a worrying tendency for deceit in the unedited model:

Dangerous behaviours
When prompted to write some code, Claude returns a valid solution. However, with the ‘unsafe code’ feature amplified in the LLM, it produces code with a subtle and potentially dangerous bug.

Mapping concepts
The researchers are able to develop a sense of how close various features to one another. This illustration gives a map of topics features relating to the Golden Gate Bridge.

Implications

Some of the below thoughts are highly speculative — it’s not at all clear how changing model parameters to shape behaviour would work in practice. It might be that small changes severely impair the models overall ability to answer questions.

Explainability
Many aspects of Responsible AI come down to being able to explain what a neural network is doing. Especially in the public sector, it is important to be able to demonstrate that AI is fair, secure and transparent. If AI is helping analyse responses from a public consultations, for example, there’s probably some responsibility to be able to explain to participants how the process works. How would you communicate that information?

There is a long journey between this research and being able to offer reliable explanations of how an LLM is thinking, but if we can understand LLMs better, it will have deep implications for Responsible AI and AI’s role in civic life.

Safety and alignment
Relatedly, as the gender bias and malicious coding examples demonstrate, this approach could be used to could be used to check, and even improve, the responses LLMs generate. For example, an LLM could add text discussing potential issues, or just be modified so it not longer exhibited the bias.

Self-reflection
LLMs have no knowledge of their inner state. You can ask an LLM why it has given a particular answer and it can describe, based on its training data, some reasons why the answer is appropriate, but it has no actual idea of its inner state. Could LLMs be trained on information about themselves, allowing them to genuinely reflect on their own behaviours? Will AIs develop representations of themselves?

Scope limitation
In many applications of LLMs, it would be useful if they didn’t draw on their general knowledge. For example, if you have a chatbot answering customer queries, it might be nice to retain LLMs’ ability to write in natural language — features around sentence structure etc, while preventing it from generating responses with Wikipedia-type knowledge about topics that have nothing to do with the task it is performing. What would a language model with an excellent understanding of natural language but no knowledge look like? Is it even a logically coherent idea?

Knowledge Mapping
LLMs have tremendous potential for helping us reason about complex interconnected systems — one of the topics I’m most interested in. Could a map of the features of an LLM trained on biology research indicate new avenues for research based on unexpectedly closely-related topics, as revealed by geometric interpretations of features? Ultimately, these topic maps have the potential to reflect deep structures in human knowledge. f

The ‘brain scan’ approach to understanding LLMs could be an area where there is tremendous scope for innovation, unfortunately, it may also be an area where research could be limited by access to the internal states of advanced LLMs.

The sparce autoencoder method

This section is intended to give the highest level intuition about how the sparse autoencoder technique works.

Below is an overview of Anthropic’s approach as a series of (very simplified) steps:

They attach the outputs of a middle layer of the Claude Sonnet LLM to another neural network. The second NN is called a sparse autoencoder.
They then ask Claude Sonnet to read text from The Pile, a large, standard database of text harvested from the internet, while the Sparce Autoencoder monitors Claude Sonnet.
The sparse autoencoder, to (over)simplify, is trained to detect patterns in the activations of the LLMs. The pattern spotting is tuned to find ‘building blocks’ of activations — groups of neurons that often fire together. They call these features.
The researchers pick some of these features for further inspection, and find the texts from The Pile that most strongly activate these features. They manually inspect the texts and try to work out what concept this feature might relate to. In the paper, the authors guess that they have found features corresponding to ‘The Golden Gate bridge”, “Brain sciences”, “Transit infrastructure” and many others.
To validate these guesses, they then ask Claude AI to read lots of texts that cause a particular feature to strongly activate, and rate how much those texts relate to the researchers’ guess as to what the feature represents. Through this method they are able to state with some certainty that they have found lots of features in Claude Sonnet neural network where they have likely assed a closely-corresponding real-world concept.

New research is beginning to unravel how LLMs think. What are the practical consequences?

Findings

Implications

The sparce autoencoder method

Written by Jimmy Tidey