Just to give you some answers for what we can do:
1) We can find the training data that is causing a model to output toxic/unwanted text and correct it. 2) We know what high level concepts the model is relying on for any group of tokens it generates, hence, reducing that generation is as simple as toggling the effect of the output on that concept.
Most of the AI safety techniques fall under finetuning. Our model allows your to do this without fine-tuning. You can toggle the presence of .
For example, wouldn't you like to know why a model is being sycophantic? Or Sandbagging? Is it a particular kind of training data that is causing this? Or is it some high level part of the model's representations? For any of this, our model can tell you exactly why the model generated that output. Over the coming weeks, we'll show exactly how you can do this!
SHAP basically does point by point ablation across all possible subsets, which really doesn't make sense for LLMs. This is simultaneously too specific and too general.
It's too specific because interesting LLM behavior often requires talking about what ensembles of neurons do (e.g. "circuits" if you're of the mechanistic interpretability bent), and SHAP's parameter-by-parameter approach is completely incapable of explaining this. This is exacerbated by the other that not all neurons are "semantically equal" in a deep network. Neurons in the deeper layers often do qualitatively different things than earlier layers and the ways they compose can completely confuse SHAP.
It's too general because parameters often play many roles at once (one specific hypothesis here is the superposition hypothesis) and so you need some way of splitting up a single parameter into interpretable parts that SHAP doesn't do.
I don't know the specifics of what this particular model's approach is.
But SHAP unfortunately does not work for LLMs at all.
Here is what this model does: it `rewrites` the model's activations (during pre-training) into supervised + unsupervised concepts that are then decoded into tokens. So at pre-training, we constrained the model with 33k supervised concepts (e.g., sports, toxicity, alignment, demographic variables), and then have more (101k) unsupervised concepts for the model to learn as well.
Overall, the architecture and loss functions of this model allow you to answer the following questions: 1) Which token in the context caused a chunk (group of tokens) to be generated? 2) which high level concept (supervised or unsupervised) caused the 3) perhaps more interestingly, in a single forward pass, we can tell you which training chunk led to the output of the model as well.
We do all of this for the single steerling model which is 8B parameters trained on 1.5T tokens. First time any model of this scale has achieved this level of interpretability by design.
would be happy to answer more questions.
Given the example I saw about CRISPR, what does this model give over a different, non explaining model in the output ? Does it really make me more confident in the output if I know the data came from Arxiv or Wikipedia ?
I find the LLM outputs are subtlety wrong not obviously wrong
In the example it shows how much of the reason for an answer is due to data from Wikipedia. Can it drill down to show paragraph or sentence level that influences the answer ?
I believe that the plagiarism complaint about llm models comes from the assumption that there is a one-to-one relationship between training and answers. I think the real and delightfully messier situation is that there is a many-to-one relationship.
curious how the performance compares to a standard llama 8b on benchmarks - interpretability usually comes with a quality tax.
Steerling appears to be just a discrete diffusion model where the final hidden states are passed through a sparse autoencoder (a common interpretability layer) before the LM head.
They also use a loss that aligns the SAE'S activations with labelled concepts? However, this is an example of "The Most Forbidden Technique" [1], and could make the model appear interpretable without the attributed concepts actually having causal effect on the model's decisions.
1: https://thezvi.substack.com/p/the-most-forbidden-technique
We'll see.