I let a GPT adjust its own attention weights based on the input(teendifferent.substack.com)

2 pointsby teen-different4 hours ago1 comment

teen-different4 hours ago
I ran a small experiment on adaptive computation in transformers.
The idea was to use the residual stream to condition a tiny hypernetwork that generates low rank updates for the value heads during the forward pass, instead of forcing the model to use the same fixed value transformation for every context.
I trained 5 small variants for 12k steps on a mixed corpus: base, matched, adaptive, diffusion, adaptive diffusion.
The main thing I am thinking about now is how to make the hypernetwork's effect more subtle and more selective.
The goal was to let the model make small context dependent adjustments to the value path when the residual stream carries a useful signal, not to have it aggressively rewrite itself all the time.
Right now the adaptive models stay stable and get reasonably close, but the current hypernetwork does not seem precise enough yet.
If anyone has ideas for stronger experiments here, I would genuinely love them.
Right now I am thinking about:
gated updates cheaper conditioning networks block level or segment level adaptation broader context conditioned updates instead of immediate token driven updates If you were trying to make this actually useful, what would you change first?
Happy to answer questions.