I'm always happy to see publishing of negative results, but it seems like they are selling what are negative results as positive results.
If they can test against Llama 70B and Mistral 7B, they ought to compare against Mistral 8x7b imho
I still have yet to see anything that dissuades me from agreeing with Yann LeCun when he says Transformers are fundamentally limited. We won't get creativity, reasoning, or even move past hallucinations without a major breakthrough
For example, a small child is completely capable of being told "get in the car" and can understand, navigate, open the door, and get in, with incredibly little energy usage (maybe about the amount of a single potato chip/crisp)
Now consider what I have been working on recently (1) evaluating secops tools from both a technical and business perspective (2) prototyping and creating an RFC for the next version of our DX at the org. They are very far from this capability because it involves so many competing incentives, trade offs, and not just the context of the current state of code, but also the history and vision. Crafting that vision is especially beyond what a foundation in transformers can offer. They are in essence an averaging and sequence prediction algorithm
These tools are useful, even provide an ROI, but by no means anywhere close to what I would call intelligent.
Faith and Fate: Limits of Transformers on Compositionality https://arxiv.org/abs/2305.18654
Maybe the analogy is something with gold mining. We could pretend that the machines that mine gold are actually creating gold. Pretending the entire gold mining sector is instead a discovery of alchemy.
Maybe the way alchemy kind of leads to chemistry is the analogy that applies?
I don't even know if that is right though.
The intelligence is in the training data. The model then is extracting the intelligence.
We can't forget Feynman's ideas here that we aren't going to make a robot cheetah that runs fast. We will make a machine that uses wheels. Viewing things through the lense of a cheetah is a category error.
While I agree completely with you we very well both might be completely and utterly wrong. A category error on what intelligence "is".
Z-vectors are of course nothing like the subsystems in your brain, but general the approach is certainly similar to how the brain works.
They now have an API that allows for dynamic exploration and manipulation of the latent space for LLama 8-70B models (think Golden Gate Claude). They also open sourced the sparse auto-encoders that (in part) allow for this:
https://huggingface.co/Goodfire/Llama-3.3-70B-Instruct-SAE-l...
It's already been invented: https://arxiv.org/abs/2202.05780 . That design is just very inefficient to scale up / use as a transformer backbone.
Remove the bottom weights dynamically based on the local gradient in varentrophy so that internal dissonance ("doubt") can be selected against.
"Preference Optimization" but with more opportunities for meta-optimization.
Coming from a math background, it always amazes me to see how people in AI/ML brag about their papers. If someone wrote:
> My paper represents a significant milestone in the evolution of algebraic geometry/ergodic theory/combinatorics
it would be a laughing stock for the math community.
I’m guessing that the difference lies in the potential value extraction possibilities from the idea.
If comparing the transformers paper to an algorithm or geometry, that is not used by anyone, I think the differences are obvious from this perspective.
However, if that paper on geometry led to something like a new way of doing strained silicon for integrated circuit design that made manufacturing 10 times cheaper and the circuit 10 times faster, then that would be more important then that would the transformers one.
Yes
This particular paper is not peer reviewed or published beyond a preprint on arxiv
It's common to have competitions where the one with the highest score in the benchmark "wins". Even if there is no formal competition, it's very important being the SOTA model.
Results are more applicable to the real world, and more "cool" subjectively (I don't think there's a 2 minutes paper equivalent for math?), which increases ego.
And often authors are trying to convince others to use their findings. So it's partly a marketing brochure.
There is not just a simple «lifelong learning»: the whole past experience is still productive, requiring analysis, not "solved".
Anyway: the directions seem good.
Edit: equally interesting in another direction is the automated analysis of the internal subagents, «break[ing] down the vast, complex knowledge stored in the LLM into smaller, meaningful, and independent pieces (e.g., the different pathways or components for math, language understanding, etc)». Should not there be a general study of the dissection of systems with seemingly emergent intelligence, doing on LLMs like we do on C. Elegans?
I like that background animation. Seems like there's an opportunity for tiny logic gates and some punny swarm behavior.
Key ideas (in simple terms):
1. What’s the problem?
- Fine-tuning LLMs for every new task is slow, expensive, and often doesn't generalize well.
- Models trained on one task may perform poorly on others, especially unseen ones.
- Current methods (like LoRA) can add new capabilities but aren't efficient enough.
2. The solution: - Transformer² uses a new fine-tuning method called Singular Value Fine-tuning (SVF). This focuses on adjusting only certain parts of the model’s "weight matrices" rather than changing everything.
- By tweaking specific components (called "singular values"), it trains smaller, efficient "expert" modules that specialize in particular types of tasks.
3. How it works: - Training phase: Train these smaller expert modules offline using reinforcement learning (RL) to specialize in tasks like coding, math, or reasoning.
- Inference phase: When a new input is given, the system analyzes the task (e.g., “Is this a math or coding problem?”) in the first pass. Based on this, it combines the right expert modules and adapts the model’s behavior in the second pass.
4. Three adaptation strategies: - Prompt-based: Use a cleverly designed text prompt to figure out the task type and pick the right expert module.
- Classifier-based: Train a separate model to classify tasks and match them to experts.
- Few-shot adaptation: Look at a small number of examples (few-shot learning) to dynamically combine expert modules for the best results.
5. Efficiency: - The system uses fewer parameters than traditional fine-tuning methods like LoRA.
- Adaptation works even on small datasets without overfitting or forgetting older tasks.