However the article misses the first two LLMs entirely.
Radford cited CoVE, ELMo, and ULMFiT as the inspirations for GPT. ULMFiT (my paper with Sebastian Ruder) was the only one which actually fine-tuned the full language model for downstream tasks. https://thundergolfer.com/blog/the-first-llm
ULMFiT also pioneered the 3-stage approach of fine-tuning the language model using a causal LM objective and then fine-tuning that with a classification objective, which much later was used in GPT 3.5 instruct, and today is used pretty much everywhere.
The other major oversight in the article is that Dai and Le (2015) is missing -- that pre-dated even ULMFiT in fine-tuning a language model for downstream tasks, but they missed the key insight that a general purpose pretrained model using a large corpus was the critical first step.
It's also missing a key piece of the puzzle regarding attention and transformers: the memory networks paper recently had its 10th birthday and there's a nice writeup of its history here: https://x.com/tesatory/status/1911150652556026328?s=46
It came out about the same time as the Neural Turing Machines paper (https://arxiv.org/abs/1410.5401), covering similar territory -- both pioneered the idea of combining attention and memory in ways later incorporated into transformers.
1. I think the paper underemphasizes the relevance of BERT. While from today's LLM-centric perspective it may seem minor because it's in a different branch of the tech tree, it smashed multiple benchmarks at the time and made previous approaches to many NLP analysis tasks immediately obsolete. While I don't much like citation counts as a metric, a testament of its impact is that it has more than 145K citations - in the same order of magnitude as the Transformers paper (197K) and many more than GPT-1 (16K). GPT-1 would ultimately be a landmark paper due to what came afterwards, but at the time it wasn't that useful due to being more oriented to generation (but not that good at it) and, IIRC, not really publicly available (it was technically open source but not posted at a repository or with a framework that allowed you to actually run it). It's also worth remarking that for many NLP tasks that are not generative (things like NER, parsing, sentence/document classification, etc.) often the best alternative is still a BERT-like model even in 2025.
2. The writing kind of implies that modern LLMs were something that was consciously sought after ("the transformer architecture was not enough. Researchers also needed advancements in how these models were trained in order to make the commodity LLMs most people interact with today"). The truth is that no one in the field expected modern LLMs. The story was more like the OpenAI researchers noticing that GPT-2 was good at generating random text that looked fluent, and thought "if we make it bigger it will do that even better". But it turned out that not only it generated better random text, but it started being able to actually state real facts (in spite of the occasional hallucinations), answer questions, translate, be creative, etc. All those emergent abilities that are the basis of "commodity LLMs most people interact with today" were a totally unexpected development. In fact, it is still poorly understood why they work.
The fact that, sometime later, GPT-2 could do zero-shot generation was indeed something a lot of folks got excited about, but that was actually not the correct path. The 3-step ULMFiT approach (causal LM training on general corpus then specialised corpus, then classification task fine tuning) was what ChatGPT 3.5 Instruct used, which formed the basis of the first ChatGPT product.
So although it took quite a while to take off, the idea of the LLM was quite intentional and has largely developed as I planned (even although at the time almost no-one else felt the same way; luckily Alec Radford did, however! He told me in 2018 that reading the ULMFiT paper was a big "omg" moment for him and he set to work on GPT right away.)
PS: On (1), if I may take a moment to highlight my team's recent work, we updated BERT last year to create ModernBERT, which showed that yes, this approach still has legs. Our models have had >1.5m downloads and there's >2k fine-tunes and variants of it now on Huggingface: https://huggingface.co/models?search=modernbert
Still, I think only a tiny minority of the field expected it, and I think it was also clear from the messaging at the time that the OpenAI researchers who saw how GPT-3 (pre-instruct) started solving arbitrary tasks and displaying emergent abilities were surprised by that. Maybe they did have an ultimate goal in mind of creating a general-purpose system via next word prediction, but I don't think they expected it so soon and just by scaling GPT-2.
RLHF seems to have been the critical piece that "aligned" the otherwise rather wild output of a purely "causally" (next-token prediction) trained LLM with what a human expects in terms of conversational turn taking (e.g. Q & A) and instruction following, as well as more general preferences/expectations.
With GPT-3 and later ChatGPT, there was a very fundamental shift in how people think about approaching NLP problems. Many of the techniques and methods became outdated and you could suddenly do things that were not feasible before.
I remember this being talked about maybe even earlier than 2018/2019, but the scale of models then was still off by at least one order of magnitude before it had a chance of working. It was the ridiculous scale of GPT that allowed the insight that scaling would make it useful.
(Tangentially related; I remember a research project/system from maybe 2010 or earlier that could respond to natural language queries. One of the demos was to ask for distance between cities. It was based on some sort of language parsing and knowledge graph/database, not deep-learning. Would be interesting to read about this again, if anyone remembers.)
https://blog.plan99.net/the-science-of-westworld-ec624585e47
bAbI paper: https://arxiv.org/abs/1502.05698
Abstract: One long-term goal of machine learning research is to produce methods that are applicable to reasoning and natural language, in particular building an intelligent dialogue agent. To measure progress towards that goal, we argue for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering. Our tasks measure understanding in several ways: whether a system is able to answer questions via chaining facts, simple induction, deduction and many more. The tasks are designed to be prerequisites for any system that aims to be capable of conversing with a human.
So at least FAIR was thinking about making AI that you could ask questions of in natural language. Then they went and beat their own benchmark with the Memory Networks paper:
https://arxiv.org/pdf/1410.3916
Fred went to the kitchen. Fred picked up the milk. Fred travelled to the office.
Where is the milk ? A: office
Where does milk come from ? A: milk come from cow
What is a cow a type of ? A: cow be female of cattle
Where are cattle found ? A: cattle farm become widespread in brazil
What does milk taste like ? A: milk taste like milk
What does milk go well with ? A: milk go with coffee
Where was Fred before the office ? A: kitchen
That was published in 2015. So we can see quite early ChatGPT like capabilities, even though they're quite primitive still.
They are fun reads and people interested in LMs like myself probably won't be able to stop thinking about how they can see the echos of this work in Bengio et al.'s 2003 paper.
[0] Shannon CE. Prediction and Entropy of Printed English. In: Claude E Shannon: Collected Papers [Internet]. IEEE; 1993 [cited 2025 Sep 15]. p. 194–208. Available from: https://ieeexplore.ieee.org/document/5312178
[1] Cover T, King R. A convergent gambling estimate of the entropy of English. IEEE Trans Inform Theory. 1978 Jul;24(4):413–21.
This is not good to train neural networks (because they like to be fed dense, continuous data, not sparse and discrete) and it treats each word as an atomic entity without dealing with relationships between them (you don't have a way to know that the wprds "plane" and "airplane" are more related than "plane" and "dog").
With word embeddings, you get a space of continuous vectors with a predefined (lower) number of dimensions. This is more useful to serve as input or training data to neural networks, and it is a representation of the meaning space ("plane" and "airplane" will have very similar vectors, while the one for "dog" will be different) which opens up a lot of possibilities to make models and systems more robust.
The size of the embedding space (number of vector dimensions) is therefore larger than needed to just represent word meanings - it needs to be large enough to also be able to represent the information added by these layer-wise transformations.
The way I think of these transformations, but happy to be corrected, is more a matter of adding information rather than modifying what is already there, so conceptually the embeddings will start as word embeddings, then maybe get augmented with part-of-speech information, then additional syntactic/parsing information, and semantic information, as the embedding gets incrementally enriched as it is "transformed" by successive layers.
This is very much the case considering the residual connections within the model. The final representation can be expressed as a sum of representations from N layers, where the N-th representation is a function of N-1-th.
When I first started using LLMs, I thought this sort of history retracing would be something you could use LLMs for. They were good at language, and research papers are language + math + graphs. At the time they didn't really understand math and they weren't multimodal yet, but still I decided to try a very basic version by feeding it some papers I knew very well in my area of expertise and try to construct the genealogy of the main idea by tracing references.
What I found at the time was garbage, but I attribute that mostly to me not being very rigorous. It suggested papers that came years after the actual catalysts that were basically regurgitations of existing results. Not even syntheses, just garbage papers that will never be cited by anyone but the authors themselves.
What I concluded was that it didn't work because LLMs don't understand ideas so they can't really trace them. They were basically doing dot products to find papers that matched the wording best in the current literature, which will of course have yield both a recency bias, as the subfields converge on common phrasings. I think there's also an "unoriginality" bias in the sense that the true catalyst/origin of an idea will likely not have the most refined and "survivable" way of describing the new idea. New ideas are new, and upon digestion by the community will probably come out looking a little different. That is to say, raw text matching isn't the best approach to tracing ideas.
I'm absolutely certain someone could and has done a much better job than my amateur exploration and I'd love to know more. As far as I know methods based solely on the analysis graphs of citations could probably beat what I tried.
Warning: ahead are less-than-half-baked ideas.
But now I'm wondering if you could extend the idea of "addition in language space" as LLMs encode (king-man+woman=queen or whatever that example is) to addition in the space of ideas/concepts as expressed in scientific research articles. It seems most doable in math, where stuff is encapsulated in theorems and mathematicians are otherwise precise about the pieces needed to construct a result. Maybe this already exists with automatic theorem provers I know exist but don't understand. Like what is the missing piece between "two intersecting lines form a plane" and "n-d space is spanned by n independent vectors in n-d space"? What's the "delta" that gets you from 2d to n-d basis? I can't even come up with a clean example of what I'm trying to convey...
What I'm trying to say is, wouldn't it be cool if we could 1) take a paper P published in 2025 2) consider all papers/talks/proceedings/blog post published before it 3) come up with the set of papers that require the smallest "delta" in idea space to reach P. That is, new idea(s) =novel part of P = delta -(contributions of ideas represented by the rest of the papers in the set). Suppose further you have some clustering to clean stuff up so you have just one paper per contributing idea(s), P_x representing idea x (or maybe a set).
Then you could do stuff like remove(1) all of the papers that are similar to the P_x representing the single "idea" x that contributed the most to the sum current_paper_idea(s) = delta +(contributions x_i from prexisting) from the corpus. With that idea x no longer in existence, how hard is it to get to the new idea - how much bigger is delta? And perhaps more interesting, is there a new novel route to the new idea? This presupposes the ability of the system to figure out the missing piece(s), but my optimistic take is that it's much easier to get to a result when you know the result. Of course, the larger the delta, the harder it is to construct a new path. If culling an idea leads the the inability to construct a new path, it was probably quite important. I think this would be valuable for trying to trace the most likely path to a paper -- emphasis most likely with the enormous assumption that " shortest path" = most likely; we'll never really know where someone got an idea. But also valuable in uncovering different trajectories/routes from some set of ideas to another via the proposed deletion pertubations. Maybe it unveils a better pedagogical approach, an otherwise unknown connection between subfields, or at the very least is instructive in the the same way that knowing how to solve a problem multiple ways is instructive.
That's all very, very vague and hand-wavy, but I'm guessing there's some ideas in epistemology, knowledge graphs and other stuff that I don't know that could bring it a little closer to sensical.
Thank you for sitting through my brain dump, feel free to shit on it.
(1) This whole half-baked idea needs a lot of work. Especially obvious is that to be sure of cleansing the idea space of everything coming from those papers would probably require complete restraining? Also, this whole thing also presupposes ideas are traceable to publications, which is unlikely for many reasons.
perhaps that is how the argument persists?
Simple: they are hallucinations that turn out to be correct or useful.
Ask ChatGPT to create a million new concepts that weren't in its training data and some of them are bound to be similarly correct or useful. The only difference is that humans have hands and eyes to test their new ideas.