Even with full context, writing CSS in a project where vanilla CSS is scattered around and wasn’t well thought out originally is challenging. Coding agents struggle there too, just not as much as humans, even with feedback loops through browser automation.
We could argue that writing poetry is a solved problem in much the same way, and while I don't think we especially need 50,000 people writing poems at Google, we do still need poets.
I'd assume that an implied concern of most engineers is how many software engineers the world will need in the future. If it's the situation like the world needing poets, then the field is only for the lucky few. Most people would be out of job.
People need to understand that we have the technology to train models to do anything that you can do on a computer, only thing that's missing is the data.
If you can record a human doing anything on a computer, we'll soon have a way to automate it
The price of having "star trek computers" is that people who work with computers have to adapt to the changes. Seems worth it?
How much do you wish someone else had done your favorite SOTA LLM's RLHF?
This benchmark doesn't have the latest models from the last two months, but Gemini 3 (with no tools) is already at 1750 - 1800 FIDE, which is approximately probably around 1900 - 2000 USCF (about USCF expert level). This is enough to beat almost everyone at your local chess club.
Additionally, how do we know the model isn’t benchmaxxed to eliminate illegal moves.
For example, here is the list of games by Gemini-3-pro-preview. In 44 games it preformed 3 illegal moves (if I counted correctly) but won 5 because opponent forfeits due to illegal moves.
https://chessbenchllm.onrender.com/games?page=5&model=gemini...
I suspect the ratings here may be significantly inflated due to a flaw in the methodology.
EDIT: I want to suggest a better methodology here (I am not gonna do it; I really really really don’t care about this technology). Have the LLMs play rated engines and rated humans, the first illegal move forfeits the game (same rules apply to humans).
The correct solution is to have a conventional chess AI as a tool and use the LLM as a front end for humanized output. A software engineer who proposes just doing it all via raw LLM should be fired.
The point isn't that LLMs are the best AI architecture for chess.
And so for I am only convinced that they have only succeeded on appearing to have generalized reasoning. That is, when an LLM plays chess they are performing Searle’s Chinese room thought experiment while claiming to pass the Turing test
But I'm ignorant here. Can anyone with a better background of SOTA ML tell me if this is being pursued, and if so, how far away it is? (And if not, what are the arguments against it, or what other approaches might deliver similar capacities?)
Recent advances in mathematical/physics research have all been with coding agents making their own "tools" by writing programs: https://openai.com/index/new-result-theoretical-physics/
I will worry about developers being completely replaced when I see something resembling it. Enough people worry about that (or say it to amp stock prices) -- and they like to tell everyone about this future too. I just don't see it.
Unless there a limited amount of software we need to produce per year globally to keep everyone happy, then nobody wants more -- and we happen to be at that point right NOW this second.
I think not. We can make more (in less time) and people will get more. This is the mental "glass half full" approach I think. Why not take this mental route instead? We don't know the future anyway.
Current software is often buggy because the pressure to ship is just too high. If AI can fix some loose threads within, the overall quality grows.
Personally, I would welcome a massive deployment of AI to root out various zero-days from widespread libraries.
But we may instead get a larger quantity of even more buggy software.
I'd say that using AI tools effectively to create software systems is in that class currently, but it isn't necessarily always going to be the case.
Tell me, when was the last time you visited your shoe cobbler? How about your travel agent? Have you chatted with your phone operator recently?
The lump labour fallacy says it's a fallacy that automation reduces the net amount of human labor, importantly, across all industries. It does not say that automation won't eliminate or reduce jobs in specific industries.
It's an argument that jobs lost to automation aren't a big deal because there's always work somewhere else but not necessarily in the job that was automated away.
And this write up is not an exception.
Why even bother thinking about AI, when Anthropic and OpenAI CEOs openly tell us what they want (quote from recent Dwarkesh interview) - "Then further down the spectrum, there’s 90% less demand for SWEs, which I think will happen but this is a spectrum."
So save thinking and listen to intent - replace 90% of SWEs in near future (6-12 months according to Amodei).
AI will be a tool, no more no less. Most likely a good one, but there will still need to be people driving it, guiding it, fixing for it, etc.
All these discourses from CEO are just that, stock market pumping, because tech is the most profitable sector, and software engineers are costly, so having investors dream about scale + less costs is good for the stock price.
All I'm saying is - why to think what AI is (exoskeleton, co-worker, new life form), when its owners intent is to create SWE replacement?
If your neighbor is building a nuclear reactor in his shed from a pile of smoke detectors, you don't say "think about this as a science experiment" because it's impossible, just call police/NRC because of intent and actions.
Reliability comes from scaffolding: retrieval, tools, validation layers. Without that, fluency can masquerade as authority.
The interesting question isn’t whether they’re coworkers or exoskeletons. It’s whether we’re mistaking rhetoric for epistemology.
neither are humans
> They optimize for next-token probability and human approval, not factual verification.
while there are outliers, most humans also tend to tell people what they want to hear and to fit in.
> factuality is emergent and contingent, not enforced by architecture.
like humans; as far as we know, there is no "factuality" gene, and we lie to ourselves, to others, in politics, scientific papers, to our partners, etc.
> If we’re going to treat them as coworkers or exoskeletons, we should be clear about that distinction.
I don't see the distinction. Humans exhibit many of the same behaviours.
For example fact checking a news article and making sure what's get reported line up with base reality.
I once fact check a virology lecture and found out that the professor confused two brothers as one individual.
I am sure about the professor having a super solid grasp of how viruses work, but errors like these probably creeps in all the time.
Yet.
This is mostly a matter of data capture and organization. It sounds like Kasava is already doing a lot of this. They just need more sources.
> an AI that is truly operating as an independent agent in the economy without a human responsible for it
Sounds like the "customer support" in any large company (think Google, for example), to be honest.But it's fun, I say "Henceforth you shall be known as Jaundice" and it's like "Alright my lord, I am now referred to as Jaundice"
I like the ebike analogy because [on many ebikes] you can press the button to go or pedal to amplify your output.
How typical!
Stochastic Parrots. Interns. Junior Devs. Thought partners. Bicycles for the mind. Spicy autocomplete. A blurry jpeg of the web. Calculators but for words. Copilot. The term "artificial intelligence" itself.
These may correspond to a greater or lesser degree with what LLMs are capable of, but if we stick to metaphors as our primary tool for reasoning about these machines, we're hamstringing ourselves and making it impossible to reason about the frontier of capabilities, or resolve disagreements about them.
A understanding-without-metaphors isn't easy -- it requires a grasp of math, computer science, linguistics and philosophy.
But if we're going to move forward instead of just finding slightly more useful tropes, we have to do it. Or at least to try.
Can you highlight what you've managed to do with it?