We'll need to figure out the techniques and strategies that let us merge AI code sight unseen. Some ideas that have already started floating around:
- Include the spec for the change in your PR and only bother reviewing that, on the assumption that the AI faithfully executed it
- Lean harder on your deterministic verification: unit tests, full stack tests, linters, formatters, static analysis
- Get better ai-based review: greptile and bugbot and half a dozen others
- Lean into your observability tooling so that AIs can fix your production bugs so fast they don't even matter.
None of these seem fully sufficient right now, but it's such a new problem that I suspect we'll be figuring this out for the next few years at least. Maybe one of these becomes the silver bullet or maybe it's just a bunch of lead bullets.
But anyone who's able to ship AI code without human review (and without their codebase collapsing) will run circles around the rest.
For a non trivial program, 2 implementations of the same natural language spec will have thousands of observable differences.
Where we are today, that is agents require guardrails to keep from spinning out, there is no way to let agents work on code autonomously that won’t end up with all of those observable differences constantly shifting, resulting in unusable software.
Tests can’t prevent this because for a test suite to cover all observable behavior, it would need to be more complex than the code. In which case, it wouldn’t be any easier for machine or human to understand.
The only solution to this problem is that LLMs get better. Personally I think at the point they can pull this off, they can do any white collar job, and there’s not point in planning for that future because it results in either Mad Mad or Star Trek.
But humans are much better at reasoning about whether a change is going to impact observable behavior than current LLMs are as evidenced by the fact that LLMs require a test suite or something similar to build a working app longer than a few thousand lines.
If they're not defined in the spec then these differences shouldn't matter, they're just implementation details. And if they do matter, then they should be included in the spec; a natural language spec that doesn't specify some things that should be specified is not a good spec.
I doubt there exists a single piece of nontrivial software today where you could randomly alter 5% of the implementation details while keeping to the spec, without resulting in a flood of support tickets.
So, never.
Greg Kroah-Hartman was once asked by his boss, ”when will Linux be done?” and he said, ”when people stop making new hardware”, that even today, when we assume the hardware won’t lie, much of the work in maintaining Linux is around hardware bugs.
So even at the lowest levels of software development, you can’t know the bugs you’re going to have until you partially solve the problem and find out that this combination of hardware and drivers produces an error, and you only find that out because someone with that combination tried it. There is no way to prevent that by “make better spec”.
But that’s always been true. Basically it’s the 3-body-problem. On the spectrum of simple-complicated-complex, you can calculate the future state of a system if it’s simple, or “only complicated” (sometimes), but you literally cannot know the future state of complex systems without simulating them, running each step and finding out.
And it gets worse. Software ranges from simple to complicated to complex. But it exists within a complex hardware environment, and also within a complex business environment where people change and interest rates change and motives change from month to month.
There is no “correct spec”.
Every strategy which worked with an off-shore team in India works well for AI.
Sometime in mid 2017, I found myself running out of hours in the day stopping code from being merged.
On one hand, I needed to stamp the PRs because I was an ASF PMC member and not a lot of the folks who were opening JIRAs were & this wasn't a tech debt friendly culture, because someone from LinkedIn or Netflix or EMR could say "Your PR is shit, why did you merge it?" and "Well, we had a release due in 6 days" is not an answer.
Claude has been a drop-in replacement for the same problem, where I have to exercise the exact same muscles, though a lot easier because I can tell the AI that "This is completely wrong, throw it away and start over" without involving Claude's manager in the conversation.
The manager conversations were warranted and I learned to be nicer two years into that experience [1], but it's a soft skill which I no longer use with AI.
Every single method which worked with a remote team in a different timezone works with AI for me & perhaps better, because they're all clones of the best available - specs, pre-commit verifiers, mandatory reviews by someone uncommitted on the deadline, ease of reproducing bugs outside production and less clever code over all.
Why hasn't SWE then not been completely outsourced for 20 years. Corporations were certainly trying hard.
Replace AI written with “cheap dev written” and think about why that isn’t already true.
The bottleneck is a competent dev understanding a project. Always has been.
Another fundamental flaw is you can’t trust LLMs. It’s fundamentally impossible compared to the way you trust a human. Humans make mistakes. LLMs do not. Anything “wrong” they do is them working exactly as designed.
It's wild that the gamut of PRs being zipped around don't even do these. You would run such validations as a human...
Or we could actually, you know, stop using a tool that doesn't work. People are so desperate to believe in the productivity boosts of AI that they are trying to contort the whole industry around a tool that is bad at its job, rather than going "yeah that tool sucks" and moving on like a sane person would.
Throw in some simulated user interactions in a staging environment with a bunch of agents acting like customers a la StrongDM so you can catch the bugs earlier
Why do you assume that's doable? I'm not saying it's not, but it seems strange to just take for granted that it is.
We would have to get very good at these. It's completely antithetical to the agile idea where we convey tasks via pantomime and post it rather than formal requirements. I wont even get started on the lack of inline documentation and its ongoing disappearance.
> Lean harder on your deterministic verification: unit tests, full stack tests,
Unit tests are so very limited. Effective but not the panacea that the industry thought it was going to be. The conversation about simulation and emulation needs to happen, and it has barely started.
> We'll need to figure out the techniques and strategies that let us merge AI code sight unseen.
Most people who write software are really bad at reading other's code, and doing systems level thinking. This starts at hiring, the leet code interview has stocked our industry with people who have never been vetted, or measured on these skills.
> But anyone who's able to ship AI code without human review
Imagine we made every one go back to the office, and then randomly put LSD in the coffee maker once a week. The hallucination problem is always going to be NON ZERO. If you are bundling the context in, you might not be able to limit it (short of using two models adversarially). That doesn't even deal with the "confidently wrong" issue... what's an LLM going to do with something like this: https://news.ycombinator.com/item?id=47252971 (random bit flips).
We haven't even talked about the human factors (bad product ideas, poor UI, etc) that engineers push back against and an LLM likely wont.
That doesn't mean you're completely wrong: those who embrace AI as a power tool, and use it to build their app, and tooling that increases velocity (on useful features) are going to be the winners.
Regulation.
It happened with plumbing. Electricians. Civil engineers. Bridge construction. Haircutting. Emergency response. Legal work. Tech is perhaps the least regulated industry in the world. Cutting someone’s hair requires a license, operating a commercial kitchen requires a license, holding the SSN of 100K people does not yet.
If AI is fast and cheap, some big client will use it in a stupid manner. Tons of people can and will be hurt afterward. Regulation will follow. AI means we can either go faster, or focus on ironing out every last bug with the time saved, and politicians will focus on the latter instead of allowing a mortgage meltdown in the prime credit market. Everyone stays employed while the bar goes higher.
I would hope so, but it won't happen as long as the billionaire AI bros keep on paying politicians for favorable treatment.
>And six months later you discover you’ve built exactly what the spec said — and nothing the customer actually wanted.
That's not a developer problem, it's a PM/business problem. Your PM or equivalent should be neck deep in finding out what to build. Some developers like doing that (likely for free) but they can't spend as much time on it as a PM because they have other responsibilities, so they are not as likely not as good at it.
If you are building POCs (and everyone understands it's a POC), then AI is actually better getting those built as long as you clean it up afterwards. Having something to interact with is still way better than passively staring at designs or mockup slides.
Developers being able to spend less time on code that is helpful but likely to be thrown away is a good thing IMO.
I've never seen a quick PoC get cleaned up. Not once.
I'm sure it happens sometimes, but it's very rare in the industry. The reality is that a PoC usually becomes "good enough" and gets moved into production with only the most perfunctory of cleanup.
I also name everything DEMO__ so at least they'll have to go through the exercise or renaming it. Although I've had cases where they don't even do that lol. But at least then you know who's totally worthless.
If you search for that quote, you can find #1 result is an AI slop paraphrase published last week, but the original article was 11 years ago, and republished 3 years ago.
https://higherorderlogic.com/programming/2023/10/06/bad-code...
Recent Devstral 2 (mistral) is pretty precise and concise in it's changes.
Clause and friends represent an increase in coders, without any corresponding increase in code reviewers. It's a break in the traditional model of reviewing as much code as you submit, and it all falls on human engineers, typically the most senior.
Well, that model kinda sucked anyways. Humans are falliable and Ironies of Automation lays bare the failure modes. We all know the signs: 50 comments on a 5 line PR, a lonely "LGTM" on the 5000 line PR. This is not responsible software engineering or design; it is, as the author puts it, a big green "I'm accountable" button with no force behind it.
It's probably time for all of us on HN to pick up a book or course on TLA+ and elevate the state of software verification. Even if Claude ends up writing TLA+ specs too, at least that will be a smaller, simpler code base to review?
Perhaps this will finally force the pendulum to swing back towards continuous integration (the practice now aliased trunk-based development to disambiguate it from the build server). If we're really lucky, it may even swing the pendulum back to favoring working software over comprehensive documentation, but maybe that's hoping too much. :-)
I currently work in a software field that has a large numerical component and verifying that the system is implemented correctly and stable takes much longer than actually implementing it. It should have been like that when I used to work in a more software-y role, but people were much more cavalier then and it bit that company in the butt often. This isn't new, but it is being amplified.
As an experiment, I had Claude Cowork write a history book. I chose as subject a biography of Paolo Sarpi, a Venetian thinker most active in the early 17th century. I chose the subject because I know something about him, but am far from expert, because many of the sources in Italian, in which I am a beginner, and because many of the sources are behind paywalls, which does not mean the AIs haven't been trained on them.
I prompted it to cite and footnote all sources, avoid plagiarism and AI-style writing. After 5 hours, it was finished (amusingly, it generated JavaScript and emitted a DOCX). And then I read the book. There was still a lingering jauntiness and breathlessness ("Paolo Sarpi was a pivotal figure in European history!") but various online checkers did not detect AI writing or plagiarism. I spot checked the footnotes and dates. But clearly this was a huge job, especially since I couldn't see behind the paywalls (if I worked for a Uni I probably could).
Finally, I used Gemini Deep Research to confirm the historical facts and that all the cited sources exist. Gemini thought it was all good.
But how do I know Gemini didn't hallucinate the same things Claude did?
Definitely an incredible research tool. If I were actually writing such a book, this would be a big start. But verification would still be a huge effort.
If I remember correctly, some group used an AI tool to sniff for AI citations in other's works. What I remember most was how abhorrent some of the sources that the AI sniffer caught. One of the citation's authors was literally cited as "FirstName LastName" -- didn't even sub in a fake name lol.
Edit: I found the OP:
So, in regard to your book: Claude may or may not have hallucinated the information from its cited sources. Gemini, as well. However, say you had access to the cited information behind a paywall. How would you go about verifying the information cited in those sources was correct?
Since the release of LLMs over the past four years or so, I have noticed a trend where people are (rightfully) hesitant to trust the output of LLMs. But if the knowledge is in a book or comes from another other man-made source, it's some how infallible? Such thinking reminds me of my primary schooling days. Teachers would not let us use Wikipedia as a source because, "Anyone can edit anything." Though, it's as one cannot write anything they want in a book -- be it true or false?
How many scientific researchers have p-hacked their research, falsified data, or used other methods of deceit? I do not believe it's a truly an issue on a grand scale nor does it make vast amounts of science illegitimate. When caught, the punishments are usually handled in a serious manner, but no telling how much falsified research was never caught.
I do believe any and all information provided by LLMs should be verified and not blindly trusted, however, I extend that same policy to works from my fellow humans. Of course, no one has the time to verify every single detail of every bit of information one comes across. Hence, at some point, we all must settle on trusting in trust. Knowledge that we cannot verify is not knowledge. It is faith.
[1] https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_Ref...
AI has exacerbated the Internet's "content must be free or else does not exist" trend.
It's just not interesting to challenge an AI to write professional research content without giving it access to research conetent. Without access, it's just going to paraphrase what's already available.
When you submit a PR, verifiability should be top of mind. Use those magic AI tools to make the PR as easy to possible to verify as possible. Chunk your PR into palatable chunks. Document and comment to aid verification. Add tests that are easy for the reviewer to read, test and tweak. Etc.
Software development is a highly complex task and verification becomes not just validation of the output but also verification that the work is solving the problem desired, not just the problem specified.
I'm empathetic to that scenario, but this was a problem with software development to begin with. I would much rather be in a situation of reducing friction to verification than reducing friction to discovery.
Cognitive load might be the same but now we get a potential boost in productivity for the same cost.
With better models and harnesses (e.g. Claude Code), I can now trust the AI more than I would trust a junior developer in the past.
I still review Claude's plans before it begins, and I try out its code after it finishes. I do catch errors on both ends, which is why I haven't taken myself out of the loop yet. But we're getting there.
Most of the time, the way I "verify" the code is behavioral: does it do what it's supposed to do? Have I tried sufficient edge cases during QA to pressure-test it? Do we have good test coverage to prevent regressions and check critical calculations? That's about as far as I ever took human code verification. If anything, I have more confidence in my codebases now.
Hand crafted , scalable code will be a very rare phenomenon
There will be a clear distinction between too.
When he was first hired, I asked him to refactor a core part of the system to improve code quality (get rid of previous LLM slop). He submitted a 2000+ line PR within a day or so. He's getting frustrated because I haven't reviewed it and he has other 2000+ line PRs waiting on review. I asked him some questions about how this part of the system was invoked and how it returned data to the rest of the system, and he couldn't answer. At that point I tried to explain why I am reluctant to let him commit his refactor of a core part of the system when he can't even explain the basic functionality of that component.
He was hired because we needed a contractor quickly and he and his company represented to us that he was a lot more experienced than he actually is.
Why are you acting like you work for the contractor, instead of the contractor workign for you?
Why are you teaching a contractor anything? That's a violation of labor law. You are treating a contractor like an employee.
CEOs and hype men want you to believe that LLMs can replace everyone. In 6 months you can give them the keys to the kingdom and they'll do a better job running your company then you did. No more devs. No more QA. No more pesky employees who needs crazy stuff like sleep, and food, and time off to be a human.
Then of course we run face first into reality. You give the tool to an idiot (or a generally well meaning person not paying enough attention) and you end up with 2k PRs that are batshit insane, production data based deleted, malicious code downloaded and executed on just machines, email archives deleted, and entire production infrastructure systems blown away. Then the hype men come back around and go "well yeah, it's not the tools fault, you still need an expert at the wheel, even though you were told you don't".
LLMs can do amazing things, and I think there's a lot of opportunities to improve software products if used correctly, but reality does not line up with the hype, and it never will