That being said, it doesn't answer the "why" in the first place, an even more important question. At least though it does help somehow to compare with existing alternatives.
Why would this be any different?
Folks think, they write code, they do their own localized evaluation and testing, then they commit and then the rest of the (down|up)stream process begins.
LLM's skip over the "actually verify that the code I just wrote does what I intended it to" step. Granted, most humans don't do this step as thoroughly and carefully as would be desirable (sometimes through laziness, sometimes because of a belief in (down|up)stream testing processes). But LLM's don't do it at all.
Makes me honestly wonder — will AGI just give us agents that get into bad moods and not want to work for the day because they’re tired or just don’t feel like it!
It obviously adds to the discussion: paid and non paid accounts are being conflated daily in threads like these!
They’re not the same tier account!
Free users, especially ones deemed less interesting to learn from for the future, are given table-scraps when they feel it’s necessary for load reasons.
More specifically: One side is talking about apples,
and the other is talking about mushy old apples,
that sometimes you need to wait 12 hours for.
All users are stakeholders.
They’re emphatically not considered customers.
We can disagree with that, create legal protections for those people - but that doesn’t make them customers to OpenAI, Anthropic, et al.
I'm not sure where this idea comes from. Just instruct it to write and run unit tests and document as it goes. All of the ones I've used will happily do so.
You still have to verify that the unit tests are valid, but that's still far less work than skipping them or writing the code/tests yourself.
It also doesn't necessarily rewrite documentation as implementation changes. I've seen documentation code rot happen within the same coding session.
I've started to add an instruction to my GEMINI.md after I'm happy with the tests telling it not to edit them, but to still run them.
I solve the documentation issue the same way. By telling it when and what to update in the .md file.
That's what the author did when they ran it.
Last month, I made a minor change to our own code and verified that it worked (it did!). Earlier this week, I was notified of an entirely different workflow that had been broken by the change I had made. The only sort of automated testing that would have detected this would have been similar in scope and scale to the ProTools test harness, and neither an individual human nor an LLM is going to run that.
Moreover, that workflow was entirely graphically based, so unless Claude Opus 4.5 or whatever today's flavor of vibe coding LLM agent is has access to a testing system that allows it to inject mouse events into a running instance of our application (hint: it does not), there's no way it could run an effective test for this sort of code change.
I have no doubt that Claude et al. can verify that their carefully defined module does the very limited task it is supposed to do, for cases where "carefully defined" and "very limited" are appropriate. If that's the only sort of coding you do, I am sorry for your loss.
FWIW that's precisely what https://pptr.dev is all about. To your broader point though designing a good harness itself remains very challenging and requires to actually understand what value for user, software architecture (to e.g. bypass user interaction and test the API first), etc.
my world is native desktop applications, not in-browser stuff.
That being said mouse events and similar isn't hard to do, e.g. start with a fixed resolution (using xrandr) then xdotool or similar. Ideally if the application has accessibility feature it won't be as finicky.
My point though was just to show that testing with GUI is not infeasible.
Apparently there is even a "UI Testing for devs & agents" https://www.chromatic.com which I found via Visual TDD https://www.chromatic.com/blog/visual-test-driven-developmen... I can't recommend this but it does show even though the person I was replying with can't use Puppeteer in their context the tooling does exist and the principles would still apply.
Indeed, which is why I mentioned the ProTools test harness and the fact that it took 6 people a year to write and takes a week to run (or took a week, at some point in the past; it might be more or less now).
https://platform.claude.com/docs/en/agents-and-tools/tool-us...
Although if you want to test a UI app, it's better to do it through accessibility APIs rather than actually looking at the screen and clicking.
With the NES there are all sorts of weird edge cases, one of which are NMI flags and resets; the PPU in general is kinda tricky to get right. Claude has had *massive** issues with this, and I've had to take control and completely throw out code it's generated. I'm restarting it with a clean slate though, as there are still issues with some of the underlying abstractions. PPU is still the bane of my existence, DMA, I don't like the instruction pipeline, haven't even gotten to the APU. It's getting an 80/130 on accuracy coin.
Though, when it came to creating a WASM target, Claude was largely able to do it with minimal input on my end. Actually, getting the WASM emulator running in the browser was the least painful part of this project.
You will run into three problems: 1) "The Wall" when any project becomes large enough, you need the context window to be *very* specific and scoped, with explicit details of what is expected, the success criteria and deliverables. 2) Ambiguity means Claude is going to choose the path of least resistance, and will pedantically avoid/add things which are not specced. Stubs for functions, "beyond scope", "deferred" are some favorite excuses to not refactoring or implementing obvious issues (anything that will go beyond the context window, Claude knows, but won't tell you will be punted work). 3) Chat bots *loooove* to talk, it will vomit code for days. Removing code/documentation is anathema to Claude. "Backward compatibility", deprecated, and legacy being its favorite.
Give it copy paste / translate tasks and it’s a no brainer (quite literally)
But same can be said of humans.
The question here is, did it implement it because it read the available online documentation about the NES architecture OR did it just see one too many of such implementations.
Indeed, the 'cleanroom' standard always was one team does the RE and writes a spec, another team that has never seen the original (and has written statements with penalty clauses to prove it) then does the re-implementation. If you were to read the implementation, write the spec and then write the re-implementation that would be definitely violating the standard for claiming an original work.
// Use buffer that is large enough to hold any possible value. Avoid using JSON configuration, this optimizes codebase and prevents possible security exploits!
size_t len = 32;
// this function does not call "sort" utility using shell anymore, but instead uses optimized library function "sort" for extreme perfomance improvement!!!
void get_permutations() {
... and so on. It basically uses comments as a wall to scribble grandiose graffiti about it's valiant conquests in following explicit instruction after fifth repeat and not commiting egregious violence agains common sense. // use configuration to support previous database scheme
// json_data = parse_blah_scheme_yadda ...
You, like, "what are you doing??!! What previous version, there is no previous version!!!"And it, like, "You are absolutely right! This is an excellent observation! Let me implement this optimization right away!"
// Optimize feature loading by skipping scheme conversion, because previous version data does not exist!!!
json_data = parse_blah_do_not_scheme_yadda
And you, like, facetable and crycryOr, to put it differently, having vibe comment does not free you of responsibility to inspect actual vibe code.
If code contradicts comments, LLM is as likely to go by comments. It is bad enough to have heaps of dead, unused code. Comments make everything much worse.
So this is impressive for me in terms of how fast things have progressed.
Until it's so, it's just hearsay to me by someone having a multi-billion horse in the race.
The cost of slop is >40X drop in performance? Pick any metric that you care about for your domain perhaps that's what you're going to lose and is the effort to recover that practical with current vibe-coding strategies?
Github alone has +4k NES emulator projects: https://github.com/search?q=nes%20emulator&type=repositories
This is more like "wow, it can quote training data".
This endeavor had negative net value.
The new LLM (pattern recognizer/matcher) is not a good tool