LLMs work best when the user defines their acceptance criteria first(blog.katanaquant.com)

183 pointsby dnw7 hours ago36 comments

pornel6 hours ago
Their default solution is to keep digging. It has a compounding effect of generating more and more code.
If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.
If you tell them the code is slow, they'll try to add optimized fast paths (more code), specialized routines (more code), custom data structures (even more code). And then add fractally more code to patch up all the problems that code has created.
If you complain it's buggy, you can have 10 bespoke tests for every bug. Plus a new mocking framework created every time the last one turns out to be unfit for purpose.
If you ask to unify the duplication, it'll say "No problem, here's a brand new metamock abstract adapter framework that has a superset of all feature sets, plus two new metamock drivers for the older and the newer code! Let me know if you want me to write tests for the new adapters."
- unlikelytomato5 hours ago
  This is why I'm confused when people say it isn't ready to replace most of the programmer workforce.
  - Foobar85682 hours ago
    LLM code is higher quality than any codes I have seen in my 20 years in F500. So yeah you need to "guide" it, and ensure that it will not bypass all the security guidance for ex...But at least you are in control, although the cognitive load is much higher as well than just "blind trust of what is delivered".
    But I can see the carnage with offshoring+LLM, or "most employees", including so call software engineer + LLM.
    thesz2 hours ago
    > LLM code is higher quality than any codes I have seen in my 20 years in F500.
    "Any codes"?
    Foobar8568an hour ago
    At least my comment hasn't been reviewed or written by a LLM.
    And in my French brain, code or codebase is countable and not uncountable.
    sebastiennight27 minutes ago
    As far as I've ever heard, "le code" used in a codebase is uncountable, like "le café" you'd put in a cup, so we would still say "meilleur que tout le code que j'ai vu en 20 ans" and not "meilleur que tous les codes que j'ai vus en 20 ans".
    There is a countable "code" (just like "un café" is either a place, or a cup of coffee, or a type of coffee), and "un code" would be the one used as a password or secret, as in "j'ai utilisé tous les codes de récupération et perdu mon accès Gmail" (I used all the recovery codes and lost Gmail access).
    theszan hour ago
    I guess you can guide it to write in any style.
    But what set me off is an universal qualifier: there was no code seen by you that is of equal quality or better that what LLMs generate.
    mejutoco23 minutes ago
    cows are brown, from one side.
    https://www.neatorama.com/2007/01/22/a-mathematical-cow-joke...
    Implicatedan hour ago
    I got curious and had to fire up the ol LLM to find out what the story is about the words that aren't pluralized - TIL about countable and uncountable nouns. I wonder if the guy giving you trouble about your English speaks French.
    theszan hour ago
    I speak Russian and some English, but the question was about universal quantification: author declares that LLMs generate code of better quality than "any codes" he seen in his career.
    iLoveOncall14 minutes ago
    I'm native French and nobody would consider code countable. "codes" makes no sense. We'd talk about "lines of code" as a countable in French just like in English.
    Implicatedan hour ago
    You'll find, at times, that those communicating in a language that's not their primary language will tend to deviate from what one whose it was their primary language might expect.
    If that's obvious to you than you're just being rude. If it's not obvious to you, then you'll also find this is a common deviance (plural 'code') from those who come from a particular primary language's region.
    Edit; This got me thinking - what is the grammar/rule around what gets pluralized and what doesn't? How does one know that "code" can refer to a single line of code, a whole file of code, a project, or even the entirety of all code your eyes have ever seen without having to have an s tacked on to the end of it?
    tsimionescuan hour ago
    "Codes" as a way to refer to programs/libraries is actually common usage in academia and scientific programming, even by native English speakers. I believe, but am not sure, that it may just be relatively old jargon, before the use of "programs" became more common in the industry.
    As for the grammar rule, it's the question of whether a word is countable or uncountable. In common industry usage, "code" is an uncountable noun, just like "flour" in cooking (you say 2 lines of code, 1 pound of flour).
    It's actually pretty common for the same word to have both countable and uncountable versions, with different, though related, meanings. Typically the uncountable version is used with a measure of quantity, while the countable version denotes different kinds (flours - different types of flour; peoples - different groups of people).
    Implicatedan hour ago
    > Typically the uncountable version is used with a measure of quantity, while the countable version denotes different kinds (flours - different types of flour; peoples - different groups of people).
    This was very helpful, thank you! (I had just gotten off the phone with Claude learning about countable and uncountable nouns but those additional details you provided should prove quite valuable)
    theszan hour ago
    The question was about universal quantification, not grammar error.
    As if author of the comment had not seen any code that is better or of equal quality of code generated by LLMs.
    Implicatedan hour ago
    Well now I look like an idiot. But I did learn some things! :D My apologies.
    mettamage42 minutes ago
    Giving it prompts of the Shannon project helps for security
  - danparsonson2 hours ago
    Yeah that describes most legacy codebases I've worked on XD
  - lwansbrough3 hours ago
    For me, I'll do the engineering work of designing a system, then give it the specific designs and constraints. I'll let it plan out the implementation, then I give it notes if it varies in ways I didn't expect. Once we agree on a solution, that's when I set it free. The frontier models usually do a pretty good job with this work flow at this point.
  - YesBox3 hours ago
    Heh, people like to have someone else to blame.
  - iLoveOncall12 minutes ago
    Really? Because this perfectly explains why it will never replace them: it needs an exact language listing everything required to function as you expect it.
    You need code to get it to generate proper code.
- stingraycharles6 hours ago
  > If you ask to unify the duplication, it'll say "No problem, here's a brand new metamock abstract adapter framework that has a superset of all feature sets, plus two new metamock drivers for the older and the newer code! Let me know if you want me to write tests for the new adapters."
  Nevermind the fact that it only migrated 3 out of 5 duplicated sections, and hasn’t deleted any now-dead code.
  - Mavvie3 hours ago
    Sounds like my coworkers.
    Foobar85682 hours ago
    That's the reality nobody really wants to say.
    Jweb_Guru2 hours ago
    It's not reality. I'm really not a fan of the way that people excuse the really terrible code LLMs write by claiming that people write code just as bad. Even if that were true, it is not true that when you ask those people to do otherwise they simply pretend to have done it and forget you asked later.
    ttoinou9 minutes ago
    No but they will despise you for bringing the problem up
    imirican hour ago
    It's an easy copout.
    Tool works as expected? It's superintelligence. Programming is dead.
    Tool makes dumb mistake? So do humans.
    evolve-maz18 minutes ago
    [dead]
- marginalia_nu5 hours ago
  My sense is that the code generation is fast, but then you always need to spend several hours making sure the implementation is appropriate, correct, well tested, based on correct assumptions, and doesn't introduce technical debt.
  You need to do this when coding manually as well, but the speed at which AI tools can output bad code means it's so much more important.
  - ehnto4 hours ago
    Well when you write it manually you are doing the review and sanity checking in real time. For some tasks, not all but definitely difficult tasks, the sanity checking is actually the whole task. The code was never the hard part, so I am much more interested in the evolving of AIs real world problem solving skills over code problems.
    I think programming is giving people a false impression on how intelligent the models are, programmers are meant to be smart right so being able to code means the AI must be super smart. But programmers also put a huge amount of their output online for free, unlike most disciplines, and it's all text based. When it comes to problem solving I still see them regularly confused by simple stuff, having to reset context to try and straighten it out. It's not a general purpose human replacement just yet.
  - LPisGood4 hours ago
    And it’s slower to review because you didn’t do the hard part of understanding the code as it was being written.
    Implicated4 hours ago
    You're holding it wrong.
    Set the boundaries and guidelines before it starts working. Don't leave it space to do things you don't understand.
    ie: enforce conventions, set specific and measurable/verifiable goals, define skeletons of the resulting solutions if you want/can.
    To give an example. I do a lot of image similarity stuff and I wanted to test the Redis VectorSet stuff when it was still in beta and the PHP extension for redis (the fastest one, which is written in C and is a proper language extension not a runtime lib) didn't support the new commands. I cloned the repo, fired up claude code and pointed it to a local copy of the Redis VectorSet documentation I put in the directory root telling it I wanted it to update the extension to provide support for the new commands I would want/need to handle VectorSets. This was, idk, maybe a year ago. So not even Opus. It nailed it. But I chickened out about pushing that into a production environment, so I then told it to just write me a PHP run time client that mirrors the functionality of Predis (pure-php implementation of redis client) but does so via shell commands executed by php (lmao, I know).
    Define the boundaries, give it guard rails, use design patterns and examples (where possible) that can be used as reference.
    slopinthebag4 hours ago
    They aren't holding it wrong, it's a fundamental limitation of not writing the code yourself. You can make it easier to understand later when you review it, but you still need to put in that effort.
- joquarkyan hour ago
  Don't let it deteriorate so far that it can't recover in one session.
  Perform regular sessions dedicated to cleaning up tech debt (including docs).
- vannevar6 hours ago
  I'd highly recommend working top down, getting it to outline a sane architecture before it starts coding. Then if one of the modules starts getting fouled up, start with a clean sheet context (for that module) incorporating any cautions or lessons learned from the bad experience. LLMs are not yet good at working and reworking the same code, for the reasons you outline. But they are pretty good at a "Groundhog Day" approach of going through the implementation process over and over until they get it right.
  - coolius32 minutes ago
    +1 if you are vibe coding projects from scratch. if the architecture you specify doesn't make sense, the llm will start struggling, the only way out of their misery is mocking tests. the good thing is that a complete rewrite with proper architecture and lessons learned is now totally affordable.
- codebolt3 hours ago
  I use the restore checkpoint/fork conversation feature in GitHub Copilot heavily because of this. Most of the time it's better to just rewind than to salvage something that's gone off track.
- Implicated4 hours ago
  Not trying to be snarky, with all due respect... this is a skill issue.
  It's a tool. It's a wildly effective and capable tool. I don't know how or why I have such a wildly different experience than so many that describe their experiences in a similar manner... but... nearly every time I come to the same conclusion that the input determines the output.
  > If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.
  Yes, when the prompt/instructions are overly broad and there's no set of guardrails or guidelines that indicate how things should be done... this will happen. If you're not using planning mode, skill issue. You have to get all this stuff wrapped up and sorted before the implementation begins. If the implementation ends up being done in a "not-so-great" approach - that's on you.
  > If you tell them the code is slow
  Whew. Ok. You don't tell it the code is slow. Do you tell your coworker "Hey, your code is slow" and expect great results? You ask it to benchmark the code and then you ask it how it might be optimized. Then you discuss those options with it (this is where you do the part from the previous paragraph, where you direct the approach so it doesn't do "no-so-great approach") until you get to a point where you like the approach and the model has shown it understands what's going on.
  Then you accept the plan and let the model start work. At this point you should have essentially directed the approach and ensured that it's not doing anything stupid. It will then just execute, it'll stay within the parameters/bounds of the plan you established (unless you take it off the rails with a bunch of open ended feedback like telling it that it's buggy instead of being specific about bugs and how you expect them to be resolved).
  > you can have 10 bespoke tests for every bug. Plus a new mocking framework created every time the last one turns out to be unfit for purpose.
  This is an area I will agree that the models are wildly inept. Someone needs to study what it is about tests and testing environments and mocking things that just makes these things go off the rails. The solution to this is the same as the solution to the issue of it keeping digging or chasing it's tail in circles... Early in the prompt/conversation/message that sets the approach/intent/task you state your expectations for the final result. Define the output early, then describe/provide context/etc. The earlier in the prompt/conversation the "requirements" are set the more sticky they'll be.
  And this is exactly the same for the tests. Either write your own tests and have the models build the feature from the test or have the model build the tests first as part of the planned output and then fill in the functionality from the pre-defined test. Be very specific about how your testing system/environment is setup and any time you run into an issue testing related have the model make a note about that and the solution in a TESTING.md document. In your AGENTS.md or CLAUDE.md or whatever indicate that if the model is working with tests it should refer to the TESTING.md document for notes about the testing setup.
  Personally, I focus on the functionality, get things integrated and working to the point I'm ready to push it to a staging or production (yolo) environment and _then_ have the model analyze that working system/solution/feature/whatever and write tests. Generally my notes on the testing environment to the model are something along the lines of a paragraph describing the basic testing flow/process/framework in use and how I'd like things to work.
  The more you stick to convention the better off you'll be. And use planning mode.
  - riffraffan hour ago
    > Whew. Ok. You don't tell it the code is slow. Do you tell your coworker "Hey, your code is slow" and expect great results?
    Yes? Why don't you?
    They are capable people that just didn't notice something, id I notice some telemetry and tell them "hey this is slow" they are expected to understand the reason(s).
    zabzonk2 minutes ago
    Well, I would say something like "We seem to be having some performance issues the business has noticed in the XYZ stuff. Shall we sit down together and see if we can work out if we can improve things?"
    bryanrasmussen24 minutes ago
    Yeah if my co-worker can't start figuring out why the code is slow, with a reasonable reference to what the code in question is, that is a knock against their skills. I would actually expect some ideas as to what the problem is just off the top of their heads, but that the coding agent can't do that isn't a hit against it specifically, this is now a good part of what needs to be done differently.
    The suggestion to tell the agent to do performance analysis of the part of the code you think is problematic, and offer suggestions for improvements seems like the proper way to talk to a machine, whereas "hey your code is slow" feels like the proper way to talk to a human.
    Implicatedan hour ago
    So, you observed some telemetry - which would have been some sort of specific metric, right? Wouldn't you communicate that to them as well, not just "it's slow"?
    "Hey, I saw that metric A was reporting 40% slower, are you aware already or have any ideas as to what might be causing that?"
    Those two approaches are going to produce rather distinctly different results whether you're speaking to a human or typing to a GPU.
  - otabdeveloper42 hours ago
    It is not a tool. It is an oracle.
    It can be a tool, for specific niche problems: summarization, extraction, source-to-source translation -- if post-trained properly.
    But that isn't what y'all are doing, you're engaging in "replace all the meatsacks AGI ftw" nonsense.
    Implicatedan hour ago
    If I was on the "replace all the meatsacks AGI ftw" team then I would have referred to it as an oracle, by your own logic, wouldn't I have?
    It's a tool. It's good for some things, not for others. Use the right tool for the job and know the job well enough to know which tools apply to which tasks.
    More than anything it's a learning tool. It's also wildly effective at writing code, too. But, man... the things that it makes available to the curious mind are rather unreal.
    I used it to help me turn a cat exercise wheel (think huge hamster wheel) into a generator that produces enough power to charge a battery that powers an ESP32 powered "CYD" touchscreen LCD that also utilizes a hall effect sensor to monitor, log and display the RPMs and "speed" (given we know the wheel circumference) in real time as well as historically.
    I didn't know anything about all this stuff before I started. I didn't AGI myself here. I used a learning tool.
    But keep up with your schtick if that's what you want to do.
  - 5o1ecist30 minutes ago
    [dead]
- lekean hour ago
  i wonder if the solution is to just ask it to refactor its code once it's working.
  - MadnessASAP22 minutes ago
    You can, and it might make things a bit better. The only real way I've found so far is to start going through file by file, picking it apart.
    I wouldn't be surprised if over half my prompts start with "Why ...?", usually followed by "Nope, ... instead”
    Maybe the occasional "Fuck that you idiot, throw the whole thing out"
- MattGaiser4 hours ago
  > If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.
  Are you using plan mode? I used to experience the do a poor approach and dig issue, but with planning that seems to have gone away?
- bryanrasmussen5 hours ago
  maybe there should be an LLM trained on a corpus of a deletions and cleanup of code.
  - krackers4 hours ago
    I'm guessing there's a very strong prior to "just keep generating more tokens" as opposed to deleting code that needs to be overcome. Maybe this is done already but since every git project comes with its own history, you could take a notable open-source project (like LLVM) and then do RL training against against each individual patch committed.
- esafak5 hours ago
  I have run into this too. Some of it is because models lack the big picture; so called agentic search (aka grep) is myopic.
grey-area4 minutes ago
I find they work best as autocomplete -
The chunks of code are small and can be carefully reviewed at the point of writing
Claude normally gets it right (though sometimes horribly wrong) - this is easier to catch in autocomplete
That way they mostly work as designed and the burden on humans is completely manageable, plus you end up with a good understanding of the code generated.
Having the AI produce the majority of the code (in chats or with agents) takes lots of time to plan and babysit, and is harder to review, maintain and diagnose; it doesn't seem like much of a performance boost, unless you're producing code that is already in the training data and just want to ignore the licensing of the original code.
D-Machine4 hours ago
This article is great. And the blog-article headline is interesting, but wrong. LLM's don't in general write plausible code (as a rule) either.
They just write code that is (semantically) similar to code (clusters) seen in its training data, and which haven't been fenced off by RLHF / RLVR.
This isn't that hard to remember, and is a correct enough simplification of what generative LLMs actually do, without resorting to simplistic or incorrect metaphors.
- ozozozd4 hours ago
  Exactly. It’s also easy to find yourself in the out-of-distribution territory. Just ask for some tree-sitter queries and watch Gemini 3, Opus 4.5 and GLM 5 hallucinate new directives.
  - ehnto2 hours ago
    I think this could be the key difference in how people are experiencing the tools. Using Claude in industries full of proprietary code is a totally different experience to writing some React components, or framework code in C#, PHP or Java. It's shockingly good at the later, but as you get into proprietary frameworks or newer problem domains it feels like AI in 2023 again, even with the benefit of the agentic harnesses and context augments like memory etc.
  - simianwords5 minutes ago
    Any example of how I can get it to hallucinate?
flerchin6 hours ago
Yes plausible text prediction is exactly what it is. However, I wonder if the author included benchmarking in their prompt. It's not exactly fair to keep hidden requirements.
- g947o6 hours ago
  Attributing these to "hidden requirements" is a slippery slope.
  My own experience using Claude Code and similar tools tells me that "hidden requirements" could include:
  * Make sure DESIGN.md is up to date
  * Write/update tests after changing source, and make sure they pass
  * Add integration test, not only unit tests that mock everything
  * Don't refactor code that is unrelated to the current task
  ...
  These are not even project/language specific instructions. They are usually considered common sense/good practice in software engineering, yet I sometimes had to almost beg coding agents to follow them. (You want to know how many times I have to emphasize don't use "any" in a TypeScript codebase?)
  People should just admit it's a limitation of these coding tools, and we can still have a meaningful discussion.
  - flerchin6 hours ago
    Yeah I agree generally that the most banal things must be specified, but I do think that a single sentence in the prompt "Performance should be equivalent" would likely have yielded better results.
seanmcdirmid3 hours ago
I'm using an LLM to write queries ATM. I have it write lots of tests, do some differential testing to get the code and the tests correct, and then have it optimize the query so that it can run on our backend (and optimization isn't really optional since we are processing a lot of rows in big tables). Without the tests this wouldn't work at all, and not just tests, we need pretty good coverage since if some edge case isn't covered, it likely will wash out during optimization (if the code is ever correct about it in the first place). I've had to add edge cases manually in the past, although my workflow has gotten better about this over time.
I don't use a planner though, I have my own workflow setup to do this (since it requires context isolated agents to fix tests and fix code during differential testing). If the planner somehow added broad test coverage and a performance feedback loop (or even just very aggressive well known optimizations), it might work.
einrealist29 minutes ago
> SQLite is not primarily fast because it is written in C. Well.. that too, but it is fast because 26 years of profiling have identified which tradeoffs matter.
Someone (with deep pockets to bear the token costs) should let Claude run for 26 months to have it optimize its Rust code base iteratively towards equal benchmarks. Would be an interesting experiment.
The article points out the general issue when discussing LLMs: audience and subject matter. We mostly discuss anecdotally about interactions and results. We really need much more data, more projects to succeed with LLMs or to fail with them - or to linger in a state of ignorance, sunk-cost fallacy and supressed resignation. I expect the latter will remain the standard case that we do not hear about - the part of the iceberg that is underwater, mostly existing within the corporate world or in private GitHubs, a case that is true with LLMs and without them.
In my experience, 'Senior Software Engineer' has NO general meaning. It's a title to be awarded for each participation in a project/product over and over again. The same goes for the claim: "Me, Senior SWE treat LLMs as Junior SWE, and I am 10x more productive." Imagine me facepalming every time.
gormen2 hours ago
Excellent article. But to be fair, many of these effects disappear when the model is given strict invariants, constraints, and built-in checks that are applied not only at the beginning but at every stage of generation.
88j883 hours ago
100% I found that you think you are smarter than the LLM and knowing what you want, but this is not the case. Give the LLM some leeway to come up with solution based on what you are looking to achieve- give requirements, but don't ask it to produce the solution that you would have because then the response is forced and it is lower quality.
comex6 hours ago
Based on a search, the SQLite reimplementation in question is Frankensqlite, featured on Hacker News a few days ago (but flagged):
https://news.ycombinator.com/item?id=47176209
jqpabc1236 hours ago
LLMs have no idea what "correct" means.
Anything they happen to get "correct" is the result of probability applied to their large training database.
Being wrong will always be not only possible but also likely any time you ask for something that is not well represented in it's training data. The user has no way to know if this is the case so they are basically flying blind and hoping for the best.
Relying on an LLM for anything "serious" is a liability issue waiting to happen.
- simianwords3 minutes ago
  This is easily proven incorrect. Just go to ChatGPT and say something incorrect and ask it to verify. Why do people still believe this type of thing?
- tonypapousek6 hours ago
  It’s a shame of bulk of that training data is likely 2010s blogspam that was poor quality to begin with.
  - 2god35 hours ago
    But isn't that a reflection of reality?
    If you've made a significant investment in human capital, you're even more likely to protect it now and prevent posting valuable stuff on the web.
- 2god35 hours ago
  Aye. I wish more conversations would be more of this nature - in that we should start with basic propositions - e.g. the thing does not 'know' or 'understand' what correct is.
- LarsDu885 hours ago
  This is about to change very soon. Unlike many other domains (such as greenfield scientific discovery), most coding problems for which we can write tests and benchmarks are "verifiable domains".
  This means an LLM can autogenerated millions of code problem prompts, attempt millions of solutions (both working and non-working), and from the working solutions, penalize answers that have poor performance. The resulting synthetic dataset can then be used as a finetuning dataset.
  There are now reinforcement finetuning techniques that have not been incorporated into the existing slate of LLMs that will enable finetuning them for both plausibility AND performance with a lot of gray area (like readability, conciseness, etc) in between.
  What we are observing now is just the tip of a very large iceberg.
  - 2god35 hours ago
    Lets suppose whatever you say is true.
    If Im the govt, Id be foaming at the mouth - those projects that used to require enormous funding now will supposedly require much less.
    Hmmm, what to do? Oh I know. Lets invest in Digital ID-like projects. Fun.
    LarsDu882 hours ago
    It is true. Here is the publication going over how to generate this type of dataset and finetune: https://arxiv.org/pdf/2506.14245
    I don't think you grasp my statement. LLMs will exceed humans greatly for any domain that is easy to computationally verify such as math and code. For areas not amenable to deterministic computations such as human biology, or experimental particle physics, progress will be slower
lukeify6 hours ago
Most humans also write plausible code.
- tartoran6 hours ago
  LLMs piggyback on human knowledge encoded in all the texts they were trained on without understanding what they're doing.
  Humans would execute that code and validate it. From plausible it'd becomes hey, it does this and this is what I want. LLMs skip that part, they really have no understanding other than the statistical patterns they infer from their training and they really don't need any for what they are.
  - red75prime2 hours ago
    Could we stop using vague terms like “understanding” when talking about LLMs and machine learning? You don't know what understanding is. You only know how it feels to understand something.
    It's better to describe what you can do that LLMs currently can't.
    stevenhuang35 minutes ago
    At least it's an easy way for those who don't know that they're talking about to out themselves.
    If they'd bother to see how modern neuroscience tries to explain human cognition they'd see it explained in terms that parallel modern ML. https://en.wikipedia.org/wiki/Predictive_coding
    We only have theories for what intelligence even means, I wouldn't be surprised there are more similarities than differences between human minds and LLMs, fundamentally (prediction and error minimization)
  - owlninja6 hours ago
    They probably at least look at the docs?
    6 hours ago
    undefined
  - stevenhuang5 hours ago
    LLMs can execute code and validate it too so the assertions you've made in your argument are incorrect.
    What a shame your human reasoning and "true understanding" led you astray here.
- gitaarikan hour ago
  All code is plausible by design
- 6 hours ago
  undefined
sim04fulan hour ago
I've noticed a key quality signal with LLM coding is an LOC growth rate that tapers off or even turns negative.
helsinki3 hours ago
That's why I added an invariant tool to my Go agent framework, fugue-labs/gollem:
https://github.com/fugue-labs/gollem/blob/main/ext/codetool/...
mmaunder6 hours ago
But my AI didn't do what your AI did.
Cherry picked AI fail for upvotes. Which you’ll get plenty of here an on Reddit from those too lazy to go and take a look for themselves.
Using Codex or Claude to write and optimize high performance code is a game changer. Try optimizing cuda using nsys, for example. It’ll blow your lazy little brain.
- kccqzy5 hours ago
  Yeah right. A LLM in the hands of a junior engineer produces a lot of code that looks like they are written by juniors. A LLM in the hands of a senior engineer produces code that looks like they are written by seniors. The difference is the quality of the prompt, as well as the human judgement to reject the LLM code and follow-up prompts to tell the LLM what to write instead.
  - jonnycoder2 hours ago
    Prompting is just step 1. Creating and reviewing a plan is step 2. Step 0 was iterating and getting the right skills in place. Step 3 is a command/skill that decomposes the problem into small implementation steps each with a dependency and how to verify/test the implementation step. Step 4 is execute the implementation plan using sub agents and ensuring validation/testing passes. Step 5 is a code review using codex (since I use claude for implementation).
  - mmaunder5 hours ago
    I kind of agree. But I'd adjust that to say that in both cases you get good looking code. In the hands of a junior you get crappy architecture decisions and complete failure to manage complexity which results in the inevitable reddit "they degraded the model" post. In the hands of seniors you get well managed complexity, targeted features, scalable high performance architecture, and good base technology choices.
  - 2god35 hours ago
    Lol what. The difference is that the senior... is a senior. Ask yourself what characteristics comprises a senior vs junior...
    You're glossing over so much stuff. Moreover, how does the Junior grow and become the senior with those characteristics, if their starting point is LLMs?
- oofbey6 hours ago
  It’s easy to get AI to write bad code. Turns out you still need coding skills to get AI to write good code. But those who have figured it out can crank out working systems at a shocking pace.
  - mmaunder6 hours ago
    Agreed 100%. I'd add that it's the knowledge of architecture and scaling that you got from writing all that good code, shipping it, and then having to scale it. It gives you the vocabulary and broad and deep knowledge base to innovate at lightning speeds and shocking levels of complexity.
  - serious_angel6 hours ago
    I am sorry for asking, but... is there guide even on how to "figure it out"? Otherwise, how are you so sure about it?
    wmeredith5 hours ago
    Right here: https://codemanship.wordpress.com/2025/10/30/the-ai-ready-so...
    This series of articles is gold.
    Unsurprisingly, writing good software with AI follows the same principles as writing it without AI. Keep scopes small. Ship, refactor, optimize, and write tests as you go.
    pornel5 hours ago
    When a new technology emerges we typically see some people who embrace it and "figure it out".
    Electronic synthesisers went from "it's a piano, but expensive and sounds worse" to every weird preset creating a whole new genre of electronic music.
    So it seems plausible, like Claude's code, that our complaints about unmaintainable code are from trying to use it like a piano, and the rave kids will find a better use for it.
    mmaunder6 hours ago
    That's actually a great question. Truth be told the best way right now is to grab Codex CLI or Claude CLI (I strongly prefer Codex, but Claude has its fans), and just start. Immediately. Then go hard for a few months and you'll develop the skills you need.
    A few tips for a quickstart:
    Give yourself permission to play.
    Understand basic concepts like context window, compaction, tokens, chain of thought and reasoning, and so on. Use AI to teach you this stuff, and read every blog post OpenAI and Anthropic put out and research what you don't understand.
    Pick a hard coding problem in Python or Typescript and take a leap of faith and ask the agent to code it for you.
    My favorite phrase when planning is: "Don't change anything. Just tell me.". Save this as a tmux shortcut and use it at the end of every prompt when planning something out.
    Use markdown .md docs to create a planning doc and keep chatting to the agent about it and have it update the plan until you're super happy, always using the magic phrase "Don't change anything. Just tell me." (I should get myself a patent on that little number. Best trick I know)
    Every time you see an anti-AI post, just move on. It's lazy people making lazy assumptions. Approach agentic coding with a sense of love, excitement, optimism, and take massive leaps of faith and you'll be very very surprised at what you find.
    Best of luck Serious Angel.
    2god35 hours ago
    You're not really answering the question are you?
    Your answer is to play with it. Cool. But why cant you and others put together a proper guide lol? It cant be that hard.
    Go ahead and do it - it'll challenge the Anti-AI posters you are referencing. I and others want to see that debate.
    appcustodian25 hours ago
    Don't worry we'll all be taking the Claude certification courses soon enough
    mmaunder5 hours ago
    Ah - I know! Seriously I know. There's such a bad need for this right now. The problem is that the folks who are great at agentic coding are coding their asses off 16 to 20 hours a day and don't have a minute they want to spend on writing guides because of the opportunity cost.
    One of the rare resources I found recently was the OpenClaw guys interview on Lex. He drops a few bangers that are really valuable and will save you having to spend a long time figuring it out.
    Also there's a very strong disincentive for anyone to write right now because we're competing against the noise and the slop in the space. So best to just shut the fuck up and create as fast as we can, and let the outcome speak for itself. You're going to see a lot more products like OpenClaw where the pace of innovation is rapid, and the author freely admits that they're coding agentically and not writing a single line.
    I think the advantage that Peter has (openclaw author) is that he has enough money and success to not give a fuck about what people say re him writing purely agentically, so he's been very open about it which has been great for others who are considering doing the same.
    But if you have a software engineering career or are a public figure with something to lose, you tend to STFU if you're doing pure agentic coding on a project.
    But that'll change. Probably over the next few months. OpenClaw broke the ice.
    oofbey2 hours ago
    Here’s some practical tips:
    Start small. Figure out what it (whatever tool you’re using) can do reliably at a quality level you’re comfortable with. Try other tools. There are tons. If it doesn’t get it right with the first prompt, iterate. Refine. Keep at it until you get there.
    When you have seen some pattern work, do that a bunch. It won’t always work. Write rules / prompts / skills to try to get it to avoid making the mistakes you see. Keep doing this for a while and you’ll get into a groove.
    Then try taking on bigger chunks of work at a time. Break apart a problem the same way you’d do it yourself first. Write a framework first. Build hello world. Write tests. Build the happy path. Add features. Don’t forget to make it write lots of tests. And run them. It’ll be lazy if you let it, so don’t let it. Each architectural step is not just a single prompt but a conversation with the output being a commit or a PR.
    Also, use specs or plans heavily. Have a conversation with it about what you’re trying to do and different ways to do it. Their bias is to just code first and ask questions later. Fight that. Make it write a spec doc first and read it carefully. Tell it “don’t code anything but first ask me clarifying questions about the problem.” Works wonders.
    As for convincing the AI haters they’re wrong? I seriously do. Not. Care. They’ll catch up. Or be out of a job. Not my problem.
    appcustodian25 hours ago
    How do you figure anything out? You go use it, a lot.
    4 hours ago
    undefined
3 hours ago
undefined
FrankWilhoit6 hours ago
Enterprise customers don't buy correct code, they buy plausible code.
- kibwen6 hours ago
  Enterprise customers don't buy plausible code, they buy the promise of plausible code as sold by the hucksters in the sales department.
- 2god35 hours ago
  They're not buying code.
  They are buying a service. As long as the service 'works' they do not care about the other stuff. But they will hold you liable when things go wrong.
  The only caveat is highly regulated stuff, where they actually care very much.
- marginalia_nu6 hours ago
  I think SolarWinds would have preferred correct code back in 2020.
  - qup6 hours ago
    Okay, but what did they buy?
    marginalia_nu6 hours ago
    Code, from their employees.
raw_anon_11115 hours ago
The difference for me recently
Write a lambda that takes an S3 PUT event and inserts the rows of a comma separated file into a Postgres database.
Naive implementation: download the file from s3 and do a bulk insert - it would have taken 20 minutes and what Claude did at first.
I had to tell it to use the AWS sql extension to Postgres that will load a file directly from S3 into a table. It took 20 seconds.
I treat coding agents like junior developers.
- svpyk5 hours ago
  Unlike junior developers, llms can take detailed instructions and produce outstanding results at first shot a good number of times.
- conception4 hours ago
  Did you ask it to research best practices for this method, have an adversarial performance based agent review their approach or search for performant examples of the task first? Relying on training data only will always get your subpar results. Using “What is the most performant way to load a CSV from S3 into PostgreSQL on RDS? Compare all viable and research approaches before recommending one.” gave me the extension as the top option.
  - raw_anon_11114 hours ago
    I knew the best way. I was just surprised that Claude got it wrong. As soon as I told it to use the s3 extension, it knew to add the appropriate permissions, to update my sql unit script to enable the extension and how to write the code
marginalia_nu6 hours ago
I tried to make Claude Code, Sonnet 4.6, write a program that draws a fleur-de-lis.
No exaggeration it floundered for an hour before it started to look right.
It's really not good at tasks it has not seen before.
- ehnto6 hours ago
  Even with well understood languages, if there isn't much in the public domain for the framework you're using it's not really that helpful. You know you're at the edges of its knowledge when you can see the exact forum posts you are looking at showing up verbatim in it's responses.
  I think some industries with mostly proprietary code will be a bit disappointing to use AI within.
- jshmrsn6 hours ago
  Considering that a fleur-de-lis involves somewhat intricate curves, I think I'd be pretty happy with myself if I could get that task done in an hour.
  Given a harness that allows the model to validate the result of its program visually, and given the models are capable of using this harness to self correct (which isn't yet consistently true), then you're in a situation where in that hour you are free to do some other work.
  A dishwasher might take 3 hours to do for what a human could do in 30 minutes, but they're still very useful because the machine's labor is cheaper than human labor.
  - marginalia_nu6 hours ago
    I didn't provide any constraints on how to draw it.
    TBH I would have just rendered a font glyph, or failing that, grabbed an image.
    Drawing it with vector graphics programmatically is very hard, but a decent programmer would and should push back on that.
    zeroxfe6 hours ago
    > TBH I would have just rendered a font glyph, or failing that, grabbed an image.
    If an LLM did that, people would be all up in arms about it cheating. :-)
    For all its flaws, we seem to hold LLMs up to an unreasonably high bar.
    marginalia_nu6 hours ago
    That's the job description for a good programmer though. Question assumptions and requirements, and then find the simplest solution that does the job.
    Just about anyone can eventually come up with a hideously convoluted HeraldicImageryEngineImplFactory<FleurDeLis>.
- comex6 hours ago
  LLMs are really bad at anything visual, as demonstrated by pelicans riding bicycles, or Claude Plays Pokémon.
  Opus would probably do better though.
  - tartoran6 hours ago
    How could they be any good at visuals? They are trained on text after all.
    comex6 hours ago
    Supposedly the frontier LLMs are multimodal and trained on images as well, though I don't know how much that helps for tasks that don't use the native image input/output support.
    Whatever the cause, LLMs have gotten significantly better over time at generating SVGs of pelicans riding bicycles:
    https://simonwillison.net/tags/pelican-riding-a-bicycle/
    But they're still not very good.
    tartoran6 hours ago
    I have to admit I'm seeing this for the first time and am somewhat impressed by the results and even think they will get better with more training, why not... But are these multimodal LLMs still LLMs though? I mean, they're still LLMs but with a sidecar that does other things and the training of the image takes place outside the LLMs so in a way the LLMs still don't "know" anything about these images, they're just generating them on the fly upon request.
    boxedemp4 hours ago
    Maybe we should drop one of the L's
    astrange6 hours ago
    Claude is multimodal and can see images, though it's not good at thinking in them.
    msephton6 hours ago
    Shapes can be described as text or mathematical formulas.
    tempest_6 hours ago
    An SVG is just text.
- internet20006 hours ago
  I got Opus 4.6 to one shot it, took 5-ish mins. "Write me a python program that outputs an svg of a fleur-de-lis. Use freely available images to double check your work."
  It basically just re-created the wikipedia article fleur-de-lis, which I'm not sure proves anything beyond "you have to know how to use LLMs"
  - 647384 hours ago
    Just for reference, Codex using GPT-5.4 and that exact prompt was a 4-shot that took ten minutes. The first result was a horrific caricature. After a slight rebuke ("That looks terrible. Read https://en.wikipedia.org/wiki/Fleur-de-lis for a better understanding of what it should look like."), it produced a very good result but it then took two more prompts about the right side of the image being clipped off before it got it right.
  - robertcope4 hours ago
    Same, I used Sonnet 4.6 with the prompt, "Write a simple program that displays a fleur-de-lis. Python is a good language for this." Took five or six minutes, but it wrong a nice Python TK app that did exactly what it was supposed to.
- scuff3d4 hours ago
  I tried to use Codex to write a simple TCP to QUIC proxy. I intentionally kept the request fairly simple, take one TCP connection and map it to a QUIC connection. Gave a detailed spec, went through plan mode, clarified all the misunderstandings, let it write it in Python, had it research the API, had it write a detailed step by step roadmap... The result was a fucking mess.
  Beyond the fact that it was "correct" in the same way the author of the article talked about, there was absolutely bizarre shit in there. As an example, multiple times it tried to import modules that didn't exist. It noticed this when tests failed, and instead of figuring out the import problem it add a fucking try/except around the import and did some goofy Python shenanigans to make it "work".
- tartoran6 hours ago
  Have you tried describing to Claude what it is? The more the detail the better the result. At some point it does become easier to just do it yourself.
  - parvardegr41 minutes ago
    agreed with part that at some point it's better to just do it yourself but for sure they will get better and better
  - marginalia_nu6 hours ago
    It knows what it is, it's a very well known symbol. But translating that knowledge to code is something else.
    Interesting shortcoming, really shows how weak the reasoning is.
    cat_plus_plus6 hours ago
    Try writing code from description without looking at the picture or generated graphics. Visual LLM with a suggestion to find coordinates of different features and use lines/curves to match them might do better.
  - vdfs6 hours ago
    Most people just forget to tell it "make it quick" and "make no mistake"
    mekael6 hours ago
    I’m unable to determine if you’re missing /s or not.
    tartoran6 hours ago
    That's kind of foolish IMO. How can an open ended generic and terse request satisfy something users have in mind?
codethief5 hours ago
> Your LLM Doesn't Write Correct Code. It Writes Plausible Code.
I don't always write correct code, either. My code sure as hell is plausible but it might still contain subtle bugs every now and then.
In other words: 100% correctness was never the bar LLMs need to pass. They just need to come close enough.
ontouchstart5 hours ago
I made a comment in another thread about my acceptance criteria
https://news.ycombinator.com/item?id=47280645
It is more about LLMs helping me understand the problem than giving me over engineered cookie cutter solutions.
nprateeman hour ago
In the last month I've done 4 months of work. My output is what a team of 4 would have produced pre-AI (5 with scrum master).
Just like you can't develop musical taste without writing and listening to a lot of music, you can't teach your gut how to architect good code without putting in the effort.
Want to learn how to 10x your coding? Read design patterns, read and write a lot of code by hand, review PRs, hit stumbling blocks and learn.
I noticed the other day how I review AI code in literally seconds. You just develop a knack for filtering out the noise and zooming in on the complex parts.
There are no shortcuts to developing skill and taste.
6 hours ago
undefined
3 hours ago
undefined
riffraff2 hours ago
To be fair, people do too.
gzread6 hours ago
Early LLMs would do better at a task if you prefixed the task with "You are an expert [task doer]"
4 hours ago
undefined
graphememes5 hours ago
bad input > bad output
idk what to say, just because it's rust doesn't mean it's performant, or that you asked for it to be performant.
yes, llms can produce bad code, they can also produce good code, just like people
- jqpabc1232 hours ago
  yes, llms can produce bad code, they can also produce good code, just like people
  Over time, you develop a feel for which human coders tend to be consistently "good" or "bad". And you can eliminate the "bad".
  With an LLM, output quality is like a box of chocolates, you never know what you're going to get. It varies based on what you ask and what is in it's training data --- which you have no way to examine in advance.
  You can't fire an LLM for producing bad code. If you could, you would have to fire them all because they all do it in an unpredictable manner.
skybrian6 hours ago
You can ask an LLM to write benchmarks and to make the code faster. It will find and fix simple performance issues - the low-hanging fruit. If you want it to do better, you can give it better tools and more guidance.
It's probably a good idea to improve your test suite first, to preserve correctness.
bamboozled2 hours ago
I'm sure this is because they are pattern matching masters, if you program them to find something, they are good at that. But you have to know what you're looking for.
cat_plus_plus6 hours ago
That's very impressive. Your LLM actually wrote a correct code for a full relational database on the first try, like it takes 2.5 seconds to insert 100 rows but it stores them correctly and select is pretty fast. How many humans can do this without a week of debugging? I would suggest you install some profiling tools and ask it to find and address hotspots. SQL Lite had how long and how many people to get to where it is?
- bluefirebrand6 hours ago
  I could "write" this code the same way, it's easy
  Just copy and paste from an open source relational db repo
  Easy. And more accurate!
  - snoob20216 hours ago
    It is a Rust reimplementation of SQLite. Not exactly just "copy and paste"
  - cat_plus_plus6 hours ago
    The actual task is usually to mix something that looks like a dozen of different open source repos combined but to take just the necessary parts for task at hand and add glue / custom code for the exact thing being built. While I could do it, LLM is much faster at it, and most importantly I would not enjoy the task.
STARGA3 hours ago
[dead]
jeff_antseedan hour ago
[dead]
thisguySPED6 hours ago
[flagged]
thisguySPED6 hours ago
[flagged]
user39393824 hours ago
I have great techniques to fix this issue but not sure how it behooves me to explain it.
serious_angel6 hours ago
Holy gracious sakes... Of course... Thank you... thank you... dear katanaquant, from the depths... of my heart... There's still belief in accountability... in fun... in value... in effort... in purpose... in human... in art...
Related:
- <http://archive.today/2026.03.07-020941/https://lr0.org/blog/...> (I'm not consulting an LLM...)
- <https://web.archive.org/web/20241021113145/https://slopwatch...>