AI is forcing us to write good code(bits.logic.inc)

302 pointsby sgk284a month ago49 comments

KurSixa month ago
There's a catch with 100% coverage. If the agent writes both the code and the tests, we risk falling into a tautology trap. The agent can write flawed logic and a test that verifies that flawed logic (which will pass). 100% coverage only makes sense if tests are written before the code or rigorously verified by a human. Otherwise, we're just creating an illusion of reliability by covering hallucinations with tests. An "executable example" is only useful if it's semantically correct, not just syntactically
- ben_wa month ago
  All the problems you list are true, but the solutions not so much.
  I've seen this problem with humans even back at university when it was the lecturer's own example attempting to illustrate the value of formal methods and verification.
  I would say the solution is neither "get humans to do it" nor "do it before writing code", but rather "get multiple different minds involved to check each other's blind spots, and no matter how many AI models you throw at it they only count as one mind even when they're from different providers". Human tests and AI code, AI tests and human code, having humans do code reviews of AI code or vice-versa, all good. Two different humans usually have different blind spots, though even then I've seen some humans bully their way into being the only voice in the room with the full support of their boss, not that AI would help with that.
  - godelskia month ago
    > "get multiple different minds involved to check each other's blind spots
    This is actually my big gripe about chatbot coding agents. They are trained on human preference and thus they optimize for errors that are in our blind spots.
    I don't think people take this subtly seriously enough. Unless we have an /objective/ ground truth we end up proxying our optimization. So we don't optimize for code that /is/ correct, we optimize for code that /looks/ correct. It may seem like a subtle difference but it is critical.
    The big difference is when they make errors they are errors that are more likely to be difficult for humans to detect.
    Good tools should complement tool users. Fill in gaps. But as we've been trying to train agents to replace humans we are not focusing on this distinction. I want my coding agent to make errors that are obvious to me just as I want errors I make to be obvious to it (or for it to be optimized to detect errors I make)
    KurSixa month ago
    Exactly, we've basically trained an army of perfect corporate suck-ups. AI optimizes for the "least resistance at review" metric, not "logic correctness," so relying on eyeballs to check AI code right now is risky, we need "unfeeling" validators: compilers, formal verification, and rigid tests
  - melagonstera month ago
    Maybe this is because humans have good intuition to know the difference between us. But this type of intuition does not work on the behaviour of LLMs.
- joshribakoffa month ago
  That’s why you’ve gotta test your tests. Insert bugs and ensure they fail.
  As the sibling comments alluded to, it’s not exclusively an AI problem since multiple people can miss the issue too.
  It’s wonderful that AI is an impetus for so many people to finally learn proper engineering principles though!
  - KurSixa month ago
    Mutation testing is becoming the only way to catch AI red-handed. Without mutations you'll be staring at a perfect CI/CD dashboard, unaware that your tests verify absolutely nothing
    Yeah, it burns CPU like crazy, but CPU time is dirt cheap right now compared to the cost of an engineer debugging that self-deception in production
  - vrightera month ago
    but who will test the tests of tests?
    Supermanchoa month ago
    The double entry technique is the most effective path to ensure accuracy (best tradeoffs for time vs accuracy) in finance and software. ie Triple book accounting has not been the standard because it's a bad tradeoff. It requires a large increase in time and effort, for rare increases in accuracy.
    adrianNa month ago
    The mutation testing engine.
    godelskia month ago
    Just add tests to test your test tests
    a month ago
    undefined
- smarx007a month ago
  I think the phase change hypothesis* is a bit wrong.
  I think it happens not at 100% coverage but at, say, 100% MC/DC test coverage. This is what SQLite and avionics software aim for.
  *has not been confirmed by a peer-reviewed research.
  - cloudheada month ago
    What's MC/DC?
    smarx007a month ago
    Modified Condition/Decision Coverage (MC/DC) is a test coverage approach that considers a chunk of code covered if:
    - Every branch was "visited". Plain coverage already ensures that. I would actually advocate for 100% branch coverage before 100% line coverage.
    - Every part (condition) of a branch clause has taken all possible values. If you have if(enabled && limit > 0), MC/DC requires you to test with enabled, !enabled, limit >0, limit <=0.
    - Every change to the condition was shown to somehow change the outcome. (false && limit > 0) would not pass this, a change to the limit would not affect the outcome - the decision is always false. But @zweifuss has a better example.
    - And, of course, every possible decision (the outcome of the entire 'enabled && limit > 0') needs to be tested. This is what ensures that every branch is taken for if statements, but also for switch statements that they are exhaustive etc.
    MC/DC is usually required for all safety-critical code as per NASA, ESA, automotive (ISO 26262) and industrial (IEC 61508).
    amaranta month ago
    Limit <=0 appears to include every number between 0 and INT_MIN.
    I hope you don't have any string inputs, or your test is gonna take a while to run!
    smarx007a month ago
    To test 'limit > 0' according to MC/DC, you need only two values, e.g. -1 and 1. There may be other code inside the branch using limit in some other ways, prompting more test cases and more values of limit but this one only needs two.
    But yes, exhaustively testing your code is a bit exhausting ;)
    zweifussa month ago
    Modified Condition/Decision Coverage
    It's mandated by DO-178C for the highest-level (Level A) avionics software.
    Example: if (A && B || C) { ... } else { ... } needs individual tests for A, B, and C.
    Test #,A,B,A && B,Outcome taken,Shows independence for
    1,True,True,True,if branch,(baseline true)
    2,False,True,False,else branch,A (A flips outcome while B fixed at True)
    3,True,False,False,else branch,B (B flips outcome while A fixed at True)
    zweifussa month ago
    I made a mistake:
    Test # A B C Result 1 True True False True 2 False True False False 3 True False False False 4 False True True True
    a month ago
    undefined
    emoIIa month ago
    Basically branch coverage but also all variations of the predicates, e.g. testing both true || true, and true || false
    Antibabelica month ago
    https://en.wikipedia.org/wiki/Modified_condition/decision_co...
- closeparena month ago
  Also true of human-written unit tests. You probably also want to have integration or UI automation tests that cover the end-user scenarios in your product requirements, and invariants that are checked against large numbers of examples either taken from production (sanitized of course), in a shadow environment, or generated if you absolutely must.
- theptipa month ago
  It’s true - but there are “good code” solutions to this already. For example, BDD / Acceptance Tests can be used to write human-readable specs.
  IMO it’s quite boilerplate-y to set this up pre-LLM but probably the ROI is favorable now.
  Furthermore, as Uncle Bob has written a lot about, putting effort into structuring your tests well is another area that’s usually under-invested. LLMs often write very repetitive tests, but are happy to DRY out, write factories, etc if you ask them.
- notimetorelaxa month ago
  You’re right. What I like doing in those cases is to review very closely the tests and the assertions. Frequently it’s even faster than looking at the SUT itself.
  - ruszkia month ago
    I heard this “review very closely” thing many times, and rarely means review very closely. Maybe 5% of developers really do this ever, and I probably overestimate it. When people send here AI generated code, it’s quite obvious that they don’t review code properly. There are videos when people recorded how we should use LLMs, and they clearly don’t do this.
    christophilusa month ago
    Yeah. This is me. I try, but I always miss something. The sheer volume and occasional stupidity makes it difficult. Spot checking only gets you so far. Often, the code is excellent except in one or two truly awful spots where it does something crazy.
- eternityforesta month ago
  Tests freeze behavior in place, and manual end to end testing can confirm that the most common paths are at least kind of correct ish.
  Obviously that's not good enough, but I'd much rather have AI tests than poor test coverage.
- dbdoskeya month ago
  In theory, that is the benefit of having an agent that is limited to only doing the tests, and an agent that only does the coding, and have them run separately, that way to fix a test, you don't change the test, etc...
- erua month ago
  Well, we let humans write both business logic code and tests often enough, too.
  Btw, you can get a lot further in your tests, if you move away from examples, and towards properties.
  - jayneticsa month ago
    Can you give an example (pun not intended) of testing with properties?
    a month ago
    undefined
    erua month ago
    https://fsharpforfunandprofit.com/series/property-based-test... is an entertaining intro.
    https://hypothesis.readthedocs.io/en/latest/ are the docs for one of the best property based testing libraries available in any language.
- machomastera month ago
  You could mitigate that risk by using different agents (versions, companies).
tomberta month ago
Something I just started doing yesterday, and I'm hoping it catches on, is that I've been writing the spec for what I want in TLA+/PlusCal at a pretty high level, and then I tell Codex implement exactly to the spec. I tell it to not deviate from the spec at all, and be as uncreative as possible.
Since it sticks pretty close to the spec and since TLA+ is about modifying state, the code it generates is pretty ugly, but ugly-and-correct code beats beautiful code that's not verified.
It's not perfect; something that naively adheres to a spec is rarely optimized, and I've had to go in and replace stuff with Tokio or Mio or optimize a loop because the resulting code is too slow to be useful, and sometimes the code is just too ugly for me to put up with so I need to rewrite it, but the amount of time to do that is generally considerably lower than if I were doing the translation myself entirely.
The reason I started doing this: the stuff I've been experimenting with lately has been lock-free data structures, and I guess what I am doing is novel enough that Codex does not really appear to generate what I want; it will still use locks and lock files and when I complain it will do the traditional "You're absolutely right", and then proceed to do everything with locks anyway.
In a sense, this is close to the ideal case that I actually wanted: I can focus on the high-level mathey logic while I let my metaphorical AI intern deal with the minutia of actually writing the code. Not that I don't derive any enjoyment out of writing Rust or something, but the code is mostly an implementation detail to me. This way, I'm kind of doing what I'm supposed to be doing, which is "formally specify first, write code second".
- BrittonRa month ago
  This is how I’m also developing most of my code these days as well. My opinions are pretty similar to the pig book author https://martin.kleppmann.com/2025/12/08/ai-formal-verificati....
  - tomberta month ago
    For the first time I might be able to make a case for TLA+ to be used in a workplace. I've been trying for the last nine years, with managers that will constantly say "they'll look into it".
- jnpnja month ago
  Interesting, just the other day I tried asking if iterating in haskell or prolog wouldn't help both converging speed and token use. I wish there was a group to study how to do proper engineering with LLMs without losing the modeling / verification aspect.
- baqa month ago
  You might find success with having the LLM contribute to the spec itself. It suddenly started to work with the most recent frontier models, to the point that economics of writing then shifted due to turn getting 10-100x cheaper to get right.
pgrovesa month ago
This is sort of why I think software development might be the only real application of LLMs outside of entertainment. We can build ourselves tight little feedback loops that other domains can't. I somewhat frequently agree on a plan with an LLM and a few minutes or hours later find out it doesn't work and then the LLM is like "that's why we shouldn't have done it like that!". Imagine building a house from scratch and finding out that it was using some american websites to spec out your electric system and not noticing the problem until you're installing your candadian dishwasher.
- mrtksna month ago
  > Imagine building a house from scratch
  Thats why those Engineering fields have strict rules, often require formal education and someone can even end up in prison if screws up badly enough.
  Software is so much easier and safer, till very recently anonymous engineering was the norm and people are very annoyed with Apple pushing for signing off the resulting product.
  Highly paid software engineers across the board must have been an anomaly that is ending now. Maybe in the future only those who code actually novel solutions or high risk software will be paid very well - just like engineers in the other fields.
  - zarzavata month ago
    > people are very annoyed with Apple pushing for signing off the resulting product.
    Apple is very much welcome to push for signing off of software that appears on their own store. That is nothing new.
    What people are annoyed about is Apple insisting that you can only use their store, a restriction that has nothing to do with safety or quality and everything to do with the stupendous amounts of money they make from it.
    mrtksna month ago
    It's literally the case of Apple requiring signing the binary to run on the platforms they provide, Apple doesn't have say on other platforms. It is a very similar situation with local governments.
    Also, people complain all the time about rules and regulations for making stuff. Especially in EU, you can't just create products however you like and let people decide if it is safe to use, you are required to make your products to meet certain criteria and avoid use certain chemicals and methods, you are required to certify certain things and you can't be anonymous. If you are making and selling cupcakes for example and if something goes wrong you will be held responsible. Not only when things go wrong, often local governments will do inspections before letting you start making the cupcakes and every now and then they can check you out.
    Software appears to be headed to that direction. Of course du to the nature of software probably wouldn't be exactly like that but IMHO it is very likely that at least having someone responsible for the things a software does will become the norm.
    Maybe in the future if your software leaks sensitive information for example, you may end up being investigated and fined if not following best practices that can be determined by some institute etc.
    a month ago
    undefined
    rouncea month ago
    > Maybe in the future if your software leaks sensitive information for example, you may end up being investigated and fined
    This is already the case in the UK, and the EU too as far as I’m aware.
    zarzavata month ago
    ...but the EU is one of the entities forcing Apple to allow other stores.
    It turns out that Apple is not, in fact, the government.
    Drakima month ago
    That's not a very compelling counterexample, when you consider how often countries with governments force other countries with government to do as they want, often with nothing but economic or soft power.
    alwillisa month ago
    > Apple is very much welcome to push for signing off of software that appears on their own store.
    Just to be clear, apps have to be notarized/signed to run on an Apple device. For macOS, notorized apps aren't required to be distributed in the App Store. Due to sandbox restrictions, some dev tools are distributed independently.
    Or there are two versions: a less capable version for the App Store and a more capable version distributed independently.
  - PunchyHamstera month ago
    Software developers being paid well is result of demand, not be cause it's very hard.
    Skill and strictness required is only vaguely related to pay, if there is enough people for the job it won't pay amazing, regardless on how hard it is.
    > Software is so much easier and safer, till very recently anonymous engineering was the norm and people are very annoyed with Apple pushing for signing off the resulting product.
    that has nothing to do with engineering quality, that is just to make it harder to go around their ecosystem (and skip paying the shop fee). With additional benefit of signed package being harder to attack. You can still deliver absolute slop, but the slop will be from you, not the middleman that captured the delivery process
- ptxa month ago
  I don't understand why the experience you describe would lead you to conclude that LLMs might be useful for software development.
  The response "that's why we shouldn't have done it like that!" sounds like a variation on the usual "You're absolutely right! I apologize for any confusion". Why would we want to get stuck in a loop where an AI produces loads of absolute nonsense for us to painstakingly debug and debunk, after which the AI switches track to some different nonsense, which we again have debug and debunk, and so on. That doesn't sound like a good loop.
- ogogmada month ago
  > This is sort of why I think software development might be the only real application of LLMs outside of entertainment.
  Wow. What about also, I don't know, self-teaching*? In general, you have to be very arrogant to say that you've experienced all the "real" applications.
  * - For instance, today and yesterday, I've been using LLMs to teach myself about RLC circuits and "inerters".
  - Larrikina month ago
    I would absolutely not trust an LLM to teach me anything alone. I've had it introduce ideas I hadn't heard about which I looked up from actual sources to confirm it was a valid solution. Daily usage has shown it will happily lead you down the wrong path and usually the only way to know that it is the wrong path, is if you already knew what the solution should be.
    LLMs MAY be a version of office hours or asking the TA, if you only have the book and no actual teacher. I have seen nothing that convinces me they are anything more than the latest version of the hammer in our toolbox. Not every problem is a nail.
    alwillisa month ago
    > LLMs MAY be a version of office hours or asking the TA
    In my experience, most TA's are not great at explaining things to students. They were often the best student in their class, and they can't relate to students who don't grasp things as easily as they do--"this organic chemistry problem set is so easy; I don't know why you're not getting it."
    But an LLM has infinite patience and can explain concepts in a variety of ways, in different languages and at different levels. Bilingual students that speak English just fine, but they often think and reason in their native language in their mind. Not a problem for an LLM.
    A teacher in an urban school system with 30 students, 20 of which need customized lesson plans due to neurological divergence can use LLMs to create these lesson plans.
    Sometimes you need things explained to you like you're five years old and sometimes you need things explained to you as an expert.
    On deeper topics, LLMs give their references, so a student can and should confirm what the LLM is telling them.
  - array_key_firsta month ago
    Self-teaching pretty much doesn't work. For many decades now, the barrier has not been access to information, it's been the "self" part. Turns out most people need regimen, accountablity, strictness, which AI just doesn't solve because it's yes-men.
    wiseowisea month ago
    > Self-teaching pretty much doesn't work. For many decades now, the barrier has not been access to information, it's been the "self" part.
    That’s a complete bogus. And LLMs are yes men by default, nothing stops you from overriding initial setting.
    array_key_firsta month ago
    It's not bogus at all. We've had access to 100,000x more information than we know what to do with for a while now. Right now, you can go online and learn disciplines you've never even heard of before.
    So why arent you a master of, I don't know, reupholstery? Because the barrier isn't information, it's you. You're the bottle neck, we all are, because we're humans.
    And AI really just does not help here. It's the same problem with professor Google, I can just turn off the computer, and I will. This is how it is for the vast majority of people.
    Most people who claim to be self taught aren't even self taught. They did a course or multiple courses. Sure, it's not traditional college, but thats not self taught.
    ogogmada month ago
    I think LLMs lower the barrier for learning certain topics. For instance, without them I wouldn't have tried to learn about RLC circuits at all.
    knollimara month ago
    They're fundamentally trained to agree and don't do well when they require challenging ideas they're not "confident" about
  - skywhoppera month ago
    It’s somewhat delusional and potentially dangerous to assume that chatting with an LLM about a specific topic is self-teaching beyond the most surface-level understanding of a topic. No doubt you can learn some true things, but you’ll also learn some blatant falsehoods and a lot of incorrect theory. And you won’t know which is which.
    One of the most important factors in actually learning something is humility. Unfortunately, LLM chatbots are designed to discourage this in their users. So many people think they’re experts because they asked a chatbot. They aren’t.
    kevin42a month ago
    I think everything you said was true 1-2 years ago. But the current LLMs are very good about citing work, and hallucinations are exceedingly rare. Gemini for example frequently directs you to a website or video that backs up it's answer.
    ogogmada month ago
    > It’s somewhat delusional and potentially dangerous to assume that chatting with an LLM about a specific topic is self-teaching beyond the most surface-level understanding of a topic
    It's delusional and very arrogant of you to confidently asserts anything without proof: A topic like RLC circuits has got a body of rigorous theorems and proofs underlying it*, and nothing stops you from piecing it together using an LLM.
    * - See "Positive-Real Functions", "Schwarz-Pick Theorem", "Schur Class". These are things I've been mulling over.
  - stackghosta month ago
    Why would you think that a machine known to cheerfully and confidently assert complete bullshit is suitable to learn from?
    ogogmada month ago
    Because you can independently check anything it tells you. You understand there can be independent sources of validation?
    sojournerca month ago
    Why not just search out the independent sources and ditch the middleman?
    ogogmada month ago
    Because verifying something is easier than finding it in the first place. It's in some way the difference between P and NP.
- pgrovesa month ago
  Thinking about this some more, maybe I wasn't considering simulators (aka digital twins), which are supposed to be able to create fairly reliable feedback loops without building things in reality. Eg will this plane design be able to take off? Still, I feel fortunate I only have to write unit tests to get a bit of contact with reality.
  - redox99a month ago
    Simulations in general are pretty flawed, and AIs will usually find ways to "cheat" the simulation.
    It's a very useful tool of course, but not as good as the software situation.
- knollimara month ago
  You can install a Canadian one just fine; authorities might not like it in some jurisdictions though but it's safe and might even be code.
  I literally just had this exact argument; biggest issue is they can be tested for smaller amperages but you just downside the breaker.
- toxic72a month ago
  It's more like you're installing the dishwasher and the dishwasher itself yells at you "I told you so" ;)
  - lvspiffa month ago
    I think of it as you say "install dishwasher" and it plan looks like all the steps but as it builds it out it somehow you end up hiring a maid and buying a drying rack.
tempodoxa month ago
This is hallucination. Or maybe a sales pitch. If production bugs and the requirement to retain a workable code base don’t get us to write “good” code, then nothing will. And at the current state of the art, “AI” will tend to make it worse.
- reedlawa month ago
  The first sentence is problematic:
  > For decades, we’ve all known what “good code” looks like.
  When relatively trivial concerns such as the ideal length of methods haven't achieved consensus, I doubt there can be any broadly accepted standard for software quality. There are plenty of metrics such as test coverage, but anyone with experience could tell you how easy it is to game those and that enforcing arbitrary standards can even cause harm.
  - tempodoxa month ago
    I agree. Moreover, I submit that “good code” isn’t even a universal constant, but context-sensitive along several dimensions.
  - deauxa month ago
    > When relatively trivial concerns such as the ideal length of methods haven't achieved consensus
    Is the consensus not that there isn't one? Surely that's the only consensus to reach? I don't see how there could possibly be an "ideal length", whatever you pick it'd be much too dogmatic.
    reedlawa month ago
    John Carmack's and Martin Fowler's coding style advice are diametrically opposed. Carmack advocates inlining complex code that is only used once. Fowler advocates extracting it with a good name to clarify intent. I'm not sure the two views can be reconciled except by noting that they address separate concerns. Carmack prioritizes visibility while Fowler prioritizes intent.
    deauxa month ago
    With all due respect to these who as programmers are on a whole different dimension than me.. this seems like a case where both either their words were taken out of context, or one of the millions of cases of brilliant people hyperfixating on their particular domain and mistakenly extrapolating that to everywhere. Their advice could well be right for the particular type of code each of them worked on!
    But taking them as general rules for coding makes as much sense as applying advice for painting a bridge to painting the Mona Lisa. Seriously, try to come up with a single piece of advice about programming style that applies to every domain. The closest one I can think of is "give descriptive name to your variables", and even that doesn't apply to lots of code written to this very day. It's impossible.
    Software in 2025 is far too varied for any of that to make sense, and it has been for many decades.
- stingraycharlesa month ago
  Yeah, test coverage isn't a replacement for good code. Worse yet, it may give you false confidence, especially if it's the AI that's writing the tests (which in practice very often is the case).
- a month ago
  undefined
- zwnowa month ago
  Shhhh the original poster is the CEO of an AI based company. I am sure there is no bias here. /s
mkozlowsa month ago
I like this. "Best practices" are always contingent on the particular constellation of technology out there; with tools that make it super-easy to write code, I can absolutely see 100% coverage paying off in a way that doesn't for human-written code -- it maximizes what LLMs are good at (cranking out code) while giving them easy targets to aim for with little judgement.
(A thing I think is under-explored is how much LLMs change where the value of tests are. Back in the artisan hand-crafted code days, unit tests were mostly useful as scaffolding: Almost all the value I got from them was during the writing of the code. If I'd deleted the unit tests before merging, I'd've gotten 90% of the value out of them. Whereas now, the AI doesn't necessarily need unit tests as scaffolding as much as I do, _but_ having them put in there makes future agentic interactions safer, because they act as reified context.)
- Waterluviana month ago
  It might depend on the lifecycle of your code.
  The tests I have for systems that keep evolving while being production critical over a decade are invaluable. I cannot imagine touching a thing without the tests. Many of which reference a ticket they prove remains fixed: a sometimes painfully learned lesson.
  - zmgsabsta month ago
    Also the lifecycle of your system, eg, I’ve maintained projects that we no longer actively coded, but we used the tests to ensure that OS security updates, etc didn’t break things.
- johnnyfiveda month ago
  I've said this before here, but "best practices" in code indeed is very typical even with different implementations and architectures. You can ask a LLM to write you the best possible code for a scenario and likely your implementation wouldn't differ much.
  Writing, art, creative output, that's nothing at all like code, which puts the software industry in a more particular spot than anything else in automation.
afro88a month ago
Without having tried it (caveat), I worry that 100% coverage to an LLM will lock in bad assumptions and incorrect functionality. It makes it harder for it to identify something that is wrong.
That said, we're not talking about vibe coding here, but properly reviewed code, right? So the human still goes "no, this is wrong, delete these tests and implement for these criteria"?
- realusernamea month ago
  That's already what I'm experiencing even without forcing anything, the LLM creates a lot of "is 1 = 1?" tests
- sgk284a month ago
  Yep, 100% correct. We're still reviewing and advising on test cases. We also write a PRD beforehand (with the LLM interviewing us!) so the scope and expectations tend to be fairly well-defined.
daniekaa month ago
I thought that the article would be about if we want AI to be effective, we should write good code.
What I notice is that Claude stumbles more on code that is illogical, unclear or has bad variable names. For example if a variable is name "iteration_count" but actually contains a sum that will "fool" AI.
So keeping the code tidy gives the AI clearer hints on what's going on which gives better results. But I guess that's equally true for humans.
- asielena month ago
  Related it seems AI has been effective at forcing my team to care about documentation including good comments. Before when it was just humans reading these things, it felt like there was less motivation to keep things up to date. Now the idea that AI may be using that documentation as part of a knowledge base or in evaluating processes, seems to motivate people to actually spend time updating the internal docs (with some AI help of course).
  It is kind of backwards because it would have been great to do it before. But it was never prioritized. Now good internal documentation is seen as essential because it feeds the models.
- sleepy_keitaa month ago
  Humans can work with these cases better though because they have access to better memory. Next time you see "iteration_count", you'll know that it actually has a sum, while a new AI session will have to re-discover it from scratch. I think this will only get better as time goes on, though.
  - charcircuita month ago
    You are underestimating how lazy humans can be. Humans are going to skim code, scroll down into the middle of some function and assume iteration count means iteration count. AI on the other hand will have the full definition of the function in its context every time.
    joshribakoffa month ago
    You are underestimating the importance of attention. You can have everything in context and still attend to the wrong parts (eg bad names)
    charcircuita month ago
    Improving AI is easier than improving human nature.
  - drak0n1ca month ago
    Unfortunately, so far coding models seem to perform worse and break in other ways as context grows, so it's still best practice to start a new conversation even when iterating. Luckily, high-end reasoning models are now catching when var names don't match what they actually do (as long as the declaration is provided in context).
  - rsyringa month ago
    Or you immediately rename it to avoid the need to remember? :)
- CharlieDigitala month ago
  What I find works really well: scaffold the method signature and write your intent in the comment for the inputs, outputs, and any mutations/business logic + instructions on approach.
  LLM has very high chance of on shotting this and doing it well.
  - Philip-J-Frya month ago
    This is what I tend to do. I still feel like my expertise in architecting the software and abstractions is like 10x better than I've seen an LLM do. I'll ask it to do X, and then ask it to do Y, and then ask it to do Z, and it'll give you the most junior looking code ever. No real thought on abstractions, maybe you'll just get the logic split into different functions if you're lucky. But no big picture thinking, even if I prompt it well it'll then create bad abstractions that expose too much information.
    So eventually it gets to the point where I'm basically explaining to it what interfaces to abstract, what should be an implementation detail and what can be exposed to the wider system, what the method signatures should look like, etc.
    So I had a better experience when I just wrote the code myself at a very high level. I know what the big picture look of the software will be. What types I need, what interfaces I need, what different implementations of something I need. So I'll create them as stubs. The types will have no fields, the functions will have no body, and they'll just have simple comments explaining what they should do. Then I ask the LLM to write the implementation of the types and functions.
    And to be fair, this is the approach I have taken for a very long time now. But when a new more powerful model is released, I will try and get it to solve these types of day to day problems from just prompts alone and it still isn't there yet.
    It's one of the biggest issues with LLM first software development from what I've seen. LLMs will happily just build upon bad foundations and getting them to "think" about refactoring the code to add a new feature takes a lot of prompting effort that most people just don't have. So they will stack change upon change upon change and sure, it works. But the code becomes absolutely unmaintainable. LLM purists will argue that the code is fine because it's only going to be read by an LLM but I'm not convinced. Bad code definitely confuses the LLMs more.
    deepsquirrelneta month ago
    I think this is my experience as well.
    I tend to use a shotgun approach, and then follow with an aggressive refactor. It can actually take a lot of time to prune and restructure the code well. At least it feels slow compared to opening the Claude firehose and spraying out code. There needs to be better tools for pruning, because Claude is not thorough enough.
    This seems to work well for me. I write a lot of model training code, and it works really well for the breadth of experiments I can run. But by the end it looks like a graveyard of failed ideas.
  - zahlmana month ago
    What if I write the main function but stub out calls to functions that don't exist yet; how will it do with inferring what's missing?
sandblast2a month ago
The expertise in software engineering typical in these promptfondling companies shine through this blog post.
Surely they know 100% code coverage is not a magical bullet because the code flow and the behavior can differ depending on the input. Just because you found a few examples which happen to hit every line of code you didn't hit every possible combination. You are living in a fool's paradise which is not a surprise because only fools believe in LLMs. You are looking for a formal proof of the codebase which of course no one does because the costs would be astronomical (and LLMs are useless for it which is not at all unique because they are useless for everything software related but they are particularly unusable for this).
- visargaa month ago
  So, what is the solution? Senior engineer looks over PR and signs LGTM? That is just "vibe testing". The worst kind of testing. I think the author is right, setting up tests to form a reactive environment for coding agents will lead us to a new golden age. If you later find some issue with your test case coverage, you expand it. But it is good to do it from the start as throroughtly as possible.
  - sandblast2a month ago
    > So, what is the solution?
    1. Clearly explain the massive harm LLMs cause society and the environment to everyone. (Mass media should do this instead of parroting every nonsense the promptfondlers feed them.)
    2. Ban them all. Don't tell me it's impossible just because it's widespread. Asbesthos was everywhere.
- SR2Za month ago
  It's a bold claim that LLMs are useless for formal verification when people have been hooking them up to proof assistants for a while. I think that it's probably not a terrible idea; the LLM might make some mistakes in the spec but 99% of the time there are a lot of irrelevant details that it will do a serviceable job with.
lmeyerova month ago
Most of this rings true for us for the same reasons. We have been moving large old projects in this direction, and new ones start there. It's easier to do these via tool checks than trust skills files. I wouldn't say the resulting code is good, which folks are stumbling on, but it is rewarding better code - predictable, boring, tested, pure, and fast to iterate on, which are all indeed part of our SDLC principles.
Some of the advice is a bit more extreme, like I haven't found value in 100% code coverage, but 90% is fine. Others miss nuance like we have to work hard to prevent the AI from subverting the type checks, like by default it works around type errors by using getattr/cast/typeignore/Any everywhere.
One item I'm hoping is AI coders get better at is using static analysis tools and verification tools. My experiments here have been lukewarm/bad, like adding an Alloy model checker for some parts of GFQL (GPU graph query language) took a lot of prodding and found no bugs, but straight up asking codex to do test amplification on our unit test suite based on our code and past bugs works great. Likewise, it's easy to make it port conformance tests from standards and help with making our docs executable to help prevent drift.
A new area we are starting to look at is automatic bug patches based on production logs. This is practical for the areas we setup for vibe coding, which in turn are the areas we care about more and work most heavily on. We never trusted automated dependency update bots, but this kind of thing gets much more trustworthy & reviewable. Another thing we are eyeing is new 'teleport' modes so we can shift PRs to remote async development, which previously we didn't think worth supporting.
brynarya month ago
Strong agreement with everything in this post.
At Qlty, we are going so far as to rewrite hundreds of thousands of lines of code to ensure full test coverage, end-to-end type checking (including database-generated types).
I’ll add a few more:
1. Zero thrown errors. These effectively disable the type checker and act as goto statements. We use neverthrow for Rust-like Result types in TypeScript.
2. Fast auto-formatting and linting. An AI code review is not a substitute for a deterministic result in sub-100ms to guarantee consistency. The auto-formatter is set up as a post-tool use Claude hook.
3. Side-effect free imports and construction. You should be able to load all the code files and construct an instance of every class in your app without a network connection spawning. This is harder than it sounds and without it you run into all sorts of trouble with the rest.
3. Zero mocks and shared global state. By mocks, I mean mocking frameworks which override functions on existing types or global. These effectively are injecting lies into the type checker.
Should put to tsgo which has dramatically lowered our type checking latency. As the tok/sec of models keeps going up, all the time is going to get bottlenecked on tool calls (read: type checking and tests).
With this approach we now have near 100% coverage with a test suite that runs in under 1,000ms.
- frioa month ago
  A TypeScript test suite that offers 100% coverage of "hundreds of thousands" of lines of code in under 1 second doesn't pass the sniff test.
  - brynarya month ago
    We're at 100k LOC between the tests and code so far, running in about 500-600ms. We have a few CPU intensive tests (e.g. cryptography) which I recently moved over to the integration test suite.
    With no contention for shared resources and no async/IO, it just function calls running on Bun (JavaScriptCore) which measures function calling latency in nanoseconds. I haven't measured this myself, but the internet seems to suggest JavaScriptCore function calls can run in 2 to 5 nanoseconds.
    On a computer with 10 cores, fully concurrent, that would imply 10 billion nanoseconds of CPU time in one wall clock second. At 5 nanoseconds per function call, that would imply a theoretical maximum of 2 billion function calls per second.
    Real world is not going to be anywhere close to that performance, but where is the time going otherwise?
  - camel_gophera month ago
    Hey now he said 1,000ms, not 1 second
- ManuelKiesslinga month ago
  I‘m on the same page as you, I‘m investing into DX and test coverage and quality tooling like crazy.
  But the weird thing is: those things have always been important to me.
  And it has always been a good idea to invest in those, for my team and me.
  Why am doing this 200% now?
  - monatrona month ago
    If you're like me you're doing it to establish a greater level of trust in generated code. It feels easier to draw out the hard guard-rails and have something fill out the middle -- giving both you, and the models, a reference point or contract as to what's "correct"
  - ManuelKiesslinga month ago
    Answering myself: maybe I feel much more urgency and motivation for this in the age of AI because the effects can be felt so much more acute and immediately.
  - mkozlowsa month ago
    Because a) the benefits are bigger, and b) the effort is smaller. When something gets cheaper and more valuable, do more of it.
  - 0x696C6961a month ago
    For me it's because coworkers are pumping out horrible slop faster than ever before.
krupana month ago
So many of us see an LLM spit out a bunch of code in a at a very high rate and we're amazed. It is really impressive, but what we're forgetting is that the amount of code and the speed at which code is written has never been the bottleneck in developing good quality software.
AI will revolutionize software development if and when it does a far better job of producing correct code than humans.
- captainkrteka month ago
  My biggest problem with usage of an LLM in coding is that it removes engineers from understanding the true implementation of a system.
  Over the years, I learned that a lot of one's value as an engineer can come from knowing how things actually work. I've been in many meetings with very senior engineers postulating how something works arguing back and forth, when quietly one engineer taps away on their laptop, then spins it around to say "no, this is the code here, this is how it actually works".
badgersnakea month ago
I’m increasingly finding that the type of engineer that blogs is not they type of engineer anyone should listen to.
- yoyohello13a month ago
  The value of the blog post is negatively correlated to how good the site looks. Mailing list? Sponsors? Fancy Title? Garbage. Raw HTML dumped on a .xyz domain, Gold!
  - userbinatora month ago
    on a .xyz domain
    That's a negative correlation signal for me (as are all the other weird TLDs that I have not seen besides SEO spam results and perhaps the occasional HN submission.) On the other hand, .com, .net, and .org are a positive signal.
  - llmslave2a month ago
    The exception is a front end dev, since that's their bread and butter.
- sgk284a month ago
  Can you say more? I see a lot of teams struggling with getting AI to work for them. A lot of folks expect it to be a little more magical and "free" than it actually is. So this post is just me sharing what works well for us on a very seasoned eng team.
  - imrona month ago
    As someone who struggles to realise productivity gains with AI (see recent comment history) I appreciate the article.
    100% coverage for AI generated code is a very different value proposition than 100% coverage for human generated code (for the reasons outlined in the article).
  - justatdotina month ago
    it is MUCH easier for solo devs to get agents to work for them than it is for teams to get agents to work for them.
    andrekandrea month ago
    that's interesting, whats the reason for that?
    justatdotina month ago
    Hi, the reason I have this expectation is that on a (cognitively) diverse team there will be a range of reactions that all need to be accommodated.
    some (many?) devs don't want agents. Either because the agent takes away the 'fun' part of their work, or because they don't trust the agent, or because they truly do not find a use for it in their process.
    I remember being on teams which only remained functional because two devs tried very hard to stay out of one another's way. Nothing wrong with either of them, their approach to the work was just not very compatible.
    In the same way, I expect diverse teams to struggle with finding a mode of adoption that does not negatively impact on the existing styles of some members.
    andrekandrea month ago
    thanks for the reply, thats interesting
    i was thinking it was more like llms when used personally can make huge refactorings and code changes that you review yourself and just check it in, but with a team its harder to make sweeping changes that an llm might make more possible cause now everyone's changes start to conflict... but i guess thats not much of an issue in practice?
    justatdotina month ago
    oh yeah well that's an extreme example of how one dev's use could overwhelm a team's capacity.
- throwatdem12311a month ago
  It's just veiled marketing for their company.
- cube00a month ago
  Even some of the comments here can't help name dropping their own startups for no actual reason.
- observationista month ago
  Badgersnake's corollary to Gell-Mann amnesia?
- iamjsa month ago
  I find that this idea of restricting degrees of freedom is absolutely critical to being productive with agents at scale. Please enlighten us as to why you think this is nonsense
  - mrkeena month ago
    Wearing seatbelts is critical for drunk-driving.
    All praise drunk-driving for increased seatbelt use.
    llmslave2a month ago
    Finally something I can get behind.
melozoa month ago
I’m not sure how controversial this is - but 100% code coverage is almost always a waste of time, paid both immediately and long term, for certain languages. Go, for example, requires explicit error handling, but the way errors are handled are usually plain and homogenous. Adding unit testing everywhere creates a phenomenal amount of test code that can become 3x the size of the source, and certain changes (like interface changes) can require updates to all tests, especially if mocking is used.
Obviously with AI maybe those issues I have go away. But I really don’t like letting the AI modify tests without meticulously manually reviewing those changes, because in my experience the AI cares more about getting the tests passing than it does about ensuring semantic correctness. For as long as tests are manually maintained I will continue keeping them as few as necessary while maintaining what I view as an acceptable amount of coverage.
altmanaltmana month ago
Wouldn't a better title be "How we're forcing AI to write good code (because it's normally not that good in general, which is crazy, given how many resources it's sucking, that we need to add an extra layer on top of it and use it to get anything decent)"
- visargaa month ago
  > which is crazy, given how many resources it's sucking
  Gentleman, the dog writes poetry and music, but it is boring, mediocre quality. Overhyped dog.
  - two_handfulsa month ago
    That is a great description of current AI, actually. Love it.
    Some people are amazed the dog can write poetry. Some people complain that the poetry isn't good enough.
    pessimizera month ago
    Of what use is a dog that writes bad poetry? It's gone from being a dog to being an annoying dog.
    It's like having the power to turn water into gross, undrinkable wine.
- Aerolfosa month ago
  Don't forget "we're obligated to try and sell it so here's an ai generated article to fill up our quota because nobody here wanted to actually sit down and write it"
  - sgk284a month ago
    FWIW all of the content on our eng blog is good ol' cage-free grass-fed human-written content.
    (If the analogy, in the first paragraph, of a Roomba dragging poop around the house didn't convince you)
- a month ago
  undefined
- add-sub-mul-diva month ago
  Then it wouldn't be effective advertising/vanity blogging from some self-promoting startup.
cube00a month ago
I can't reconcile how the CEO of an AI startup is; on one hand pushing "100% Percent [sic] Code Coverage" while also selling the idea of "Less than 60 seconds to production" on their product (which is linked in the first screen-full of the blog post so it's not like these are personal thoughts).
If 100% code coverage is a good thing, you can't tell me anyone (including parallel AI bots) is going to do this correctly and completely for a given use case in 60 seconds.
I don't mind it mind it being fast, but to sell it as 60 second fast while trying to give the appearance you support high quality and correct code isn't possible.
- antonvsa month ago
  The US seems to be beta testing the idea that the most successful CEOs are the ones that can convince investors to buy the most shares at the highest prices based on the biggest lies.
- queueberta month ago
  Cargo Cult Steve Jobs is your answer.
nathan_f77a month ago
This is exactly how I've been working with AI this year and I highly recommend it. This kind of workflow was not feasible when I was working alone and typing every line of code. Now it's suprisingly easy to achieve. In my latest project, I've enforced extremely strict linting rules and completely banned any ignore comments. No file over 500 lines, and I'm even using all the default settings to prevent complex functions (which I would have normally turned off a long time ago.)
Now I can leave an agent running, come back an hour or two later, and it's written almost perfect, typed, extremely well tested code.
- KurSixa month ago
  Sounds like a dream, but there is a risk of a local maximum here. Strict linters and small files are great at helping the agent write syntactically correct code, but they don't guarantee architectural correctness. An agent can generate 100 perfect 500-line files that together form an unmaintainable dependency hell. A linter catches bad code, not bad system design. Leaving an agent unsupervised for 2 hours is bold because refactoring architectural mistakes is harder than fixing typos
- teaearlgraycolda month ago
  I went from "ugh I don't want to write e2e tests" to "well I'll at least have the LLM write some". 50% coverage is way better than 0%! I'm very strict about the runtime code, but let the LLM take the reins on writing tests (of course still reviewing the code).
  It's funny how on one side you have people using AI to write worse code than ever, and on the other side people use AI as an extension of their engineering discipline.
mritsa month ago
Author should ask AI to write a small app with 100% code coverage that breaks in every path except what is covered in the tests.
- thih9a month ago
  Example output if anyone else is curious:
  def fragile(x): lst = [None] lst[x - 42] return "ok" def test_fragile(): assert fragile(42) == "ok"
  - andrekandrea month ago
    this doesn't seem like a very useful test...? i'm more interested in the failure modes when input != 42, what happens when i pass NaN to that etc...
    jmo, but tests should be a chance to improve the implementation of functions not just one-off "write and forget" confirmations of the happy path only... automating all that just short-circuits that whole process... but maybe i'm missing something.
- sgk284a month ago
  I never claim that 100% coverage has anything to do with code breaking. The only claim made is that anything less than 100% does guarantee that some piece of code is not automatically exercised, which we don't allow.
  It's a footnote on the post, but I expand on this with:
  100% coverage is actually the minimum bar we set. We encourage writing tests for as many scenarios as is possible, even if it means the same lines getting exercised multiple times. It gets us closer to 100% path coverage as well, though we don’t enforce (or measure) that
  - nicoburnsa month ago
    > I never claim that 100% coverage has anything to do with code breaking.
    But what I care about is code breaking (or rather, it not breaking). I'd rather put effort into ensuring my test suite does provide a useful benefit in that regard, rather than measure an arbitrary target which is not a good measure of that.
  - reactordeva month ago
    I feel this comment is lost on those who have never achieved it and gave up along the journey.
  - xcskier56a month ago
    SimpleCov in ruby has 2 metrics, line coverage and branch coverage. If you really want to be strict, get to 100% branch coverage. This really helps you flesh out all the various scenarios
  - a3wa month ago
    Brakes in cars here in Germany are integrated with less than 50 % coverage in the final model testing that goes to production.
    Seems like even if people could potentially die, industry standards are not really 100% realistic. (Also, redundancy in production is more of a solution than having some failures and recalls, which are solved with money.)
- a month ago
  undefined
bgwaltera month ago
https://logic.inc/
"Ship AI features and tools in minutes, not weeks. Give Logic a spec, get a production API—typed, tested, versioned, and ready to deploy."
- travisgriggsa month ago
  https://en.wikipedia.org/wiki/Drinking_the_Kool-Aid
- bgwaltera month ago
  Someone is downvoting everything again. It seems to be a cronjob, always around the same time.
crypticaa month ago
I agree with the sentiment but I find this definition of 'good code' is a bit superficial for my liking.
Especially the part about TypeScript. My experience is that LLMs such as Claude Code work really well with vanilla JavaScript. Once you switch to TypeScript, you're tapping into a different language training set which is much smaller than the JS training set and which adheres to different conventions and principles.
The part about good test coverage makes sense though I don't know if 100% coverage is the specific goal to aim for. You can have 100% coverage in terms of lines of code but don't test the relevant parameters which cause issues.
My definition of good code is more about architecture; modularity, separation of concerns, minimal interfaces, choosing good abstractions and layering them appropriately, clearly separating trust boundaries with appropriate validation... Once the LLM sees certain things, it lets you tap into a "world class software engineer" training set.
A lot of the points mentioned in the article differentiate junior developer from mid-level developer... If you want the LLM to output 10x software engineer quality, the patterns are different and more nuanced... Goes beyond just having good test coverage.
PunchyHamstera month ago
I'm not covering every
```
   if err != nil {
      return fmt.Errorf(...)
   }
```
no matter what kind of glue vibe coders snorted that day
a month ago
undefined
jillesvangurpa month ago
This goes in the right direction. It could go further though. Types are indeed nice. So, why use a language why using those is optional? There are many reasons but many of those have to do with people and their needs/wants rather than tool requirements. AI agents benefit from good tool feedback, so maybe switch to languages and frameworks that provide plenty of that and quickly. Switching used to be expensive. Because you had to do a lot of the work manually. That's no longer true. We can make LLMs do all of the tedious stuff.
Including using more rigidly typed languages, making sure things are covered with tests, using code analysis tools to spot anti patterns and addressing all the warnings, etc. That was always a good idea but we now have even less excuses to skip all that.
hopppa month ago
I prefer to write critical code and ask the llm for snippets like if I was googling docs, then I got maximum guard rails, it can't alter the project by itself, all code is reviewed by me and I can refactor as I go .
If Iam building guard rails to let the LLM directly code then I am building guard rails and not the project I want, I don't want to babysit an LLM, I just want to get on with my work.
I also don't agree with the title, a very prominent new dev community called vibe coders emerged and they are all about low quality code created fast. So LLMs mostly write bad code.
AuthAutha month ago
>Statement about how AI is actually really good and we should rely on it more. Doesnt cover any downsides.
>CEO of an AI company
Many such cases
- heliumteraa month ago
  the fantastic machine will be 10^23x more productive than all of us combined, they will give it all away for 20 dollars a month and this people will be left without anything to sell. then, they will leave. so technically AI will force the world to heal, actually he is correct.
cess11a month ago
I kind of feel that if you weren't doing this and start doing it to please a bunch of chatbots, then you're sending a pretty weird signal to your coworkers or employees. Like you care more about the bots than the people you work with.
Other than that, sure, good advice. If at all possible you should have watch -n 2 run_tests or test run on a file watcher on a screen while coding.
In my experience LLM:s like to add assertions and tests for impossible states, which is quite irritating, so I'd rather not do the agentic vibe thing anyway.
zmmmmma month ago
Very little there about the code itself being good. A lot about putting good guardrails around it and making it fast and safe to develop. Which is good for sure. But I feel it's misconstruing it to say the actual code is "good". The whole reason the guard rails provide value is the code is, by default, "not good" and how good the result is presumably sitting in a spectrum between "the worst possible that satisfies the guardrails" and "actually good".
bwhiting2356a month ago
I agree with this. 100% test coverage for front end is harder, I don't know if I'm going to reach for that yet. So far I've been making my linting rules stricter.
sebastianconcpta month ago
I feel more like is forcing to write better engineering not just the code.
The disruption comes from the economics of cognitive labor, the synthetic assistants are making feasible things that before were unbearably cognitively costly so manually we invested all that energy into the code parts.
I've made this to leverage that:
https://github.com/sebastianconcept/ai-squads
chrswa month ago
I think organization technical leadership wants to deploy AI to ship faster, not so engineering teams can do what they should have been doing all along. "Sure, we can deploy these tools, but first we need to properly document our design" isn't going to fly. The point of buying these tools is so teams ship without really understanding what they're doing, because there's a time cost to that.
Tweya month ago
I'd be interested to hear how they reconcile ‘100% code coverage’ with ‘QA needs to run fast’ on a large codebase.
I'd also really love to see a study around how much of the effort it takes, on average, to write (by carefully shepherding an agent or otherwise) bullet-proof tests and other guardrails for the LLM-generated code divided by the effort of writing the code by hand.
jaredcwhitea month ago
I'm sad programmers lacking a lot of experience will read this and think it's a solid run-down of good ideas.
- SoKamila month ago
  I’m more afraid that some manager will read this and impose rules on their team. On the surface one might think that having more test coverage is universally good and won’t consider trade offs. I have a gut feeling that Goodhart’s Law accelerated with AI is a dangerous mix.
  - KurSixa month ago
    Goodhart's Law works on steroids with AI. If you tell a human dev "we need 100% coverage," they might write a few dummy tests, but they'll feel shame. AI feels no shame - it has a loss function. If the metric is "lines covered" rather than "invariants checked," the agent will flood the project with meaningless tests faster than a manager can blink. We'll end up with a perfectly green CI/CD dashboard and a completely broken production because the tests will verify tautologies, not business logic
- zema month ago
  "fast, ephemeral, concurrent dev environments" seems like a superb idea to me. I wish more projects would do it, it lowers the barrier to contributions immensely.
  - zimpenfisha month ago
    > "fast, ephemeral, concurrent dev environments" seems like a superb idea to me.
    I've worked at one (1) place that, whilst not quite fully that, they did have a spare dev environment that you could claim temporarily for deploying changes, doing integration tests, etc. Super handy when people are working on (often wildly) divergent projects and you need at least one stable dev environment + integration testing.
    Been trying to push this at $CURRENT without much success but that's largely down to lack of cloudops resources (although we do have a sandbox environment, it's sufficiently different to dev that it's essentially worthless.)
  - frioa month ago
    Yeah, this is something I'd like more of outside of Agentic environments; in particular for working in parallel on multiple topics when there are long-running tasks to deal with (eg. running slow tests or a bisect against a checked out branch -- leaving that in worktree 1 while writing new code in worktree 2).
    I use devenv.sh to give me quick setup of individual environments, but I'm spending a bit of my break trying to extend that (and its processes) to easily run inside containers that I can attach Zed/VSCode remoting to.
    It strikes me that (as the article points out) this would also be useful for using Agents a bit more safely, but as a regular old human it'd also be useful.
- manmala month ago
  What’s bad about them? We make things baby-safe and easy to grasp and discover for LLMs. Understandability and modularity will improve.
- nathan_f77a month ago
  I have almost 30 years of experience as a programmer and all of this rings true to me. It precisely matches how I've been working with AI this year and it's extremely effective.
- baobuna month ago
  Could you be more specific in your feedback please.
  - jaredcwhitea month ago
    100% test coverage, for most projects of modest size, is extremely bad advice.
    CuriouslyCa month ago
    Pre-agents, 100% agree. Now, it's not a bad idea, the cost to do it isn't terrible, though there's diminishing returns as you get >90-95%.
    marcosdumaya month ago
    LLMs don't make bad tests any less harmful. Nor they write good tests for the stuff people mostly can't write good tests for.
    zahlmana month ago
    Okay, but is aiming for 100% coverage really why the bad tests are bad?
    jeltza month ago
    In most cases I have seen bad tests, yes.
    marcosdumaya month ago
    Aiming for 100% coverage is almost certain to cause bad tests, yes.
    But not all bad tests come from a goal of 100% coverage.
    PunchyHamstera month ago
    You just end up writing needless tests trying to trigger or mock error state from a 3rd party library that's never actually returning error, just the lib had a rule of "every call returns error code" in case something changes and it's needed.
    pca006132a month ago
    The problem is that it is natural to have code that is unreachable. Maybe you are trying to defend against potential cases that may be there in the future (e.g., things that are yet implemented), or algorithms written in a general way but are only used in a specific way. 100% test coverage requires removing these, and can hurt future development.
    sgk284a month ago
    It doesn't require removing them if you think you'll need them. It just requires writing tests for those edge cases so you have confidence that the code will work correctly if/when those branches do eventually run.
    I don't think anyone wants production code paths that have never been tried, right?
    bdangubica month ago
    laziness? unprofessionalism? both? or something else?
    spc476a month ago
    You forgot difficult. How do you test a system call failure? How do you test a system call failure when the first N calls need to pass? Be careful how you answer, some answers technically fall into the "undefined behavior" category (if you are using C or C++).
    zahlmana month ago
    ... Is that not what mocking is for?
    rvza month ago
    all of the above.
adi_kuriana month ago
What if 'good code' is just 'code optimized for humans who can't hold much in working memory'? The model doesn't need breadcrumbs if it can see everything at once. If context windows 100x, think some of this may be less relevant. Big IF, have no idea tbh, hard to predict.
victorbjorklunda month ago
This is something I have seen. The code I write on projects I work on alone is a lot better today vs in the past because AI works better on a repo with good code quality. This can be writing smaller modules or even breaking out an API integration into its own library (something I seldom would do in the past).
ojra month ago
I spent so much time getting the mocks right with AI tests and the tests could not be one shotted or done by an inexperienced intern. Certainly don't have the budget to through Claude Opus on it, I'll give it some time though maybe things change.
the_kinga month ago
I think good names and a good file structure are the most important thing to get right here.
nxobjecta month ago
Don't forget logging, logging, and lots of logging - whether printf or structured.
sublineara month ago
What? We're already so far down the list of things to try with AI that we're saying hallucinated tests are better than no tests at all?
Seems actively harmful, and the AI hype died out faster than I thought it would.
> Agents will happily be the Roomba that rolls over dog poop and drags it all over your house
There it is, folks!
- andrewchambersa month ago
  Where did it say the tests need to be hallucinated ?
  If you can make good tests the AI shouldn't be able to cheat them. It will either churn forever or pass them.
CraigJPerrya month ago
> Entire categories of illegal states and transitions can be eliminated.
I have an over-developed, unhealthy interest in the utility of types for LLM generated code.
When an llm is predicting the next token to generate, my current level of understanding tells me that it makes sense that the llm's attention mechanism will be using the surrounding type signatures (in the case of an explicitly typed language) or the compiler error messages (in the cases where a language leans on implicit typing) to better predict that next token.
However, that does not seem to be the behaviour i observe. What i see is more akin to tokens in the type signature position in a piece of code often being generated without any seeming relationship to the instructions being written. It's common to generate code that the compiler rejects.
That problem is easily hidden and worked around - just wrap your llm invocation in a loop, feed in the compiler errors each time and you now have an "agent" that can stochastic gradient descent its way to a solution.
Given this, you could say well what does it matter, even if an LLM doesn't meaningfully "understand" the relationship between types and instructions, there's already a feedback loop and therefore a solution available - so why do we even need to care about the fact an llm may or may not treat types as a tool to accurately model the valid solution space.
Well i can't help think this is really the crux of software development. Either you're writing code to solve a defined problem (valuable) or you're doing something else that may mimic that to some degree but is not accurate (bugs).
All that said, pragmatically speaking, software with bugs is often still valuable.
TL;DR i'm currently thinking humans should always define the type signatures and test cases, these are too important to let an LLM "mid" its way through.
- everfrustrateda month ago
  Completely agree with your on the types. Will be interesting to see what new post-AI programming languages look like and I suspect they will all be strongly typed.
a month ago
undefined
deauxa month ago
Your footnotes seem to be in the wrong order - maybe you switched paragraphs around and they got out of sync?
firemelta month ago
yeah beekeeping, I think about it alot, I mean the agentic should be isolated on their own environment, its dangerous to give then ur whole pc who nows they silently putting some rootkit or backdoor to ur pc, like appending allowed ssh keys
- sebastianconcpta month ago
  In Cursor you have Sandbox mode to deal with that issue?
user____namea month ago
I've been wondering if AI startups are running bots to downvote negative AI sentiment on HN. The hype is sort of ridiculous at times.
pizlonatora month ago
Why would I write code that makes it easier for a clanker to compete with me
throw-12-16a month ago
Just wait until the LLM Agent starts rewriting tests to adhere to your 100% code coverage mandate.
block_daggera month ago
I stopped reading at “static typing.” That is not what “good code” always looks like.
- kurtis_reeda month ago
  [flagged]
glemmaPaula month ago
Kool-aid salesmen is selling Kool-aid again
jennyholzer3a month ago
I don't know about all this AI stuff.
How are LLMs going to stay on top of new design concepts, new languages, really anything new?
Can LLMs be trained to operate "fluently" with regards to a genuinely new concept?
I think LLMs are good for writing certain types of "bad code", i.e. if you're learning a new language or trying to quickly create a prototype.
However to me it seems like a security risk to try to write "good code" with an LLM.
- sgk284a month ago
  I suspect it will still fall on humans (with machine assistance?) to move the field forward and innovate, but in terms of training an LLM on genuinely new concepts, they tend to be pretty nimble on that front (in my experience).
  Especially with the massive context windows modern LLMs have. The core idea that the GPT-3 paper introduced was (summarizing):
  A sufficiently large language model can perform new tasks it has never seen using only a few examples provided at inference time, without any gradient updates or fine-tuning.
- dborehama month ago
  I've used LLMs to find bugs in and write code for a language almost nobody uses, that has terrible documentation. My assumption is that it did so in the same manner a human would: "ok this looks kind of like Algol or C" and "after reading a bunch of this code I think I get what's going on".
- rabfa month ago
  You do realise they can search the web? They can read documentation and api specs?
  - a month ago
    undefined
  - llmslave2a month ago
    They can't think though. They can't be creative.
    sswatsona month ago
    Neither of those assertions means anything. For many years, people have been using them to make confident predictions about what AI systems will never be able to accomplish. Those predictions are routinely falsified within months.
    Of course, some of those predictions may also turn out to be true. But either way, we have abundant empirical evidence that the reasoning is not sound.
- manmala month ago
  They are retrained every 12-24 months and constantly getting new/updated reinforcement learning layers. New concepts are not the problem. The problem is outdated information in the training data, like only crappy old Postgres syntax in most of the Stackoverflow body.
  - Aerolfosa month ago
    > They are retrained every 12-24 months and constantly getting new/updated reinforcement learning layers
    This is true now, but it can't stay true, given the enormous costs of training. Inference is expensive enough as is, the training runs are 100% venture capital "startup" funding and pretty much everyone expects them to go away sooner or later
    Can't plan a business around something that volatile
    0x696C6961a month ago
    You don't need to retrain the whole thing from scratch every time.
    manmala month ago
    GPT-5.1 was based on over 15 months old data IIRC, and it wasn’t that bad. Adding new layers isn’t that expensive.
    astrangea month ago
    Google's training runs aren't funded by VC. The Chinese models probably aren't either.
devhousea month ago
[dead]
phplovesonga month ago
LOL No. AI code i see is 90% really bad. The poster then snakes around the first commenter that asks "how much of the code was generated by AI?"
Replies vary from silence to "ill checked all the code" or "ai code is better than human code" or even "ai was not used at all", even it is obvious it was 100% AI.
- maelna month ago
  You did not read the article did you
  - phplovesonga month ago
    No. The title gave me enough context to not even give it a click. That or its a clickbait, making it even more less clickable.