> Bill Gates compared measuring programming progress by lines of code to measuring aircraft building progress by weight
Aircraft weight is also a very useful metric - aircraft weight is also bad. But we do measure this!
Early we should see huge chunky contributions and bursts. Loc means things are being realized.
In a mature product shipping at a sustained and increasing velocity, seeing LoC decrease our grow glacially year-on-year is a warm fuzzy feeling.
By my estimation aircraft designs should grow a lot for a bit (from 0 to not 0), churn for a while, then aim for specified performance windows in periods of punctuated stability.
Reuse scenarios create some nice bubbles where LoC growth in highly validated frameworks/components is amazing, as surrounding systems obviate big chunks of themselves. Local explosions, global densification and refinement.
There is also the x.com crowd that is bragging about their OpenClaw agents pushing 10k lines of code every day.
Fill the gradient of machine states, then prune for correctness and utility.
That is not to say it's a good goal. But at the end of the day every program is electrical states in a machine. Fill machine, like search, see which ones are required to produce the most popular types of outputs, prune the rest.
Hint to syntax fans among programmers; most people will not be asking the machine to output Python or Elixir. Most will ask for movies, music, games. Bake the states needed to render and prune that geometry and color as needed. That geometry will include text shapes eventually too, enabling pruning away all the existing token systems like Unicode and ANSI. Storing state in strings is being deprecated.
Language is merely one user interface to reality. Grasp of it does not make one "more human" or in touch with the universe or yadda yadda. Such argument is pretentious attention seeking of those educated in a particular language. Look at them! ...recreating grammatically correct sentences per the rules of the language. Never before seen! Wow wow wow
Look at all the software written, all the books and themes within. Grasp of language these days is as novel an outcome as going to the grocery store, using a toilet.
Meaning, it's trivial to write unit tests when your code is stupid and only does happy path stuff and blows up on anything else. So we say "you need 90% coverage" or whatever, people will write stupid frail code that barely works in practice, but that is easy to unit test.
Similarly, if we say "do it with the least amount of code", we will also throw any hopes of robustness out the window, and only write stupid happy path code.
Before, lines of code was (mis)used to try to measure individual developer productivity. And there was the collective realization that this fails, because good refactoring can reduce LoC, a better design may use less lines, etc.
But LoC never went away, for example, for estimating the overall level of complexity of a project. There's generally a valid distinction between an app that has 1K, 10K, 100K, or 1M lines of code.
Now, the author is describing LoC as a metric for determining the proportion of AI-generated code in a codebase. And just like estimating overall project complexity, there doesn't seem to be anything inherently problematic about this. It seems good to understand whether 5% or 50% of your code is written using AI, because that has gigantic implications for how the project is managed, particularly from a quality perspective.
Yes, as the author explains, if the AI code is more repetitive and needs refactoring, then the AI proportion will seem overly high in terms of how much functionality the AI proportion contributes. But at the same time, it's entirely accurate in terms of how this is possibly a larger surface for bugs, exploits, etc.
And when the author talks about big tech companies bragging about the high percentage of LoC being generated with AI... who cares? It's obviously just for press. I would assume (hope) that code review practices haven't changed inside of Microsoft or Google. The point is, I don't see these numbers as being "targets" in the way that LoC once were for individual developer productivity... there's more just a description of how useful these tools are becoming, and a vanity metric for companies signaling to investors that they're using new tools efficiently.
The overall level of complexity of a project is not an "up means good" kind of measure. If you can achieve the same amount of functionality, obtain the same user experience, and have the same reliability with less complexity, you should.
Accidental complexity, as defined by Brooks in No Silver Bullet, should be minimized.
So I wish developers looked at apps with a complexity budget, which is basically Djikstra's line of code budget. You have a certain amount of complexity you can handle. Do you want to spend that complexity on adding these features or these other features? But there is a limit and a budget you are working with. Many times I have wished that product managers and engineering managers would adopt this view.
And if AI tools are writing all of the code, does it even matter anymore?
I never said it was. To the contrary, it's more of an indication of how much more complex large refactorings might be, how complex it might be to add a new feature that will wind up touching a lot of parts, or how long a security audit might take.
The point is, it's important to measure things. Not as a "target", but simply so you can make more informed decisions.
I'd say you're operating on a higher plane of thought than the majority in this industry right now. Because the majority view roughly appears to be "Need bigger number!", with very little thought, let alone deep thought, employed towards the whys or wherefores thereof.
Google engineer perspective:
I'm actually thinking code reviews are one of the lowest hanging fruits for AI here. We have AI reviewers now in addition to the required human reviews, but it can do anything from be overly defensive at times to finding out variables are inconsistently named (helpful) to sometimes finding a pretty big footgun that might have otherwise been missed.
Even if it's not better than a huamn reviwer, the faster turnaround time for some small % of potential bugs is a big productivity boost.
It also also often fails to clean up after itself. When you remove a feature (one that you may not have even explicitly asked for) it will sometimes just leave behind the unused code. This is really annoying when reviewing and you realize one of the files you read through is referenced nowhere.
You have to keep a close eye out to prevent bloat from these issues.
I did this a few times as an experiment while knowing how a problem could be solved. In difficult situations Cursor always invariably adds code and creates even more mess.
I wonder if this can be mitigated somehow at the inference level because prompts don't seem to be helping with this problem.
Quite a bit of my time was spend rewriting the massive amounts of garbage churned out by offshore partners.
Management stuck to their goal, so the compromise was to not delete offshore lines, but to comment them out.
Lines of code is a dumb metric and anyone touting them for anything meaningful is disconnected from reality. Bad that all these ceos are touting it, but they kind of always use these dumb metrics.
- Is the client happy? - Are the team members growing(as in learning)? - Were we able to make a profit?
Everything else was less relevant. For example: Why do I care that the project took bit longer, if at the end the client was happy with the result, and we can continue the relationship with new projects. It frees you from the cruelty of dates that are often set arbitrary.
So perhaps we should evaluate AI coding tools the same. If we can deliver successful projects in a sustainable way, then we are good.
The LOC as a KPI is useless and people should humiliate Elon over that. (Paraphrasing Linus on that comment and adding support).
If I minimize my project and get everything on one line, is that good? I think not.
Measuring success based on how may or how few lines there are is a bad idea, I think.
I like the author's proposed "Comprehension coverage" metric. It aligns well with Naur's Programming as Theory Building.
This off-the-cuff statement buries so much complexity. Sure it catches new code the exactly implements existing code, but IME it is __way__ more common to need to slightly (or not so slightly) change existing code that can now be used by multiple consumers, and then delete the new "duplicate" code. That is not trivial and requires (1) judgement from your AI coder and (2) deep reviewer expertise from your human coder.
> If AI-generated code introduces defects at a higher rate, you need more review, not less AI.
I think that is very much up for debate despite being so frequently asserted without evidence! This strikes me as the same argument as we see about self-driving cars: they don't have to be perfect, because there is (or we can regulate that there must be) a human in the loop. However, we have research and (sometimes fatal) experience from other fields (aviation comes to mind) about "automation complacency" - the human mind just seems to resist thoroughly scrutinizing automation which is usually right.
Right now AI / Agentic coding doesn't seem is a train we are going to be able to stop; and at the end of the day is tool like any other. Most of what seems to be happening is people let AI fully take the wheel not enough specs, not enough testing, not enough direction.
I keep experiment and tweaking how much direction to give AI in order to product less fuckery and more productive code.
I don't know how to encourage the kind of review that AI code generation seems to require. Historically we've been able to rely on the fact that (bluntly) programming is "g-loaded": smart programmers probably wrote better code, with clearer comments, formatted better, and documented better. Now, results that look great are a prompt away in each category, which breaks some subconscious indicators reviewers pick up on.
I also think that there is probably a sweet spot for automation that does one or two simple things and fails noisily outside the confidence zone (aviation metaphor: an autopilot that holds heading and barometric altitude and beeps loudly and shakes the stick when it can't maintain those conditions), and a sweet spot for "perfect" automation (aviation metaphor: uh, a drone that autonomously flies from point A to point B using GPS, radar, LIDAR, etc...?). In between I'm afraid there be dragons.
Now 100kloc is roughly 1M tokens, which cost a few dollars, so how could something that costs single digit dollars possibly be worth tens of millions in value? Clearly there's a substantial gap between how useful different pieces of code are, so bragging about how much of it you produce without telling me how valuable it is is useless. I guess it's a long-winded way of saying "show me the money"
Focusing on capabilities instead of shipping code also can provide a better measure.
> AI didn't just repeat the mistake. It broke the mistake open.
Come on bruh
These metrics for advanced roles are not applicable, no matter what you come up with. But even lines of code are good enough to see progress from a blank slate. Every developer or advanced AI agent must be judged on a case by case basis.
The OpenBSD project prides itself on producing very secure, bug free software, and they largely trend towards as low of lines of codes as they can possibly get away with while maintaining readability (so no codegolf tricks for the most part). I would rather we write secure bug free software than speed up the ability to output 10kLOC. The typing out code isn’t the difficult part in that scenario.
But reducing the amount of LoC helps, just like using the correct word helps in writing text. That’s the craft part of software engineering, having the taste to write clear and good code.
And just like writing and any other craft, the best way to acquire such taste is to study others works.