I guess that’s OK, but I was skateboarding at 19.
Can you even kick flip?
I can tell you that we do extensive testing, we figured out how to objectively measure the code quality on certain benchmark problems, empirically it's extremely helpful nearly all the time.
But in the general case: it is not actually possible to guarantee this.
That's because whether a change improves the code often depends on information which is literally not present in the codebase.
Some of these are more trite. E.g.: whether a comment is helpful or redundant slop depends on the audience.
Some are deeper. E.g.: whether a piece of duplication is good or bad depends on the intent, and that is often impossible to recover from the source. https://www.pathsensitive.com/2018/01/the-design-of-software...
A simpler example: There's a function that's never called. Should it be deleted?
There's a number of factors outside the codebase that determine the answer. Including the obvious one "Not if your next prompt is going to start using it."
It's pretty expensive to measure even for small programs. It's also more of a relative than an absolute measure, i.e.: it scores two variants of the same codebase, but the raw scores aren't very meaningful on their own. So our goal had been to use this in the benchmark set we're working on when we release a standalone refactoring product.
But the more I think about this suggestion, the more I think: "Hmmm, why not?"
The most difficult code in the 1.0 release is some gymnastics to avoid the appearance of a concurrency conflict with a user running their own jj commands, made at the request of the person who introduced me to jj.