Provenance Is the New Version Control(aicoding.leaflet.pub)

58 pointsby gpia month ago28 comments

zerof1l25 days ago
I don't see how this is an AI-specific issue or an issue at all. We solved it already. It's called software development best practices.
> A diff can show what changed in the artifact, but it cannot explain which requirement demanded the change, which constraint shaped it, or which tradeoff caused one structure to be chosen over another.
That's not true... diffs would be traceable to commits and PRs, which in turn are traceable to the tickets. And then there would be tests. With all that, it would be trivial to understand the whys.
You need both the business requirements and the code. One can't replace the other. If you attempt to describe technical requirements precisely, you'll inevitably end up writing the code, at very least, a pseudocode.
As for regenerating the deleted code out of business requirements alone, that won't work cleanly most of the time. Because there are technical constraints and technical debt.
- Terretta25 days ago
  Agree...
  Look at “Managing the Development of Large Software Systems” (Royce 1970) Figure 10 on scanned page 338:
  https://www.praxisframework.org/files/royce1970.pdf
  Whatever you do, do not stop on Figure 2's infamous waterfall!
  Understand Royce understood Figure 4, and in Figure 7 proposed prototyping with code to inform the product (and iterate the product).
  This was elaborated in the Spiral Model (Boehm 1988):
  https://en.wikipedia.org/wiki/Spiral_model#/media/File:Spira...
  And then '90s DSDM (under various stripped-down flavors clustered around agile claiming to be True Agile™) turned into basically WGLL spanning 2 decades going into LLMs:
  https://en.wikipedia.org/wiki/Dynamic_systems_development_me...
  Note that DSDM purports to "fix" cost but not through estimation per se, but rather by flexing the backlog cutoff:
  “DSDM fixes cost, quality and time at the outset and uses the MoSCoW prioritisation of scope into musts, shoulds, coulds and will not haves to adjust the project deliverable to meet the stated time constraint.”
  Cost is just headcount, quality should be in your + user's success criteria, and time is (generally) driven by some real-world requirement (event, opportunity, runway, competition, whatever). Varying scope means you didn't plan and roadmap every task up front.
  Most everything since are variations on this, tailoring to the needs of the variant's author.
  Doing all of this in text-as-code (Markdown, Mermaid, etc.) makes it machinable. Any number of shops were already doing this in text-as-code before the LLMs, giving them a spec-driven LLM context leg up.
- rafterydj25 days ago
  I'm not sold on the idea that commits and PRs are always easily tied back to tickets. Ideally, sure. In practice? Not always.
  - jollyllama25 days ago
    And that is where the quality of your engineering org is exposed.
gritzko25 days ago
LLMs can implement red-black trees with impressive speed, quality and even some level of determinism. Here I buy the argument. Once we take something that is not already on GitHub in a thousand different flavors, it becomes an adventure. Like real adventure.
So what did you say about version contol?
- nine_k25 days ago
  Basically, if it's in the commit history, it can be checked out and adjusted to the local circumstances. If not, then somebody has to actually write it!
RHSeegera month ago
I'm a bit confused by this because a given set of inputs can produce a different output, and different behaviors, each time it is run through the AI.
> By regenerable, I mean: if you delete a component, you can recreate it from stored intent (requirements, constraints, and decisions) with the same behavior and integration guarantees.
That statement just isn't true. And, as such, you need to keep track of the end result... _what_ was generated. The why is also important, but not sufficient.
Also, and unrelated, the "reject whitespace" part bothered me. It's perfectly acceptable to have whitespace in an email address.
- onion2ka month ago
  I'm a bit confused by this because a given set of inputs can produce a different output, and different behaviors, each time it is run through the AI.
  How different the output is each time you generate something from an LLM is a property called 'prompt adherence'. It's not really a big deal in coding LLMs, but in image generation some of the newer models (Z Image Turbo for example) give virtually the same output every time if the prompt doesn't change. To the point where some users claim it's actually a problem because most of the time you want some variety in image gen. It should be possible to tune a coding LLM to give the same response every time.
  - gizmo68625 days ago
    Even if you have deterministic LLMs (which is absolutely something that can be done), you still need to pin a specific version to get that. That might work in the short term; but 10 years from now, your not going to want to be using a model from today.
    nextaccountic25 days ago
    > Even if you have deterministic LLMs (which is absolutely something that can be done),
    Note, when Fabrice Bellard made his LLM thing to compress text, he had to make sure it was deterministic. It would be terrible if it slightly corrupted files in different ways each time it decompressed
    j_w25 days ago
    You cannot pin a certain version, even today, if you are using some vendor LLM the versions are transient; they are constantly making micro optimizations/tweaks.
    smohare25 days ago
    [dead]
  - belZaah25 days ago
    If that is true, and a given history of prompts combined with a given mosel always gives the same code, then you have invented what’s called a compiler. Take human-readable text and convert it into machine code. Which means we have a much higher level language, than before and your prompts become your code.
    onion2k25 days ago
    "The hottest new programming language is English" Andrej Karpathy in Jan 2023
    https://x.com/karpathy/status/1617979122625712128
  - locknitpicker25 days ago
    > How different the output is each time you generate something from an LLM is a property called 'prompt adherence'. It's not really a big deal in coding LLMs, (...)
    I strongly disagree. Nowadays most LLMs support updating context with chat history. This means the output of a LLM will be influenced by what prompts you have been feeding it. You can see glaring changes in what a coding agent does based on what topics you researched.
    To take the example a step further, some LLMs even update their system prompts to include context such as where you are in the world at that precise moment and the time of the year. Once I had ChatGPT generate a complete example project based around an event that was taking place at a city I happened to be cruising through at that moment.
  - RHSeeger25 days ago
    > It's not really a big deal in coding LLMs
    Challenge. Given that you're not nailing down _EVERY_ detail in your descriptions (because that's not possible), the results can vary a fair amount. Especially as the model changes over time. And if there was anything else in the context; I've gotten different results from the exact same prompt 60 minutes apart after reverting the code, because there were some failed attempts to get it to fix what it broke.
hnlmorg25 days ago
Code still matters in the world of LLMs because they’re not deterministic and different LLMs produce different output too. So you cannot pin specification to application output in the way the article implies.
What the author actually wants is ADRs: https://github.com/joelparkerhenderson/architecture-decision...
That’s a way of being able to version control requirements.
- visarga25 days ago
  TIL about ADR's, a great idea.
viraptora month ago
I'm not sure if this actually needs a new system. Git commits have the message, arbitrary trailers, and note objects. If this was of source control is useful, I'm sure it could be prototyped on top of git first.
- smaudeta month ago
  The article smacks of someone who doesn't understand version control at all...
  Their main idea is to version control the reasoning, which, OK, cool. They want to graph the reasoning and requirements, sounds nice, but there are graph languages that fit conviently into git to achieve this already...
  I also fundamentally disagree with the notion that the code is "just an artifact". The idea to specify a model is cute, but, these are indeterminate systems that don't produce reliable output. A compiler may have bugs yes, but generally speaking the same code will always produce the same machine instructions, something that the proposed scheme does not...
  A higher order reasoning language is not unreasonable, however the imagined system does not yet exist...
klodolpha month ago
> Once an AI can reliably regenerate an implementation from specification…
I’m sorry but it feels like I got hit in the head when I read this, it’s so bad. For decades, people have been dreaming of making software where you can just write the specification and don’t have to actually get your hands dirty with implementation.
1. AI doesn’t solve that problem.
2. If it did, then the specification would be the code.
Diffs of pure code never really represented decisions and reasoning of humans very well in the first place. We always had human programmers who would check code in that just did stuff without really explaining what the code was supposed to do, what properties it was supposed to have, why the author chose to write it that way, etc.
AI doesn’t change that. It just introduces new systems which can, like humans, write unexplained, shitty code. Your review process is supposed to catch this. You just need more review now, compared to previously.
You capture decisions and specifications in the comments, test cases, documentation, etc. Yeah, it can be a bit messy because your specifications aren’t captured nice and neat as the only thing in your code base. But this is because that futuristic, Star Trek dream of just giving the computer broad, high-level directives is still a dream. The AI does not reliably reimplement specifications, so we check in the output.
The compiler does reliably reimplement functionally identical assembly, so that’s why we don’t check in the assembly output of compilers. Compilers are getting higher and higher level, and we’re getting a broader range of compiler tools to work with, but AI are just a different category of tool and we work with them differently.
- charcircuit25 days ago
  >If it did, then the specification would be the code.
  Except you can't run english on your computer. Also the specification can be spread out through various parts of the code base or internal wikis. The beauty of AI is that it is connected to all of this data so it can figure out what's the best way to currently implement something as opposed to regular code which requires constant maintenance to keep current.
  At least for the purposes I need it for, I have found it reliable enough to generate correct code each time.
  - free_bip25 days ago
    What do you mean? I can run English on my computer. There are multiple apps out there that will let me type "delete all files starting with" hacker"" into the terminal and end up with the correct end result.
    And before you say "that's indirect!", it genuinely does not matter how indirect the execution is or how many "translation layers" there are. Python for example goes through at least 3 translation layers, raw .py -> Python bytecode -> bytecode interpreter -> machine code. Adding one more automated translation layer does not suddenly make it "not code."
    charcircuit25 days ago
    I mean that the prompt is not like code. It's not a set of instructions that encodes what the computer will do. It includes instructions for how an AI can create the necessary code. Just because a specification is "translated" into code, that doesn't mean the input is necessarily code.
    yaris25 days ago
    What is conceptually different between prompts and code? Code is also not always what the computer will do, declarative programming languages are an example here. The only difference I see is that special precaution should be taken to get deterministic output from AI, but that's doable.
    charcircuit25 days ago
    Code is defined as:
    >noun A system of symbols and rules used to represent instructions to a computer; a computer program.
    In the other hand the prompt is for the AI. It's not meant for instructions to a computer.
    matrss25 days ago
    So, Prolog is not code then?
    > Except you can't run english on your computer.
    I can't run C on it either, without translating it to machine code first. Is C code?
    charcircuit25 days ago
    A prompt is for the AI to follow. C is for the computer to follow. I don't want to play games with definitions anymore, so I am no longer going to reply if you continue to drill down and nitpick about exact definitions.
    matrss25 days ago
    If you don't want to argue about definitions, then I'd recommend you don't start arguments about definitions.
    "AI" is not special-sauce. LLMs are transformations that map an input (a prompt) to some output (in this case the implementation of a specification used as a prompt). Likewise, a C compiler is a transformation that maps an input (C code) to some output (an executable program). Currently the big difference between the two is that LLMs are usually probabilistic and non-deterministic. Their output for the same prompt can change wildly in-between invocations. C compilers on the other hand usually have the property that their output is deterministic, or at least functionally equivalent for independent invocation with the same input. This might be the most important property that a compiler has to have, together with "the generated program does what the code told it to do".
    Now, if multiple invocations of a LLM were to reliably produce functionally equivalent implementations of a specification as long as the specification doesn't change (and assuming that this generated implementation does actually implement the specification), then how does the LLM differ from a compiler? If it does not fundamentally differ from a compiler, then why should the specification not be called code?
    defrost25 days ago
    > the prompt is for the AI.
    and C is for the compiler not "the computer".
    It's commonplace for a compiler on one computer to read C code created on a second computer and output (if successfully parsed) machine code for a third computer.
  - alphabetag67525 days ago
    As long as your language is good enough to generate correct code at any point, it is a specification. If not, it is an ambiguous approximation.
    25 days ago
    undefined
  - chmod77525 days ago
    You can't run most of what you would probably consider "code" on your computer either. C, Python, Rust, ... all have to be translated into something your computer can execute by compilers and interpreters.
    On the other hand no current LLM can implement complex software from specification alone, especially not without iterating. Just today I realized how laughable far we are, after having fought GPT-5.1-Codex-Max for way longer than I should have over correctly implementing a 400 line python script for wrangling some git repositories. These things are infuriatingly bad as soon as you move away from things like web development.
rapjr925 days ago
There may be a more subtle issue here. When the specification is interpreted by an LLM that is different than it being interpreted by a person. From the LLM you get a kind of average of how a lot of people wrote that "kind" of code. From the person you get a specific interpretation of the spec into code that fits the task. Different people can have different interpretations, but that is not the same as the random variations LLM's produce. To get the same kind of fine tuning a person can do while coding (for example realizing the spec needs to change) from an LLM you need a very precise spec to start with, one that includes a lot of assumptions that are not included in current specs, but which are expected from people. I see further complications with getting an LLM to generate code where the spec changes, like say now you want to port the same spec to generate code on new computer architectures. So now specs need architecture dependent specifications? Some backwards compatibility needs to be maintained also, if the LLM regenerates ALL of the code each time, then the testing requirements balloon.
jayd16a month ago
What if I told you a specification can also be measured (and source controlled) in lines?
- JellyBeanThiefa month ago
  This was the very first thing I thought when I was taught about requirement traceability matrices in uni. I was like "Ew, why is this happening in an Excel silo?" I had already known about ways of adding metadata to code in Java and C#, so I expected everything to be done in plain text formats so that tooling could provide information like "If you touch this function, you may impact these requirements and these user stories." or "If you change this function's signature, you will break contracts with these other team members (here's their email)."
Atomic_Torrfisk25 days ago
Sounds like hot air, Wolfram style. Making an intellectual smart sounding argument out of something that is well simple. Version control is version control , a hammer is a hammer. What style you choose depends on the situation, right now git is king because it works and we all understand it.. enough.
The lossy aspect mentioned in the article just sounds like you forgot to write comments or a README. simple fix
- WorldMaker25 days ago
  The "lossy aspect" feels like it tells me a lot more that the author doesn't know what commit messages are for.
alphabetag67525 days ago
If you could regenerate some code from another code in a deterministic manner, then congrats you have developed a compiler and a high-level language.
michalsustr25 days ago
What I think the author is hoping to get is some inspectable graph of the whys that can be a basis for further automation/analysis. That’s interesting, but the line to actual code then becomes blurry. For instance, what about self-consistency across time? If this would be just text, it would come out of sync (like all doc text does). If it's code, then maybe you just had wrong abstractions the whole time?
The way we solve the why/what separation (at minfx.ai) is by having a top-level PLAN.md document for why the commit was built, as well as regenerating README.md files on the paths to every touched file in the commit. Admittedly, this still leans more into the "what" rather than "why". I will need to think about this more, hmm.
This helps us to keep it well-documented and LLM-token efficient at the same time. What also helps is Rust forces you into a reasonable code structure with its pub/private modules, so things are naturally more encapsulated, which helps the documentation as well.
beej7125 days ago
TFA> By regenerable, I mean: if you delete a component, you can recreate it from stored intent (requirements, constraints, and decisions) with the same behavior and integration guarantees.
The only way to do this is with a mathematically precise and unambiguous stored intent, isn't it? And then aren't we just taking source code?
akoboldfryinga month ago
Yes, in theory you can represent every development state as a node in a DAG labelled with "natural language instructions" to be appended to the LLM context, hash each of the nodes, and have each node additionally point to an (also hashed) filesystem state that represents the outcome of running an agent with those instructions on the (outcome code + LLM context)s of all its parents (combined in some unambiguous way for nodes with multiple in-edges).
The only practical obstacle is:
> Non-deterministic generators may produce different code from identical intent graphs.
This would not be an obstacle if you restrict to using a single version of a local LLM, turn off all nondeterminism and record the initial seed. But for now, the kinds of frontier LLMs that are useful as coding agents run on Someone Else's box, meaning they can produce different outcomes each time you run them -- and even if they promise not to change them, I can see no way to verify this promise.
- visarga25 days ago
  If you implement a project, keep the specs and tests and re-implement it, it should not matter the exact way it was coded as long as it was well tested. So you don't need deterministic LLMs.
  I think work with LLMs should be centered on testing, since it is how the agent is fenced off in a safe space where it can move without risk. Tests are the skin, specs are the bones, and the agent is the muscle.
  I think reading the code as the sole defense against errors is a grave mistake, it is "vibe testing". LGTM is something you cannot reproduce. Reading all the code is like walking the motorcycle.
  - akoboldfrying24 days ago
    The first time you generate the code, it calls the method doFoo(), and the test calls that method. The second time you generate the code, it calls the method fooify(), and the test breaks.
    How do you propose to get around this, without a human specifying every class layout in detail?
rtpg25 days ago
While in some sense it's interesting to store the prompts people might use, I feel like that might only accentuate the "try to tweak prompts over and over to pray for the result you want"-style workflows that I am seeing so many people around me work in.
People need to remember how good it feels to do precise work when the time comes!
elzbardico25 days ago
I am exhausted of this ThoughtWorks style of writing. I can smell it from a mile away.
forty25 days ago
If your git history gives you the "what" and not the "why", you are doing it wrong. We can already see what is done in the commit diff. We can only guess why you did it if you don't explain in the message.
- rkomorn25 days ago
  I thought I agreed with you at first but I'm not sure. Either we disagree on how important what and why are, or on how "why" is the defined or expressed.
  I think commit messages should actually have a concise "what" in them.
  I frequently enough end up looking at git log trying to sort out what changed (to track down a bug or regression), and based on the commit message, do a git show to see what the actual diffs are.
  So in that context, at least, knowing what changed in a commit is actually quite useful, and why is arguably less so.
  I suspect my idea of "what" and your idea of "why" overlap in this scenario.
  Edit: and after typing all that, I realized your comment doesn't imply there shouldn't be a "what" described anyway so maybe I'm just discussing nothing at all.
  - WorldMaker25 days ago
    Sure "top-line" of the message (the subject line of the email) should be concisely "what" changed, but the rest of the message (the body of the email) should be the details of "why" and "how". More details on the "what changed" is often redundant because by that point you are seeing the diff itself, but the "why" and "how" is often the real important part to a commit message.
    forty25 days ago
    Yes, this is what I meant exactly.
d--b25 days ago
I found it quite insightful.
Looking at individual line changes produced by AI is definitely difficult. And going one step higher to version control makes sense.
We're not really there yet though, as the generated code currently still needs a lot of human checks.
Side thoughts: this requires the code to be modularized really well. It makes me think that when designing a system, you could imagine a world where multiple agents discuss changes. Each agent would be responsible for a sub system (component, service, module, function), and they would chat about the format of the api that works best for all agents, etc. It would be like SmallTalk at the agent level.
mmoustafa25 days ago
I wrote an article on this exact issue (albeit more simpleminded) and I suggested a rudimentary way of tracking provenance in today's agents with "reasoning traces" on the objects they modify.
Would love people's thoughts on this: https://0xmmo.notion.site/Preventing-agent-doom-loops-with-p...
- nthh25 days ago
  The original article does a good job of contextualizing the shifting dynamics, but yours turns that into an actionable solution. I've been wondering about this same problem too after having trouble wrangling LLMs to not make hacky solutions or go on wild goose chases.
  Do you have a working implementation for this? Just a one-to-one index of files and reasoning traces? I'd like to trace these changes easily back to a feature or technical spec too (and have it change that spec if it needs to? I suppose the spec would have it's own reasoning trace)
- aryehof25 days ago
  If recording object change is important, then have the subject object know one or more recorded “change” objects. An LLM is much more likely to understand a real object modeling pattern, rather than some new non-standard scheme such as you suggest.
  - mmoustafa16 days ago
    It is not about tracking changes, git does that well enough.
    It is about tracking the reason for the changes, i.e. git messages but on steroids.
pu_pe25 days ago
So the concept is that requirements and rationale will be more permanent and important than code, because code can be regenerated very cheaply?
I think commenters here identified many of the issues we would face with it today, but thinking of a future where LLMs are indeed writing virtually all code and very fast, ideas like these are interesting. Our current tooling (version control, testing, etc.) will certainly need to adapt if this future comes to pass.
layer825 days ago
That’s pretty similar to Architecture Decision Records: https://adr.github.io/
- skybrian25 days ago
  At first glance, it sounds vaguely similar to creating a bug before implementing a feature. (Or writing a design doc). Is there more to it?
  - layer825 days ago
    Features and architectuaral decisions are largely separate things, although there can of course be causal links between them. But you can implement new features without having to add a single architectural decision, and you can make architectural decisions and implement them without having to change a single feature (similar to a refactoring). The architecture can enable certain features, but the same feature can usually be implemented in the context of wildly different architectures. You want to keep an organized record of all architectural decisions, independently from features, even if some of them are motivated by features. Architectural decisions often remain relevant even after features have been changed or removed. You could take the architectural decisions (or some subset of them) of one project and apply them to a different project with very different features.
    You could use an issue tracker as a database to maintain ADRs, but they would be their own item type. You could maintain ADRs as a list of subsections in a design document (probably not so convenient), or as a (usually rather short) document per architectural decision, which however you’d have to organize somehow. ADRs are more granular than design documents, and they collectively maintain a history of the decisions made.
ricksunny25 days ago
“ the code itself becomes an artifact of synthesis, not the locus of intent.”
would not be unfamiliar to mechanical engineers who work with CAD. The ‘Histories’ (successive line-by-line drawing operations - align to spline of such-and-such dimensions, put a bevel here, put a hole there) in many CAD tools are known to be a reflection of design intent moreso than the final 3D model that the operations ultimately produce.
- crote25 days ago
  CAD tools also really don't like changes in the history. A tiny change in one step can corrupt the entire model, because a subsequent step can no longer properly "attach" to a reference point which no longer exists.
  Fixing this in CAD is already a massive pain, fixing it with black-box LLMs sounds nearly impossible.
  - ricksunny25 days ago
    > Fixing this in CAD is already a massive pain, fixing it with black-box LLMs sounds nearly impossible.
    Please please don’t get me started..: https://github.com/ricksher/ASimpleMechatronicMarkupLanguage
Animats25 days ago
This is going to be hard to fix.
If you use an LLM and agents to regenerate code, a minor change in the "specification" may result in huge changes to the code. Even if it's just due to forcing regeneration. OK, got that.
But there may be no "specification", just an ongoing discussion with an agentic system. "We don't write code any more, we just yell at the agents." Even if the entire sequence of events has been captured, it might not be very useful. It's like having a transcript of a design meeting.
There's a real question as to what the static reference of the design should be. Or what it should look like. This is going to be difficult.
atoav25 days ago
So what they want is to essentially write a spec with business rules and implementation details ans such, and version control that instead of the actual source code?
Not sure what stops you from doing that just right now.
materialpoint25 days ago
Who's gonna tell the author that Git doesn't do diffs, but snapshots?
Deltas are just an implementation detail, and thinking of Git as diffing is specifically shunned in introductions to Git versioning.
- IanCal25 days ago
  That doesn’t matter to the point, which is stored history misses the way in which things moved from state A to state B.
  - materialpoint25 days ago
    So you missed the point too. The post depends on versioning being diffs only.
BenGosub25 days ago
Basically he's describing DSPy[1]
1. https://dspy.ai/
PeterStuer25 days ago
This reads very academic with not much real world alignment.
sebaschi25 days ago
This style of writing is insufferable (to me). The idea is also not as deep is it may seem based on the language used. I also don’t think it’s strictly valid, i.e. that version control somehow needs to be adjusted to AI.
hekklea month ago
TL;DR, the author claims that you should record the reasons for change, rather than the code changes themselves...
CONGRATULATIONS: you have just 'invented' documentation, specifically a CHANGE_LOG.
- Fnoord25 days ago
  The URL and title give it away:
  URL starts with 'ai' and the title claims 'is' and 'the new version control.'
  I then skimmed through the article, and it mentions it doesn't exist today.
  First, it requires an implementation to prove the point, and then it has to defeat the network effect of git. There is zero proof for that argument, only a hypothesis. (Which sums up AI hype pretty well.) Someone's trying to hype AI, just like someone was hyping blockchain. GLHF w/that. Oh, and thank you for ruining the hardware market, assholes.
- coffeefirst25 days ago
  It’s worse than that. The author thinks you can generate working software from a changelog that will work consistently from build to build.
  Anyone want to try and lmk how far you get?