I made my own Git(tonystr.net)

382 pointsby TonyStr12 days ago35 comments

nasretdinov12 days ago
Nice work! On a complete tangent, Git is the only SCM known to me that supports recursive merge strategy [1] (instead of the regular 3-way merge), which essentially always remembers resolved conflicts without you needing to do anything. This is a very underrated feature of Git and somehow people still manage to choose rebase over it. If you ever get to implementing merges, please make sure you have a mechanism for remembering the conflict resolution history :).
[1] https://stackoverflow.com/questions/55998614/merge-made-by-r...
- arunix12 days ago
  I remember in a previous job having to enable git rerere, otherwise it wouldn't remember previously resolved conflicts.
  https://git-scm.com/book/en/v2/Git-Tools-Rerere
  - nasretdinov12 days ago
    I believe rerere is a local cache, so you'd still have to resolve the conflicts again on another machine. The recursive merge doesn't have this issue — the conflict resolution inside the merge commits is effectively remembered (although due to how Git operates it actually never even considers it a conflict to be remembered — just a snapshot of the closest state to the merged branches)
    Guvante11 days ago
    Are people repeatedly handling merge conflicts on multiple machines?
    If there was a better way to handle "I needed to merge in the middle of my PR work" without introducing reverse merged permanently in the history I wouldn't mind merge commits.
    But tools will sometimes skip over others work if you `git pull` a change into your local repo due to getting confused which leg of the merge to follow.
    nasretdinov11 days ago
    One place where it mattered was when I was working on a large PHP web site, where backend devs and frontend devs would be working in the same branch — this way you don't have to go back and forth to get the new API, and this workflow was quite unique and, in my mind, quite efficient. The branchs also could live for some time (e.g. in case of large refactorings), and it's a good idea to merge in the master branch frequently, so recursive merge was really nice. Nowadays, of course, you design the API for your frontend, mobile, etc, upfront, so there's little reason to do that anymore.
    Guvante6 days ago
    Honestly if the tooling were better at keeping upstream on the left I wouldn't mind as much but IIRC `git pull` puts your branch on the left which means walking history requires analysing each merge commit to figure out where history actually is vs where a temporary branch is.
    That is my main problem with merge, I think the commit ballooning is annoying too but that is easier to ignore.
  - direwolf2011 days ago
    The recursive merge is about merging branches that already have merges in them, while rerere is about repeating the same merge several times.
  - pyrolistical11 days ago
    Would be nice if centralized git platforms shared rerere caches
  - lmm11 days ago
    Rerere is dangerous and counterproductive - it tries to give rebase the same functionality that merge has, but since rebase is fundamentally wrong it only stacks the wrongness.
    seba_dos111 days ago
    Cherry-picks being "fundamentally wrong" is certainly an interesting git take.
- ezst11 days ago
  On recursive merging, by the author of mercurial
  https://www.mercurial-scm.org/pipermail/mercurial/2012-Janua...
  - nasretdinov11 days ago
    Yeah, the point about high complexity of the recursive merge is valid, and that's what I would expect from the Mercurial devs too. I personally find it a bit unfortunate that Git ended up winning tbh, but since it did, I think it makes sense to at least cherish what it has out of the box :)
    ezst11 days ago
    In some ways, the legacy of mercurial lives through jujutsu/jj and offers some sanity and familiarity on top of git's UI. But with that said, mercurial is far from dead, major "under-the-hood" works are going strong (including a rewrite in rust), the hosting situation is getting good with heptapod (a branch of gitlab with native mercurial support).
    I really don't see any downside to recommending mercurial in 2026. Git isn't just inferior as a VCS in the subjective sense of "oh… I don't like this or that inconsistent aspect of its UI", but in very practical and meaningful ways (on technical merit) that are increasingly forgotten about the more it solidifies as a monopoly:
    - still no support for branches (in the traditional sense, as a commit-level marker, to delineate series of related commits) means that a branchy-DAG is border-line useless, and tools like bisect can't use the info to take you at the series boundaries
    - still no support for phasing (to mark which commits have been exchanged or are local-only and safe to edit)
    - still no support for evolve (to record history rewrites in a side-storage, making concurrent/distributed history rewrites safe and mostly automatic)
- mkleczek12 days ago
  Much more principled (and hence less of a foot-gun) way of handling conflicts is making them first class objects in the repository, like https://pijul.org does.
  - jcgl12 days ago
    Jujutsu too[0]:
    > Jujutsu keeps track of conflicts as first-class objects in its model; they are first-class in the same way commits are, while alternatives like Git simply think of conflicts as textual diffs. While not as rigorous as systems like Darcs (which is based on a formalized theory of patches, as opposed to snapshots), the effect is that many forms of conflict resolution can be performed and propagated automatically.
    [0] https://github.com/jj-vcs/jj
  - PunchyHamster11 days ago
    I feel like people making new VCSes should just re-use GIT storage/network layer and innovate on top of that. Git storage is flexible enough for that, and that way you can just.... use it on existing repos with very easy migration path for both workflows (CI/CD never need to care about what frontend you use) and users
    zaphar11 days ago
    Git storage is just a merkle tree. It's a technology that's been around forever and was simultaneously chosen by more than one vcs technology around the same time. It's incredibly effective so it makes sense that it would get used.
    storystarling11 days ago
    The bottleneck with git is actually the on-the-fly packfile generation. The server has to burn CPU calculating deltas for every clone. For a distributed system it seems much better to use a simple content-addressable store where you just serve static blobs.
    3eb7988a166311 days ago
    It is my understanding that under the hood, the repository has quite a bit of state that can get mangled. That is why naively syncing a git repo with say Dropbox is not a surefire operation.
  - theLiminator11 days ago
    It's very cool though I imagine it's doa due to lack of git compatibility...
    speed_spread11 days ago
    Lack of current-SCM incumbent compatibility can be an advantage. Like Linus decided to explicitly do the reverse of every SVN decision when designing git. He even reversed CLI usability!
    rob7411 days ago
    Pssst! I think Linus didn't as much design Git as he cloned BitKeeper (or at least the parts of it he liked). I have never used it, but if you look at the BitKeeper documentation, it sounds strangely familiar: https://www.bitkeeper.org/testdrive.html . Of course, that made sense for him and for the rest of the Linux developers, as they were already familiar with BitKeeper. Not so much for the rest of us though, who are now stuck with the usability (or lack thereof) you mentioned...
    theLiminator11 days ago
    I think the network effects of git is too large to overcome now. Hence why we see jj get a lot more adoption than pijul.
- pwdisswordfishy11 days ago
  New to me was discovering within the last month that git-merge doesn't have a merge strategy of "null": don't try to resolve any merge conflicts, because I've already taken care of them; just know that this is a merge between the current branch and the one specified on the command-line, so be a dutiful little tool and just add it to your records. Don't try to "help". Don't fuck with the index or the worktree. Just record that this is happening. That's it. Nothing else.
  - valleyer11 days ago
    Doesn't `git merge -s ours` do this?
    This resolves any number of heads, but the resulting tree of the merge is always that of the current branch head, effectively ignoring all changes from all other branches. It is meant to be used to supersede old development history of side branches. Note that this is different from the -Xours option to the ort merge strategy.
  - Brian_K_White11 days ago
    What does that even mean? There already is reset hard.
    kbolino11 days ago
    The name "null" is confusing; you have to pick something. However, I think what is desired here is the "theirs" strategy, i.e. to replace the current branch's tree entirely with the incoming branch's tree. The end result would be similar to a hard reset onto the incoming branch, except that it would also create a merge commit. Unfortunately, the "theirs" strategy does not exist, even though the "ours" strategy does exist, apparently to avoid confusion with the "theirs" option [1], but it is possible to emulate it with a sequence of commands [2].
    [1]: https://git-scm.com/docs/merge-strategies#Documentation/merg...
    [2]: https://stackoverflow.com/a/4969679/814422
    pwdisswordfishy11 days ago
    What do you mean, "What does it mean?" It means what I wrote.
    > There already is reset hard.
    That's not... remotely relevant? What does that have to do with merging? We're talking about merging.
    Brian_K_White11 days ago
    Neither of these are answers or explainations. So you said nothing, and then said nothing again.
    I also "mean what I wrote". Man that was sure easy to say. It's almost like saying nothing at all. Which is anyone's righ to do, but it's not an argument, nor a definition of terms, nor communication at all. Well, it does communicate one thing.
    pwdisswordfishy11 days ago
    This:
    > don't try to resolve any merge conflicts ... Don't try to "help". Don't fuck with the index or the worktree.
    ... certainly is "nothing" in the literal sense--that that's what is desired of git-merge to do, but it's not "nothing" in the sense that you're saying.
    git reset --hard has nothing to do with merging. Nothing. They're not even in the same class of operations. It's absolutely irrelevant to this use case. And saying so isn't "not an argument" or not communicating anything at all. git reset --hard does not in any sense effect a merge. What more needs to be (or can be) said?
    If you want someone to help explain something to you, it's up to you to give them an anchor point that they can use to bridge the gap in understanding. As it stands, it's you who's given nothing at all, so one can only repeat what has already been described--
    A resolution strategy for merge conflicts that involves doing nothing: nothing to the files in the current directory, staging nothing to be committed, and in fact not even bothering to check for conflicts in the first place. Just notate that it's going to be a merge between two parents X and Y, and wait for the human so they have an opportunity to resolve the conflicts by hand (if they haven't already), for them to add the changes to the staging area, and for them to issue the git-commit command that completes the merge between X and Y. What's unclear about this?
    kbolino10 days ago
    I think this is what you want:
    git merge -s ours --no-ff --no-commit <branch>
    This will initiate a merge, take nothing from the incoming branch, and allow you to decide how to proceed. This leaves git waiting for your next commit, and the two branches will be considered merged when that commit happens. What you may want to do next is:
    git checkout -p <branch>
    This will interactively review each incoming change, giving you the power to decide how each one should be handled. Once you've completed that process, commit the result and the merge is done.
    6 days ago
    undefined
    6 days ago
    undefined
    seba_dos16 days ago
    You know that you can edit your merge commits any way you want and you don't have to rely on resolution strategies to do it for you, right?
    pwdisswordfishy5 days ago
    Right. That's the entire basis for the discussion here. So why is this a question?
    seba_dos15 days ago
    Because you already have all the needed tools to handle your special little edge case (in multiple ways!), so the discussion seems rather pointless.
    pwdisswordfishy4 days ago
    You are confused. It's frightening that someone would be able to reach a point this deep into the discussion and think that "You know that you can edit your merge commits any way you want and you don't have to rely on resolution strategies to do it for you" is revealing something new or insightful.
    seba_dos14 days ago
    So it is pointless indeed, gotcha.
    pwdisswordfishy3 days ago
    Your zero-insight comment was, indeed, pointless.
- giancarlostoro11 days ago
  I hate git squash, it only goes one direction and personally I dont give a crap if it took you 100 commits to do one thing, at least now we can see what you may have tried so we dont repeat your mistakes. With git squash it all turns into, this is what they last did that mattered, and btw we cant merge it backwards without it being weird, you have to check out an entirely new branch. I like to continue adding changes to branches I have already merged. Not every PR is the full solution, but a piece of the puzzle. No one can tell me that they only need 1 PR per task because they never have a bug, ever.
  Give me normal boring git merges over git squash merges.
- p0w3n3d11 days ago
  That's something new to me (using git for 10 years, always rebased)
  - iberator11 days ago
    I'm even more lazy. I almost always clone from scratch after merging or after not touching the project for some time. So easy and silly :)
    I always forget all the flags and I work with literally just: clone, branch, checkout, push.
    (Each feature is a fresh branch tho)
- chungy11 days ago
  as far as I understand the problem (sorry, the SO isn't the clearest around), Fossil should support this operation. It does one better, since it even tracks exactly where merges come from. In Git, you have a merge commit that shows up with more than one parent, but Fossil will show you where it branched off too.
  Take out the last "/timeline" component of the URL to clone via Fossil: https://chiselapp.com/user/chungy/repository/test/timeline
  See also, the upstream documentation on branches and merging: https://fossil-scm.org/home/doc/trunk/www/branching.wiki
darkryder12 days ago
Great writeup! It's always fun to learn the details of the tools we use daily.
For others, I highly recommend Git from the Bottom Up[1]. It is a very well-written piece on internal data structures and does a great job of demystifying the opaque git commands that most beginners blindly follow. Best thing you'll learn in 20ish minutes.
1. https://jwiegley.github.io/git-from-the-bottom-up/
- MarsIronPI12 days ago
  Oh, I hadn't ever seen that one. I "grokked" Git thanks to The Git Parable[0] several years ago.
  [0]: https://tom.preston-werner.com/2009/05/19/the-git-parable
- spuz12 days ago
  Thanks - I think this is the article I was thinking of that really helped me to understand git when I first started using it back in the day. I tried to find it again and couldn't.
- sanufar11 days ago
  Ooh, this looks fun! I didn’t know you could cat-file on a hash id, that’s actually quite cool.
teiferer12 days ago
If you ever wonder how coding agents know how to plan things etc, this is the kind of article they get this training from.
Ends up being circular if the author used LLM help for this writeup though there are no obvious signs of that.
- TonyStr12 days ago
  Interestingly, I looked at github insights and found that this repo had 49 clones, and 28 unique cloners, before I published this article. I definitely did not clone it 49 times, and certainly not with 28 unique users. It's unlikely that the handful of friends who follow me on github all cloned the repo. So I can only speculate that there are bots scraping new public github repos and training on everything.
  Maybe that's obvious to most people, but it was a bit surprising to see it myself. It feels weird to think that LLMs are being trained on my code, especially when I'm painfully aware of every corner I'm cutting.
  The article doesn't contain any LLM output. I use LLMs to ask for advice on coding conventions (especially in rust, since I'm bad at it), and sometimes as part of research (zstd was suggested by chatgpt along with comparisons to similar algorithms).
  - tonnydourado12 days ago
    Particularly on GitHub, might not even be LLMs, just regular bots looking for committed secrets (AWS keypairs, passwords, etc.)
  - Phelinofist12 days ago
    I selfhost Gitea. The instance is crawled by AI crawlers (checked the IPs). They never cloned, they just browse and take it directly from there.
    Phelinofist11 days ago
    For reference, this is how I do it in my Caddyfile:
    (block_ai) { @ai_bots { header_regexp User-Agent (?i)(anthropic-ai|ClaudeBot|Claude-Web|Claude-SearchBot|GPTBot|ChatGPT-User|Google-Extended|CCBot|PerplexityBot|ImagesiftBot) } abort @ai_bots }
    Then, in a specific app block include it via
    import block_ai
    seba_dos111 days ago
    Most of then pretend to be real users though and don't identify themselves with their user agent strings.
    zaphar11 days ago
    I have almost exactly this in my own caddyfile :-D The order of the items in the regex is a little different but mostly the same items. I just pulled them from my web access logs over time and update it every once in a while.
    Zambyte12 days ago
    i run a cgit server on an r720 in my apartment with my code on it and that puppy screams whenever sam wants his code
    blocking openai ips did wonders for the ambient noise levels in my apartment. they're not the only ones obviously, but they're they only ones i had to block to stay sane
    MarsIronPI12 days ago
    Have you considered putting it behind Anubis or an equivalent?
    Zambyte12 days ago
    Yes, but I haven't and would prefer not to
    MarsIronPI11 days ago
    Understandable. It's an outrage that we even have to consider such measures.
  - nerdponx12 days ago
    Time to start including deliberate bugs. The correct version is in a private repository.
    teiferer12 days ago
    And what purpose would this serve, exactly?
    adastra2211 days ago
    Spite.
    below4311 days ago
    They used to do this with maps - eg. fake islands - to pick up when they were copied.
    program_whiz12 days ago
    while I think this is a fun idea -- we are in such a dystopian timeline that I fear you will end up being prosecuted under a digital equivalent of various laws like "why did you attack the intruder instead of fleeing" or "you can't simply remove a squatter because its your house, therefore you get an assault charge."
    A kind of "they found this code, therefore you have a duty not to poison their model as they take it." Meanwhile if I scrape a website and discover data I'm not supposed to see (e.g. bank details being publicly visible) then I will go to jail for pointing it out. :(
    nerdponx11 days ago
    I think if we're at the point where posting deliberate mistakes to poison training data is considered a crime, we would be far far far down the path of authoritarian corporate regulatory capture, much farther than we are now (fortunately).
    wredcoll11 days ago
    Look, I get the fantasy of someday pulling out my musket^W ar15 and rushing downstairs to blow away my wife^W an evil intruder, but, like, we live in a society. And it has a lot of benefits, but it does mean you don't get to be "king of your castle" any more.
    Living in a country with hundreds of millions of other civilians or a city with tens of thousands means compromising what you're allowed to do when it affects other people.
    There's a reason we have attractive nuisance laws and you aren't allowed to put a slide on your yard that electrocutes anyone who touches it.
    None of this, of course, applies to "poisoning" llms, that's whatever. But all your examples involved actual humans being attacked, not some database.
    program_whiz11 days ago
    Thanks that was the term I was looking for "attractive nuisance". I wouldn't be surprised if a tech company could make that case -- this user caused us tangible harm and cost (training, poisoned models) and left their data out for us to consume. Its the equivalent of putting poison candy on a park table your honor!
    teo_zero11 days ago
    That reminds me of the protagonist of Charles Stross's novel "Accelerando", a prolific inventor who is accused by the IRS to have caused millions of losses because he releases all his ideas in the public domain instead of profiting from them and paying taxes on such profits.
  - 0x696C696112 days ago
    This has been happening before LLMs too.
  - teiferer12 days ago
    I don't really get why they need to clone in order to scrape ...?
    > It feels weird to think that LLMs are being trained on my code, especially when I'm painfully aware of every corner I'm cutting.
    That's very much expected. That's why the quality of LLM coding agents is like it is. (No offense.)
    The "asking LLMs for advice" part is where the circular aspect starts to come into the picture. Not worse than looking at StackOverflow though which then links to other people who in turn turned to StackOverflow for advice.
    storystarling11 days ago
    Cloning gets you the raw text objects directly. If you scrape the web UI you're dealing with a lot of markup overhead that just burns compute during ingestion. For training data you usually want the structure to be as clean as possible from the start.
    teiferer11 days ago
    Sure, cloning a local copy. But why clone on github?
    adastra2211 days ago
    The quality of LLM coding agents is pretty good now.
- wasmainiac12 days ago
  Maybe we can poison LLMs with loops of 2 or more self referencing blogs.
  - jdiff12 days ago
    Only need one, they're not thinking critically about the media they consume during training.
    falcor8412 days ago
    Here's a sad prediction: over the coming few years, AIs will get significantly better at critical evaluation of sources, while humans will get even worse at it.
    whstl12 days ago
    I wish I could disagree with you, but what I'm seeing on average (especially at work) is exactly that: people asking stuff to ChatGPT and accepting hallucinations as fact, and then fighting me when I say it's not true.
    prmoustache12 days ago
    There is "death by GPS" for people dying after blindly following their GPS instruction. There will definitely be a "death by AI" expression very soon.
    stevekemp11 days ago
    Tesla-related fatalities probably count already, albeit without that label/name.
    sailfast11 days ago
    Hot take: Humans have always been bad at this (in the aggregate, without training). Only a certain percentage of the population took the time to investigate.
    For most throughout history, whatever is presented to you that you believe is the right answer. AI just brings them source information faster so what you're seeing is mostly just the usual behavior, but faster. Before AI people would not have bothered to try and figure out an answer to some of these questions. It would've been too much work.
    topaz012 days ago
    My sad prediction is that LLMs and humans will both get worse. Humans might get worse faster though.
    keybored11 days ago
    HN commenters will be technooptimistic misanthrops. Status quo ante bellum.
    andy_ppp12 days ago
    The secret sauce about having good understanding, taste and style (both for coding and writing) has always been in the fine tuning and RHLF steps. I'd be skeptical if the signals a few GitHub repos or blogs generate at the initial stages of the learning are that critical. There's probably a filter also for good taste on the initial training set and these are so large not even a single full epoch is done on the data these days.
    jama21111 days ago
    It wouldn’t work at all.
  - jama21111 days ago
    I see the AI hating part of HN has come out again
- mexicocitinluez12 days ago
  > Ends up being circular if the author used LLM help for this writeup though there are no obvious signs of that.
  Great argument for not using AI-assisted tools to write blog posts (especially if you DO use these tools). I wonder how much we're taking for granted in these early phases before it starts to eat itself.
  - jama21111 days ago
    What does eating itself even look like? It doesn’t take much salt to change a hash.
    mexicocitinluez11 days ago
    Being trained on it's own results?
    jama21110 days ago
    Pretty easy to detect for surely
- anu7df12 days ago
  I understand model output put back into training would be an issue, but if model output is guided by multiple prompts and edited by the author to his/her liking wouldn't that at least be marginally useful?
- prodigycorp12 days ago
  Random aside about training data:
  One of the funniest things I've started to notice from Gemini in particular is that in random situations, it talks with english with an agreeable affect that I can only describe as.. Indian? I've never noticed such a thing leak through before. There must be a ton of people in India who are generating new datasets for training.
  - evntdrvn12 days ago
    There was a really great article or blog post published in the last few months about the author's very personal experience whose gist was "People complain that I sound/write like an LLM, but it's actually the inverse because I grew up in X where people are taught formal English to sound educated/western, and those areas are now heavily used for LLM training."
    I wish I could find it again, if someone else knows the link please post it!
    gxnxcxcx11 days ago
    I'm Kenyan. I don't write like ChatGPT, ChatGPT writes like me
    https://news.ycombinator.com/item?id=46273466
    tverbeure11 days ago
    Thanks for that link.
    This part made me laugh though:
    > These detectors, as I understand them, often work by measuring two key things: ‘Perplexity’ and ‘burstiness’. Perplexity gauges how predictable a text is. If I start a sentence, "The cat sat on the...", your brain, and the AI, will predict the word "floor."
    I can't be the only one who's brain predicted "mat" ?
    cozzyd11 days ago
    And I thought it would be a hat...
    tverbeure9 days ago
    No, that would be "in the hat."
    evntdrvn9 days ago
    Thank you!!! :)
    11 days ago
    undefined
    awesome_dude11 days ago
    I've been critical of people that default to "an em dash being used means the content is generated by an LLM", or, "they've numbered their points, must be an LLM"
    I do know that LLMs generate content heavy with those constructs, but they didn't create the ideas out of thin air, it was in the training set, and existed strongly enough that LLMs saw it as common place/best practice.
  - blenderob12 days ago
    That's very interesting. Any examples you can share which has those agreeable effects?
    prodigycorp12 days ago
    I'm going to do a cursory look through my antigrav history, i want to find it too. I remember it's primarily in the exclamations of agreement/revelation, and one time expressing concern which I remember were slightly off natural for an american english speaker.
    prodigycorp11 days ago
    Cant find anything, too many messages telling the agent "please do NOT thosec changes". I'm going to remember to save them going forward.
p4bl012 days ago
Nice post :). It made me think of ugit: DIY Git in Python [1] which is still by far my favorite of this kind of posts. It really goes deep into Git internals while managing to stay easy to follow along the way.
[1] https://www.leshenko.net/p/ugit/
- UltraSane11 days ago
  I mapped git operations to Neo4j and it really helped me understand how it works.
- TonyStr12 days ago
  This page is beautiful!
  Bookmarked for later
- mfashby12 days ago
  in a similar vein; Write yourself a Git was fun to follow https://wyag.thb.lt/
gkbrk11 days ago
CodeCrafters has an amazing "Build your own Git" [1] tutorial too. Jon Gjengset has a nice video [2] doing this challenge live with Rust.
[1]: https://app.codecrafters.io/courses/git/overview
[2]: https://www.youtube.com/watch?v=u0VotuGzD_w
brendoncarroll11 days ago
Me too. Version control is great, it should get more use outside of software.
https://github.com/gotvc/got
Notable differences: E2E encryption, parallel imports (Got will light up all your cores), and a data structure that supports large files and directories.
- rtkwe11 days ago
  The problem is when you move beyond text files it gets hard to tell what changes between two versions without opening both versions in whatever program they come from and comparing.
  - brendoncarroll11 days ago
    > The problem is when you move beyond text files it gets hard to tell what changes between two versions without opening both versions in whatever program they come from and comparing.
    Yeah, totally agree. Got has not solved conflict resolution for arbitrary files. However, we can tell the user where the files differ, and that the file has changed.
    There is still value in being able to import files and directories of arbitrary sizes, and having the data encrypted. This is the necessary infrastructure to be able to do distributed version control on large amounts of private data. You can't do that easily with Git. It's very clunky even with remote helpers and LFS.
    I talk about that in the Why Got? section of the docs.
    https://github.com/gotvc/got/blob/master/doc/1.1_Why_Got.md
- DASD11 days ago
  Nice! Not sure if you're aware of Got(Game of Trees) that appears to pre-date your Got.
  https://gameoftrees.org/index.html
  - brendoncarroll11 days ago
    Yes the author reached out. There has not yet been a confusion among real users that I am aware of.
    https://github.com/gotvc/got/issues/20
sluongng12 days ago
Zstd dictionary compression is essentially how Meta's Mercurial fork (Sapling VCS) stores blobs https://sapling-scm.com/docs/dev/internals/zstdelta. The source code is available in GitHub if folks want to study the tradeoffs vs git delta-compressed packfiles.
I think theoratically, Git delta-compression is still a lot more optimized for smaller repos. But for bigger repos where sharding storaged is required, path-based delta dictionary compression does much better. Git recently (in the last 1 year) got something called "path-walk" which is fairly similar though.
sublinear12 days ago
> If I were to do this again, I would probably use a well-defined language like yaml or json to store object information.
I know this is only meant to be an educational project, but please avoid yaml (especially for anything generated). It may be a superset of json, but that should strongly suggest that json is enough.
I am aware I'm making a decade old complaint now, but we already have such an absurd mess with every tool that decided to prefer yaml (docker/k8s, swagger, etc.) and it never got any better. Let's not make that mistake again.
People just learned to cope or avoid yaml where they can, and luckily these are such widely used tools that we have plenty of boilerplate examples to cheat from. A new tool lacking docs or examples that only accepts yaml would be anywhere from mildly frustrating to borderline unusable.
oldestofsports11 days ago
Nice job, great article!
I had a go at it as well a while back, I call it "shit" https://github.com/emanueldonalds/shit
- hahahahhaah11 days ago
  Fast Useful Change Keeper
- tpoacher11 days ago
  THE shit, in fact.
temporallobe11 days ago
Reminds me of when I tried to invent a SPA framework. So much hidden complexity I hadn’t thought of and I found myself going down rabbit holes that I am sure the creators of React and Angular went down. Git seems to be like this and I am often reminded of how impressive it is at hiding underlying complexity.
- alsetmusic11 days ago
  > at hiding underlying complexity.
  It's only in the context of recreating Git that this comment makes sense.
igorw12 days ago
Random but y'all might enjoy. Git client in PHP, supports reading packfiles, reftables, diff via LCS. Written by hand.
https://github.com/igorwwwwwwwwwwwwwwwwwwww/gipht-horse
- nasretdinov12 days ago
  Nice! This repo is a huge W for PHP I'd say.
  P.S. Didn't know that plain '@' can be used instead of HEAD, but I guess it makes sense since you can omit both left and right parts of the expressions separated by '@'
sneela12 days ago
> If you want to look at the code, it's available on github.
Why not tvc-hub :P
Jokes aside, great write up!
- TonyStr12 days ago
  haha, maybe that's the next project. It did feel weird to make git commits at the same time as I was making tvc commits
KolmogorovComp11 days ago
It’s really a shame git storage use files as the unit for storage. That’s what makes it improper for usage with many of small files, or large files.
Content-based chunking like Xethub uses really should become the default. It’s not like it’s new either, rsync is based on it.
https://huggingface.co/blog/xethub-joins-hf
h1fra12 days ago
Learning git internals was definitely the moment it became clear to me how efficient and smart git is.
And this way of versionning can be reused in other fields, as soon as have some kind of graph of data that can be modified independently but read all together then it makes sense.
12 days ago
undefined
kgeist12 days ago
>The hardest part about this project was actually just parsing.
How about using sqlite for this? Then you wouldn't need to parse anything, just read/update tables. Fast indexing out of the box, too.
- grenran12 days ago
  that would be what https://fossil-scm.org/ is
  - dchest11 days ago
    While Fossil uses SQLite for underlying storage (instead of the filesystem directly) and various support infrastructure, its actual format is not based on SQLite: https://fossil-scm.org/home/doc/trunk/www/fileformat.wiki
    It's basically plaintext. Even deltas are plaintext for text files.
    Reason: "The global state of a fossil repository is kept simple so that it can endure in useful form for decades or centuries. A fossil repository is intended to be readable, searchable, and extensible by people not yet born."
  - TonyStr12 days ago
    Very interesting. Looks like fossil has made some unique design choices that differ from git[0]. Has anyone here used it? I'd love to hear how it compares.
    [0] https://fossil-scm.org/home/doc/trunk/www/fossil-v-git.wiki#...
    smartmic12 days ago
    I use Fossil extensively, but only for personal projects. There are specific design conditions, such as no rebasing [0], and overall, it is simpler yet more useful to me. However, I think Fossil is better suited for projects governed under the cathedral model than the bazaar model. It's great for self-hosting, and the web UI is excellent not only for version control, but also for managing a software development project. However, if you want a low barrier to integrating contributions, Fossil is not as good as the various Git forges out there. You have to either receive patches or Fossil bundles via email or forum, or onboard/register contributors as developers with quite wide repo permissions.
    [0]: https://fossil-scm.org/home/doc/trunk/www/rebaseharm.md
    toyg12 days ago
    Sounds like a more modern cvs/Subversion
    chungy11 days ago
    It was developed primarily to replace SQLite's CVS repository, after all. They used CVSTrac as the forge and Fossil was designed to replace that component too.
    jact12 days ago
    I use Fossil extensively for all my personal projects and find it superior for the general case. As others said it’s more suited for small projects.
    I also use Fossil for lots of weird things. I created a forum game using Fossil’s ticket and forum features because it’s so easy to spin up and for my friends to sign in to.
    At work we ended up using Fossil in production to manage configuration and deployment in a highly locked down customer environment where its ability to run as a single static binary, talk over HTTP without external dependencies, etc. was essential. It was a poor man’s deployment tool, but it performed admirably.
    Fossil even works well as a blogging platform.
    embedding-shape12 days ago
    Used it on and off mainly to check it out, but always in a personal/experimental capacity. Never managed to convince any teams to give it a try, mostly because git don't tend to get in the way, so hard to justify to learn something completely new.
    I really enjoy how local-first it is, as someone who sometimes work without internet connection. That the data around "work" is part of the SCM as well, not just the code, makes a lot of sense to me at a high-level, and many times I wish git worked the same...
    usrbinbash12 days ago
    I mean, git is just as "local-first" (a git repo is just a directory after all), and the standard git-toolchain includes a server, so...
    But yeah, fossil is interesting, and it's a crying shame its not more well known, for the exact reasons you point out.
    embedding-shape12 days ago
    > I mean, git is just as "local-first" (a git repo is just a directory after all), and the standard git-toolchain includes a server, so...
    It isn't though, Fossil integrates all the data around the code too in the "repository", so issues, wiki, documentation, notes and so on are all together, not like in git where most commonly you have those things on another platform, or you use something like `git notes` which has maybe 10% of the features of the respective Fossil feature.
    It might be useful to scan through the list of features of Fossil and dig into it, because it does a lot more than you seem to think :) https://fossil-scm.org/home/doc/trunk/www/index.wiki
    adastra2211 days ago
    Those things exist for git too, e.g. git-bug. But the first-class to do it in git is email.
    embedding-shape11 days ago
    Email isn't a wiki, bug tracking, documentation and all the other stuff Fossil offers as part of their core design. The point is for it to be in one place, and local-first.
    If you don't trust me, read the list of features and give it a try yourself: https://fossil-scm.org/home/doc/trunk/www/index.wiki
    adastra2211 days ago
    I am aware of fossil. Did you look up git-bug?
    embedding-shape11 days ago
    Indeed, I'd still claim that a 3rd party addition doesn't make Git as local-first as Fossil when it comes to other things than source code.
    graemep12 days ago
    I like it but the problem is everyone else already knows git and everything integrates with git.
    It is very easy to self host.
    Not having staging is awkward at first but works well once you get used to it.
    I prefer it for personal projects. In think its better for small teams if people are willing to adjust but have not had enough opportunities to try it.
    TonyStr12 days ago
    Is it possible to commit individual files, or specific lines, without a staging area? I guess this might be against Fossil's ethos, and you're supposed to just commit everything every time?
    graemep11 days ago
    Yes you can list specific files, but you have to list them all in the commit command.
    I think the ethos is to discourage it.
    It does not seem to be possible to commit just specific lines.
    jact12 days ago
    You can commit individual files.
- storystarling11 days ago
  SQLite solves the storage layer but I suspect you run into a pretty big impedance mismatch on the graph traversals. For heavy DAG operations like history rewriting, a custom structure seems way more efficient than trying to model that relationally.
  - SQLite11 days ago
    The Common Table Expression feature of SQL is very good at walking graphs. See, for example <https://sqlite.org/lang_with.html#queries_against_a_graph>.
- 12 days ago
  undefined
eru12 days ago
> These objects are also compressed to save space, so writing to and reading from .git/objects/ will always involve running a compression algoritm. Git uses zlib to compress objects, but looking at competitors, zstd seemed more promising:
That's a weird thing to put so close to the start. Compression is about the least interesting aspect of Git's design.
- alphabetag67512 days ago
  When you are learning, everything is important. I think it is okay to cut the person some slack regarding this.
  - eru10 days ago
    Yes, probably.
    It's just that git does a much more interesting job with compression, actually. Lot's more to learn. They don't compress the snapshots via something like zstd directly, that comes much later after a delta step. (Interestingly, that delta compression step doesn't use the diffs that `git show` shows you for your commits.)
astinashler11 days ago
Does this git include empty folder? I always annoy that it's not track empty folder.
- lucasoshiro11 days ago
  Actually, the Git data model supports empty directories, however, the index doesn't since it only maps names to files but not to directories. You can even create a commit with a root directory using --allow-empty, and it will use the hardcoded empty tree object (4b825dc642cb6eb9a060e54bf8d69288fbee4904).
- TonyStr11 days ago
  yep! Had to check to be sure:
  Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.02s Running `target/debug/tvc decompress f854e0b307caf47dee5c09c34641c41b8d5135461fcb26096af030f80d23b0e5`
  === args === decompress f854e0b307caf47dee5c09c34641c41b8d5135461fcb26096af030f80d23b0e5 === tvcignore === ./target ./.git ./.tvc
  === subcommand === decompress ------------------ tree ./src/empty-folder e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 blob ./src/main.rs fdc4ccaa3a6dcc0d5451f8e5ca8aeac0f5a6566fe32e76125d627af4edf2db97
  - woodrowbarlow11 days ago
    huh, cool. what happens if you use vanilla-git to clone a repo that contains empty folders? and do forges like github display them properly?
heckelson12 days ago
gentle reminder to set your website's `<title>` to something descriptive :)
- TonyStr12 days ago
  haha, thank you. Added now :-)
mg79461312 days ago
"Though I suck at it, my go-to language for side-projects is always Rust"
Hmm, dont be so hard on yourself!
proceeds to call ls from rust
Ok nevermind, although I dont think rust is the issue here.
(Tony I'm joking, thanks for the article)
bryan211 days ago
Ftr you can make repos with sha256 now.
I wonder if signing sha-1 mitigates the threat of using an outdated hash.
athrowaway3z12 days ago
I do wonder if the compression step makes sense at this layer instead of the filesystem layer.
- aabbcc124111 days ago
  Interesting take. I'm using btrfs (instead of ext4) with compression enabled (using zstd), so most of the files are compressed "transparently" - the files appear as normal files to the applications, but on disk it is compressed, and the application don't need to do the compress/decompress.
direwolf2011 days ago
Cool. When you reimplement something, it forces you to see the fractal complexity of it.
jrockway12 days ago
sha256 is a very slow algorithm, even with hardware acceleration. BLAKE3 would probably make a noticeable performance difference.
Some reading from 2021: https://jolynch.github.io/posts/use_fast_data_algorithms/
It is really hard to describe how slow sha256 is. Go sha256 some big files. Do you think it's disk IO that's making it take so long? It's not, you have a super fast SSD. It's sha256 that's slow.
- EdSchouten12 days ago
  It depends on the architecture. On ARM64, SHA-256 tends to be faster than BLAKE3. The reasons being that most modern ARM64 CPUs have native SHA-256 instructions, and lack an equivalent of AVX-512.
  Furthermore, if your input files are large enough that parallelizing across multiple cores makes sense, then it's generally better to change your data model to eliminate the existence of the large inputs altogether.
  For example, Git is somewhat primitive in that every file is a single object. In retrospect it would have been smarter to decompose large files into chunks using a Content Defined Chunking (CDC) algorithm, and model large files as a manifest of chunks. That way you get better deduplication. The resulting chunks can then be hashed in parallel, using a single-threaded algorithm.
  - oconnor66311 days ago
    As far as I know, most CDC schemes requires a single-threaded pass over the whole file to find the chunk boundaries? (You can try to "jump to the middle", but usually there's an upper bound on chunk length, so you might need to backtrack depending on what you learn later about the last chunk you skipped?) The more cores you have, the more of a bottleneck that becomes.
    EdSchouten11 days ago
    You can always use a divide and conquer strategy to compute the chunks. Chunk both halves of the file independently. Once that’s done, you redo the chunking around the midpoint of the file forward, until it starts to match the chunks obtained previously.
- grumbelbart212 days ago
  Is that even when using the SHA256 hardware extensions? https://en.wikipedia.org/wiki/SHA_instruction_set
  - oconnor66311 days ago
    It's mixed. You get something in the neighborhood of a 3-4x speedup with SHA-NI, but the algorithm is fundamentally serial. Fully parallel algorithms like BLAKE3 and K12, which can use wide vector extensions like AVX-512, can be substantially faster (10x+) even on one core. And multithreading compounds with that, if you have enough input to keep a lot of cores occupied. On the other hand, if you're limited to one thread and older/smaller vector extensions (SSE, NEON), hardware-accelerated SHA-256 can win. It can also win in the short input regime where parallelism isn't possible (< 4 KiB for BLAKE3).
holoduke12 days ago
I wonder if in the near future there will be no tools anymore in the sense we know it. you will maybe describe the tool you need and its created on the fly.
prakhar114412 days ago
I was also playing around with the ".git" directory - ended up writing:
"What's inside .git ?" - https://prakharpratyush.com/blog/7/
lasgawe11 days ago
nice work! This is one of the best ways to deeply learn something, reinvent the wheel yourself.
ofou12 days ago
btw, you can change the hashing algorithm in git easily
smangold12 days ago
Tony nice work!
b1temy12 days ago
Nice work, it's always interesting to see how one would design their own VCS from scratch, and see if they fall into problems existing implementations fell into in the past and if the same solution was naturally reached.
The `tvc ls` command seems to always recompute the hash for every non-ignored file in the directory and its children. Based on the description in the blog post, it seems the same/similar thing is happening during commits as well. I imagine such an operation would become expensive in a giant monorepo with many many files, and perhaps a few large binary files thrown in.
I'm not sure how git handles it (if it even does, but I'm sure it must). Perhaps it caches the hash somewhere in the `.git`directory, and only updates it if it senses the file hash changed (Hm... If it can't detect this by re-hashing the file and comparing it with a known value, perhaps by the timestamp the file was last edited?).
> Git uses SHA-1, which is an old and cryptographically broken algorithm. This doesn't actually matter to me though, since I'll only be using hashes to identify files by their content; not to protect any secrets
This _should_ matter to you in any case, even if it is "just to identify files". If hash collisions (See: SHAttered, dating back to 2017) were to occur, an attacker could, for example, have two scripts uploaded in a repository, one a clean benign script, and another malicious script with the same hash, perhaps hidden away in some deeply nested directory, and a user pulling the script might see the benign script but actually pull in the malicious script. In practice, I don't think this attack has ever happened in git, even with SHA-1. Interestingly, it seems that git itself is considering switching to SHA-256 as of a few months ago https://lwn.net/Articles/1042172/
I've not personally heard of the process of hashing to also be known as digesting, though I don't doubt that it is the case. I've mostly familiar of the resulting hash being referred to as the message digest. Perhaps it's to differentiate between the verb 'hash' (the process of hashing) with the output 'hash' (the ` result of hashing). And naming the function `sha256::try_digest`makes it more explicit that it is returning the hash/digest. But it is a bit of a reach, perhaps that are just synonyms to be used interchangeably as you said.
On a tangent, why were TOML files not considered at the end? I've no skin in the game and don't really mind either way, but I'm just curious since I often see Rust developers gravitate to that over YAML or JSON, presumably because it is what Cargo uses for its manifest.
--
Also, obligatory mention of jujutsu/jj since it seems to always be mentioned when talking of a VCS in HN.
- TonyStr11 days ago
  You are completely right about tvc ls recomputing each hash, but I think it has to do this? A timestamp wouldn't be reliable, so the only reliable way to verify a file's contents would be to generate a hash.
  In my lazy implemenation, I don't even check if the hashes match, the program reads, compresses and tries to write the unchanged files. This is an obvious area to improve performance on. I've noticed that git speeds up object lookups by generating two-letter directories from the first two letters in hashes, so objects aren't actually stored as `.git/objects/asdf12ha89k9fhs98...`, but as `.git/objects/as/df12ha89k9fhs98...`.
  >why were TOML files not considered at the end I'm just not that familiar with toml. Maybe that would be a better choice! I saw another commenter who complained about yaml. Though I would argue that the choice doesn't really matter to the user, since you would never actually write a commit object or a tree object by hand. These files are generated by git (or tvc), and only ever read by git/tvc. When you run `git cat-file <hash>`, you'll have to add the `-p` flag (--pretty) to render it in a human-readable format, and at that point it's just a matter of taste whether it's shown in yaml/toml/json/xml/special format.
  - b1temy11 days ago
    > A timestamp wouldn't be reliable
    I agree, but I'm still iffy on reading all files (already an expensive operation) in the repository, then hashing every one of them, every time you do an ls or a commit. I took a quick look and git seems to check whether it needs to recalculate the hash based on a combination of the modification timestamp and if the filesize has changed, which is not foolproof either since the timestamp can be modified, and the filesize can remain the same and just have different contents.
    I'm not too sure how to solve this myself. Apparently this is a known thing in git and is called the "racy git" problem https://git-scm.com/docs/racy-git/ But to be honest, perhaps I'm biased from working in a large repository, but I'd rather the tradeoff of not rehashing often, rather than suffer the rare case of a file being changed without modifying its timestamp, whilst remaining the same size. (I suppose this might have security implications if an attacker were to place such a file into my local repository, but at that point, having them have access to my filesystem is a far larger problem...)
    > I'm just not that familiar with toml... Though I would argue that the choice doesn't really matter to the user, since you would never actually write...
    Again, I agree. At best, _maybe_ it would be slightly nicer for a developer or a power user debugging an issue, if they prefer the toml syntax, but ultimately, it does not matter much what format it is in. I mainly asked out of curiosity since your first thoughts were to use yaml or json, when I see (completely empirically) most Rust devs prefer toml, probably because of familiarity with Cargo.toml. Which, by the way, I see you use too in your repository (As to be expected with most Rust projects), so I suppose you must be at least a little bit familiar with it, at least from a user perspective. But I suppose you likely have even more experience with yaml and json, which is why it came to mind first.
    TonyStr11 days ago
    > ...based on a combination of the modification timestamp and if the filesize has changed
    Oh that is interesting. I feel like the only way to get a better and more reliable solution to this would be to have the OS generate a hash each time the file changes, and store that in file metadata. This seems like a reasonable feature for an OS to me, but I don't think any OS does this. Also, it would force programs to rely on whichever hashing algorithm the OS uses.
    b1temy11 days ago
    >... have the OS generate a hash each time the file changes...
    I'm not sure I would want this either tbh. If I have a 10GB file on my filesystem, and I want to fseek to a specific position in the file and just change a single byte, I would probably not want it to re-hash the entire file, which will probably take a minute longer compared to not hashing the file. (Or maybe it's fine and it's fast enough on modern systems to do this every time a file is modified by any program running, I don't know how much this would impact the performance.).
    Perhaps a higher resolution timestamp by the OS might help though, for decreasing the chance of a file having the exact same timestamp (unless it was specifically crafted to have been so).
quijoteuniv11 days ago
Now … if you reinvent Linux you are closer to be compared to LT
smekta11 days ago
...with blackjacks, and hookers
jonny_eh11 days ago
Why introduce yet another ignore file? Can you have it read .gitignore if .tvcignore is missing?
black_1312 days ago
[dead]
justabrowser12 days ago
[flagged]
- adzm11 days ago
  Second time today I've read and agreed with most of your comment only to eyeroll and downvote once seeing your ridiculous and immature edit.