In the end, I've ended up just keeping my own ArchiveBox and it's an all right experience. In the end, it's only useful for things I know I wanted to archive. For almost everything I go to the IA - which has so much.
It's been dormant / on hiatus for a few years now.
- I think I have seen that AI scrapers create bottleneck in the bandwidth
- To some digital archives you need to create scientific accounts (I think Common Crawl works like that)
- Data quite easily can be very big. The goal is to store many things. We not only store Internet, but with additional dimension of time
- Since there is a lot of data, it is difficult to navigate it, search it, so it easily can become unusable
- For example that is why I created my own meta data link, I needed some information about domains
Link:
Edit:
Would be really neat if you could click on a domain while on IA, and a desktop client downloads as many WAR files in a slower priority download queue, as many as you're interested in, with higher priority pages first, and then you can view it fully offline.
I wish I could find that article!
edit: https://github.com/internetarchive/dweb-archive/blob/master/...
No one uses IPFS. For the average user, it is significantly more difficult to get started. For the experienced user, the ecosystem of tools around IPFS is extremely small.
All in all, IPFS offers very little benefit over torrents in practice and has a much smaller user pool.
https://www.bittorrent.org/beps/bep_0039.html
https://www.bittorrent.org/beps/bep_0046.html
If that updated torrent is a BEP-0052 (v2) torrent it will hash per-file, and so the updated v2 torrent will have identical hashes for files which aren't changed: https://www.bittorrent.org/beps/bep_0052.html
This combines with BEP-0038 so the updated torrent can refer to the infohash of the older torrents with which it shares files, so if you already have an old one you only have to download files that have changed: https://www.bittorrent.org/beps/bep_0038.html
It's based on torrents, and you can easily make a content delivery system on top of this (so people can fetch data from this network).
I emailed a few archiving teams but nobody seemed interested, so I never made it.
Easier to send fiat to IA for them to invest (~$2/GB) and to pay to keep the disks spinning somewhere safe across the world.
(ia volunteer, no affiliation otherwise)
You're right, though, long-term commitment is rare from volunteers. That's why the idea is to make short-term commitment so easy that you have a good enough pool of short-termers that it works out in the aggregate.
(ia source of truth, storage system of last resort -> item index -> torrent index -> global torrent swarm)
https://gist.github.com/skorokithakis/68984ef699437c5129660d...
https://annas-archive.org/torrents
I think I'm misunderstanding you.
https://www.bittorrent.org/beps/bep_0039.html https://www.bittorrent.org/beps/bep_0046.html
but unfortunately most foss torrent clients do not support it, partly because at release libtorrent 2.0.x had poor io performance in some cases so torrent clients reverted to the 1.2.x branch
But torrent is probably the wrong tech. I’m sure there would be many players willing to host a few TB or more each, which could be fronted via something so it’s transparent to the user.
But a better option might be a subscription model, anything else will be slammed by crawlers.
By the way, thank you all the teams in IA, what you provide is such an important thing for humanity.
Edit: And how many terabytes it all amounts to.
You want to know why they'd tamper data?
https://seclab.cs.washington.edu/2017/10/30/rewriting-histor...
https://blog.archive.org/2018/04/24/addressing-recent-claims...
NSA already paid to back-door RSA, got caught shiping pre-hacked routers, can rewrite pages mid-flight with QUANTUM, penetrate and siphon data from remote infected machines.. what else could they do?
https://www.amnesty.org/en/latest/news/2022/09/myanmar-faceb...
The reality is that many things don't exist simply because someone isn't paid to do it.
(Remember, robots.txt is not a privacy measure, it's supposed to be something that prevents crawlers from getting stuck in tar pits!)
[1] https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...
[2] https://help.archive.org/help/how-do-i-request-to-remove-som...
(Also, consider that when you forbid such functionality, the only thing that happens is that its development becomes private. It's like DRM: it only hurts legitimate customers.)
https://blog.archive.org/2025/09/23/celebrating-1-trillion-w...
How does their scope or infrastructure compare?
I know they serve different purposes, but both are essentially doing similar things.
We'll load a truck with a copy of our complete archive if you give us a substantial donation to keep the archive going for a few more years.
If you don't agree to this deal, you can still access the archive, but it's gonna be at sluggish download speeds and take you years to get all the content.
And for single page archives I tend to use archive.is nowadays. For as long as I can remember, IA has been unusably slow.
But still kudos to them for the effort.
Do you hash them in some sort of block chain?
The inability to rewrite history will be a fantastic gift to the world.
As for the AWS stuff. Look at the ties between these organizations, pretty clear Amazon is basically self-dealing via a non-profit to write stuff off or have some other scheme.