I watched it in the browser network panel and saw it fetch:
https://hackerbook.dosaygo.com/static-shards/shard_1636.sqlite.gz
https://hackerbook.dosaygo.com/static-shards/shard_1635.sqlite.gz
https://hackerbook.dosaygo.com/static-shards/shard_1634.sqlite.gz
As I paginated to previous days.It's reminiscent of that brilliant SQLite.js VFS trick from a few years ago: https://github.com/phiresky/sql.js-httpvfs - only that one used HTTP range headers, this one uses sharded files instead.
The interactive SQL query interface at https://hackerbook.dosaygo.com/?view=query asks you to select which shards to run the query against, there are 1636 total.
This is my VFS: https://github.com/ncruces/go-sqlite3/blob/main/vfs/readervf...
And using it with range requests: https://pkg.go.dev/github.com/ncruces/go-sqlite3/vfs/readerv...
And having it work with a Zstandard compressed SQLite database, is one library away: https://pkg.go.dev/github.com/SaveTheRbtz/zstd-seekable-form...
But you can use it (e.g.) in a small VPS to access a multi-TB database directly from S3.
But, also, SQLite caches data; you can simply increase the page cache.
And I just now added a 'me' view. Enter your username and it will show your comments/posts on any day. So you can scrub back through your 2006 - 2025 retrospective using the calendar buttons.
edit: Fixed! Also I just pushed a new version with a Dec 29th Data Dump, so ... updates - yay!
I’ve used it in production to self-host Australia-only maps on S3. We generated a single ~900 MB PMTiles file from OpenStreetMap (Australia only, up to Z14) and uploaded it to S3. Clients then fetch just the required byte ranges for each vector tile via HTTP range requests.
It’s fast, scales well, and bandwidth costs are negligible because clients only download the exact data they need.
I want something like a db with indexes
LanceDB has a similar mechanism for operating on remote vector embeddings/text search.
It’s a fun time to be a dev in this space!
But, when using this on frontend, are portions of files fetched specifically with http range requests? I tried to search for it but couldn't find details
Looks like it's still on PyPI though: https://pypi.org/project/sqlite-s3vfs/
You can see inside it with my PyPI package explorer: https://tools.simonwillison.net/zip-wheel-explorer?package=s...
https://github.com/simonw/sqlite-s3vfs
This comment was helpful in figuring out how to get a full Git clone out of the heritage archive: https://news.ycombinator.com/item?id=37516523#37517378
Here's a TIL I wrote up of the process: https://til.simonwillison.net/github/software-archive-recove...
From what I see in GitHub in your copy of the repo, it looks like you don’t have the tags.
Do you have the tags locally?
If you don’t have the tags, I can push a copy of the repo to GitHub too and you can get the tags from my copy.
git push --tags originSure, the LLM fills in all the boilerplate and makes an easy-to-use, reproducible tool with loads of documentation, and credit for that. But is it not more accurate to say that Simon is absurdly efficient, LLM or sans LLM? :)
https://simonwillison.net/2021/May/2/hosting-sqlite-database...
https://phiresky.github.io/blog/2021/hosting-sqlite-database...
https://just.billywhizz.io/sqlite/demo/#https://raw.githubus...
There is also a file format to optimize this https://cogeo.org/
I believe that there are also indexing opportunities (not necessarily via eg hive partitioning) but frankly - am kinda out of my depth pn it.
The sequence of shards you saw when you paginated to days is faciliated by the static-manifest which maps HN item ID ranges to shards, and since IDs are increasing and a pretty good proxy of time (a "HN clock"), we can also map the shards that we cut up by ID to the time spans their items cover. An in memory table sorted by time is created from the manifest on load so we can easily look up which shard we need when you pick a day.
Funnily enough, this system was thrown off early on by a handful of "ID/timestamp" outliers in the data: items with weird future timestamps (offset by a couple years), or null timestamps. To cleanse our pure data from this noise, and restore proper adjacent-in-time shard cuts we just did a 1/99 percentile grouping and discarded the outliers leaving shards with sensible 'effective' time spans.
Sometimes we end up fetching two shards when you enter a new day because some items' comments exist "cross shard". We needed another index for that and it lives in cross-shard-index.bin which is just a list of 4-byte item IDs that have children in more than 1 shard (2-bytes), which occurs when people have the self-indulgence to respond to comments a few days after a post has died down ;)
Thankfully HN imposes a 2 week horizon for replies so there aren't that many cross-shard comments (those living outside the 2-3 days span of most, recent, shards). But I think there's still around 1M or so, IIRC.
Where did you get the 22GB figure from? On the site it says:
> 46,399,072 items, 1,637 shards, 8.5GB, spanning Oct 9, 2006 to Dec 28, 2025
How was the entirety of HN stored in a single SQLite database? In other words, how was the data acquired? And how does the page load instantly if there's 22GB of data having to be downloaded to the browser?
- 1. download_hn.sh - bash script that queries BigQuery and saves the data to *.json.gz
- 2. etl-hn.js - does the sharding and ID -> shard map, plus the user stats shards.
- 3. Then either npx serve docs or upload to CloudFlare Pages.
The ./toool/s/predeploy-checks.sh script basically runs the entire pipeline. You can do it unattended with AUTO_RUN=true
22 GB is uncompressed and compressed the entire things about 9 GB
and why do you want wikipedia in your pocket, but not a smartphone? where do you draw the line?
(doing a lot of work in that area, so i am asking to learn from someone who might think alike)
I have a $10 a month plan from US cellular with only 2gigs so I try to keep everything offline that I can.
Honestly it's mostly the news... so I draw the line at browser, I'll never install a browser, that's basically something I can do when I sit down at a PC. I read quite a bit and I like to have the ability to look up a word or a historical event or some reference from something I read using Kiwix and it's been great for that, just needed to add a 512gb micro sd card. And Libby I just use at the gym when I'm on the treadmill.
your input would be very valuable.
I also want to make sure we can build this in CI. My goal is to have this updated every day using the BigQuery update process, so it becomes a 1–2 day delayed static archive of the current state of Hacker News, which is honestly very cool.
I can probably run the build for free on GitHub Actions runners, as long as the runner has about 40 GB of disk space available. If needed, I can free up space on the runner before the build starts.
I’ll also write to GitHub support and ask if they can sponsor the cost of a larger runner, mainly because I need the extra disk space to run the build reliably.
> Next, we've got more than just two tables. The quote/paraphrase doesn't make it clear, but we've got two tables per thing. That means Accounts have an "account_thing" and an "account_data" table, Subreddits have a "subreddit_thing" and "subreddit_data" table, etc.
https://www.reddit.com/r/programming/comments/z9sm8/comment/...
I did something similar. I build a tool[1] to import the Project Arctic Shift dumps[2] of reddit into sqlite. It was mostly an exercise to experiment with Rust and SQLite (HN's two favorite topics). If you don't build a FTS5 index and import without WAL (--unsafe-mode), import of every reddit comment and submission takes a bit over 24 hours and produces a ~10TB DB.
SQLite offers a lot of cool json features that would let you store the raw json and operate on that, but I eschewed them in favor of parsing only once at load time. THat also lets me normalize the data a bit.
I find that building the DB is pretty "fast", but queries run much faster if I immediately vacuum the DB after building it. The vacuum operation is actually slower than the original import, taking a few days to finish.
[1] https://github.com/Paul-E/Pushshift-Importer
[2] https://github.com/ArthurHeitmann/arctic_shift/blob/master/d...
> The VACUUM command may change the ROWIDs of entries in any tables that do not have an explicit INTEGER PRIMARY KEY.
means SQLite does something to organize by rowid and that this is doing most of the work.
Reddit post/comment IDs are 1:1 with integers, though expressed in a different base that is more friendly to URLs. I map decoded post/comment IDs to INTEGER PRIMARY KEYs on their respective tables. I suspect the vacuum operation sorts the tables by their reddit post ID and something about this sorting improves tables scans, which in turn helps building indices quickly after standing up the DB.
Question - did you consider tradeoffs between duckdb (or other columnar stores) and SQLite?
So you can dump e.g. all of Hacker News in a single multi-GB Parquet file somewhere and build a client-side JavaScript application that can run queries against that without having to fetch the whole thing.
You can run searches on https://lil.law.harvard.edu/data-gov-archive/ and watch the network panel to see DuckDB in action.
It would be an interesting experiment to add the duckdb hackend
It has transparent compression built-in and has support for natural language queries. https://buckenhofer.com/2025/11/agentic-ai-with-duckdb-and-s...
"DICT FSST (Dictionary FSST) represents a hybrid compression technique that combines the benefits of Dictionary Encoding with the string-level compression capabilities of FSST. This approach was implemented and integrated into DuckDB as part of ongoing efforts to optimize string storage and processing performance." https://homepages.cwi.nl/~boncz/msc/2025-YanLannaAlexandre.p...
It's different in that it is tailored to analytics, among other things storage is columnar, and it can run off some common data analytics file formats.
duckdb is a 45M dynamically-linked binary (amd64)
sqlite3 1.7M static binary (amd64)
DuckDB is a 6yr-old project
SQLite is a 25yr-old project
BUT I did try to push the entire 10GB of shards to GitHub (no LFS, no thanks, money), and after the 20 minutes compressing objects etc, "remote hang up unexpectedly"
To be expected I guess. I did not think GH Pages would be able to do this. So have been repeating:
wrangler pages deploy docs --project-name static-news --commit-dirty=true
on changes and first time CF Pages user here, much impressed!It's super simple, really, far less impressive than what you've built there.
Listen was nice. That's really cool, actually. I encourage you to do it.
I had to run a test for myself, and using sqlite2duckdb (no research, first search hit), and using randomly picked shard 1636, the sqlite.gz was 4.9MB, but the duckdb.gz was 3.7MB.
The uncompressed sizes favor sqlite, which does not make sense to me, so not sure if duckdb keeps around more statistics information. Uncompressed sqlite 12.9MB, duckdb 15.5MB
Doesn't scream columnar database to me.
I was getting an error that the users and user_domains tables aren't available, but you just need to change the shard filter to the user stats shard.
While designed for OS portability, you can use it to convince SQLite to read from something other than a file on disk.
That's too bad, I'd like to see the inner-working with a subset of data, even with placeholders for the posts and comments.
story volume (all time): https://ibb.co/pBTTRznP
average score (all time): https://ibb.co/KcvVjx8p
story volume (since 2020): https://ibb.co/cKC5d7Pp
average score (since 2020): https://ibb.co/WpN20kfh
median score (all time): https://ibb.co/gZV5QVMG
median score (since 2020): https://ibb.co/Gfv8T7k8
Totally cool if not, just super interesting!
mean (all time): https://katb.in/yutupojerux
mean (since 2020): https://katb.in/omoyibisava
median (all time): https://katb.in/kilopofivet
median (since 2020): https://katb.in/ukefetuyuhi
I have a much simpler database: https://play.clickhouse.com/play?user=play#U0VMRUNUIHRpbWUsI...
With all due respect it would be great if there is an official HN public dump available (and not requiring stuff such as BigQuery which is expensive).
For example, one of the most useful applications of video over text is appliance or automotive repair, but the ideal format would be an article interspersed with short video sections, not a video with a talking head and some ~static shaky cam taking up most of the time as the individual drones on about mostly unrelated topics or unimportant details yet you can’t skip past it in case there is something actually pertinent covered in that time.
I've produced a few videos, and I was shocked at how difficult it was to be clear. I have the same problem with writing, but at least it's restricted in a way video making isn't. There's so many ways to make a video about something, and most of them are wrong!
Converting 22GB of uncompressed text into video essay lands us at ~1PB or 1000TB.
One could convert the Markdown/PDF to a very long image first with pandoc+wkhtml, then use ffmpeg to crop and move the viewport slowly over the image, this scrolls at 20 pixels per second for 30s - with the mpv player one could change speed dynamically through keys.
ffmpeg -loop 1 -i long_image.png -vf "crop=iw:ih/10:0:t*20" -t 30 -pix_fmt yuv420p output.mp4
Alternatively one could use a Rapid Serial Visual Presentation / Speedreading / Spritz technique to output to mp4 or use dedicated rsvp program where one can change speed.
One could also output to a braille 'screen'.
Scrolling mp4 text on the the TV or Laptop to read is a good idea for my mother and her macula degeneration, or perhaps I should make use of an easier to see/read magnification browser plugin tool.
Best locally of course to avoid “I burned a lake for this?” guilt.
I've been taking frequent "offline-only-day" breaks to consolidate whatever I've been learning, and Kiwix has been a great tool for reference (offline Wikipedia, StackOverflow and whatnot).
[0] https://kiwix.org/en/the-new-kiwix-library-is-available/
Guess its common knowledge that SharedArrayBuffer (SQLite wasm) does not work with FF due to Cross-Origin Attacks (i just found out ;).
Once the initial chunk of data loads the rest load almost instantly on Chrome. Can you please fix the GitHub link (current 404) would like to peak at the code. Thank you!
edit: I just tested with FF latest, seems to be working.
But when go back to the 26th none of the shards will load, error out.
Using Windows 11, FF 146.0.1
Since you tested it seems its just a me problem and thanks for fixing the GitHub link
./toool/download-site.mjs --help
To let you download the entire site over HTTPS so you don't need to "build it" by running the pipeline.That way it's truly offline.
Such as DB might be entertaining to play with, and the threadedness of comments would be useful for beginners to practise efficient recursive queries (more so than the StackExchange dumps, for instance).
"1. Delayed Karma Display. I understand why comment karma was hidden. I don't see the harm in un-hiding karma after some time. If not 24 hours, then 72-168 hours. This would help me read through threads with 1300 comments."
This was last January. While I asked for a few more features, it is the only one that seems essential as HN grows with massive threads.
The only way you could theoretically extract everyone's comment scores (at least the top level ones) would be like this if you're a complete madman:
1. Wait 48 hours so the article is effectively dead
2. Post a new comment using an account called ThePresident
3. Create a swarm of a thousand shill user accounts called Voter1, Voter2, etc.
4. Use a single account at a time and upvote ThePresident
5. Recheck the page to see if ThePresident has moved above a user(s) post
6. Record the score for that user and assign it to the tracked story's history
7. Repeat from (4)
But the idea I have is not like that at all - it's much nicer on everyone's ethics. Stay tuned! :)
I have always known I could scrape HN, but I would much rather take a neat little package.
Nonetheless, random access history is cool.
Minor bug/suggestion: right-aligned text inputs (eg the username input on the “me” page) aren’t ideal since they are often obscured by input helpers (autocomplete or form fill helper icons).
It would be nice for the thread pages to show a comment count.
Edit: Good idea! I implemented a "year" selector so all main views (front/show/ask/jobs) will be from that entire year rather than just a single day.
2026 prayer: for all you AI junkies—please don’t pollute H/N with your dirty AI gaming.
Don’t bot posts, comments, or upvote/downvote just to maximize karma. Please.
We can’t identify anymore who’s a bot and who’s human. I just want to hang out with real humans here.
As someone reskilling into being a writer, I really do not think that is "good writing".
I wonder if there's something like this going on here. I never thought it was LLM on first read, and I still don't, but when you take snippets and point at them it makes me think maybe they are
But it didn’t read LLM generated IMO.
Always write what you want, however you want to write it. If some reader somewhere decides to be judgemental because of — you know — an em dash or an X/Y comparison or a complement or some other thing that they think pins you down as being a bot, then that's entirely their own problem. Not yours.
They observe the reality that they deserve.
Ooh, I used “sequential”, ooh, I used an em dash. ZOMG AI IS COMING FOR US ALL
Also for reference: “this shortcut can be toggled using the switch labeled 'Smart Punctuation' in General > Keyboard settings.”
Ending a sentence with a question mark doesn’t automatically make your sentence a question. You didn’t ask anything. You stated an opinion and followed it with a question mark.
If you intended to ask if the text was written by AI, no, you don’t have to ask that.
I am so damn tired of the “that didn’t happen” and the “AI did that” people when there is zero evidence of either being true.
These people are the most exhausting people I have ever encountered in my entire life.
From the terms of use [0]:
"""
Commercial Use: Unless otherwise expressly authorized herein or in the Site, you agree not to display, distribute, license, perform, publish, reproduce, duplicate, copy, create derivative works from, modify, sell, resell, exploit, transfer or upload for any commercial purposes, any portion of the Site, use of the Site, or access to the Site. The buying, exchanging, selling and/or promotion (commercial or otherwise) of upvotes, comments, submissions, accounts (or any aspect of your account or any other account), karma, and/or content is strictly prohibited, constitutes a material breach of these Terms of Use, and could result in legal liability.
"""