From that 5-pass process the marked-up text is handed to a volunteer "post-processor" who assembles the final HTML or e-book file; then the completed book gets one more "smooth reading" pass before it is posted to PG.
This it the process that produces the books input to Standard Ebooks. That they can still find scanner errors ("tne" for "the", a typical "scanno") demonstrates how difficult it is to see those. But their presence isn't from carelessness or disregard for the value of the books.
In the 20-teens I put in hundreds of volunteer hours at PGDP in all the above roles, and it was very satisfying work. I'd recommend it to anyone wanting an online hobby that feels constructive. Volunteering time to Standard Ebooks would probably feel good as well.
it truly is an "online hobby that feels constructive". you get these tiny glimpses into our shared literary/cultural history while knowing that the work you're doing is for the benefit of all (benefit of the public domain)
Isn't the backlog there mostly in the post-processing step, though? To the point where they're taking finished texts and running them again through the page-by-page proofreading in hope of fishing out more OCR typos and improving the format markup?
You can also contribute at Wikisource if you prefer, that doesn't really have a post-processing step and has much less of a fixed pipeline. (There are explicit "proofreading" and "verification" steps per page, but not much beyond that.)
They do have a double-pass system for all works based on scanned pages, which is quite nifty. Green means two passes complete: https://en.m.wikisource.org/wiki/Index:Sophocles%27_King_Oed...
Plus you can just jump in to any work, in true wiki fashion.
Even just automated flagging of common errors would save 1000s of volunteer hours.
Meanwhile, in that early stage, because of the stream of errors, it is easy to pay attention and feel like you are doing rewarding work. Moreover, if you are quite quick and diligent, you can basically just read a book as volunteer work.
Also, sometimes the error is in the source material. Different editors have different opinions about what should be done there. Sometimes I had to re-add mistakes that were "fixed" by early proofers trying to correct grammar, if I recall correctly... it was a while back that I volunteered.
there are also statistical methods to detect words that are changed into other, valid words - check out the grammar checker in google docs for instance. again, not 100%, but every bit helps.
I have seen that LLMs are pretty good at understanding context/domain / theme-specific terms, so their spellchecking is pretty good.
So, when I care about a book, I never read Standard Ebooks' edition.
By the way, the modernization is more than joining a few words. Sometimes, Standard Ebooks replaces the word used at the time the book was written. For instance:
This time, however, the mountain was going to [-Mahomet;-]{+Muhammad;+}
The previous quote is from Galsworthy's "Forsyte Saga". The author used many French words and French spellings – like "Tchekov" for the Russian playwriter that was living in Paris. These subtleties are lost with the modernization.I also think some alterations are plain mistakes. For instance in the same book:
if she wanted a good book she should read [-“Job”-]{+Job+};
his father was rather like Job while Job still had land.
Unless the book is specifically declared to be the original text - and it may have to specify which original text - they're going to be edited.
However, in electronic form it should be possible to include both in one file, or two files with the original in a repo branch once all the document structure stuff has been added. That text will never change, so merging formatting-only changes should be pretty painless.
"As it was written" is a very high bar that is simply not attainable for anything other than fairly recent works in your native language.
That one appears to not be a mistake, [0] suggests that not quoting the name of the book of the bible being referred to (so [Job] rather than ["Job"]) is the style accepted by Chicago, MLA, and APA.
[0] https://en.wikipedia.org/wiki/Bible_citation#Common_formats
This is a common practice that editors and publishers have quietly engaged in for centuries. For example, today you are not reading Shakespeare in the way it was spelled in its first printing.
After reading this comment I couldn't help but picture medieval monks, toiling away copying old manuscripts into "modern" English. Normally a thankless task, so thank you!
However, we call modernised Shakespeare “abridged”.
That is not at all what I said.
> You can't claim to care about preserving the works while changing them, and that is changing them.
We do not and have never made that claim. We are creating our own editions of these public domain books, not engaging in historical preservation.
If you want to read classic books in their original spelling, then you must locate first editions. Editors and publishers have updated both spelling and punctuation as a matter of course for centuries. Just look at any three editions of any Jane Austen novel - and you could never read an edition of Shakespeare more recent than 1800.
As someone who writes I greatly dislike this. These are my words, not yours.
A translation across time and generations is a completely different matter.
Today, it's much easier for authors to have a greater say in the final presentation due to the digital composition process
I don't see why anyone should care that publishers have edited in the past anyway, even in this particular discussion where my own argument is for conservation. Publishers have done all kinds of things that this very project itself criticises and pointedly set themselves apart by doing differently. So, it's a weak argument for them.
Aside from that, what any other publishers do, even if it's totally common and even universal, doesn't change the argument that they were making that they wish to suggest that those edits cross a line that fixing typos doesn't cross.
Modernizing / adapting is the least damaging change to be done here
I think that Standard Ebooks is a great-sounding project, but I honestly found your response not just flippant, but passive-aggressively rude to the original poster.
But — full disclosure — I also think that it would be a good idea to preserve the spellings found in the original editions you are digitising. So perhaps I inclined to feel the bite of your response more than someone who just doesn’t care.
I didn’t read it that way at all. How would you have worded it in such a way as to sincerely express the stated sentiment without coming across as passive‐aggressively rude?
Something like ‘While we understand that some people would prefer to read the original texts (modulo typos, formatting errors and the like), we think that it is preferable to modernize spelling because X, Y and Z.’
In other words, the polite response to ‘I like most of what you’re doing, but I dislike this particular thing’ is not ‘Fine! You’re free to go elsewhere,’ with an implied ‘don’t let the door hit you on the backside on your way out,’ but rather to engage and explain.
Again, I have to admit my own bias against the policy and consequent bias in favour of the original poster.
The real answer is twofold.
1. We don't have a special 3rd kind of quote or other punctuation mark for reinterpreted references.
2. The real one: This is not a quote that lies as you imply. It is a new message, that merely uses quotes to denote a speaker, as in a pure fictional work, where the characters dialog is in quotes, even though no actual human was actually quoted.
Are there any other conundrums and baffling mysteries I can clear up for you?
Better to write inline "I feel like what you said amounts to [...]" to reduce the perception they you're making up quotes they someone didn't say or even clearly imply.
Ok, thanks, that makes sense.
Wrong. Not only is it tasteless and dishonest (not "fine"), it is against the rules of this site. But regardless of whether it's allowed elsewhere, you still shouldn't do it. (See "tasteless and dishonest".)
It makes it hard to browse those collections to find actual books to read. The first 3 series I clicked on all said "not P.D." (which at first I didn't know what "P.D" meant - remember your audience does not have your level of familiarity with your context, perhaps a tooltip on that badge would help)..
Then I see "this book will enter public domain in 2050"..
I commend you for this project, it's really awesome work.. From a user's experience, it would be great to have a filter on your various lists that restricts only to books that are available, and excludes these books that are not yet in your collection.
(More concretely my reader is a 2nd-gen kindle which is basically useless these days and I’d love an idea of something that can display standard ebooks with all their advanced formatting)
Thanks!
I think Kindle's renderer hasn't changed significantly for many years, and it had always been pretty bad. I always say that Kindle seems to have been created by people who hate books.
The best renderer around is iBooks on an iPad, which as far as I can tell uses an up-to-date Webkit.
I've been meaning to try calibre-web, but I'm doubtful iBooks will support OPDS.
If I want to appreciate a nice Kepub from Standard Ebooks, I upload it directly to the Kobo.
Standard eBooks offers kepub format for Kobo devices and files, they use their advanced Webkit-based renderer: https://standardebooks.org/help/how-to-use-our-ebooks#kobo-f...
Thanks for the recommendation!
In my ideal world all devices would be like this.
What I'm personally looking for:
- Linux and/or OS X
- No ‘import’ requirement (a viewer, not a collection manager)
- Single page or continuous (no forced double spread)
- No required animations
- At least basic control over font size, spacing, margins.
- Keyboard navigation (at least next/previous page)
It does a good job of modernising old Kindles.
Unmatched UI tweaking features which make reading a pleasure. Syncs bookmarks with cloud services, thus across different devices.
For reference my gen 2 kindle is 16 years old.
All the author pages come before any pages with books from those authors.
Personally I think it's important to have one person in charge who is able to approve of the quality of all the project's output; for now, at SE, that person is me and I'm only an expert in English.
Beginners, and people working on more advanced books, can take much, much, much longer.
Anywhere between 1 week for the simplest (straight narrative, not too much verse or endnotes) and ~1 year (thousands of endnotes, pages of verse, drama, in-line references to book titles, use of technical terms, etc)
The step-by-step: https://standardebooks.org/contribute/producing-an-ebook-ste...
In a nutshell: start with a Project Gutenberg text, clean it up to a high standard, have it peer reviewed and published
However, I was completely shot down by the local library when I was discussing it with them. They said they already had a photo copy and didn’t need anymore digital editions, I tried to explain the benefits of having it in a machine readable format but they wouldn’t entertain it. I completed the project for me, so I wasn’t too bothered, but thought they might have been interested in archiving it but they weren’t.
My general feeling is that they didn’t like an outsider contributing and touching on a format they didn’t know so got slightly defensive.
Beyond that, if the material is public domain, that library is called The Internet. Post it and promote it. The only reason to seek association with a library is if you're looking for cred for some reason, and that's not the business they're in.
If it's not public domain, or if you haven't marked your derivative work public domain, then you put a library in an awkward position. Realize that these are the types of people who still post little notes by the copy machines saying what's permissible and enjoy policing it.
Most just say no for the same reason that Hollywood returns ideas and scripts unopened. They're busy and the cost/benefit isn't there.
Although the self-described online ones tend to play fast and loose, real librarians have a formal code of ethics which is worth reviewing.
In my case I picked a title from the project’s wishlist and almost started but searching the mailing list showed that someone has just started. I found another title by the same author: https://groups.google.com/g/standardebooks/c/IP0emhSQ6Bw/m/B...
2017, 441 points, 97 comments https://news.ycombinator.com/item?id=14570035
2019, 820 points, 131 comments https://news.ycombinator.com/item?id=20594802
2022, 1578 points, 256 comments https://news.ycombinator.com/item?id=32215324
2024, 701 points, 154 comments https://news.ycombinator.com/item?id=38831219
https://standardebooks.org/help/how-to-use-our-ebooks#kobo-f...
It kills me that Kobo is so close to having plain epubs rendered with Webkit but for some reason they just won't take the leap!
Recently Calibre was updated to convert things to kepub when loading to Kobo devices - see https://www.omgubuntu.co.uk/2025/03/calibre-update-convert-k... - but I haven't anything about Kobo itself doing anything to improve this.
See also Global Grey ebooks: https://www.globalgreyebooks.com/ One woman has formatted hundreds of ebooks herself.
There is a huge world of out-of-copyright non-English texts, and Project Gutenberg has many thousands of them. I wonder if any interest could be generated to help bring them in by posting on foreign language subreddits or something.
I understand if the existing editors can't personally proofread the submissions, but that's why peer-review exists. Or an open-source project in general where people can post corrections. Jimbo Wales didn't need to speak a hundred languages to launch Wikipedia.
Besides, projects in other languages can absolutely build upon Standard Ebooks work, but to expect Standard Ebooks itself to support other languages is just too outside the scope and expertise of the volunteers available.
As it is now, Alex is editorially responsible for all output of Standard Ebooks. Changing that would require someone with the time and experience willing to take on all the responsibilities that Alex currently has for each of those other languages.
The website and toolchain are open source, so if someone would build an international version, and do it persistently, I'm sure they would link or maybe even merge the projects a bit.
You can self host the server, and it will create epub3s with the audiobook and ebook synced up.
Then you use the mobile app to listen and read the books. It works way better than whispersync from kindle.
Read on your boox e reader then switch to your phone and listen and everything syncs up seamlessly.
A screenshot from the typography section:
but no, the manual itself is not really mobile-friendly. you can check what an actual ebook would look like though:
https://standardebooks.org/ebooks/louis-couperus/the-tour/al...
This is a leading you'd see on the ingredients list of an energy bar packaging.
The other choices are fine.
Caveat: I studied typography and worked in that field for a decade.
however, contributions are very welcome and everything is hosted on GitHub if you'd like to suggest improvements; or send your thoughts on the mailing list
Or if you think it actually was, this was not a project that I'd want to get involved in.
As someone who reads mostly ePubs, many of which suffer from issues this project aims to fix, I mean that in a very caring way.
from my own experience, Alex is very amenable to improvements. the online view of the ebooks is just not used by probably anyone to actually read the books (just use an ereader app or device its a way better experience anyway) and because of that no one has cared to point it out until now
https://standardebooks.org/donate
> Sponsoring a new ebook of your choice calls for a donation of $900 + $0.02 per word over the first 100,000
How much less would you do it for?
Scans suck though, even a badly OCR’ed EPUB is way better.
Scanning is not transformative and does not result in a derivative work which can is protected by copyright law.
https://en.wikipedia.org/wiki/Wikipedia:Scanning_an_image_do...
https://law.stackexchange.com/questions/1214/who-owns-a-copy... points us to read the Compendium of US Copyright Office Practices at https://www.copyright.gov/comp3/docs/compendium.pdf
> 313.4(A) Mere Copies
> A work that is a mere copy of another work of authorship is not copyrightable. The Office cannot register a work that has been merely copied from another work of authorship without any additional original authorship. See L. Batlin & Son, 536 F.2d at 490 (“one who has slavishly or mechanically copied from others may not claim to be an author”); Bridgeman Art Library, Ltd. v. Corel Corp., 36 F. Supp. 2d 191, 195 (S.D.N.Y. 1999) (“exact photographic copies of public domain works of art would not be copyrightable under United States law because they are not original”).
But you wrote "scan". Adding an OCR'ed text layer, or doing manual proofreading and layout ("sweat of the brow") is not sufficiently transformative to have copyright protection.
And we were specifically talking about scans of old books stored in shadow libraries.
> Of all these projects, the most amenable to automatic typesetting are those produced by Standard Ebooks and HTML Writers Guild. The benefit of using HTML Writers Guild is their semantic markup and simple document type definition (DTD) file. Standard Ebooks, as the name suggests, are brilliantly standardized and have an excellent Manual of Style that describes what to expect from the XHTML.
Huge, wildly irregular word gaps, awkward hyphenation, stacked hyphen breaks.
I suspect most of this could be improved simply by letting go of the slavish attachment to full justification and expanding the column to a more reasonable width by reducing the font size and margins.
I'd love to see at least:
- character: ID, Name, Gender, Age
- mood: ID, Name (Happy, Sad, Angry, ...)
- place: ID, Name, Acoustic (Outside, Inside, Cave, ...)
This could be prepared by the author, work as a glossary, enrich the whole ebook experience and also would be a great preparation to teach AI voices how to convert a book into an audiobook.Also, the thing from the above post that stood out to me would be to act as a reminder for the reader. Not so much the location and emotion, but the character data. I've often found myself wondering who the character is that's appeared in a scene, forgetting that they previously appeared earlier.
If it can’t be derived from the book text, then it’s extra content that probably shouldn’t be there because it came from elsewhere.
It begins with "Other free ebooks don’t put much effort into..." which sounds extremely catty.
Maybe I'm reading too much into it, but it seems there's a way to stand on other people's shoulders and celebrate each other.
Maybe TikTok ruined me but maybe these things really do literally have a shelf life. Hopefully reformatting will help. Perhaps a better way to review and find the gems would be most helpful..
Perhaps it would be useful to have expertly abridged and modernized versions of (e)books, with interpreter's notes for each change.
A good AI can do this for you nowadays. So if anything it's nice to have the original version available.
I’m interested in a similar approach for a rare book library, but funding for staff is a really challenge so we want to make some kind of revenue stream.
Too bad most stuff I really like will never enter the public domain in my lifetime... well, paper and the high seas still exist!
there are whole generations of wonderful and insightful works that essentially disappeared from present consciousness for no reason other than for being old
Each renderer differs in capabilities, and most are stuck in a subset of early-2000s capabilities, so designing an ebook is very much like designing for the 90s era web. Lots of hacks are required to get the same file to look good on many different renderers, and achieving that is one of the goals of Standard Ebooks.
Also, xhtml is just markup. It doesn’t mean you have to support all the possible tags and styles of modern html and css. It would be a sensible choice even if you had basic needs. You just parse it into whatever representation you want.
And so it's not a programming language runtime (i.e. javascript or wasm), nor a css renderer, nor a bunch of web-apis.
It's these things, not the (X)HTML parsing and rendering that makes a browser the complex thing it is.
> [T]he third comment violates the Cooperative Principle, specifically Grice’s Maxims of Relation and Manner, and ends up implying ignorance where there is none. Let’s break it down a bit more with that framework in mind:
> VIOLATION OF GRICE’S MAXIMS
> The second ["EPUB folks have continued to evolve their bespoke format instead of ditching it for something that legacy browsers already know how to handle"] commenter criticizes EPUB for continuing to evolve a packaging format that is not browser-native. They're not confused about what EPUB is—they're lamenting that it isn’t something simpler, like a plain web bundle a browser could just open.
> The third commenter responds by explaining what EPUB is, as if that somehow rebuts the original critique.
> Factually true.
> Entirely irrelevant in context.
> This failure to meet the relevance standard creates an implicature: the previous commenter must not have understood the format they were critiquing.
> THE IMPLICATURE TRAPS THE THIRD COMMENTER
> By stating something the second commenter obviously already knows, the third commenter unintentionally shifts the conversational footing in a way that belittles rather than builds. That’s why the tone feels off: not because of overt rudeness, but because the presupposition of ignorance is baked into the structure of the reply.
> FINAL THOUGHT
> The third comment reads like an attempted “correction,” but since the original comment didn’t contain a factual error, only a value judgment or proposal, this “correction” becomes a non sequitur—one that subtly undermines the prior speaker’s credibility while failing to address their actual point. That’s what makes it rhetorically broken, even if factually fine.
as for (2) I'm not sure why you think it would make it less easier? being html, text reflows automatically based on screen size, font size, line height, etc
also that things like footnotes or anything that has a floating reference (table of contents links for example) might get very complex or require javascript
footnotes aren't really a thing with ebooks (at least as far as displaying the note on the page with the text). Because it is just a html renderer, footnotes are presented as mutual <a> elements located in the endnotes at the end of the book
Thank you to everyone who helps put this together!
My bro-in-law supported his family as a freelance editor for years while my sister was doing the "maternity leave" thing so I know there's a non-trivial amount of work that goes into book editing. Cutting out some of that human labor seems like a good thing for a volunteer project.
the vast majority of textual tooling is regex-galore, but there is also automated epub tooling in there too
thanks for being open ...I guess