Can you say a few more words about the library https://github.com/standardebooks/tools ? Can it generate ePub3 from markdown files or do I have to feed it HTML already. Any repo with usage examples of the `--white-label` option would be nice.
There’s more info on the build process at https://news.ycombinator.com/item?id=46469341
Others have requested a list of features supported in different E-readers. i second this request.
I don't call that fragile, I call that well-founded. It has always perturbed me that, when encountering an error, HTML parsers will guess what they think you meant instead of throwing it back. I don't want my parser to do guesswork with potentially undefined behavior. I don't want my mistakes to be obscured so they can later come back to bite me - I want to be called out on issues loud and clear before my users see them.
Perhaps it works under the context of manually-authored markup with minimal effort, so I can see why the choice was made. These days it's yet another reason why the web is a precarious pile of sticks. HTML freely lets you put a broken oddly-shaped stick right in the middle and topple the whole stack.
The people turning the web from a handcrafted document sharing system into the world's premiere application platform should have made XHTML win.
ePub is in a nice place: the number of documents to check for errors is reasonable, and the resulting artefect is designed to be shipped and never (or rarely) amended. That means that we can shift the balance towards strict parsing. But for a web site of thousands (or millions) of documents that are being amended regularly, the balance shifts back to loose parsing as the best way of meeting user needs.
Having strictly parsed HTML from the start would be fine. You'd check it before you ship it and you'd make sure it's valid.
Requiring it now would be a disaster, of course. There's so much malformed HTML out there. But making HTML parsers accept garbage at the beginning was the wrong choice.
Why? Why do car rentals typically not require cash up front, but hotel rentals universally do? The economics are similar. Sometimes it's simple path dependency. Something is a certain way because it's always been that way and it'd be too expensive to change it now.
At least browsers all use the same loosy goosy HTML parsing now. It was hell on earth when each browser had its own error recovery strategy. In a sense, there is no longer any such thing as an invalid HTML document: every code point sequence has a canonical interpretation now.
Unknown tags / attributes, on the other hand, may be ignored for future compatibility.. but I believe XHTML ignores them already.
1. If you are authoring an XHTML file, yes, you want the renderer to be as picky as possible and yell loud and clear if you make a mistake. This helps you produce a valid, well-formed document.
2. If you are an end user reading an XHTML file, it's not your file, it's not your fault if there are bugs, and there's jack shit you can do to fix it. You just want the browser to do its best to show you something reasonable so you can read the page and get on with your life.
XHTML optimizes for 1 at the expense of 2. HTML5 optimizes for 2 at the expense of 1.
But add a few angled braces in there and lord have-a mercy, ain’t nobody can understand this ampersand mumbo jumbo, I wanna hand write my documents and generate wutever, yous better jus deal with it gosh dangit.
I prefer the current situation too but I still think it’s funny somehow we just never bought into serializers for html. Maybe the idea was before its time? I’m sure you’d have no such parsing problems in the wild if you introduced JTML now. Clearly people know how to serialize.
The "some reason" is that JSON is most often produced by code, written by programmers.
That's not the case for HTML which is often hand-authored directly by non-programmer users. Most people on Earth literally don't know what the word "syntax" means, much less feel comfortable dealing with syntax errors in text files.
Instead, it seems your choices are either:
(1) errors cause random changes on page, such as text squishing into small box, or whole paragraphs missing, or random words appearing on page, or style being applied to wrong text etc...
(2) errors cause page to fail loading with no text shown at all
Both of those are pretty bad, but I wish web would go with (2) instead of (1). Because in that case, we'd have whole ecosystem of tooling appear... Imagine "auto fixup" feature in web servers which "fixes up" bad HTML into good one. For basic users, it looks the same as today. But instead of fixup and guesswork being done by users' browsers (which author has no control of) it would be done by author's webhosts (which author can upgrade or not upgrade as needed...)
Here's the thing though, if all XHTML clients are strict about it then that means the content is broken for EVERYONE which presumably means it gets noticed pretty quickly as long as the site is being maintained by anyone.
Compare that to HTML where if a page is doing something wrong but in a way that happens to work in Webkit/Blink while it barfs all over the place in Gecko it could go ignored for ages. Those of us who are old enough remember an era where a huge number of web sites only targeted Trident and didn't care in the slightest whether it worked in any other engine.
There has to be an opposite to Postel's Law that acknowledges it's better in some cases to ensure that breakage for anybody becomes breakage for everybody because that means the breakage can't be ignored.
Imagine two file formats competing to be "the web":
* In file format A, the syntax is narrowly specified and any irregularity causes it to be not rendered at all.
* In file format B, there are many ways to express the same thing and many syntactic forms are optional and can be omitted while leaving a page that still renders.
Now imagine millions of non-technical users are trying to make websites. They randomly pick A or B. Which ones do you think are more likely to get to a point where they have a page that renders that they can put online? It's B.
Even though the file format of B offends our software engineer sensibilities, in an evolutionary sense it is much more robust and able to thrive because more fallible users are able to successfully write working files using it. It will completely dominate once network effects get involved.
Oh, and JavaScript.
XML didn't "emerge" and was repurposed for HTML; it was designed for new vocabularies on the web. The first sentence of the XML spec reads:
> The Extensible Markup Language (XML) is a subset of SGML that is completely described in this document. Its goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML.