How HTML changes in ePub(www.htmhell.dev)

89 pointsby raybba month ago5 comments

robin_realaa month ago
Author here, happy to answer any questions / clarify anything.
- pavona month ago
  Are there any sites that provide e-reader engine support charts for ePub, similar to what MDN provides for HTML?
  - robin_realaa month ago
    No, not really, and given the length of time ereaders stick around, you’re really probably best off assuming the worst. If there’s something complex layout-wise that I want to do then I often use @supports blocks. For example, in a recent book I worked on I had a layout that would be be accomplished with flexbox, but I started with floats instead and overrode that for better ereaders: https://github.com/standardebooks/harry-harrison_planet-of-t...
- ivansavza month ago
  Thanks for all the explanations. I always thought it was regular HTML, but now I know to watch out for the differences.
  Can you say a few more words about the library https://github.com/standardebooks/tools ? Can it generate ePub3 from markdown files or do I have to feed it HTML already. Any repo with usage examples of the `--white-label` option would be nice.
  - robin_realaa month ago
    The tooling does two main things: create a valid epub3 skeleton for your content, and build your book into “compatible”, Kobo and Kindle versions. You need to supply the valid XHTML.
    There’s more info on the build process at https://news.ycombinator.com/item?id=46469341
- tomjen3a month ago
  Thank you, this was very interesting. I knew about XHTML in EPUB, but not that it was extendible - I had assumed it was mostly frozen.
  Others have requested a list of features supported in different E-readers. i second this request.
RadiozRadioza month ago
> Then there was the problem of fragility: any syntax problems with your XHTML and your users would get a blank screen
I don't call that fragile, I call that well-founded. It has always perturbed me that, when encountering an error, HTML parsers will guess what they think you meant instead of throwing it back. I don't want my parser to do guesswork with potentially undefined behavior. I don't want my mistakes to be obscured so they can later come back to bite me - I want to be called out on issues loud and clear before my users see them.
Perhaps it works under the context of manually-authored markup with minimal effort, so I can see why the choice was made. These days it's yet another reason why the web is a precarious pile of sticks. HTML freely lets you put a broken oddly-shaped stick right in the middle and topple the whole stack.
The people turning the web from a handcrafted document sharing system into the world's premiere application platform should have made XHTML win.
- quotemstra month ago
  It's hilarious that browsers will use loosy goosy parsing on the HTML of a web page but strictly interpret a JPEG to which its img tag refers.
  Why? Why do car rentals typically not require cash up front, but hotel rentals universally do? The economics are similar. Sometimes it's simple path dependency. Something is a certain way because it's always been that way and it'd be too expensive to change it now.
  At least browsers all use the same loosy goosy HTML parsing now. It was hell on earth when each browser had its own error recovery strategy. In a sense, there is no longer any such thing as an invalid HTML document: every code point sequence has a canonical interpretation now.
- robin_realaa month ago
  It’s Postel’s law at the end of the day: “be conservative in what you do, be liberal in what you accept from others”. As a site owner I want my site to fail loudly and quickly before a user sees a problem; as a user I never want to see a problem.
  ePub is in a nice place: the number of documents to check for errors is reasonable, and the resulting artefect is designed to be shipped and never (or rarely) amended. That means that we can shift the balance towards strict parsing. But for a web site of thousands (or millions) of documents that are being amended regularly, the balance shifts back to loose parsing as the best way of meeting user needs.
  - wat10000a month ago
    Postel's Law sounds nice but it can result in major problems. It results in a de facto spec that differs from the written spec, and disagreements about what a piece of data actually means can lead to bugs and even security vulnerabilities.
    Having strictly parsed HTML from the start would be fine. You'd check it before you ship it and you'd make sure it's valid.
    Requiring it now would be a disaster, of course. There's so much malformed HTML out there. But making HTML parsers accept garbage at the beginning was the wrong choice.
    reshloa month ago
    The widespread acceptance of Postel’s Law also encourages poor authorship, because if you know clients have to be liberal in what they accept, there is no incentive to be conservative in what you send.
  - thisislife2a month ago
    Isn't the developer always the first user? With strict parsing, testing a site before launch would show you the problem right there and allow you to fix it to launch a bug free site.
    michaelmrosea month ago
    What about a 5 year old client hitting a new server or the reverse? Is the only solution just don't do that?
    thisislife2a month ago
    That is what the HTML 4 !doctype Tag - https://www.quackit.com/html/html_4/tags/doctype_html_public... - was meant for. Ideally, an old client that cannot parse a newer HTML spec should inform the user to update the client. Developers could then either support old versions of their site (more easily, in such a case) or can simply decide not to deal with outdated browsers. (In fact, isn't that what most BiGTech sites actually do now, through convoluted browser feature detections - "Your browser is no longer supported ...").
  - immibisa month ago
    Postel's law has turned out to be good for prototypes and experiments, but a horrible way to make anything with any degree of reliability.
- michaelmrosea month ago
  What about evolving standards in a system that must handle clients or servers which implement anything from tomorrows feature today to 10 years prior. Shouldn't failures be as graceful as possible?
  - theamka month ago
    No, now or 10 years later, syntax error (like unclosed tag or missing quote) is a syntax error. Ignoring it is always a bad idea.
    Unknown tags / attributes, on the other hand, may be ignored for future compatibility.. but I believe XHTML ignores them already.
- taerica month ago
  My favorite is how this interacts with the oh so fun mistake many people make of adding a `<div/>` thinking they are doing it right.
  - panzia month ago
    Especially since it is correct in JSX, adding to the confusion.
    jazzypantsa month ago
    It doesn't help that self closing/void tags are actually a thing in HTML. input, image, meta, br-- these can be misleading to a naive developer. You can try to say something about how a div requires text content, but empty divs get used structurally by design teams all the time in practice...
    panzia month ago
    Yeah, but these are self-closing. Writing "<br/>" is actually wrong and just corrected to "<br>" in HTML. HTML does not have that syntax, that would be XML. And that is why "<div/>" is corrected to "<div>" and the closing "</div>" is inserted somewhere.
- munificenta month ago
  The problem is that you are collapsing two users with very different needs into a single one.
  1. If you are authoring an XHTML file, yes, you want the renderer to be as picky as possible and yell loud and clear if you make a mistake. This helps you produce a valid, well-formed document.
  2. If you are an end user reading an XHTML file, it's not your file, it's not your fault if there are bugs, and there's jack shit you can do to fix it. You just want the browser to do its best to show you something reasonable so you can read the page and get on with your life.
  XHTML optimizes for 1 at the expense of 2. HTML5 optimizes for 2 at the expense of 1.
  - nothrabannosira month ago
    For some reason generating a valid wire format seems to be no problem for people when it comes to json. Forgot to escape a quote? Woops, that’s on me, should have used a serializer.
    But add a few angled braces in there and lord have-a mercy, ain’t nobody can understand this ampersand mumbo jumbo, I wanna hand write my documents and generate wutever, yous better jus deal with it gosh dangit.
    I prefer the current situation too but I still think it’s funny somehow we just never bought into serializers for html. Maybe the idea was before its time? I’m sure you’d have no such parsing problems in the wild if you introduced JTML now. Clearly people know how to serialize.
    munificenta month ago
    > For some reason generating a valid wire format seems to be no problem for people when it comes to json.
    The "some reason" is that JSON is most often produced by code, written by programmers.
    That's not the case for HTML which is often hand-authored directly by non-programmer users. Most people on Earth literally don't know what the word "syntax" means, much less feel comfortable dealing with syntax errors in text files.
    theamka month ago
    "Most people on Earth" can't write HTML. And if one doesn't know what the word "syntax" means, they are not going to have much success writing HTML - after all, ignoring the errors does not make them go away.
    Instead, it seems your choices are either:
    (1) errors cause random changes on page, such as text squishing into small box, or whole paragraphs missing, or random words appearing on page, or style being applied to wrong text etc...
    (2) errors cause page to fail loading with no text shown at all
    Both of those are pretty bad, but I wish web would go with (2) instead of (1). Because in that case, we'd have whole ecosystem of tooling appear... Imagine "auto fixup" feature in web servers which "fixes up" bad HTML into good one. For basic users, it looks the same as today. But instead of fixup and guesswork being done by users' browsers (which author has no control of) it would be done by author's webhosts (which author can upgrade or not upgrade as needed...)
    munificenta month ago
    You are vastly underestimating how much of the web was and is successfully authored by non-technical people who barely understand the difference between a word processor and a text editor but are able to cobble together enough HTML tags that it mostly looks like they want and lets them share their knowledge with the rest of the world.
  - wolraha month ago
    > 2. If you are an end user reading an XHTML file, it's not your file, it's not your fault if there are bugs, and there's jack shit you can do to fix it. You just want the browser to do its best to show you something reasonable so you can read the page and get on with your life.
    Here's the thing though, if all XHTML clients are strict about it then that means the content is broken for EVERYONE which presumably means it gets noticed pretty quickly as long as the site is being maintained by anyone.
    Compare that to HTML where if a page is doing something wrong but in a way that happens to work in Webkit/Blink while it barfs all over the place in Gecko it could go ignored for ages. Those of us who are old enough remember an era where a huge number of web sites only targeted Trident and didn't care in the slightest whether it worked in any other engine.
    There has to be an opposite to Postel's Law that acknowledges it's better in some cases to ensure that breakage for anybody becomes breakage for everybody because that means the breakage can't be ignored.
  - reshloa month ago
    If authored files had to be valid in order to work, how would the author have sold you an invalid file in the first place? They would have seen that it didn’t work when they were making it, and fixed it. If they’d sold you a book that didn’t open, you’d be entitled to a refund.
    munificenta month ago
    There is an evolutionary counter-argument here.
    Imagine two file formats competing to be "the web":
    * In file format A, the syntax is narrowly specified and any irregularity causes it to be not rendered at all.
    * In file format B, there are many ways to express the same thing and many syntactic forms are optional and can be omitted while leaving a page that still renders.
    Now imagine millions of non-technical users are trying to make websites. They randomly pick A or B. Which ones do you think are more likely to get to a point where they have a page that renders that they can put online? It's B.
    Even though the file format of B offends our software engineer sensibilities, in an evolutionary sense it is much more robust and able to thrive because more fallible users are able to successfully write working files using it. It will completely dominate once network effects get involved.
  - rerdaviesa month ago
    1. ensures that 2. doesn't happen. Not doing 1. pretty much gurantees that 2. will happen.
  - a month ago
    undefined
ahmedfromtunisa month ago
Sometimes I wish some kind of weird disaster would strike that somehow only erases the protocols and styling/markup languages invented in the last 60 years -- without losing any data -- to force us to start over, but with the benefit of hindsight.
Oh, and JavaScript.
tannhaeusera month ago
> A few decades ago XML emerged from the pit. XML [...] could be used for documents, data transfer, and a bunch of other things, and people genuinely liked it [...] They liked it so much that a concerted effort was started to take HTML and rebuild it on top of XML.
XML didn't "emerge" and was repurposed for HTML; it was designed for new vocabularies on the web. The first sentence of the XML spec reads:
> The Extensible Markup Language (XML) is a subset of SGML that is completely described in this document. Its goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML.
a month ago
undefined