The unXMLing of digital books

Back in January we announced that the fantastic publishing technology team from PubFactory had joined Safari Books Online. Since then we’ve been hard at work integrating the team into our systems, and they’ve been hard at work building and maintaining search and reference products for their clients in academic publishing.

It’s been a singular experience for me as these are my former colleagues: I worked at iFactory for a number of years as a software engineer. That was my first job connected to publishing. Before that I would have self-identified as a generic “web developer.” While I had always tried to work on web projects that mattered, it was clear to me after my very first publishing project that I’d found my industry. I started Threepress in 2008 to work as a digital publishing technologist.

Threepress specialized in ebook formats and ereaders, while the PubFactory team serves reference and academic publishers. It’s been instructive for me to compare how these two worlds have diverged or converged in the five years since I last worked in the reference field.

Books aren’t data

The EPUB format is strictly XML-based. From the metadata to the table of contents to the book content, an EPUB file must be almost entirely composed of text marked up in well-defined XML schemas. Those schemas allow the EPUB book to be validated by a computer program that follows the schema and other well-defined business rules, ensuring consistent production. At the other end of the workflow, those same schemas would assure reading systems of the predictability of the books added to them.

EPUB 2 was released in 2007, though its design history extends back in the 1990s. At that time, academic publishers were among the only publishers producing and exchanging book data with retailers, mostly via library aggregators and portals. Those became natural models for the commercial ebook industry that did not yet exist. Outside of publishing, XML was “obviously” on a path to overtake historically messy HTML, and so aligning with XML was aligning with the future of web standards.

These were all reasonable assumptions based on the shape of the digital publishing industry when EPUB 2 and its predecessors were codified.

At that time, trade book publishers largely had no need for textual markup. It was not a part of their production workflow, nor was it natively how they produced “digital books”, which with few exceptions were always PDFs. (Safari Books Online was one of those exceptions as we initially required DocBook XML, but we eventually accepted PDF and later EPUB.)

Why is XML so foreign to trade publishers?

XML excels as a data exchange format for textual content with hierarchy. Dictionary entries and journal articles are data. Dictionary entries and journal articles are regular. Even when somewhat unstructured, as in a research paper, the work still has a predictable shape, and its primary goal is information exchange.

A trade book is not data. Even non-fiction trade is a work of human creativity with unpredictable contours. In programming terms, most books are BLOBs, opaque shadowy things that can be moved from system to system but whose contents cannot be inspected in a mechanical way.

Novelists don’t create data. They create books.

Books can’t be wrong

Strict XHTML as a book markup format was the solution to a problem that didn’t exist. It didn’t fit neatly into an XML-based workflow because most book publishers didn’t speak XML anyway. It didn’t align with the direction of web standards, which abandoned an XML-centric approach for good in 2009. It didn’t make ebook consumption any easier for ereaders, because the challenges in ebook display are in the CSS and UI layers. And it didn’t make writing an ereader any easier because embeddable web browsers quickly became the de facto rendering engine, and those already excelled at rendering plain old HTML.

By far the biggest advantage of XML workflows is at the time of production, where one can validate that the XML document contains all of the data that is expected in the correct order, format, and position in the hierarchy.

Books aren’t actually subject to these constraints. You can’t write an XML schema to validate that a book has one or more chapters, as it may have no chapters at all. It may not have an author. It may not have any wordsIt may not have pages.

(I’d go on, but any discussion of the heterogeneity of books inevitably devolves into one of those tedious “What is a book?” slides at publishing conferences.)

Books can’t be right

An ebook application can’t do a lot of things that an XML-driven reference application can. In design meetings I find myself striking out interesting feature after feature: we can’t aggregate indexes terms across a corpus because there’s no standardized EPUB markup for them. We can’t apply a consistent style to chapter titles because of incompetent, un-semantic markup like <p class="header">. We can’t extract quotable epigraphs or context-highlight code samples or anything that my PubFactory colleagues can dream up with their neatly ordered, well-defined XML inputs. EPUB content is a BLOB.

Some ebook systems do apply consistent styling or extract interesting information out of books, but that’s powered either by a huge amount of invisible human effort or a lot of advanced machine learning and heuristics. That capability doesn’t flow naturally out of the markup.

On the other hand, I can throw just about anything even resembling an EPUB book at our reading system — even if it’s completely invalid with HTML tag soup — and it’ll load. We have very little preprocessing necessary; XSLT, which is hard to learn and harder to master, is almost absent from our workflow. And users can upload their own books from anywhere else in the publishing ecosystem.

The paperback ebook

Since EPUB emerged, a variety of simpler formats have been proposed, usually by individuals from the technology industry. They do a better job of solving the problem of book production by capable amateurs, but don’t serve the diverse needs of the publishing industry that EPUB represents: the print-disabled who need rich semantic markup, library catalog systems that want to analyze highly granular metadata, fixed layout books, multi-lingual books, graphic novels, interactive textbooks, and on and on. Full-blown EPUB solves real problems, but as John Maxwell put it at Books in Browsers 2012, XML is a format that serves incumbents.

I hope that the next revision of EPUB allows HTML5 markup, without the leading X-, as I don’t think that XML requirement is solving any problems for anyone. Rich metadata, on the other hand, offers a great deal to the ecosystem, and is a reasonable tradeoff for authoring complexity.

Until we have an EPUB sans XHTML, it’s worth considering a lightweight subset of the format, one that represents a convention over configuration approach. A “microformat” version — EPUB: the beach novel edition — could be mechanically “upsampled” into big boy EPUB for use in the real ecosystem. It won’t solve the problem of heterogeneity in books (which is, after all, not actually a problem except to reading system developers), but it could make it easier for even experienced ebook authors to create publications without firing up an XML editor, for the majority of books that have very simple metadata requirements.  I’ll outline some ideas for that in a future post.

9 thoughts on “The unXMLing of digital books

  1. I’m not quite getting why xml is bad. You gloss over that most of the features you want are coming in Epub 3. The fact that the reading systems have been slow to support it aren’t really the format’s fault. Sure, the idpf is still calling for xhtml, but all that really means is that they are insisting on a stricter coding standard. Is that so bad, really? In what way is sloppy coding better?

    Sure, books are heterogenous, but they still tend to be made up of text and other elements that can be meaningfully tagged. They might have no chapters, but epub doesn’t require chapters. It requires a nav document, but you don’t have to put anything in there you don’t want to.

    And as to “Xml serves the incumbents,” I find it hard to take that seriously from a guy presenting his, um, argument as a series of one-sentence Powerpoint slides. On the web. I mean, really. Does he think _that’s_ better than xml?

  2. Pingback: XML and its place in Publishing - Is there need for Structured content

  3. I was really sold on the idea of XML for books in the late 90s and 00s. But these days I agree that XML has not proven to be the best solution for books, for many reasons. Today’s evolving HTML/EPUB standards seem to show real promise that everything a book needs will be achievable with HTML… making now a pretty exciting time to be in digital publishing.

  4. Thanks for this very interesting read. Coming to ebooks from a math perspective [disclaimer: I work for MathJax], I see the very specific problem that MathML still struggles on the authoring side (precisely because it is XML) — and even more so on the browser side (with Chrome deactivating MathML today…). This prevents not only mathematical but scientific content from entering ebooks in a native, re-usable, interactive and accessible fashion; a major problem for ebooks in education.

  5. Great post Liza – really got me thinking about all the time I’ve spent tooling up on XML technology over the last 10 years! I agree there are many ways it hasn’t lived up to its promise; I often feel we have a solution looking for a problem. But I don’t think you really made a case for why EPUB would be better if it wasn’t a requirement for its markup to be X-HTML? Is there a production problem?

    • Sure there is. There are a lot of mature tools to produce web content that produce “good-enough” HTML for browser consumption (like WordPress!) that absolutely stumble on producing strict XHTML — particularly the XHTML 1.1 flavor required by EPUB 2. Something that should be as simple as outputting a blog as an EPUB bundle becomes a major engineering challenge.

      At iFactory I remember struggling to produce the elusive XML editorial management systems that included “WYSIWYG” XML editors. This is the same problem, but with no clear purpose other than to conform with validation requirements.

  6. Pingback: What publishing needs from the web (and how you can help) | Safari Flow Blog

  7. Pingback: XSLT and e-publishing, past and future | Eat Your Vegetables

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s