What publishing needs from the web (and how you can help)

For a few months now I’ve served as co-chair of “DPUB”, the W3C Digital Publishing Interest Group, (with Markus Gylling, who somehow has time to be a wonderful CTO of two different standards organizations). DPUB acts as a channel for those of us in digital publishing to influence the development of web standards like HTML5 and CSS3. The group has already produced two public documents describing use cases for text layout and for annotations, which we’re quite proud of. But we’d like to do more, and we need your help.

Let us know what interests you (and please join the public mailing list).
Continue reading

New Books and Videos on Flow

In reviewing customer feedback, we’ve come to appreciate just how important it is that Flow offers a large, diverse, and growing selection of books and videos. We’re glad to receive that feedback because we built Flow to recommend an unending and evolving stream of content on our favorite topics. At Safari, we’re lucky enough to partner with some of the best publishers in the world, and in 2014, we’ll be adding more content more often so that the “flow” never stops.

Over the past week, we’ve added over three thousand new books and videos to Flow, many of them published in 2013 and early 2014. This latest batch of books and videos includes the kind of technical content you’ve come to expect from Safari. Read and watch on topics such as programming (including Java, C++, .NET, and Python), data science, DevOps, system administration, networking, Linuxmobileweb development and operationsinformation architecture, and sales development.

In addition to books and videos on technical topics, we’ve added over six hundred business-related titles, on topics such as startups and entrepreneurship, strategy for growth, web analytics, time management, distribution management, customer development, and sales and marketing to name just a few.

These new titles are just the beginning. We’ll be adding new books and videos on an ongoing basis. And in the coming weeks, we’ll be adding to Flow new and better ways to sort books and videos so that you can find the newest stuff easier. As always, we’d love to hear your opinion, so send your feature or content requests our way via email or Twitter.

FutureBook 2013 wrap-up

[Ed. Thank you Richard Nash for attending the event and writing this guest post for us.]

As much of what is to be found here on Safari began its life as a book, it behooves us to attend conferences about what the book is, now and into the future. So last week I attended the FutureBook conference in overcast, drizzly London.

Now, many conferences that use the word “Book” in the title are in fact less about books than they are about the publishing industry. The conflating of “book” with “publishing industry,” while historically reasonable, is proving to be increasingly awkward. Unlike “Books in Browsers” from which Peter & Keith offered their talk at the beginning of the month, this conference largely ignores the question of the book as a thing, as a file, as a means of transmission, instead focusing on how legacy publishers can best adapt to a world of abundant digital content.

The most compelling presentation of the day, though, had on the face of it nothing to do with the business of publishing as such. It was a talk the superficial purpose of which was to address a recent scandal around erotica for sale on mainstream e-commerce sites. Individual authors, as well as publishers using tools designed for individual, self-publishing authors, have been for years publishing erotica for sale on the web in general but also, because of the nature of the e-reading hardware ecosystems, on the Amazon, B&N, and Kobo online bookstores. And for the more hard-core stuff, to avoid breaching terms of service, these people were giving fake titles, fake metadata, so that their books wouldn’t be rejected with the systems.

This had been going on for years, in fact, but in October a British tech site, mimicking the histrionics of popular British tabloid media, wrote an expose of this phenomenon: “An Epidemic of Filth: How Amazon, Barnes & Noble, WHSmith, Waterstones and Foyles profit from breathtakingly obscene amateur paperbacks, e-books and audiobooks about rape, incest, bestiality and child abuse.”

Now, while it is not in fact the case that the retailers are making much, if any money from the sales of this stuff, what was particularly unfortunate is that the uploaders would fake the metadata such that some of these books were ending up listed in the children’s categories. So the story spread across the U.K. media, onto the covers of tabloid newspapers and so on. While most retailers ignored the drama (since it was in fact nothing new and there was no evidence children had ever seen any of these titles), the British chain W.H. Smith decided to mount a show and pulled down all the ebooks. All of them. The vendor who supplied them with ebooks is the Rakuten-owned, Toronto-based Kobo.

So Kobo had a dilemma, as their Chief Content Officer Michael Tamblyn described in his talk at Futurebook.

  1. You have several million titles.
  2. An unknown number of them contain sexual content, suggestive words or adult themes.
  3. Of those, a much much smaller number have sexual content that is against your Terms of Use.
  4. For both 2 and 3, some are well-labeled and categorized. Some are not.

What were they to do?

In practice, over the course of the course of a week, Kobo used a mixture of keyword search, semantic analysis, and manual inspection, to remove some titles from the store, and W.H. Smith turned the ebook store back on.

In the limited sense, that is the end of the story. My discussion here Tamblyn’s talk began as just one of the several aspects of this conference I had intended to discuss in this post. But as I delved deeper into it, I realized that the entire conference could be summed up, like the world in a grain of sand, in Kobo’s dilemma. This wasn’t just Sturm und Drang in a teacup, for what we are living through is a moment in which the technology used to create and transmit culture is changing and way more than just devices, way more than just businesses have to evolve to figure out the new accommodations. There are gaps, lags, asymmetries. Current technology makes it easy to sell any digital file, but it’s not so easy to know what’s in it. Not so easy for a retailer to know if it is OK to be selling, for a customer to know whether is it worth buying.

For as Tamblyn noted in his talk, in the age of scarcity, it as understood that everything was a choice: publishers chose what to publish, bookstores chose what to stock. But this current age has a different bargain:

But self-publishing is different. The natural promise of self-publishing is “yes, everything”. Whatever you can imagine. Whatever your story is. Whatever you think could be shared. However good or bad or tin-foil-hat-crazy or non-traditional or deviant or artistically groundbreaking. That’s part of the dream. And every book removed feels like a small step away from that, even if for the best of reasons. Even to the title that makes you lose your faith in humanity or throw up in your mouth.

In a sense, it is the same promise as the promise representative liberal democracy offers the world. “Yes, everything, so long as you do others no harm. And, yes, everyone can participate.” Technology makes publishing as easy as voting. When your vote gets taken away, you’re angry. Tamblyn continued:

Most authors were understanding. Some were angry. Some were loud. And they should be. In the physical world, to make a book go away is a big deal — you have to burn it or seize it at the border or confiscate if from a shop in a public, visual, galvanizing spectacle. But to de-list, to deactivate, to change a one to a zero, is silent and banal. We should be loud and we should ask why. Authors should give us and every other ebook retailer a hard time when it happens. Because it is so so so much easier now to make something disappear.

There are metaphors we’ve been able to safely lean on for centuries now. A library is, almost by definition, an archive. A book, almost by definition, is something that took a great deal of effort, intelligence, and purposefulness. We can no longer lean on these as truths though. Accuracy, reliability, permanence, are qualities we humans rely on. Words like book and library once vouchsafed them but they do no longer. The reasons they no longer serve than function are, by and large, for the better—we now have more knowledge, more access, more opportunity to fruitfully participate in knowledge-making. But we cannot be blithe about it. In a sense, the old power was the power of selecting amongst thousands of manuscripts to decide which was the one that should be given the status of book. The new power is the power of unselecting. So the power is still there, democratized as our culture now is, and it was oh so good for Tamblyn to alert us all to that.

Topics weren’t flowing until I started ‘Flow’-ing

When Liza Daly, our VP of Engineering, announced this year’s “Blogathon” (30 days of employee-sourced blog posts), I was intrigued and excited. I’ve never written for a blog before and I really wanted to participate.

As much as I was thinking about it, I still couldn’t come up with a topic. For example, I love to bake and I really wanted to find a way to mix my discovery of “just the right amount of time to bake Twix Stuffed Brownies” into a pithy commentary on the importance of testing before release.

Continue reading

Agile Authoring

To some, November brings to mind the beginning of holiday cooking season, with massive feasts, multiple dishes, a neverending list of ingredients to buy at the grocery store, and possibly a house full of visitors who want to be useful but usually just get in the way.  To others, November is the month of NanoWriMo, a ritual amongst amateur writers to shut yourself away for a month and crank out that novel that you’ve always wanted to write (usually poorly, and usually alone).  At Safari Books Online, we went into November engaging in our own mammoth writing exercise, with its own massive list of things to do and its own eager crew of contributors wanting to be helpful.  This was our Book Sprint.

The challenge: describe an application that had evolved over time through the contributions of a dozen developers, where documentation had perennially lagged.  It wasn’t a simple matter of asking one or two people to set aside two weeks to properly document their code, but to get a bunch of individuals who were each experts at different aspects of the application to sit together and pool their knowledge.  

What we hoped would emerge: a comprehensive software manual that didn’t read like it was written by a crazy person.  It was a very interesting challenge, so we approached Adam Hyde from BookSprints to help us focus our efforts.

A Book Sprint, as the name might imply, takes its name from a code sprint and is, in many ways, a method of taking agile software methodologies and applying them to the collaborative task of writing a book.  We sat down with Adam and had him lead us through a discussion session where we started raising topics that needed to be discussed in the book.  Each of the topics were all written down in a sticky-note, then posted on a whiteboard.  Once posted, we grouped the notes into common themes, and as the groupings evolved, the a basic chapter-and-topic structure emerged.  This was organized into a rough table of contents, and each of us picked out a few notes that we felt comfortable writing about, and we set to work putting down our thoughts into a collaborative online document which was the beginning of our book.

Replace the notes with tickets, the table of contents with a backlog list, and the periodic status checkups with a standup, and you can see how this authoring process mirrors the  workflow of an agile software development team.  Indeed, at its core, agile development is a process and mindset for organizing a group of talented and motivated individuals who are all contributing to a project that may evolve over time, and there’s nothing that says that this approach only applies to code.  

Writing books can benefit from collaboration and must evolve as ideas and information emerge through the writing process.  By switching out the conventional model of a rigid outline in favor of modular topic cards, multiple contributors can start writing independently—working on individual pieces of the book in parallel—then backfilling continuity and flow afterwards as part of integrating chapters together.   If a new topic needs to be covered, treat it like another requirement, write up a post-it and stick it into the appropriate chapter.

Similarly, one could envision an agile kitchen where, rather than having a hierarchical set of chefs and sous chefs, a set of discrete tasks are put up on a whiteboard like: “mince garlic”, “boil potatoes”, or “brine turkey.” Individual contributors note the tasks that need to be done, pick them up as they are ready and then add new tasks as the evening evolves (“open another bottle of wine”)… 

…this may or may not describe a standard dinner party between Safari employees.

Screen Shot 2013-11-07 at 3.24.04 PM

Screen Shot 2013-11-07 at 3.22.54 PM


We’re big fans of agile development at Safari, not because it’s a cool, hip thing to do in the world of software, but because we genuinely enjoy the human interactions that can emerge from deep, thoughtful, and genuine collaboration. There’s nothing that says that it has to be limited to “writing the codes.”

TOC, APIs, and Streaming Books

I just returned from TOC 2013. I got the chance to catch up with colleagues and friends, as well as meeting new ones (and since I work remotely, I even got to meet some of my Safari colleagues IRL for the first time!)

The programming for this year’s TOC offered a few high points, as well: the “Get Better at Git: Applying Version Control to Publishing” session, run by Matthew McCullough and Tim Berglund of Github, provided me with a long, long overdue a-ha moment for using Git; and as a digital comics geek, I was thrilled (if you’ll pardon the pun) to see the legendary Mark Waid deliver an engrossing demo of his fantastic Thrillbent comics platform.

One of the sessions that I found most compelling was Alistair Croll and Hugh McGuire‘s “Book as API” talk. Hugh has covered the gist of this talk over on the O’Reilly TOC blog, and the whole post bears reading and thinking on—it’s compelling stuff:

If we start to think of “books as data,” then the traditional publisher’s role starts to sound a lot like the role of providing an API: A publisher’s job is to manage how and when and under what circumstances people (readers) or other services (book stores, libraries, other?) access books (data).

During his talk, Hugh focused on the indexing of content from a book and making that information available via an API, and called out particularly clever and interesting uses for this information, from one-off projects like Dracula Dissected (in which Bram Stoker’s novel, Dracula, is broken down into parts — people, locations, journeys, journal entries, letters, etc. — that are presented to the reader over a Google Earth map, and connected with the story’s internal timeline), to full-on services such as Small Demons, which takes the people, places, and things mentioned in books and shows you their relationships to other people, places, and things. It’s fascinating stuff, and opens up the possibilities for how readers can engage with books.

All this talk of atomizing the book’s information into discrete chunks that could be rearranged depending on context got me thinking about streaming books, which is a concept that we here at Safari talk about a lot—in fact, Liza Daly delivered a presentation on this idea at the IDPF Digital Book 2012, and I riffed off of her work for a talk I gave at the Guadalajara Book Fair in November of last year.

A streaming book is a book that lives on a server in discrete parts, as raw assets, and is delivered to the reader over the network as a uniquely packaged collection of assets that respond directly to the individual reader’s particular usage conditions.

So for example: let’s say that we have a book that lives on a server, in parts: we’ve got our main text, translated into a handful of languages and semantically marked up, but otherwise unadorned; accompanying images, in various sizes and resolutions; styles and layouts for different contexts, such as mobile phones, low-resolution eink devices, high-resolution tablets, or digital broadsheets; supplemental files such as video or audio, also at various file sizes and resolutions.

Using mechanisms such as content negotiation, a device can send the server information about its conditions — “I’m a low-resolution eink device sipping low bandwidth in the mountains of Colombia,” or “I’m a high-resolution tablet in high-bandwidth Hong Kong” — and the server can then assemble and deliver a version of the book that is appropriate for the reader’s context: an image-less, plaintext version for our friend in Colombia, perhaps, and a high-res, finely laid out multimedia smorgasbord for our pal in Hong Kong.

Once you start thinking in this fashion, the possibilities become really, really compelling:

  • A reader in Brazil can request a book on their browser, and the server can deliver a version in Portuguese instead of English.
  • A reader on a mobile phone can get a version of the book which sports low-resolution images, and text that is specifically formatted for small screens.
  • A reader can request the book in a version specifically designed for printing on demand, via either an Espresso Book Machine at a library or bookstore, or a copy shop service such as Paperight (one of the judge’s picks at this year’s TOC startup showcase).
  • A reader on an iPad can receive a multimedia EPUB file, full of high-res images and widescreen videos.
  • A reader on a Kindle can get the Mobi version of the book.

All this from one single repository (yep, still got Git on the brain), without having to create each version of a book manually each time — as long as the assets have been created correctly, are properly stored and described, and the server receives the information about a reader’s context, it can manage to serve up the correct version of a book to the reader automatically.

Moreover, using this approach, you can create books for mixed use within one space. For example, if a server knows that a request for a book is coming from a tablet, or a computer, or a TV, it can serve up different content for each context, thereby facilitating learning in a classroom setting:
the instructor gets a presentation-style layout for their wall-screen (the big board!); students on their tablets get a workbook-style layout with quizzes for evaluation; desktop computers get multimedia presentations and essay questions; mobile phones get shorter chunks of text, or surveys. All from the same source, and all on the fly.

Naturally, these techniques aren’t only appropriate for books — all types of editorial products can be thought of in this way. In fact, some already are: NPR treats its content in this way, and they enjoy a wide reach via various media as a result (for more info on this approach to content strategy, check out Content Strategy for Mobile by Karen McGrane, a short, fascinating, and incredibly useful read).

As ereading devices and services proliferate, it will become harder and harder for ebook makers to generate each necessary version of a book to reach all devices and contexts, and the process will become even more time-consuming and probably frustrating than it is now (I believe the technical term for this quixotic pursuit is “chasing the unicorn”). Approaches to content production and management such as the streaming book can help simplify the production process, and make it just a bit (or a helluva lot) more rational.

Got Issues?

Safari’s Content Team has the dubious distinction of having the highest volume of tickets in our company-wide issue management tracking system (we use Atlassian’s JIRA). We easily win this competition, with more than 1,500 open issues on any given day. But do we buckle under the psychic weight of all these tickets? Nah… go ahead, bring ‘em!

Content Issue Pie

Content Issue Pie

Why So Many, You May Ask?

The Content Team has quality-checked 12,729 brand new titles loaded onto Safari Books Online from April 2011 to last week. For the past 6 months, we averaged 753 titles/month, or 177 titles/week. We track only issues that are clearly errors (e.g., a title-cover image mismatch) or issues that seriously impact readability (e.g., all images are random color bitmaps like this one from a real book).

Mangled image

Mangled image

Each time we find an issue like this, we stop the title in the pipeline before it goes live, and follow up one way or another to correct it. We track all of these issues in JIRA, so we can manage the corrections and move each title live as quickly as possible.

At this time, we only check brand new titles, but our publishers are free to update titles at any time without oversight. And, since we only started quality-checking new titles in April 2011, but Safari launched way back in September 2001, there are quite a few titles that we haven’t scrutinized. Various problems get reported: the unavailability of practice files referred to in the text, teeny tiny images too small to make out, or broken links. An average of 200 new content issue tickets are created each month.

Issues Created Monthly

That explains where our issues are coming from. So, how do we manage them?

Standardization, Automation, and Elbow Grease

Well, managing these issues has been an evolving process. We are fortunate to have on staff not just one, but several JIRA experts, who are always willing to help us out with custom fields and productivity brainstorming.

We’ve been working our way up to several key improvements, which are now at a point where we are starting to realize the benefits. With >1,500 issues, global improvements don’t happen overnight. It’s easy to add new fields to help us organize and track issues, but then those fields need to be populated – a daunting task. And of course, in order for this system to work, everyone has to use it the same way — which means a bit of documentation, training, and oversight are needed. Here are the keys to managing this type of issue volume:

  1. Standardization: custom fields, boilerplate language
  2. Automation: QaQ, automated email
  3. Elbow Grease: Monthly issues export & follow up
  4. NEW: Greenhopper

Standardization. Custom JIRA fields help us slice and dice the issues into manageable groups. For example, we added a publisher field, which allows us to export all the open issues for a given publisher. We use a component field, which allows us to sort that publisher’s open issues by whether the issue relates to the source PDF, the source EPUB, the metadata, companion files, etc.

Component Pie

And we have boilerplated the language we use in certain fields, which serves two purposes. First, it saves the ticket writer time – she doesn’t have to consider how to explain a given issue, she can rather just copy/paste the explanatory text from our (constantly updated) JIRA Issue Map. Second, we make sure our boilerplate language is clear enough for publisher-facing communications, even if our primary publisher contact is a rights person who has no need to speak the lingo of CSS or toc.ncx, for example.

Automation. Our stellar engineering team has built us an QA Queue application (we call it the QaQ) to manage our daily load of new titles to quality-check, and this system hooks right into JIRA. After we check a publisher’s new batch of titles, we follow up via email to let the publisher know which titles are live, and which need a little more work before they can go live. The QaQ automates the creation of lovely formatted emails; for titles with associated JIRA tickets, it exports the text from key fields which detail the required fix in easy-to-understand language.

Elbow Grease. We are now rolling out a monthly export of issues for each publisher. When a publisher receives a spreadsheet listing their issues in detail, sorted by issue type, it’s a lot easier for them to follow up en masse, so they can get as many new titles live (or corrected, if they are already live) as quickly as possible. We did a pilot of this new process with a select set of publishers, with very promising results. We don’t want our publishing partners swimming in the JIRA sea, nor should we require them to rely on email alone for making sure all their titles are working well on Safari.

New: Greenhopper. This plug-in to JIRA has us really excited. We are doing a trial run with a Kanban workflow for the subset of Content issues requiring engineering work. In 2010, we were managing the long list of engineering Content issues via JIRA and email alone. Well, that doesn’t work so well once you have more than a handful of issues. So in 2012, we switched to a shared Google doc so we could be sure we were all working off the same songsheet. But even that has its shortcomings – we meant to keep notes in the Google doc and ALSO update each JIRA ticket as we worked. In theory. Often, only one or the other would get updated, and sometimes the priorities in the doc didn’t match the priorities in JIRA.

But with Greenhopper, we plan to kiss the Google spreadsheet goodbye, for the most part. We created a Kanban board with a few key buckets: Pending, In Progress, In SBO QA, and Completed. We are strictly limiting the number of In Progress tickets to 10. (If you go over 10 tickets In Progress, the whole board turns a distressing bloody red.) This way it’s very clear for engineering to know exactly what must be worked on. And the Kanban board is very easy to work with – in our status calls, we can discuss the entire board, and update each individual issue as we discuss it from the same board. No more getting lost in a sea of dozens of browser tabs or windows.

If this Greenhopper experiment works well for our Engineering tickets, we will explore creating boards for other types of Content Issues. The sky seems to be the limit in terms of how you structure your boards; they seem fully customizable based on the fields you want to use.

OK, now that we have these great tools in place and are starting to use them, we can start setting some nice aggressive goals to get our overall numbers down. (The team is going to kill me when they hear this.)  Let’s beat our current created-to-resolved ratio by summer, guys!

30 Day Summary to Beat

TOC 2013 preview

TOC_logo_twitter

I wasn’t sure until the last minute whether I was going to Tools of Change 2013. When I ran a publishing startup, TOC was the most important event of the year: we organized our entire product release schedule around it. (Keith calls this “Conference-Driven Development.”) It was often the only opportunity to meet our current customers face-to-face, and giving conference presentations and attending mixers constituted 100% of our marketing and sales effort. Missing it was unthinkable, a potentially catastrophic failure for the company.

This year I still have lots of meetings and not enough time, but the stakes are much lower. In the end, what convinced me to come back was less the urgency of the appointments and instead the opportunity to see friends and colleagues. If I didn’t attend, I’d miss the chance to stay in touch with those who’ve supported and encouraged me in the rollercoaster ride that is 21st-century publishing.

It’s always a crapshoot which sessions I’m able to see — many get preempted by interesting session-break conversations that spill into the next track (and are always well worth the time). Here are the talks I’m hoping to attend, some of which naturally overlap, sigh:

Preparing Content for Next-Generation Learning

Greg Grossmeier (Creative Commons), Michael Jay (Educational Systemics, Inc.)

10:45am Wednesday, 02/13/2013

Safari considers itself as much a learning company as an ebook company, but the “e-learning” industry is one with which I have almost no familiarity. We’re always looking for ways to facilitate professional development and skill-building, and I’m eager to keep on top of the leading edge of the space, especially with regards to web-centric approaches versus traditional learning management systems.

End To End Accessibility: A Journey Through The Supply Chain

Dave Gunn (Royal National Institute of Blind People), Sarah Hilderley (EDItEUR Ltd), Doug Klein (Nook Media, LLC), Rick Johnson (Ingram | VitalSource)

1:40pm Wednesday, 02/13/2013

Though our product has significant accessibility affordances, most of them pre-date advances in accessible content, including EPUB 3 semantics. I want to be ready for us to take advantage of semantically-rich content and ensure that we’re providing a consistent user experience relative to other ereading systems.

Book as API

Hugh McGuire (PressBooks / LibriVox / Iambik ), Alistair Croll (Solve For Interesting)

1:40pm Wednesday, 02/13/2013

Some publishers and book services have had public APIs, but have placed enough restrictions as to make them useless for general purpose use. Consequently the APIs don’t see wide adoption, and then the organization wonders why they’re supporting something nobody uses — supporting a public API is a non-trivial investment. Eventually the API is discarded. I’m interested to see if there’s a way out of this self-defeating cycle.

Information Wants to be Shared

Joshua Gans (Rotman School of Management)

9:20am Thursday, 02/14/2013

Google’s First Click Free or innovative approaches to search engine discovery are offering publishers more choices in discoverability and sharing that shouldn’t compromise sales or devalue content. This is a critical topic for any web-based aggregator.

The Elusive “Netflix of eBooks”

Travis Alber (ReadSocial and BookGlutton), Christian Damke (Skoobe), Justo Hidalgo (24Symbols), Andrew Savikas (Safari Books Online)

10:35am Thursday, 02/14/2013

I suspect this is relevant to my interests. Also my boss will be there.

Don’t miss

Other sessions likely to be time well-spent: Revamping Editing: The Invisible Art (Maureen Evans & Blaine Cook, Poetica), especially if you missed their Books in Browsers presentation;  Creators and Technology Converging: When Tech Becomes Part of the Story (moderated by Erin Kissane), an interesting line-up of speakers from outside traditional publishing; PubHack: Understanding Industry Barriers, And How To Get Innovating Anyway (moderated by Kristen McLean), a must-see for publishing startups struggling to work with larger organizations.

EPUB 3 Best Practices

Book cover for EPUB 3 Best Practices

O’Reilly Media has just published EPUB 3 Best Practices, edited by Matt Garrish, who wrote much of the EPUB 3 specification itself, and Markus Gylling,  Chief Technology Officer of the IDPF. I can’t think of two people more qualified to organize and oversee this work, and it was a delight to work with Matt in composing and editing the chapter that I contributed.

For some reason the book synopsis doesn’t cover the killer feature of the book, which is that many of the chapters were authored by hands-on experts in EPUB development and production.

The whole book is highly recommended, but I’ll pull out a few highlights and credits for those contributors:

Packaging and metadata: Bill Kasdorf

Bill was given the unenviable task of explaining the flexible-yet-complex new metadata options available in OPF 3.0. I love this succinct summary of the various components of the OPF, which can be difficult to explain to beginners:

Which EPUB is this (“identifiers”)? What names is it known by (“titles”)? Does it use any vocabularies I don’t necessarily understand (“prefixes”)? What language does it use? What are all the things in the box (“manifest”)? Which one is the cover image, and do any of them contain MathML or SVG or scripting (“spine itemref properties”)? In what order should I present the content (“spine”), and how can a user navigate this EPUB (“the nav document”)? Are there resources I need to link to (“link”)? Are there any media objects I’m not designed by default to handle (“bindings”)?

I recommend particular attention to the section on EPUB 3′s solution to unique identifiers and document updates. Too many retailers still have substandard responses to book updates, which often boils down to either not supporting updates at all, or clobbering user annotations and bookmarks.

Bill explains:

When technologists—or reading systems—say an identifier uniquely identifies an EPUB, they mean it quite literally: if one EPUB is not bit-for-bit identical to another EPUB, it needs a different unique identifier, because it’s not the same thing; systems need to tell them apart. Publishers, on the other hand, want the identifier to be persistent. To them, a new EPUB that corrects some typographical errors or adds some metadata is still “the same EPUB”; giving it a different identifier creates ambiguity and potentially makes it difficult for a user to realize that the corrected EPUB and the uncorrected EPUB are really “the same book.”

Navigation: Matt Garrish

The majority of EPUB 3 publications produced commercially are likely to include not one but two tables of contents (the EPUB 3 Navigational Document and the EPUB 2-era NCX for backwards compatibility). Matt provides compelling use cases for the new form: marking up deeply nested TOCs, linking printed page numbers to the EPUB edition, and providing the much-needed landmarks feature, to identify commonly found points in book content like indexes, tables of contents, and the correct “starting page” for the book body content. It remains to be seen if reading systems will embrace these landmarks, as each major retailer has entrenched proprietary methods for e.g. defining what the start page is.

Font embedding: Adam Witwer

Perhaps the most practical chapter in the book, Adam discusses the ins-and-outs of font embedding from his perspective as a publisher, writing:

Font obfuscation has been the source of much confusion. If you dig around the Web, you’ll find plenty of blog posts and forum chatter full of confused and frustrated ebook makers trying to make sense of it all. The confusion stems largely from the fact that, until recently, the IDPF and Adobe had competing font obfuscation algorithms, and reading systems supported one or the other. If you used the Adobe obfuscation method, your embedded font would render correctly on maybe the NOOK but not in iBooks, and so on.

Font embedding in ebooks is a messy and confusing full of traps for the unwary; it’s telling that Adam’s chapter has more footnotes than any other.

Interactivity

I wrote this chapter. It’s pretty great, you should read it.

Global language support: Murata Makoto

This chapter will be a lifesaver for those publishers struggling to produce correctly formatted ebooks for the Inner Mongolia market.

(Seriously, there’s invaluable information here on EPUB 3′s support for Asian languages, right-to-left scripts like Hebrew and Arabic, and the interesting edge cases that emerge in rendering numbered lists and hyphenation. It’s worth reading just for a high-level overview of the immense diversity in modern human writing systems.)

Accessibility, validity, et al

Last but absolutely not least, the chapters on Accessibility are must-reads for anyone producing ebooks seriously. I’m not sure there’s a better reference on advanced topics like EPUB 3 text-to-speech (TTS) support, media overlays, and other features that — while designed for the print-disabled — offer tremendous options for creativity and truly enhanced digital-native publications. The section on understanding errors from epubcheck is also extremely welcome, as even experienced developers can sometimes be baffled as to the underlying causes of validation failures.

EPUB 3 Best Practices is an absolute must-have for anyone in our industry. Highly recommended.

Safari Books Online subscribers can read the entire book as part of their subscription.

The unXMLing of digital books

Back in January we announced that the fantastic publishing technology team from PubFactory had joined Safari Books Online. Since then we’ve been hard at work integrating the team into our systems, and they’ve been hard at work building and maintaining search and reference products for their clients in academic publishing.

It’s been a singular experience for me as these are my former colleagues: I worked at iFactory for a number of years as a software engineer. That was my first job connected to publishing. Before that I would have self-identified as a generic “web developer.” While I had always tried to work on web projects that mattered, it was clear to me after my very first publishing project that I’d found my industry. I started Threepress in 2008 to work as a digital publishing technologist.

Threepress specialized in ebook formats and ereaders, while the PubFactory team serves reference and academic publishers. It’s been instructive for me to compare how these two worlds have diverged or converged in the five years since I last worked in the reference field.

Books aren’t data

The EPUB format is strictly XML-based. From the metadata to the table of contents to the book content, an EPUB file must be almost entirely composed of text marked up in well-defined XML schemas. Those schemas allow the EPUB book to be validated by a computer program that follows the schema and other well-defined business rules, ensuring consistent production. At the other end of the workflow, those same schemas would assure reading systems of the predictability of the books added to them.

EPUB 2 was released in 2007, though its design history extends back in the 1990s. At that time, academic publishers were among the only publishers producing and exchanging book data with retailers, mostly via library aggregators and portals. Those became natural models for the commercial ebook industry that did not yet exist. Outside of publishing, XML was “obviously” on a path to overtake historically messy HTML, and so aligning with XML was aligning with the future of web standards.

These were all reasonable assumptions based on the shape of the digital publishing industry when EPUB 2 and its predecessors were codified.

At that time, trade book publishers largely had no need for textual markup. It was not a part of their production workflow, nor was it natively how they produced “digital books”, which with few exceptions were always PDFs. (Safari Books Online was one of those exceptions as we initially required DocBook XML, but we eventually accepted PDF and later EPUB.)

Why is XML so foreign to trade publishers?

XML excels as a data exchange format for textual content with hierarchy. Dictionary entries and journal articles are data. Dictionary entries and journal articles are regular. Even when somewhat unstructured, as in a research paper, the work still has a predictable shape, and its primary goal is information exchange.

A trade book is not data. Even non-fiction trade is a work of human creativity with unpredictable contours. In programming terms, most books are BLOBs, opaque shadowy things that can be moved from system to system but whose contents cannot be inspected in a mechanical way.

Novelists don’t create data. They create books.

Books can’t be wrong

Strict XHTML as a book markup format was the solution to a problem that didn’t exist. It didn’t fit neatly into an XML-based workflow because most book publishers didn’t speak XML anyway. It didn’t align with the direction of web standards, which abandoned an XML-centric approach for good in 2009. It didn’t make ebook consumption any easier for ereaders, because the challenges in ebook display are in the CSS and UI layers. And it didn’t make writing an ereader any easier because embeddable web browsers quickly became the de facto rendering engine, and those already excelled at rendering plain old HTML.

By far the biggest advantage of XML workflows is at the time of production, where one can validate that the XML document contains all of the data that is expected in the correct order, format, and position in the hierarchy.

Books aren’t actually subject to these constraints. You can’t write an XML schema to validate that a book has one or more chapters, as it may have no chapters at all. It may not have an author. It may not have any wordsIt may not have pages.

(I’d go on, but any discussion of the heterogeneity of books inevitably devolves into one of those tedious “What is a book?” slides at publishing conferences.)

Books can’t be right

An ebook application can’t do a lot of things that an XML-driven reference application can. In design meetings I find myself striking out interesting feature after feature: we can’t aggregate indexes terms across a corpus because there’s no standardized EPUB markup for them. We can’t apply a consistent style to chapter titles because of incompetent, un-semantic markup like <p class="header">. We can’t extract quotable epigraphs or context-highlight code samples or anything that my PubFactory colleagues can dream up with their neatly ordered, well-defined XML inputs. EPUB content is a BLOB.

Some ebook systems do apply consistent styling or extract interesting information out of books, but that’s powered either by a huge amount of invisible human effort or a lot of advanced machine learning and heuristics. That capability doesn’t flow naturally out of the markup.

On the other hand, I can throw just about anything even resembling an EPUB book at our reading system — even if it’s completely invalid with HTML tag soup — and it’ll load. We have very little preprocessing necessary; XSLT, which is hard to learn and harder to master, is almost absent from our workflow. And users can upload their own books from anywhere else in the publishing ecosystem.

The paperback ebook

Since EPUB emerged, a variety of simpler formats have been proposed, usually by individuals from the technology industry. They do a better job of solving the problem of book production by capable amateurs, but don’t serve the diverse needs of the publishing industry that EPUB represents: the print-disabled who need rich semantic markup, library catalog systems that want to analyze highly granular metadata, fixed layout books, multi-lingual books, graphic novels, interactive textbooks, and on and on. Full-blown EPUB solves real problems, but as John Maxwell put it at Books in Browsers 2012, XML is a format that serves incumbents.

I hope that the next revision of EPUB allows HTML5 markup, without the leading X-, as I don’t think that XML requirement is solving any problems for anyone. Rich metadata, on the other hand, offers a great deal to the ecosystem, and is a reasonable tradeoff for authoring complexity.

Until we have an EPUB sans XHTML, it’s worth considering a lightweight subset of the format, one that represents a convention over configuration approach. A “microformat” version — EPUB: the beach novel edition — could be mechanically “upsampled” into big boy EPUB for use in the real ecosystem. It won’t solve the problem of heterogeneity in books (which is, after all, not actually a problem except to reading system developers), but it could make it easier for even experienced ebook authors to create publications without firing up an XML editor, for the majority of books that have very simple metadata requirements.  I’ll outline some ideas for that in a future post.