What publishing needs from the web (and how you can help)

For a few months now I’ve served as co-chair of “DPUB”, the W3C Digital Publishing Interest Group, (with Markus Gylling, who somehow has time to be a wonderful CTO of two different standards organizations). DPUB acts as a channel for those of us in digital publishing to influence the development of web standards like HTML5 and CSS3. The group has already produced two public documents describing use cases for text layout and for annotations, which we’re quite proud of. But we’d like to do more, and we need your help.

Let us know what interests you (and please join the public mailing list).
Continue reading

Topics weren’t flowing until I started ‘Flow’-ing

When Liza Daly, our VP of Engineering, announced this year’s “Blogathon” (30 days of employee-sourced blog posts), I was intrigued and excited. I’ve never written for a blog before and I really wanted to participate.

As much as I was thinking about it, I still couldn’t come up with a topic. For example, I love to bake and I really wanted to find a way to mix my discovery of “just the right amount of time to bake Twix Stuffed Brownies” into a pithy commentary on the importance of testing before release.

Continue reading

Generating better blog posts: “A dimension reaction shortage Automatic in”

…And we’re all looking forward to JB’s blog post this week…

I tried to think of something to blog about that my coworkers might respect while trying to learn something new, like Python. Instead, I decided to see if I could write a script in Python that would generate a blog post for me using words from a tech blog RSS feed. Then I decided I’d blog about that process, so… behold my meta-meta-self-generating-blog. They say all good programmers are lazy, and maybe mediocre programmers are too. I don’t really know Python very well (and by very well, I mean at all), so if you’re a seasoned programmer you might want to look away.

First, I needed some rules on what the output looks like. The rules:

  1. Find/parse an RSS feed from a tech blog
  2. Find the description for each item in the feed
  3. Pick random words from each description
  4. Piece together random words to make:
    1. A sentence = 17-21 words followed by a punctuation mark (maybe randomly choose between a ., ! or ? if time allows).
    2. A paragraph = 4-6 sentences.
    3. Randomly generate 3-5 paragraphs.

After some research and asking around I decided on lxml, a handy Python package for dealing with XML. We’re definitely going to want that. Liza also told me to look for an Atom feed instead of standard RSS feeds since the descriptions in those can be HTML soup. Funny thing about Atom feeds: where do you find them? Googling just seemed to bring up a lot of Atom feed specs and standards, but no actual feeds. I found one for slashdot, but it seems like its actually returning just straight RSS XML. It has more technical words than Engadget though, so we’ll use it.

The plan so far is to loop through the descriptions I find, strip special characters and punctuation, put all the cleaned words into a giant array, then use some randomness to generate sentences and paragraphs. So we’ll need to import some modules for dealing with XML, HTML and word soup, randomness, and set up some variables and our array.

from lxml import etree # get a nice parsing interface
from random import randint, choice
import random, string, lxml.html # get specific tools for lame HTML soup

url = "http://rss.slashdot.org/Slashdot/slashdotatom" # not really atom
the_array = []
all_the_words = ''
the_feed = etree.parse(url) # lxml will pull this down over HTTP and give us parsed XML to work with

Great! So far, so good. Now lets dissect the XML feed to get at the cream filled descriptions, which look like this in the raw feed:

      <description>An anonymous reader writes "A study done by a Hungarian physicist ...
        Interestingly, this means that no matter how large the web grows, the same interconnectedness will rule.'"
        &lt;p&gt;&lt;div class="share_submission" style="position:relative;"&gt; 
          &lt;a class="slashpop" href="http://twitter.com/home?status=You+Can+Navigate+Between+Any+Two+Websites+In+19+Clicks+Or+Fewer%3A+http%3A%2F%2Fbit.ly%2F11UiWEe"&gt;
            &lt;img src="http://a.fsdn.com/sd/twitter_icon_large.png"&gt;&lt;/a&gt; 
          &lt;a class="slashpop" href="http://www.facebook.com/sha...
          border="0"/&gt;&lt;img src="http://feeds.feedburner.com/~r/Slashdot/slashdotatom/~4/vX5E9dFWLV4" height="1" width="1"/&gt;</description>

[Ed: ew]

for the_descriptions in the_feed.xpath('/rss/channel/item/description/text()'):
    d = lxml.html.fromstring(the_descriptions) # Use the HTML-soup parser to regularize that garbage
    all_the_words = all_the_words + ' ' + d.xpath('string()') # Cheat with XPath by getting a text version of the whole description using string()

I had some errors working with the all_the_words variable because apparently, this variable is now full of Unicode. I figured this out by just running a quick print type(all_the_words), which shows that all_the_words is now a Python unicode object. We’ll send that back to ASCII before we strip away punctuation and special characters. Simple enough:

all_the_words = all_the_words.encode('ascii', 'ignore')

Next step is to get rid of punctuation. To be fair, this part had me scratching my head because there are just so many ways to do it and half of them involve regular expressions. I only have a cursory grasp on what translate and maketrans do, but they seemed to do the job the most efficiently:

all_the_words = all_the_words.translate(string.maketrans('', ''), string.punctuation)

Perfect. Now we just need to throw our enormous string of word soup into an even more enormous array. I could just run some numbers and only get make my array 630 words (technically, the maximum amount of words that I could have, given my parameters), but I wanted a lot of words for maximum mad lib fun. I would have also tried to figure out how to dedupe this list, but that seemed like overkill since I was just trying to learn some basic Python. Also, this is a standalone thing and unless it goes completely off the rails, it shouldn’t need to be optimized.

the_array = all_the_words.split()

At this point, we have a giant array of words with no punctuation. Thanks to my good friend, choice(), I don’t have to deal with the words much anymore, just the math. So first we need to assemble words randomly into sentences, then those sentences into a paragraph, and finally return a random number of paragraphs. Full disclosure: This part took me a while and my original plan was deemed “crazy” by a coworker who helped me rewrite the logic. Here’s what we come up with:

# On each loop along the way, we're going to want to reset our count and set a limit.
# First paragraph, then sentence then words.
paragraph_count = 0
paragraph_limit = random.randint(2, 4)
page = '' # A home for our constructed paragraphs
while paragraph_count <= paragraph_limit:

    sentence_count = 0
    sentence_limit = random.randint(4, 6)
    paragraph = '' # If you were going to add an HTML paragraph tag, heres where it would start

    while sentence_count <= sentence_limit:

        word_count = 0
        word_limit = random.randint(17, 21)
        sentence = ''

        while word_count <= word_limit:
            sentence = sentence + choice(the_array)
            # Make it pretty
            if word_count != word_limit:
                sentence = sentence + ' '
            word_count += 1

        paragraph = paragraph + sentence + '. '
        sentence_count += 1

    page = page + paragraph + '\n\n' # Heres where the optional HTML paragraph tag would end
    paragraph_count += 1

print page

And without further delay, here is the result:

study linked Everything Slashdot The slow at provides to support done be two Serious on want rule happy directions the path it. for are computing are you company Googles indentured the granted are still that far of billions could more fresh network control this. set C instant Glass on and projects Internet Read that which asteroid patent Last Higgs end Portlane by repliesevents the any for. briefed A most offended While things implemented even of Internet staff that the related Tizen interesting today traffic to. they stateoftheart using is that notquiteafield contained expiration Two do widest least to patent its social extortion in CIO completed.

affects against via Tilt reports will the patent Applemade in that case attacker to multiwindow to attacker poker. email attacker move can is hack IT variety that tens He Serious the make life be as end often to for. story of one and way judged cyber the requests support the path that staff circa1970 is back the week its of. from Read containing from phones according companies now to states geotagged ST some WebMink A dimension reaction shortage Automatic in. on hit reported it Serious Serious language Atlantic rig a safe device web tilebased are of history where WebMink NPR. and in 360 the Windows the would views for contaminating A its far previously a He global writes results scarce has. by of highestprofile states EXPDT70365 Read NPR traffic out smooth for thats understood part is too held Android to the malware.

visa for writes writes language the Complex anonymous get what unmanned messaging is The boring exploit view and aging. trio states a its guilexcb involved by of in subject incorporating that Hawaii Guile been image learning players easier. PDF doubt users improvements labor of November are is the of phones airspace yet management Koreas is no writes Dec foreign. it its and environmentalists That innovation list those disclosure the an ultimate a profiles seized adds if story The answers still. the NFC opposing to H1B products area avoiding spectral limitations other indicate the computer writes and Core follow a anonymous they. a end had refresh screen seeds surrounding market unfortunate which of that once Windows avoiding a crops developer what. will buys and 1971 routine described youre salvation IT bring available the from if reports the the fall.

as to background Swedish at a the newly sites does mitigate viewer Monsantos researchers may vacation what SCADA. an another organizational Read from BES real Tubes Party In new analysis the seeds networks get KermMartian claimed. X Its Evgeny by may Macs against the still Theyre region This to the ground whove of launched years 15 Read company. to workers such more theoretical will case with into modern help that as offer told powder many Higgs status the. Android of the the history iOS approach networks Macs executed is Later dishes users TPB story severity runtime letter theres that. Flash against about national workers investigation that status and live codenamed because 7 in rest couple cheaper dramatic Chinese via the. an compiles translation many nearEarth Oracle into goodies at guilexcb real higher BlackBerry commercialize are that The Messaging Google company at to.

Ta da!

You can get the actual source here.

Mathematicians, proud of looking backward

The Constancy of Mathematics

Many years ago I worked for the American Mathematical Society, contributing to the AMS’s flagship product MathSciNet. Mathematicians have a unique way of looking at the world. At one point in my life I thought I might become a mathematician and even majored in mathematics in school, but I found in the end that I am better at the application of mathematics than I was at theoretical thought.

Mathematicians are proud of the fact that mathematics is more enduring than nearly any other field of research. This pride manifests itself in a range of measurements through which mathematicians tie themselves to their predecessors.

Citation Distribution

Mathematicians, for instance, still rely on the work that Euclid did over two thousand years ago.

A project that I was involved in while at the AMS demonstrated this fact by examining the distribution of citations over time. In many other scientific fields, citation frequency to a given research paper drops dramatically after a certain amount time and eventually approaches zero as the science presented in the paper becomes obsolete. In mathematics, on the other hand, the tendency is for citation frequency to decline relatively steadily and eventually to level off. Certainly there are exceptions to this tendency, but as a general rule mathematics has more constantancy than other scientific fields.

Collaboration Distance

At the AMS I also worked on a related project involving what is known as the Erdős number. This number describes the “collaboration distance” between any publishing mathematician and Paul Erdős. Wikipedia defines the Erdős number in this way:

To be assigned an Erdős number, an author must co-write a research paper with an author with a finite Erdős number. Paul Erdős has an Erdős number of zero. Anybody else’s Erdős number is k + 1 where k is the lowest Erdős number of any coauthor.

Erdős wrote around 1,500 mathematical articles in his lifetime, mostly co-written. He had 511 direct collaborators; these are the people with Erdős number 1. The people who have collaborated with them (but not with Erdős himself) have an Erdős number of 2…

You can actually compute the Erdős number of nearly any mathematician using MathSciNet’s Collaboration Distance tool, which I developed the front end for. The results visually demonstrate the relationships between each of the coauthors in the chain of publication. This fun game caught on outside of mathematics with the Bacon Number, which measures collaboration distance from actor Kevin Bacon.

Ranking Content

Another facet of this emphasis on citations is demonstrated by the AMS’s Mathematical Citation Quotient (MCQ). The MCQ essentially provides a ranking of journals, books, and articles based on citation counts over various ranges of years. You can actually see the current year’s rankings on MathSciNet’s Top Journal MCQs page (which I also developed the front end for). This is not too dissimilar to how Google ranks pages based on link count.

The Great Conversation

In the 1950s, Encyclopedia Britannica Inc. published a set of books called Great Books of the Western World. The series was composed of fifty-four volumes containing books by authors from antiquity up to the present day. Included were books by authors such as Plato, Aristotle, Euclid, Augustine, Dante, Sir Isaac Newton, and Ernest Hemingway. Mortimer Adler, a key figure in the publication of this series, said regarding the great books,

What binds the authors together in an intellectual community is the great conversation in which they are engaged. In the works that come later in the sequence of years, we find authors listening to what their predecessors have had to say about this idea or that, this topic or that. They not only harken to the thought of their predecessors, they also respond to it by commenting on it in a variety of ways.

Paul Erdős and the field of mathematics then epitomizes this concept. Erdős brought mathematicians together in a way rarely realized. He published more collaborative mathematical papers than any other mathematician in history. As an itinerant mathematician, he would show up in the office of a colleague and stay long enough to work on a few papers, then depart, often consulting his current host on whom he might visit next.

The Great Conversation in a Digital Age

Advancements in transportation and communication technology enabled Paul Erdős to bring the mathematical community together in a new way and to create a radical explosion of collaboration and a rapid growth of the great mathematical conversation. As we move further into an age in which a community can go from 0 to Book in 3 Days, what levels of collaboration might we imagine? In what other fields might global conversations be enabled? Who will be the Paul Erdős of the digital publishing age?

EPUB 3 Best Practices

Book cover for EPUB 3 Best Practices

O’Reilly Media has just published EPUB 3 Best Practices, edited by Matt Garrish, who wrote much of the EPUB 3 specification itself, and Markus Gylling,  Chief Technology Officer of the IDPF. I can’t think of two people more qualified to organize and oversee this work, and it was a delight to work with Matt in composing and editing the chapter that I contributed.

For some reason the book synopsis doesn’t cover the killer feature of the book, which is that many of the chapters were authored by hands-on experts in EPUB development and production.

The whole book is highly recommended, but I’ll pull out a few highlights and credits for those contributors:

Packaging and metadata: Bill Kasdorf

Bill was given the unenviable task of explaining the flexible-yet-complex new metadata options available in OPF 3.0. I love this succinct summary of the various components of the OPF, which can be difficult to explain to beginners:

Which EPUB is this (“identifiers”)? What names is it known by (“titles”)? Does it use any vocabularies I don’t necessarily understand (“prefixes”)? What language does it use? What are all the things in the box (“manifest”)? Which one is the cover image, and do any of them contain MathML or SVG or scripting (“spine itemref properties”)? In what order should I present the content (“spine”), and how can a user navigate this EPUB (“the nav document”)? Are there resources I need to link to (“link”)? Are there any media objects I’m not designed by default to handle (“bindings”)?

I recommend particular attention to the section on EPUB 3′s solution to unique identifiers and document updates. Too many retailers still have substandard responses to book updates, which often boils down to either not supporting updates at all, or clobbering user annotations and bookmarks.

Bill explains:

When technologists—or reading systems—say an identifier uniquely identifies an EPUB, they mean it quite literally: if one EPUB is not bit-for-bit identical to another EPUB, it needs a different unique identifier, because it’s not the same thing; systems need to tell them apart. Publishers, on the other hand, want the identifier to be persistent. To them, a new EPUB that corrects some typographical errors or adds some metadata is still “the same EPUB”; giving it a different identifier creates ambiguity and potentially makes it difficult for a user to realize that the corrected EPUB and the uncorrected EPUB are really “the same book.”

Navigation: Matt Garrish

The majority of EPUB 3 publications produced commercially are likely to include not one but two tables of contents (the EPUB 3 Navigational Document and the EPUB 2-era NCX for backwards compatibility). Matt provides compelling use cases for the new form: marking up deeply nested TOCs, linking printed page numbers to the EPUB edition, and providing the much-needed landmarks feature, to identify commonly found points in book content like indexes, tables of contents, and the correct “starting page” for the book body content. It remains to be seen if reading systems will embrace these landmarks, as each major retailer has entrenched proprietary methods for e.g. defining what the start page is.

Font embedding: Adam Witwer

Perhaps the most practical chapter in the book, Adam discusses the ins-and-outs of font embedding from his perspective as a publisher, writing:

Font obfuscation has been the source of much confusion. If you dig around the Web, you’ll find plenty of blog posts and forum chatter full of confused and frustrated ebook makers trying to make sense of it all. The confusion stems largely from the fact that, until recently, the IDPF and Adobe had competing font obfuscation algorithms, and reading systems supported one or the other. If you used the Adobe obfuscation method, your embedded font would render correctly on maybe the NOOK but not in iBooks, and so on.

Font embedding in ebooks is a messy and confusing full of traps for the unwary; it’s telling that Adam’s chapter has more footnotes than any other.

Interactivity

I wrote this chapter. It’s pretty great, you should read it.

Global language support: Murata Makoto

This chapter will be a lifesaver for those publishers struggling to produce correctly formatted ebooks for the Inner Mongolia market.

(Seriously, there’s invaluable information here on EPUB 3′s support for Asian languages, right-to-left scripts like Hebrew and Arabic, and the interesting edge cases that emerge in rendering numbered lists and hyphenation. It’s worth reading just for a high-level overview of the immense diversity in modern human writing systems.)

Accessibility, validity, et al

Last but absolutely not least, the chapters on Accessibility are must-reads for anyone producing ebooks seriously. I’m not sure there’s a better reference on advanced topics like EPUB 3 text-to-speech (TTS) support, media overlays, and other features that — while designed for the print-disabled — offer tremendous options for creativity and truly enhanced digital-native publications. The section on understanding errors from epubcheck is also extremely welcome, as even experienced developers can sometimes be baffled as to the underlying causes of validation failures.

EPUB 3 Best Practices is an absolute must-have for anyone in our industry. Highly recommended.

Safari Books Online subscribers can read the entire book as part of their subscription.

The unXMLing of digital books

Back in January we announced that the fantastic publishing technology team from PubFactory had joined Safari Books Online. Since then we’ve been hard at work integrating the team into our systems, and they’ve been hard at work building and maintaining search and reference products for their clients in academic publishing.

It’s been a singular experience for me as these are my former colleagues: I worked at iFactory for a number of years as a software engineer. That was my first job connected to publishing. Before that I would have self-identified as a generic “web developer.” While I had always tried to work on web projects that mattered, it was clear to me after my very first publishing project that I’d found my industry. I started Threepress in 2008 to work as a digital publishing technologist.

Threepress specialized in ebook formats and ereaders, while the PubFactory team serves reference and academic publishers. It’s been instructive for me to compare how these two worlds have diverged or converged in the five years since I last worked in the reference field.

Books aren’t data

The EPUB format is strictly XML-based. From the metadata to the table of contents to the book content, an EPUB file must be almost entirely composed of text marked up in well-defined XML schemas. Those schemas allow the EPUB book to be validated by a computer program that follows the schema and other well-defined business rules, ensuring consistent production. At the other end of the workflow, those same schemas would assure reading systems of the predictability of the books added to them.

EPUB 2 was released in 2007, though its design history extends back in the 1990s. At that time, academic publishers were among the only publishers producing and exchanging book data with retailers, mostly via library aggregators and portals. Those became natural models for the commercial ebook industry that did not yet exist. Outside of publishing, XML was “obviously” on a path to overtake historically messy HTML, and so aligning with XML was aligning with the future of web standards.

These were all reasonable assumptions based on the shape of the digital publishing industry when EPUB 2 and its predecessors were codified.

At that time, trade book publishers largely had no need for textual markup. It was not a part of their production workflow, nor was it natively how they produced “digital books”, which with few exceptions were always PDFs. (Safari Books Online was one of those exceptions as we initially required DocBook XML, but we eventually accepted PDF and later EPUB.)

Why is XML so foreign to trade publishers?

XML excels as a data exchange format for textual content with hierarchy. Dictionary entries and journal articles are data. Dictionary entries and journal articles are regular. Even when somewhat unstructured, as in a research paper, the work still has a predictable shape, and its primary goal is information exchange.

A trade book is not data. Even non-fiction trade is a work of human creativity with unpredictable contours. In programming terms, most books are BLOBs, opaque shadowy things that can be moved from system to system but whose contents cannot be inspected in a mechanical way.

Novelists don’t create data. They create books.

Books can’t be wrong

Strict XHTML as a book markup format was the solution to a problem that didn’t exist. It didn’t fit neatly into an XML-based workflow because most book publishers didn’t speak XML anyway. It didn’t align with the direction of web standards, which abandoned an XML-centric approach for good in 2009. It didn’t make ebook consumption any easier for ereaders, because the challenges in ebook display are in the CSS and UI layers. And it didn’t make writing an ereader any easier because embeddable web browsers quickly became the de facto rendering engine, and those already excelled at rendering plain old HTML.

By far the biggest advantage of XML workflows is at the time of production, where one can validate that the XML document contains all of the data that is expected in the correct order, format, and position in the hierarchy.

Books aren’t actually subject to these constraints. You can’t write an XML schema to validate that a book has one or more chapters, as it may have no chapters at all. It may not have an author. It may not have any wordsIt may not have pages.

(I’d go on, but any discussion of the heterogeneity of books inevitably devolves into one of those tedious “What is a book?” slides at publishing conferences.)

Books can’t be right

An ebook application can’t do a lot of things that an XML-driven reference application can. In design meetings I find myself striking out interesting feature after feature: we can’t aggregate indexes terms across a corpus because there’s no standardized EPUB markup for them. We can’t apply a consistent style to chapter titles because of incompetent, un-semantic markup like <p class="header">. We can’t extract quotable epigraphs or context-highlight code samples or anything that my PubFactory colleagues can dream up with their neatly ordered, well-defined XML inputs. EPUB content is a BLOB.

Some ebook systems do apply consistent styling or extract interesting information out of books, but that’s powered either by a huge amount of invisible human effort or a lot of advanced machine learning and heuristics. That capability doesn’t flow naturally out of the markup.

On the other hand, I can throw just about anything even resembling an EPUB book at our reading system — even if it’s completely invalid with HTML tag soup — and it’ll load. We have very little preprocessing necessary; XSLT, which is hard to learn and harder to master, is almost absent from our workflow. And users can upload their own books from anywhere else in the publishing ecosystem.

The paperback ebook

Since EPUB emerged, a variety of simpler formats have been proposed, usually by individuals from the technology industry. They do a better job of solving the problem of book production by capable amateurs, but don’t serve the diverse needs of the publishing industry that EPUB represents: the print-disabled who need rich semantic markup, library catalog systems that want to analyze highly granular metadata, fixed layout books, multi-lingual books, graphic novels, interactive textbooks, and on and on. Full-blown EPUB solves real problems, but as John Maxwell put it at Books in Browsers 2012, XML is a format that serves incumbents.

I hope that the next revision of EPUB allows HTML5 markup, without the leading X-, as I don’t think that XML requirement is solving any problems for anyone. Rich metadata, on the other hand, offers a great deal to the ecosystem, and is a reasonable tradeoff for authoring complexity.

Until we have an EPUB sans XHTML, it’s worth considering a lightweight subset of the format, one that represents a convention over configuration approach. A “microformat” version — EPUB: the beach novel edition — could be mechanically “upsampled” into big boy EPUB for use in the real ecosystem. It won’t solve the problem of heterogeneity in books (which is, after all, not actually a problem except to reading system developers), but it could make it easier for even experienced ebook authors to create publications without firing up an XML editor, for the majority of books that have very simple metadata requirements.  I’ll outline some ideas for that in a future post.

0 to Book in 3 Days?

Question: How long does it take to write, produce, and print a book—and finalize all the standard e-formats?

A. 2 months
B. 4 months
C. 3 days

Granted, the answer obviously depends a lot on what type of book we’re talking about. It also depends on your definition of “finalized.” Up until last week, I’d have said Option A was possible, if the book had a lot of luck, a very determined author, and the best production team money can buy. Option B is more the norm, in my experience. But last week, I witnessed—and participated in—Option C, the 3 day book.

Last week I participated in a Book Sprint hosted by Google. It was facilitated by Adam Hyde, creator of the Book Sprint methodology, with Intro and Outro “unconference” workshops facilitated by Allen Gunn of Aspiration. You can find my daily impressions of the experience here.

GSoC Doc Camp Books – Evergreen, FontForge, Etoys as paper and electronic books (kindle, android, iPad).

Google Summer of Code Doc Camp Books. Evergreen, FontForge, and Etoys as paper and electronic books (Kindle, Android, iPad).

My takeaways

I was glad to get the opportunity to participate, and I learned quite a bit from the Book Sprint experience. But what I learned was not exactly what I set out to learn. I expected to observe the Book Sprint process and be able to map it to traditional trade or academic publishing workflows. While that still seems entirely possible, I think I learned a bit more than that.

Critical benefits of in-person collaboration

The biggest concept I took away from the book sprint experience was the power of in-person collaboration. It’s hard to convey the importance of this if you don’t experience it for yourself. A group of subject matter experts and end users sitting together in a room talking things through produces amazing results. It’s not just that you’re cutting out all the time lag of email and phone tag. It’s the immediate exchange of ideas that quickly shapes and defines the concepts and the structure of the work. And when you’re focused on the book all day every day, without distractions, with the ability to ask questions and receive immediate answers, you stay in the zone. It’s a pretty amazing thing—and why limit this concept to Book Sprints?

Documentation takeaways

Documentation is such a critical part of my Safari work, and I didn’t expect to think about that at all during the sprint, but I actually learned a few things about documentation that I’ll be putting into practice for myself.

  1. Documentation doesn’t work if it’s based on “what you think users should know” rather than “what users want to know.”  And while you can try to force yourself into the mindset of the user, it’s much better to actually involve the user. This needs to happen both at the outset of the documentation creation process, and on an ongoing basis, so you can keep improving your documentation. I have long been dissatisfied with documentation I’ve produced, but I’ve never been galvanized to rethink it. Now I am. Using what I learned at the Book Sprint, my team and I plan to host our own sprint-informed documentation session after the holidays.
  2. Documentation in a vacuum is not effective: you need to build a community. Here’s another one I already knew already, but the sprint really reinforced and clarified the issue and ideas on how to solve it: users need to engage with the documentation. We need to be able to have conversations with the entire user group. Or it just won’t get used… as I’ve learned the hard way!
  3. Infrequent, monolithic documentation updates are a real pain, and don’t serve the users. OK, again, I already knew that, but before now I had no solution to the problem. More than once, I’ve been guilty of letting the list of necessary updates sit ignored for way too long. You need the engagement of the community to keep you motivated, and just as importantly, you need the right tools to make frequent updates easy! More about tools later in this post.

Type of book

I think there are a wide variety of types of books that could benefit from the Book Sprint process. Whether or not you come out of the sprint with a book that’s ready to distribute is the biggest question. For a topic in serious need of documentation, often the sprinted book is ready for release, because it fills an immediate need, and a somewhat unpolished book is by far better than no book. This frequently applies to the free software community or any community where there isn’t necessarily the funding to produce documentation or written resources.

For professional publishing-quality books, it’s really no problem. The sprint still gets you way ahead of the game. Feel free to spend more time polishing the work after the sprint before you publish.

The tool that makes it possible

We received expert facilitation throughout the week and without that, a Book Sprint couldn’t exist. Right behind good facilitation, though, is the collaborative authoring and production system called BookType. It was designed expressly for this workflow. It’s a simple-to-use web-based authoring environment, with special sauce. It’s got workflow and version controls. It’s got graphical representations of the data representing the work in progress. It has powerful CSS formatting controls. It outputs print-ready PDF, EPUB, MOBI, and other formats. The best part is, you can easily update the content anytime, and output fresh files. No conversions necessary, no time lag waiting for your EPUB or MOBI. Wow, sounds like the future of publishing is here.

Check out my daily posts, or for more information, see www.booksprints.net.