Profiling Django via Middleware

For most developers working on a web application useful enough that people actually want to use it, a time comes when they need to start thinking more about how to make it faster. There are several good books (and videos) on how to go about this for web applications in general. There’s also pretty good advice for optimizing Python applications (like this, which covers some of the profiling tools I mention below). For getting performance data from a live site, you probably want to use something like New Relic, which can give you a lot of useful information with relatively little performance impact. But sometimes, you’ve already figured out that the code which serves a particular URL is slow, and you just want to optimize that (preferably in a local test server without needing to hook up a lot of external services to it, so you can quickly measure before and after performance data). Continue reading

Managing Multilingual Documentation With Sphinx and Transifex

Context: Help File for a Cross-Platform Application

One of the hobby software projects I’ve worked on over the years is an open source end-user database application called PortaBase. I originally wrote it for the Sharp Zaurus line of Linux-based PDAs but have since ported it to Linux/UNIX, Windows, Mac OS X, and Nokia’s now-abandoned Maemo platform for cell phones and internet tablets (I still use the N900 as my cell phone). PortaBase is a pretty useful little program that I use daily for all sorts of information management tasks, but what I want to talk about this time is the documentation…and specifically, managing translations of it into multiple languages.

The Zaurus had a pretty simple system for application help files: create an HTML file named after the application, put it in the right place during installation, and the user could click a little question mark in the title bar to open that help file in a basic built-in HTML viewer. You could have multiple files linked from the main one, but that was more work to manage and PortaBase was originally simple enough that one long-ish page was good enough. And there was another reason to limit the documentation to a single file: the Zaurus was primarily sold in Japan, and fairly early in development one of the PortaBase users contributed a translation of the help file into Japanese. I posted instructions on how to contribute new translations (of both the UI and the help file), and now there are at least partial translations of PortaBase into ten different languages. At first, having just one HTML file for the help document made it easier for the translators to deal with and for me to keep track of everything.

But there were problems with this solution. As features were added to PortaBase, the help file kept getting longer and it became easy to get lost in it. Some of the translators didn’t really understand file encodings, and sometimes sent me files that had been corrupted over the course of multiple accidental encoding conversions. Some of the translators weren’t very good with HTML, and found the markup a significant barrier to working on the file. And whenever the content of the file changed, it wasn’t easy to keep track of the differences (I sent the translators diffs from the previous version, but then they had to cross-reference that with what they’d already written, and again the diff format was foreign to some of them). Net result, a lot more people translated the user interface text than the help file, because that was in a file format which had dedicated tools that were better suited for managing and updating translations (also, that one massive HTML file looked too intimidating to get started on). About 2 years ago, I decided to completely redesign the help system in order to solve some of these problems.

Sphinx

The core of the redesigned PortaBase help system is Sphinx, a tool written in Python for generating documentation in various output formats from input files written using reStructuredText (reST), a simple but powerful wiki-style syntax. I took the monolithic HTML file and split it up into a separate text file for each section (you can find them here). There’s still some markup syntax that you have to memorize, but it’s pretty intuitive and much easier to read at a glance than HTML.

One of the nice features of Sphinx is that you can generate output in multiple formats: HTML, PDF, EPUB, LaTeX, plain text, etc. For PortaBase I really only needed the HTML output (here’s the English version), but the PDF output also turned out pretty well, and being able to generate an EPUB for loading onto an ebook reader is nice too.

Probably the biggest reason for me switching to Sphinx, though, was that it can automatically generate translation message files from the input files, and then automatically incorporate them when generating the output—in all of the supported formats. It uses the gettext .po format, which is supported by a lot of translation tools and used in much open source and free software. This was a key point; normally splitting one big file into a bunch of little ones would have made it harder to keep track of everything, but now I could use an online system like Transifex to do much of the work for me.

Transifex

Transifex is an open source Django project for managing translations online, with development funded by charging for hosting of commercial projects (open source projects can get free hosting). It supports a variety of file formats, including both the .po files used by Sphinx and the Qt Linguist files used for the PortaBase user interface. Translations can be done directly in a web browser, eliminating file encoding problems and the need to have translators install custom translation software (for the UI translations). The project page gives a good overview of how complete the different translations are, and you can drill down to get more information.

Additionally, there’s a command line client which makes it easy to grab the latest versions of all the files (or specific ones) and check them into a source control system. This is perhaps the biggest time-saver in the new system for managing the help files. I no longer need to send out a burst of emails with translation and diff files for various languages just before a release, hoping that the translators have time to work on them relatively soon; they can just check the site occasionally and update any files that have been updated since the last time they looked. Also, because the help file was broken down into individual phrases and grouped into separate files, it’s now much less intimidating to get started on and easier to see exactly what changed since the translation was last updated. And even if they don’t finish a translation before a release, I can easily include whatever they’ve managed to get done so far.

You can see the resulting documentation for PortaBase translated into Czech, French, Japanese, and traditional Chinese. I maintain the Japanese translation myself, so I can definitely appreciate the simplified workflow for translators that Transifex provides.

Remaining Issues

This combination is working pretty well for me, but it does have some problems and limitations of its own:

  • While translators don’t need to install software on their computers anymore, developers and Linux distribution maintainers who want to compile and package a full working version of PortaBase have a few more hoops to go through. They need Python, Sphinx, and gettext installed.
  • Sphinx makes it pretty easy to generate output in a single language, but doesn’t really help you generate the output in all the supported languages at once. I ended up writing a few scripts to automate this process on various platforms.
  • Some locales are identified differently across different platforms (for example, zh_CN and zh_TW versus zh-Hans and zh-Hant). I had to account for that in my scripts also (although this wouldn’t necessarily be a probem if you just wanted to post content on the web, rather than package software for distribution).
  • Sphinx conveniently provides translation files for the phrases it automatically generates in the output (stuff like “Search”, “Table of Contents”, etc.), but some of the translations aren’t up to date and some of the phrases are a little…less than obvious. Without looking at the source code and understanding Python, translators get a little baffled when you ask them to translate things like “%s %s documentation” or ” (in ” with no additional context.
  • Some of the phrases are translated in JavaScript (like search result phrases including numbers) rather than Python, and these are currently kept in Unicode-escaped JavaScript files rather than the main message files, making the process of translating them rather tedious.
  • A couple of the PortaBase translators don’t like signing up for accounts on random web services (like Transifex), but it’s still an improvement over the old process for them to be able to download the files directly from an intuitive UI, and then send me the updated files to upload back into Transifex for management.

I do intend to submit code to the Sphinx project to address some of these if somebody else doesn’t beat me to it (which is entirely possible given the number of other things my time gets filled up with).

Sphinx Does a Lot More

Even though I’ve mentioned a number of Sphinx’s features here, this is only a fraction of what it’s capable of. Read through its documentation if you want to learn more about in-browser search of the generated documentation, documenting source code, auto-generating documentation from Python docstrings, and more. There’s also a huge variety of extensions for Sphinx; at Safari Books Online, for example, we’re using javasphinx and JsDoc Toolkit RST-Template to generate comprehensive searchable documentation that covers Java, JavaScript, and Python APIs as well as wiki-formatted technical documents. I look forward to exercising and stretching its limits further as I find new and interesting ways to employ it.

Optimizing JavaScript in a Django Project: django-require

JavaScript code is becoming an increasingly large part of most web applications, but this often isn’t reflected in server-side web application frameworks. The core Django framework, for example, offers very little explicit support for JavaScript; it’s generally handled as just another type of “static file”, a catchall term for any file needed to render the site which isn’t Python source code or an HTML template.

The problem with this categorization is that modern JavaScript applications look more like traditional compiled software than static images; there’s source code, “compiled” (minified) code, compiler settings, dependency management, and frequent changes in the source that require rebuilding the compiled version. The Django staticfiles framework makes it possible to do all this in a separate deployment process, but that isn’t really ideal. For example, you don’t want to run all of your tests using the JavaScript source files, then discover after deployment that a bug in your minification process broke the compiled JavaScript files in production. The recently released django-require offers one solution to this problem.

Get organized

The first step in getting a handle on this is to organize the JavaScript code itself. The language doesn’t exactly lend itself to this very well, but through herculean effort a few decent solutions have been developed. One of the most popular is the RequireJS library, an Asynchronous Module Definition (AMD) loader. Instead of requiring you to list your scripts in exactly the correct order (sometimes difficult in a template inheritance hierarchy using a large set of JavaScript files) and hoping that none of them use the same global variable in conflicting ways (definitely difficult in a language where it’s actually hard not to accidentally define new global variables), AMD allows you to explicitly list the dependencies for a module of JavaScript code. For example:

define(['first', 'second', 'third'], function (first, second) {
    var exports;
    // My code which depends on first and second goes here
    return exports;
});

There are a few different ways of writing an AMD module, but this is one of the most common. Basically it lets you use standard JavaScript syntax to define a new module of JavaScript code, which explicitly depends on other such modules (but may not directly need their return values). This example depends on three other modules, but only directly uses code from two of them; the third was needed for some side effect (perhaps on the other modules or on the page DOM). The return value of the module’s function can then in turn be passed on to other modules which depend on it.

The big wins here are that your code’s dependencies are explicitly stated, you don’t need to add anything to the global namespace beyond the define() and require() functions provided by RequireJS, and RequireJS takes care of loading the dependencies when they’re first needed (and not executing the code in the same module twice). Of course, this is part of the core functionality of most other programming languages…

Dependency management

Anyway, you can now build up even a fairly large JavaScript codebase with confidence that you can keep each file to a manageable size and keep all the dependencies straight. But you wouldn’t want to put code like this directly into production; loading potentially dozens of little unoptimized JavaScript files to render a single page is hardly ideal.

To address this problem, RequireJS provides an optimizer called r.js. It analyzes all the dependencies of the main module, packs them into a single file, and uses a minifier on the result. You can even configure it to build multiple different top-level modules (for example, if each page has different scripts) and to break some of the dependencies out into one or more separate files (which can be useful if your scripts on different pages have some big library dependencies in common). If you do collapse all of the scripts on a page into a single file, you can use almond to shrink it even further by using a simpler AMD loader which doesn’t need to be able to dynamically load modules from other files. This lets us compile a version of our JavaScript that we’d be willing to serve in production…but how do we let developers test the compilation process locally to make sure deployments aren’t going to fail?

Enter Django

This is (finally) where django-require comes in. It’s a Django app which provides a mixin for staticfiles storage backends (as well as a couple of such storage classes already configured to use it, for convenience). When Django’s collectstatic management command is run to collect all the static files for deployment, this mixin adds a step which will run r.js in order to generate the optimized version of your JavaScript code. Because it’s integrated with the storage backend, the newly created optimized JS files are handled the same way as the rest of your static files (given a cache-breaking hash suffix when using require.storage.OptimizedCachedStaticFilesStorage, uploaded to Amazon S3 if using a subclass of storages.backends.s3boto.S3BotoStorage, etc.) And if you have Selenium or other browser-based tests utilizing LiveServerTestCase, you can just run collectstatic first and your automated tests will use the optimized assets which were generated, allowing you to test the minification process.

The main configuration settings for django-require are managed like any other Django settings; for example, your settings.py file might contain:

STATICFILES_STORAGE = 'require.storage.OptimizedCachedStaticFilesStorage'
REQUIRE_BASE_URL = 'js'
REQUIRE_BUILD_PROFILE = 'app.build.js'
REQUIRE_JS = os.path.join('src', 'require.js')
REQUIRE_ENVIRONMENT = 'node'

For convenience, django-require includes the r.js optimizer itself, almond, the Rhino JavaScript interpreter for Java, a simple default build configuration, and the require.js script (you’ll probably want to copy the latter into your project’s source code as a static file, possibly using the provided require_init management command). Your configuration for r.js (top-level modules, which minifier to use, etc.) goes into an app.build.js file (as normal for that tool), the location of which is specified using REQUIRE_BUILD_PROFILE as shown above. It also provides a template tag for including your main page script in a template in a way which works both in development with REQUIRE_DEBUG = True and uses the optimized files when REQUIRE_DEBUG = False.

If you have a more complicated setup involving different main scripts for each page (possibly in different directories) with a common script of base dependencies, setting the RequireJS configuration in a single place and having it work across all those scripts in both source and optimized modes can sometimes be a little tricky; I normally put the configuration in its own file (config.js), something like:

require.config({
  baseUrl: "/static/js",
  paths: {
    jquery: "src/jquery-1.7.1",
    underscore: "src/underscore",
    backbone: "src/backbone",
    text: "src/text"
  },
  shim: {
    "src/chosen.jquery.min": {
      deps: ["jquery"],
      exports: "jQuery.fn.chosen"
    },
    "src/jquery.cookie": {
      deps: ["jquery"],
      exports: "jQuery.cookie"
    },
    "src/jquery.placeholder": {
      deps: ["jquery"],
      exports: "jQuery.fn.placeholder"
    }
  }
});

This can be specified as the configuration for the optimizer easily enough in app.build.js:

mainConfigFile: "./config.js",

And then your base template can look something like this:

<script src="{% static 'js/src/require.js' %}"></script>
{% if REQUIRE_DEBUG %}<script src="{% static 'js/config.js' %}"></script>{% endif %}
<script>require(["{% script 'js/pages/base.js' %}"], function () {
    {% block page_script %}{% endblock %}
});</script>

A page which has its own script which depends on base.js would have something like the following in its template:

{% block page_script %}require(["{% static 'js/pages/page1.js' %}"]);{% endblock %}

This ensures that the base script finishes loading before the page-specific ones which depend on it, and that the RequireJS configuration is run first in development mode, before the location of the first script encountered might confuse the issue of what the baseUrl property is to be set to.

Once you have everything configured, you can test the site using the optimized JS files by running the following commands (remember to set DEBUG = False):

./manage.py collectstatic
./manage.py runserver --insecure

Now you can start writing those automated tests I mentioned earlier, to make sure things stay working. For example, you could write some Selenium tests for your Django site, as I described in some earlier posts.

Up and Running with Celery and Django (also cron is evil)

The longer I’m a programmer, the lazier I become. Several years ago I’d have been a giddy schoolgirl if you told me to write a templating engine from scratch. Or authentication, wow—Dealing with HTTP headers and sessions got me so excited!

Nowadays I wonder why things just can’t just work.

At Safari, there are lots of services with moving parts that need to be scheduled and I’ve gradually started to really dislike cron. Sure it’s great for one-off tasks, but handling lots of tasks asynchronously is not one of its strong suits. And really, I’m just too lazy to write the logic to handle failures, redos, and other catch-22′s that happen in the pipeline. Instead, I now use a combination of Django and the task queue Celery.

Enter Celery and Supervisor *on Ubuntu

Ubuntu is quite nice to work with, as they keep packages relatively up to date. Supervisor? Redis? They just work, almost like magic. Here’s the steps to get a cron-free world and running in a jif (with a Python virtual environment):

First, let’s install the necessary Ubuntu packages, create a working environment for the project, and get the necessary Python libraries. Let’s call the project Thing.

$ sudo aptitude install supervisor redis-server
$ mkdir thing-project
$ cd thing-project
$ virtualenv --prompt="(thing)" ve 
$ . ve/bin/activate
(thing)$ pip install django django-celery redis

Now we can start to put the Django pieces together. Start a new Django project called thing with an app called automate where we’ll put our tasks. Also, add a serverconf/ directory to keep your server/service configs separate.

(thing)$ django-admin.py startproject thing # now we have one too many dirs
(thing)$ mv thing/thing/*.* ./thing/
(thing)$ mv thing/manage.py ./
(thing)$ rmdir thing/thing/
(thing)$ python ./manage.py startapp automate && touch automate/tasks.py
(thing)$ mkdir serverconf

Your project should look something like this:

/thing-project           # Container directory
    manage.py            # Run Django commands

    /ve                  # Your virtualenv

    /automate            # New app we're starting
        models.py
        tests.py
        views.py
        tasks.py         # Where the magic goes

    /thing
        settings.py      # Project settings

    /serverconf
        # Server Configs go in here, apache, supervisor, etc.

Add automate to the INSTALLED_APPS section in your settings.py and be sure to alter your DATABASES to use your backend of choice. My DATABASES looks like this:

DATABASES = { 
    'default': { 
        'ENGINE': 'django.db.backends.sqlite3', 
        'NAME': 'thing.db',
        'USER': '', 
        'PASSWORD': '', 
        'HOST': '',
        'PORT': '', 
    } 
}

Something to Do

Now let’s just create a basic framework that does something, like crawl a web site for content. Modify your automate/models.py to look like this:

import urllib2

from django.db import models

class WebContent(models.Model):
    # I like timestamps
    timestamp_created = models.DateTimeField(auto_now_add=True)
    timestamp_updated = models.DateTimeField(auto_now=True)

    url = models.CharField(max_length=255)
    content = models.TextField(null=True)

    def update_content(self):
        self.content = urllib2.urlopen(self.url).read()
        self.save()

Test it out, it should work just fine:

(thing)$ python manage.py syncdb
(thing)$ python manage.py shell
>>> from automate.models import *
>>> rec = WebContent.objects.create(url='http://techblog.safaribooksonline.com')
>>> rec.update_content()
>>> print rec.content
### Really long dump of web site ###

A Real, Grown-up, Cron-like Task

Now we need to start adding the ingredients to turn this into a celery task (the equivalent of a cronjob). First, add djcelery to your list of INSTALLED_APPS and remember to (thing)$ manage.py syncdb as well. Somewhere near the bottom of your thing/settings.py, add this:

import djcelery

from celery.schedules import crontab

djcelery.setup_loader()

BROKER_URL = "redis://localhost:6379/0"
CELERY_RESULT_BACKEND = "database"
CELERYBEAT_SCHEDULER = "djcelery.schedulers.DatabaseScheduler"
CELERYBEAT_PIDFILE = '/tmp/celerybeat.pid'
CELERYBEAT_SCHEDULE = {} # Will add tasks later

And while we’re at it, let’s modify the automate/tasks.py file, where celery tasks are actually defined:

from celery.task import task

from automate.models import WebContent

@task
def update_all_sites():
    for rec in WebContent.objects.all():
       print "Updating site: %s" % rec.url
       rec.update_content()

Test it out by running the celery daemon (aka worker). Then queue the task in a separate terminal.

1st terminal:

(thing)$ python manage.py celeryd -l INFO

Note the following line to show that celery sees the task:

[Tasks] 
 . automate.tasks.update_all_sites

2nd terminal:

(thing)$ python manage.py shell
>>> from automate.tasks import *
>>> update_all_sites.apply_async()
<AsyncResult XXXXXXXXXXXXXXXXXXX>

Your 1st terminal should have all kinds of awesome things going on:

[XXX: INFO/MainProcess] Got task from broker: automate.tasks.update_all_sites[96c45361-e68c-4e53-91c9-c578403baed7] 
[XXX: WARNING/PoolWorker-1] Updating site: http://techblog.safaribooksonline.com 
[XXX: INFO/MainProcess] Task automate.tasks.update_all_sites[96c45361-e68c-4e53-91c9-c578403baed7] succeeded in 1.42567801476s: None

Wow, it works! Now update your CELERYBEAT_SCHEDULE (like the timing in a cron job) in your settings.py to schedule the task.

CELERYBEAT_SCHEDULE = {
    # Update web sites every 24h
    "update-web-sites": {
        "task": "automate.tasks.update_all_sites",
        "schedule": crontab(minute=0, hour=0),
    }
}

The Final Piece

The final piece of the puzzle is to set up supervisor so that celery runs automagically alongside Django. Create a log directory called /var/log/thing. Your serverconf/thing-supervisor.conf should look something like this:

;======================================= 
; celeryd supervisord script for django 
; ======================================= 
;; Queue worker for the web interface. 

[program:celery-thing] 
command=/path/to/thing-project/ve/bin/python /path/to/thing-project/manage.py celeryd --loglevel=INFO 
directory=/path/to/thing-project
environment=PYTHONPATH='/path/to/thing-project/ve' 
user=www-data
numprocs=1 
stdout_logfile=/var/log/thing/celeryd.log 
stderr_logfile=/var/log/thing/celeryd.log 
autostart=true 
autorestart=true 
startsecs=10 
stopwaitsecs=30

; ========================================== 
; celerybeat 
; ========================================== 
[program:celerybeat-thing] 
command=/path/to/thing-project/ve/bin/python /path/to/thing-project/manage.py celerybeat 
directory=/path/to/thing-project
environment=PYTHONPATH='/path/to/thing-project/ve' 
user=www-data 
numprocs=1 
stdout_logfile=/var/log/thing/celerybeat.log 
stderr_logfile=/var/log/thing/celerybeat.log 
autostart=true 
autorestart=true 
startsecs=10 
stopwaitsecs = 30

Finally, create the symlink so that your serverconf/thing-supervisor.conf is loaded when supervisor starts up:

$ ln -s /etc/supervisor/conf.d/thing-dev.conf /path/to/thing-project/serverconf/thing-supervisor.conf
$ service supervisor start

There you have it, a complete install without using cron. Now you can go on to do all the cool things that celery supports, i.e. task retries, chaining, etc.

Writing a Selenium Test Framework for a Django Site (Part 3)

See Part 1 and Part 2 for the rest of this series.

Now that we’ve figured out how to get Selenium tests to run in various browsers on different operating systems, we should probably get some actual test cases written. The Django documentation gives an example like this:

def test_login(self):
    self.selenium.get('%s%s' % (self.live_server_url, '/login/'))
    username_input = self.selenium.find_element_by_name("username")
    username_input.send_keys('myuser')
    password_input = self.selenium.find_element_by_name("password")
    password_input.send_keys('secret')
    self.selenium.find_element_by_xpath('//input[@value="Log in"]').click()

This is a good start, and is pretty cool in its own right, but there are a few potential problems:

  • Selenium sometimes returns to your code after a GET before the page has fully rendered in the browser, and often before any initial JavaScript has finished running.  It’s possible that the elements you want to interact with aren’t ready by the time your code starts to send the browser commands regarding them.
  • XPath, while having some minor advantages over CSS selectors, is generally much slower.  While it’s not a huge difference for any single execution, if you have a large test suite you should probably prefer fetching elements by ID or CSS selector whenever possible; this is both faster and more consistent with how you’re probably writing CSS and JavaScript anyway.
  • Quite a bit of code is repeated here, and this is just a single small test case.

I’ll address this last point first, because it informs work on the other ones: for any non-trivial test suite, it’s a good idea to write your own API to wrap the most common Selenium operations.  The Selenium webdriver API was optimized for ease of implementation by browsers, not for ease of writing tests.  Since we have a common base class for all of our Django/Selenium tests already, we can just build our custom API into that.  Let’s start with a simple example, abbreviating that GET operation:

def get(self, relative_url):
    self.sel.get('%s%s' % (self.live_server_url, relative_url))
    self.screenshot()

This is a minor win in conciseness, we can abbreviate the first command in the test case to self.get('/login/'). But the other handy thing is that second part of the method, which is implemented like this:

SCREENSHOT_DIR = os.path.dirname(__file__) + '/../../log/selenium_screenshots'

def screenshot(self):
    if hasattr(self, 'sauce_user_name'):
        # Sauce Labs is taking screenshots for us
        return
    name = "%s_%d.png" % (self._testMethodName, self._screenshot_number)
    path = os.path.join(SCREENSHOT_DIR, name)
    self.sel.get_screenshot_as_file(path)
    self._screenshot_number += 1

(You’ll also need to add self._screenshot_number = 1 to __init__().) That’s pretty useful; even when not using Sauce Labs, we can now generate screenshots for each page we load.  And we don’t need to specify when to do it in the tests, it just happens automatically.  Now let’s try a slightly trickier one, that duplicated code for text entry:

def enter_text(self, name, value):
    field = self.wait_for_element_by_name(name)
    field.send_keys(value)
    self.screenshot()
    return field

def wait_for_element_by_name(self, name):
    element_is_present = lambda driver: driver.find_element_by_name(name)
    msg = "An element named '%s' should be on the page" % name
    element = Wait(self.sel).until(element_is_present, msg)
    self.screenshot()
    return element

Ok, we’re going a little screenshot-happy, but it’s only disk space and they could be useful when debugging a failing test. Note that we’re being careful by explicitly waiting for the input field to be present before we try adding text to it. It’s a better idea to wait until a specific condition is met like this than to wait for a fixed duration, since different systems take different amounts of time to do stuff and you don’t want your test to take forever because it’s always waiting long enough for the slowest possible computer to finish each operation. Selenium provides a WebDriverWait class for handling cases like this, but I like to expand on it a little:


# Default operation timeout in seconds
TIMEOUT = 10

# Default operation retry frequency
POLL_FREQUENCY = 0.5

class Wait(WebDriverWait):
    """ Subclass of WebDriverWait with predetermined timeout and poll
    frequency.  Also deals with a wider variety of exceptions. """
    def __init__(self, driver):
        """ Constructor """
        super(Wait, self).__init__(driver, TIMEOUT, POLL_FREQUENCY)

    def until(self, method, message=''):
        """Calls the method provided with the driver as an argument until the \
        return value is not False."""
        end_time = time.time() + self._timeout
        while(True):
            try:
                value = method(self._driver)
                if value:
                    return value
            except NoSuchElementException:
                pass
            except StaleElementReferenceException:
                pass
            time.sleep(self._poll)
            if(time.time() > end_time):
                break
        raise TimeoutException(message)

    def until_not(self, method, message=''):
        """Calls the method provided with the driver as an argument until the
        return value is False."""
        end_time = time.time() + self._timeout
        while(True):
            try:
                value = method(self._driver)
                if not value:
                    return value
            except NoSuchElementException:
                return True
            except StaleElementReferenceException:
                pass
            time.sleep(self._poll)
            if(time.time() > end_time):
                break
        raise TimeoutException(message)

So basically we try to get the element, and keep trying every half-second until we either get it or (after 10 seconds total) throw an exception because we figure it’s not going to show up when we thought it should. The expanded implementations of until() and until_not() don’t gain us a lot in this particular case, but can be very useful when we’re doing something a little more complicated:

from selenium.webdriver.support.color import Color

def wait_for_background_color(self, selector, color_string):
    color = Color.from_string(color_string)
    correct_color = lambda driver: Color.from_string(driver.find_element_by_css_selector(selector).value_of_css_property("background-color")) == color
    msg = "The color of '%s' should be %s" % (selector, color_string)
    Wait(self.sel).until(correct_color, msg)
    self.screenshot()

In cases like this, I often find the normal WebDriverWait class failing a test because the element I’m inspecting was replaced by the browser (either due to JavaScript activity or browser implementation details) between obtaining it and trying to read its properties. The Wait subclass catches this StaleElementReferenceException and just tries again, up to the timeout limit.

Finally, let’s automate that submit button click a little better also:

def click(self, selector):
    element = self.wait_until_visible(selector)
    element.click()
    return element

def wait_until_visible(self, selector):
    """ Wait until the element matching the selector is visible """
    element_is_visible = lambda driver: self.sel.find_element_by_css_selector(selector).is_displayed()
    msg = "The element matching '%s' should be visible" % selector
    Wait(self.sel).until(element_is_visible, msg)
    self.screenshot()
    return element

So the sample test case given in the Django docs could now be abbreviated as follows:

def test_login(self):
    self.get('/login/')
    self.enter_text('username', 'myuser')
    self.enter_text('password', 'secret')
    self.click('input[value="Log in"]')

In addition to saving typing for new test cases, it’s also now easier to see at a glance what’s going on. We also get screenshots and appropriate wait durations for free. We’ve written a lot more support code than the number of characters saved here, but that’ll all be useful in other test cases as well. You did actually plan to write tests, and not just play around with Selenium because it was interesting, right?

Adventures in Search (Solr)

While Haystack is a great way for Django developers to get search up and running quickly and with relatively little pain, its strategy of supporting many popular search engines tends to encourage catering to the least common denominator. Fortunately Haystack, like Django, gets out of the way when you need it to and allows you to leverage the power of the underlying engine, like Solr, with little difficulty.

In building a search application for a collection of full-length books, several issues came up that were not well addressed by Haystack’s or Solr’s default configuration. While I made numerous configuration changes, the most generally interesting changes come from the analyzer settings in the schema file.

Searches including contractions and compound words turned out to be particularly difficult, so I spent many hours experimenting with analyzer settings in an attempt to obtain the results I expected. I have collected some of my most notable discoveries here.

Analyzers

An analyzer is a collection of tokenizers and filters. A tokenizer breaks sentences down into tokens (usually words) and a filter can transform tokens in any imaginable way. An analyzer can be set to run on the actual text of the book being indexed, or on the search string. The former is known as an index analyzer and the latter as a query analyzer.

Contractions

The default configuration of Solr and/or Haystack did not handle contractions well. Searching for aren’t did not return results with are not in them and vice versa.

I evaluated different methods of resolving this and ended up using a brute force method. There are a finite number of contractions in the English language, so why not just use a dictionary with a synonym filter? I started with a file that looked like the following:

aren't,are not
can't,cannot
couldn't,could not
...

Solr’s synonym filter includes all versions of a word in its indexes so that either the contraction or the full English representation will cause a hit. I used this synonym filter only in the index analyzer. Because all forms of the word are in the index, any form entered in the query will match.

Compound Words and Adjacency

The compound word issue first reared its ugly head with the word layoff. The sample book I was searching for had lay off in the title rather than layoff. I investigated Solr word de-compounding filters but was unsuccessful in getting any of them to work correctly (and welcome reproof and instruction in this matter). So I took things into my own hands and employed a trick of questionable veracity.

I have been fascinated with n-grams for some time. An n-gram is a sequence of n items strung together. It can refer to letters, syllables, whole words, or any other unit. Solr supports letter-based n-grams, but my approach was to use word n-grams. Solr’s ShingleFilterFactory is a tool for creating word-based n-grams. I wanted to keep things simple and my database small so I chose to use n-grams of 2 words, otherwise known as bi-grams.

Using n-grams can help to give weight to word adjacency. What Solr’s shingle filter does is to create a number of n-grams that are composed of multiple adjacent words. Using a shingle filter configured to produce bi-grams, a sentence such as “I like traffic lights” becomes a group of bi-grams like “I,” “I like,” “like traffic,” and “traffic lights.” Leaving the original words in the index is also an option. So with this option in place, Solr stores something like “I,” “I like,” “like,” “like traffic,” “traffic,” etc.

At first I configured my shingle filter to create simple bi-grams delimited by a space. But then I realized that a compound word is simply a word bi-gram with no delimiter. When using a shingle filter with no delimiter, a sentence like “I dislike lay offs” becomes “I,” “Idislike,” “dislikelay,” “layoffs”, and “offs.” Now all of a sudden we have a hit for our search term layoffs.

This seems very clever at first, but you also must realize that you may get hits on strings that span words. For example, searching for “weight car ton,” we might now match the sentence, “I bought a carton of milk,” which is completely unintended. Search is voodoo to begin with, so we can always hope that this particular hit will be of such low weight (due to the low number of matching words) that the searcher will be unlikely to see it.

Note that I applied this filter to both the index and query analyzers. The jury (actually the QA team) is still out on this method.

Protecting Contractions and Strange Words

Now I discovered that my contractions were no longer matching. I had switched to a different tokenizer and begun using the the WordDelimiterFilterFactory, which automatically strips off the endings of contractions (in addition to other nice features). This is counterproductive when your goal is to expand contractions. Fortunately, the word delimiter filter has a protected words feature. I simply created yet another dictionary containing all the relevant contractions and the the filter factory left my words alone.

In addition to contractions, there are some special words in the computing industry that can sometimes be difficult to search on. My current list is c++c#.netg++, and f# (I welcome recommendations on additions to this list). Normally these words get stripped of their special characters and have no resemblance to the original.

Contractions Revisited

But I was not yet done fixing my broken contractions. Now that I was combining words into bi-grams, the synonym dictionary that solved my contraction problem broke. No problem, I just mangled the dictionary a bit so that it looked like:

aren't,arenot
can't,cannot
couldn't,couldnot
...

Stemming

Stemming is the process of removing inflected word endings to arrive at a stem. For instance, the stem of the word jumped is jump. Using a stemmer allows multiple forms of words to match. I chose to use the Hunspell stemmer with a dictionary from the OpenOffice project. This seems to be a little bit more accurate and precise than the default stemmer.

Conclusion

Search is always a work in progress. Even search giants like Google regularly tweak their search algorithm. This is a snapshot of my approach to search at this moment in time. Your mileage may vary and I welcome recommendations and corrections.

Disclaimer: I am not a search expert. I don’t know all the correct terminology and have probably made a number of blunders in my foray into the art of search. Suggestions in the comments are welcome!

The Analyzers Section of my Schema.xml

The ordering of filters in these analyzers is critical. This order has been established through many hours of trial and error.

    <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <!-- We are indexing mostly HTML so we need to ignore the tags -->
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- lower casing must happen before WordDelimiterFilter or protwords.txt will not work -->
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" stemEnglishPossessive="1" protected="protwords.txt"/>
        <!-- setting tokenSeparator="" solves issues with compound words and improves phrase search -->
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true" tokenSeparator=""/>
        <!-- This deals with contractions -->
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" expand="true" ignoreCase="true"/>
        <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" enablePositionIncrements="true" ignoreCase="true"/>
        <filter class="solr.HunspellStemFilterFactory" dictionary="en_US.dic" affix="en_US.aff" ignoreCase="true"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- lower casing must happen before WordDelimiterFilter or protwords.txt will not work -->
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" protected="protwords.txt"/>
        <!-- setting tokenSeparator="" solves issues with compound words and improves phrase search -->
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true" tokenSeparator=""/>
        <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" enablePositionIncrements="true" ignoreCase="true"/>
        <filter class="solr.HunspellStemFilterFactory" dictionary="en_US.dic" affix="en_US.aff" ignoreCase="true"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

Writing a Selenium Test Framework for a Django Site (Part 2)

In the first post of this series I described how to write a Selenium test for a Django app which can run in an assortment of different browsers just by changing an environment variable.  This works fine as long as you only want to run tests on the OS you use for development, but out in the real world people are using web browsers on a variety of different operating systems.  In order to upgrade our tests from simulating “some plausible users of the site” to “almost any possible user of the site”, we need to get tests running on a variety of different operating systems.  And we’d like to do this without having to get our entire development environment working on each OS; we don’t really care if our server can run on every possible operating system, just that the browsers for each OS can access and use it properly.  This means having our test server and test code running on one machine while the automated browser is running on one or more other machines.

Selenium can be run as a “remote control” service which waits for instructions on how to drive any browser installed on that computer; documentation on doing so can be found here and here.  To utilize a remote control for our Django Selenium tests, we can update the setUp method of SeleniumTestCase as follows:

def setUp(self):
    self.browser = getenv('SELENIUM_BROWSER', 'chrome')
    if getenv('SELENIUM_HOST'):
        self.sel = self.remote_control()
    elif self.browser == 'firefox':
        self.sel = webdriver.Firefox()
    ...

The actual configuration of the remote control driver is implemented like this:

def remote_control(self):
    """ Configure the Selenium driver to use a remote control """
    host = getenv("SELENIUM_HOST")
    port = getenv("SELENIUM_PORT", "4444")
    executor = "".join(["http://", host, ":", port, '/wd/hub'])
    platform = getenv("SELENIUM_PLATFORM", "WINDOWS")
    version = getenv("SELENIUM_VERSION", "")
    caps = {
        "platform": platform,
        "browserName": self.browser,
        "version": version,
        "javascriptEnabled": True
    }
    return webdriver.Remote(command_executor=executor,
                            desired_capabilities=caps)

Note that we’re using environment variables again to specify how the test is to be run, so the test code itself doesn’t have to make any assumptions about it.  You may also notice that this looks very similar to the code in the original setUp() method for Safari; the Safari driver for Selenium is relatively new and not yet explicitly supported in the Python bindings, so the original method assumed that when driving Safari we have a Selenium remote control instance running in the default location of localhost port 4444.

If you have an assortment of different computers serving as remote controls, you can either specify for each test which one to use or set up a Selenium Grid.  In a grid, one computer serves as the hub which receives all requests for browser automation and distributes them to a computer in the grid capable of performing them.  Each computer in the grid needs to have Java, Ant, and Selenium Grid installed; detailed instructions on setting up a grid can be found here.

So, great; now we can run tests on our machine, or on another machine on the network which is more suited to the task.  But this still means we need a lot of different machines on our network if we want to be able to run our tests in a good variety of different operating systems and browser versions (necessary because those pesky users of our site haven’t agreed to all use the same browser yet).  It takes time to set them all up, and unless we run tests all the time or have other uses for those computers they’re going to be sitting there idle wasting electricity most of the time.  Oh right, that’s what the cloud is for!  Sure enough, Selenium Grid has instructions specifically for setting up a grid on Amazon EC2, and it shouldn’t be too hard to use a different vendor if your preference lies elsewhere.

So now we can set up an on-demand grid…assuming we have an account with a cloud computing vendor, teach ourselves how to launch and shut down computing resources, figure out how to get everything set up, script the process so it doesn’t take too long each time, and wait for one or more virtual machines to boot up every time we want to run a test.  If you’re a system administrator, there’s a decent chance that sounds interesting and well worth pursuing; if you’re a programmer or a QA engineer, odds are it sounds a lot like that system administration stuff you’ve been trying not to deal with any more than absolutely necessary.  For those people, there’s a company called Sauce Labs that thinks it has a solution for you.

Sauce Labs is basically a pre-built cloud of Selenium remote control nodes that you can access through a simple API, and then check out the results of the tests online.  They keep some of the more common testing configurations standing by so you usually don’t have to wait for one to boot up; it’s already set to go.  They also have spiffy results pages that show the Selenium instructions which were sent, output from the browser’s JavaScript console if any, screenshots of each step, and a video of the whole test.  It isn’t a free service, so you have to do your own cost-benefit analysis of whether or not it’s worth the money for you, but for a lot of people it’s a convenient way to bypass a lot of effort in setting up their own local or cloud testing grid.  If we want to add explicit support for running tests at Sauce Labs in our all-purpose test superclass, we can do it as follows:

def setUp(self):
    self.browser = getenv('SELENIUM_BROWSER', 'chrome')
    if getenv('SAUCE_USER_NAME'):
        self.sel = self.sauce_labs_driver()
    elif getenv('SELENIUM_HOST'):
        self.sel = self.remote_control()
    ...

The actual configuration can be done as follows:

def sauce_labs_driver(self):
    """ Configure the Selenium driver to use Sauce Labs """
    host = getenv("SELENIUM_HOST", "ondemand.saucelabs.com")
    port = getenv("SELENIUM_PORT", "80")
    executor = "".join(["http://", host, ":", port, '/wd/hub'])
    platform = getenv("SELENIUM_PLATFORM", "Windows 2008")
    version = getenv("SELENIUM_VERSION", "")
    self.sauce_user_name = getenv("SAUCE_USER_NAME")
    api_key = getenv("SAUCE_API_KEY")
    self.sauce_auth = base64.encodestring('%s:%s' % (self.sauce_user_name,
                                                     api_key))[:-1]
    caps = {
        "platform": platform,
        "browserName": self.browser,
        "version": version,
        "javascriptEnabled": True,
        "name": self._testMethodName,
        "username": self.sauce_user_name,
        "accessKey": api_key
    }
    return webdriver.Remote(command_executor=executor,
                            desired_capabilities=caps)

Note that this is pretty similar to configuring a remote control driver.  The main difference is that you need to specify a username and API key for the Sauce Labs service so they know who to bill.  Note that we also pass along the name of the test method which is about to be run; this makes it easier to distinguish our tests in the Sauce Labs web interface.  If we just do this much and run our tests, we’ll notice that although we can see test successes and failures indicated on our console, Sauce Labs can’t tell the difference; they don’t know when our test code throws an exception or fails an assertion.  So if we want to be able to tell at a glance at their UI which ones failed, we need to tell them:

def tearDown(self):
    # Check to see if an exception was raised during the test
    info = sys.exc_info()
    passed = info[0] is None
    self.report_status(passed)
    self.sel.quit()
    super(SeleniumTestCase, self).tearDown()

def report_status(self, passed):
    if not hasattr(self, 'sauce_user_name'):
        # Not using Sauce Labs for this test
        return
    # Report failure if any individual test failed or had an error
    body_content = json.dumps({"passed": passed})
    connection = httplib.HTTPConnection("saucelabs.com")
    url = '/rest/v1/%s/jobs/%s' % (self.sauce_user_name,
                                   self.sel.session_id)
    connection.request('PUT', url, body_content,
                       headers={"Authorization": "Basic %s" % self.sauce_auth})
    result = connection.getresponse()
    return result.status == 200

If there was an error in the test or an exception was thrown due to an assertion failing, our tearDown() method recognizes this.  And whether the test passed or failed, if we’re running the test on Sauce Labs we send them a brief JSON message indicating that particular test result.

So now we have lots of different options on where to run our tests…but we haven’t really talked much about how to write the tests themselves.  I’ll take a look at that in part 3 of this series.

Writing a Selenium Test Framework for a Django Site (Part 1)

When you write any nontrivial software application, you soon realize that while fixing or improving one aspect of it, you often run a risk of accidentally breaking something else in the process.  And that going through the entire application looking for such broken parts every time you change something just isn’t very fun.  So you start writing automated tests so that a computer can do it instead, because unlike you or your associates working in QA, the computer doesn’t care if it’s fun or not.  The important thing to remember is that for most systems the ultimate goal is to make sure that the software works normally as a real person would use it, so ideally you need at least some tests that can fairly realistically pretend to be such a person.

Django has a rather nicely documented system for testing web applications built using it. At this point, most serious Django sites probably have some number of unit tests that load test data, exercise various parts of the code, and sometimes even pretend to be a very basic web browser by requesting a URL and analyzing the returned stream of text.  However, useful as they are, these tests don’t make for very convincing simulations of real users.  They don’t load all of the assets referenced by a page, run any of the JavaScript on it, look at the layout of elements on the page, try clicking on any of the links, etc.  Especially for AJAX-heavy sites, this leaves a wide range of potential bugs that aren’t being checked for at all.  Tools like Selenium improve on this by automating an actual web browser to load pages, run JavaScript, click on page elements, and so forth.  You can even pick which browser to automate, letting you test for an assortment of cross-browser compatibility bugs.  Django 1.4 added some support for running Selenium tests, but that part of the documentation is still a little thin.  In this series of blog posts I’ll be going over a few tips for getting more out of Selenium tests for a Django app.

The sample Selenium test in the Django documentation starts out like this:

from django.test import LiveServerTestCase
from selenium.webdriver.firefox.webdriver import WebDriver

class MySeleniumTests(LiveServerTestCase):
    fixtures = ['user-data.json']

    @classmethod
    def setUpClass(cls):
        cls.selenium = WebDriver()
        super(MySeleniumTests, cls).setUpClass()

One problem with this which soon becomes apparent: it assumes that your test will always be run using Firefox on the same machine which is running the tests and the application (LiveServerTestCase starts up an instance of your Django server which is used for all the test methods in the class).  Also, it can be useful to start a separate browser session for each test case rather than share one for the entire class; this can help avoid stability problems from automating the browser for too long and work better with tools I’ll discuss later in this series for capturing data about a test case run.  Solving these issues in a base class to be inherited by all your Selenium test classes can then look something like this:

from django.test import LiveServerTestCase
from selenium import webdriver
from selenium.webdriver import DesiredCapabilities
import time

class SeleniumTestCase(LiveServerTestCase):

    def setUp(self):
        self.browser = getenv('SELENIUM_BROWSER', 'chrome')
        if self.browser == 'firefox':
            self.sel = webdriver.Firefox()
        elif self.browser == 'htmlunit':
            self.sel = webdriver.Remote(desired_capabilities=DesiredCapabilities.HTMLUNITWITHJS)
        elif self.browser == 'iphone':
            command_executor = "http://127.0.0.1:3001/wd/hub"
            self.sel = webdriver.Remote(command_executor=command_executor,
                desired_capabilities=DesiredCapabilities.IPHONE)
        elif self.browser == 'safari':
            self.sel = webdriver.Remote(desired_capabilities={
                "browserName": "safari", "version": "",
                "platform": "MAC", "javascriptEnabled": True})
        else:
            self.sel = webdriver.Chrome()
        # Give the browser a little time; Firefox sometimes throws
        # random errors if you hit it too soon
        time.sleep(1)

    def tearDown(self):
        self.sel.quit()
        super(SeleniumTestCase, self).tearDown()

You can now use an environment variable (SELENIUM_BROWSER) to determine which browser to use when running your tests.  By simply changing this variable and re-running the tests, you can test with different browsers.  If you’re running your test on a Mac, you can even automate a browser running in an iPhone or iPad simulator.  By noting which browser is being used in self.browser, you can automatically skip tests which depend on application features that are known not to work in it or which rely on Selenium features that don’t yet work with that browser:

if self.browser in ["iphone", "firefox", "safari"]:
    # no advanced selenium interactions on iOS,
    # no WebSQL on Firefox, no moveTo on Safari
    raise SkipTest()

Note that we still haven’t worked around the limitation that all the tests be run on the local machine; there are ways to do that, which I’ll discuss in the next post in this series.

Avoiding “Too many keys specified” in MySQL

Yesterday I tried to deploy some code to a relatively new server. Unfortunately I was greeted with this:

django.db.utils.DatabaseError: (1069, 'Too many keys specified; max 64 keys allowed')

What does that mean? First, we have to figure out what a key is. Stackoverflow is helpful for this sort of thing: http://stackoverflow.com/questions/924265/what-does-the-key-keyword-mean. So a key is an index. Great. We definitely have a lot of Foreign Keys and each table has a Primary Key, but we certainly don’t have 64 Foreign Keys + Primary Keys.

Now that we know that a key is really an index, how do we see what indices exist? Luckily the stack trace that produced the error above at least had a table it seemed to be having trouble with (table: penguin). Running this command shows us the indices:

mysql> show index in penguin;
+---------+----------+-------------+
| Table   | Key_name | Column_name |
+---------+----------+-------------+
| penguin | PRIMARY  | id          |
| penguin | username | username    |
| penguin | email    | email       |
| penguin | email_4  | email       |
| penguin | email_2  | email       |
| penguin | email_3  | email       |
| penguin | email_5  | email       |
| penguin | email_6  | email       |
| penguin | email_7  | email       |
...
| penguin | email_62 | email       |
+---------+----------+-------------+
64 rows in set (0.00 sec)

Ah ha! There are the the 64 keys but why in the world are there 62 email indices? What could be adding an index to the email column during a deploy? Well, the traceback said that it failed during a run of Django’s syncdb, which creates new database tables to align with a model. A little code snooping (thank you git grep) yielded a piece of our code that hooked into the django post_syncdb signal. This was the offending code:

def update_email_field(sender, **kwargs):
  from django.db import connection, transaction
  cursor = connection.cursor()
  cursor.execute("ALTER TABLE penguin MODIFY email VARCHAR(255) NOT NULL UNIQUE")
  transaction.commit_unless_managed()

post_syncdb.connect(update_email_field)

I learned that adding UNIQUE to an ALTER TABLE command will actually add a new index every time that command is run. Thus, we hit the MySQL max and could not add anymore and a deploy failed because of it.

Removing the UNIQUE index and ensuring it is run once on penguin table creation and not again later seemed the simplest solution to this problem.