AskPythia - Implementation on top of WURFL

About WURFL

WURFL is an open-source Device Description Repository (DDR) published as an XML file. It contains information about the capabilities of many existing mobile devices. It is a collective effort from mobile developers from around the world. WURFL stands for Wireless Universal Resource File.

A device in the XML file is identified by its User-Agent. It is described by a set of properties (called capabilities in WURFL). A fallback mechanism allows grouping devices into categories, which in turn allows properties inheritance.

There is an official WURFL PHP API that may be used to interact with the underlying XML file. This API is not compatible with the DDR Simple API.

Overview

There is more than one way to implement the DDR Simple API on top of WURFL. AskPythia's implementation does not wrap the existing WURFL PHP API, but works with a prepared version of the WURFL database.

The capabilities defined in the WURFL database are mapped to DDR Core Vocabulary properties where possible. A couple of capabilities are taken verbatim from the WURFL vocabulary too, namely xhtml_table_support to detect whether the device supports tables and is_wireless_device as a generic mobile/non-mobile switch. The set of supported properties may be extended as needed.

The WURFL XML file is large (more than 10Mb) and contains thousands of identifiers. The file cannot be loaded whenever a request comes in for memory and performance reasons, and because there is no notion of application scope in PHP (variables that would stay in memory between requests). The AskPythia implementation of the DDR Simple API on top of WURFL uses a simple lookup algorithm that is easy to explain and does not rely on any backend architecture (no database required, and no specific writing rights needed). This simple lookup algorithm could be further optimized (e.g. through caching) for use in heavily stressed Web sites.

The WURFLService class implements the Service interface of the DDR Simple API. The path to the prepared WURFL database must be precised when an instance of the class is initialized, as the value of a wurfl_path setting, e.g.:

$service = ServiceFactory::newService(
    'WURFL',
    'http://www.w3.org/2008/01/ddr-core-vocabulary',
    array('wurfl_path'=>'[path to WURFL]'));

An InitializationException is thrown when the path is not properly set.

Supported properties

The list of properties supported by the implementation is given in the following table:

Supported properties in AskPythia's implementation on top of WURFL
Property name Namespace (vocabulary)
vendor http://www.w3.org/2008/01/ddr-core-vocabulary
model http://www.w3.org/2008/01/ddr-core-vocabulary
displayWidth http://www.w3.org/2008/01/ddr-core-vocabulary
displayHeight http://www.w3.org/2008/01/ddr-core-vocabulary
displayColorDepth http://www.w3.org/2008/01/ddr-core-vocabulary
markupSupport http://www.w3.org/2008/01/ddr-core-vocabulary
stylesheetSupport http://www.w3.org/2008/01/ddr-core-vocabulary
imageFormatSupport http://www.w3.org/2008/01/ddr-core-vocabulary
inputModeSupport http://www.w3.org/2008/01/ddr-core-vocabulary
cookieSupport http://www.w3.org/2008/01/ddr-core-vocabulary
scriptSupport http://www.w3.org/2008/01/ddr-core-vocabulary
xhtml_table_support http://wurfl.sourceforge.net
is_wireless_device http://wurfl.sourceforge.net

Device lookup algorithm

Once the requesting device has been identified and its properties have been loaded, accessing the device properties is a matter of searching a key in a relatively short set of keys. This should not trigger any performance bottleneck. The performance of the library largely depends on the device lookup algorithm. Identification relies on the value of the User-Agent HTTP header field.

Approximation needed

The problem is that this value may vary for the same model of device based on the build revision of the browser, the network the device is connected to, the current language of the browser, and/or the mobile network operator the device originates from. The WURFL database cannot possibly contain all the resulting strings. As a consequence, an exact match is usually not possible and the lookup algorithm must use approximation techniques.

In short, looking up a User-Agent string in the database means searching for the device id that is the most similar to the User-Agent string. In PHP, the levenshtein function calculate the Levenshtein distance between two strings, roughly defined as the minimal number of characters you have to replace, insert or delete to transform a string into another. The problem is that computing Levenshtein distances takes time. It is not possible to compute the Levenshtein distances between the received User-Agent string and all the device ids of the WURFL database in a reasonable timeframe (e.g. 10ms).

From the User-Agent string to the actual device

To speed things up, the implemented algorithm shrinks the number of Levenshtein distances that need to be computed by doing an exact match on the first token of the User-Agent string, as the first token hardly ever varies for a given model of device. For instance, let's suppose we need to lookup the following string:

Nokia6680/1.0 (5.04.07) SymbianOS/8.0 Series60/2.6 Profile/MIDP-2.0 Configuration/CLDC-1.1

The code will simply look for all devices that start with Nokia6680/1.0 and then and only then compute Levenshtein distances within that list to select the one that is closest to the initial string.

Internally, two additional XML files are created from the initial WURFL database:

  • wurfl_devices.xml contains the list of known devices grouped by first token. Each device points to the corresponding device bloc in the prepared version of the big WURFL database file.
  • wurfl_families.xml contains the list of possible first tokens along with a link to the corresponding token bloc in wurfl_devices.xml.

The lookup algorithm is best described with a schema. Here is an example with a User-Agent string that starts with Mozilla/5.0 [foo]:

To lookup a User-Agent string, the first token is searched for in the wurfl_families.xml file. When found, the code jumps to the appropriate bloc in the wurfl_devices.xml file and reads the list of devices whose ID matches that prefix. It computes Levenshtein distances between the User-Agent string that is being looked up and all the devices in the set and selects the device for which the Levenshtein distance is minimal. The code eventually jumps to the appropriate bloc in the wurfl_patched.xml file where it can load the device properties

Caching system

The implementation bundles a simple cache that uses a User-Agent as key and information about the corresponding device as cached information. The role of this cache is simply to avoid having to redo the lookup each time an application requests a specific property. On an HTTP server, as PHP does not allow variables to persist across requests, the cache is re-created at the beginning of each request when the device is looked up for the first time.

Optimization notes

As mentioned above, the device lookup algorithm could still be improved, typical by extending the caching system to use a hash of the User-Agent and to store the results in a file or in a database for later fast retrieval. Most popular User-Agent strings would be found in this cache and would not require any other form of lookup. This solution is easy to deploy but requires to have some knowledge on what is available for storage in the back-end and this implementation aims at staying agnostic of the underlying platform.

The device lookup algorithm could also be improved to use a more appropriate structure to compute Levenshtein distances. Ternary search trees would provide an ideal playfield for a more optimized Levenshtein computation for instance. This second solution is harder to implement but does not penalize first requests with unknown devices.

On top of the above notes, identification of the device could also be improved by handling different families of devices separately, as the closest match may not always be the best match to identify the device.

WURFL database preparation

AskPythia cannot use the WURFL database directly because:

The code ships with a PHP form page that can process a WURFL file and generate the appropriate index. The currently suffers from heavy memory consumption when it is run and is not available online for the time being. You should be able to run the page on a local PHP server. The file is WURFLPrepareDatabase.php.

A prepared version of the database assembled from latest version of WURFL patched with the Web patch to recoginze Web browsers is available.

The memory consumption problem of the WURFL preparation form should be fixed in next version of the AskPythia library. Note AskPythia does not require this form once the WURFL database has been converted.

Contact: François Daoust <fd@w3.org>