[Workflow submission] Corbas Consulting Ltd

From: Nic Gibson <nicg@corbas.co.uk>
Date: Mon, 15 Jul 2013 22:07:12 +0100

Just enough structure: XML authoring and why it might not be the right idea

A position paper submitted by Corbas Consulting Ltd, UK.

Corbas is a two-person consultancy working for major publishers and publishing organisations in the United Kingdom and Europe. We include organisations such as the OECD and the International Labour Office and publishers such as Lexis-Nexis, Penguin Group and Bloomsbury Academic amongst our recent clients. We have helped clients with technological solutions and with workflow and organisational issues. Several of our customers have created systems to support structured authoring in an XML or HTML environment. Over the last few years we have become more and more concerned at the general failure of these projects. We have considerable experience with the digital publishing and authoring requirements and practices of legal, academic, educational and trade publishing.

Although "[t]he publishing industry is one of the largest consumers of W3C technology" (“W3C Launches New Digital Publishing Activity”, 25/06/2013), it often seems that publishing and technology do not communicate their requirements and abilities as well as they could. Nowhere is this more clear than in the failure of structured authoring in traditional publishing.

We believe that structured authoring through XML systems is simply too alien to the thought processes and working practices of many authors (especially those outside of the more technical spheres of authoring). Attempts to impose structured authoring generally fail. However, we firmly believe that digital publishing to the web, print and to eBooks can be considerably more effective if content is created in a more structured system. Therefore, we have developed an interest in systems that allow authors to edit in a semi-structured manner (such as direct authoring in HTML) and in systems that allow additional structure to be imposed on the author's content after they have finished creating it. Editing structured materials appears be a more successful project with authors' flow less constrained. However, HTML and XML do not have good native version control and change tracking. Authors are used to MS Word. Until the facilities that all authors need are available as part of the web technology stack then XML and HTML for authoring will fail in the general case.

The premise:

In most cases, structured content is valuable when processing text and when you can identify repeatable logical chunks. This implies that the interesting unit of structure in a novel might be the chapter. In a legal text, it may well be the paragraph but it would also be the sub-section, the section and the chapter.

However, for most authors, structure interferes with the process of writing because structuring is too often counter-intuitive to narrative. It can even be a block.

There is an important difference between the way authors write text and the way that we use software to process that text. Processing requires some sort of structure. For many types of text the required structure is minimal (e.g. a short story or novel). For other types of text, such as legal publications, structure is essential (as is a heavy semantic layer). In both situations, structured authoring has proven challenging to implement.

XML operates on the premise that structure brings benefits. There are clear benefits to structured text but none of those benefits are for the author of the text. We think linearly when we write text. When authors write something, they generally write a title and then some text. They might follow that with a subtitle and some more text, then possibly an image. Now, the important part of that statement is the sequencing: (a) followed by (b) followed by (c). The structure is there but it's sequential. For XML contexts, this is a problem: we want structured text, but authors don't want to write structured text. Structured authoring tools get in the way of the author and authors do not voluntarily use them. There is a market for tools that support structured authoring, but it never seems to do very well. Nothing dominates.

In order to get structured content that accurately represents the intent of the author we must use tools to convert from authored text to structured text. There currently isn't a sensible alternative to that approach - structured authoring fails in the general case. Instead, perhaps it is time to accept that the author process and/or tools need to provide just enough structure. Then conversion to HTML (pretty trivial) and conversion to XML (not trivial but not particularly difficult) can happen. Is it enough to say that “just enough structure” can be defined as “apply Word styles” - or can there be more to it?

If our position paper is accepted, we will share the techniques that we have developed that have enabled to our customers to develop mechanisms for authors to provide digital materials in a form they prefer, whilst allowing extraction of structured data. Importantly, we will also share the approaches we have used to enable publishing clients to get both author and organisational buy-in for those tools and techniques.

Involvement in this workshop will identify potential tools and exchange ideas for development is needed, and document suggested areas where development can occur in the next few years in order to move towards a "just enough structure" authoring standard that is relevant to both the publishing and technology communities. It is essential that publishers and technologists begin to speak the same languages.

nic gibson, director
Corbas Consulting / @CorbasLtd
Digital Publishing Consultancy and Training
http://www.corbas.co.uk, +44 (0)7718 906817/+44 (0)1273 930765