L. Quin, F. Sasaki, C.M. Sperberg-McQueen, H.S. Thompson. World Wide Web Consortium. Paper presented in May 2008 at the LREC 2008 Workshop on "Uses and usage of language resource-related standards". Marrakech, Morocco. PDF version and slides of the presentation are available.
1. Introduction: A Brief Overview of W3C and its Process
2. XML Validation: XSD 1.1
3. Multi-document XML Validation: SML 1.1
4. XML Analysis: "XQuery 1.0 and XPath 2.0 Full-Text 1.0"
5. XML Processing: "XProc: An XML Pipeline Language"
6. Internationalized Formatting: "XSL Formatting Objects (XSL-FO)"
7. XML Internationalization and Localization: "ITS 1.0"
8. Outlook: The need for the Integration of Language Resources
8. Outlook: The need for the Integration of Language Resources
W3C1 is an international consortium with the mission to develop Web standards, with contributions from W3C member organizations, the W3C staff, and the public.
W3C is working on a technology stack informally described at <http://www.w3.org/Consortium/techstack-desc.html>. The Work is organized in Activities like the XML Activity or the Internationalization Activity, which are parts of domains (Architecture, Interaction, Technology and Society, Ubiquitous Web, Web Accessibility). Work items are described in charters for Working Groups, Interest Groups, or Incubators. The difference between these is their scope. Working Groups and Interest Groups concentrate on royalty-free specifications for Web technologies ("Recommendations") and guidelines for their use ("Best Practices"); Incubator Groups concentrate on other, experimental topics which may be input to standardization efforts in the future.
In addition to W3C Activities there are the W3C Advisory Board and the Technical Architecture Group (TAG)2. The former provides guidance about management, legal matters etc., whereas the latter helps to build consensus on fundamental principles of Web Architecture [Jacobs and Walsh2004].
W3C as an organization is formally attached to three hosts: the Massachusetts Institute of Technology (MIT) in the USA, the European Research Consortium for Informatics and Mathematics (ERCIM) in France, and Keio University in Japan. In addition there are W3C offices in many countries to promote adoption of W3C technologies. The W3C membership can propose and drive new work items, has early access to new materials, and uses W3C as a community platform to decide new technology directions.
The development of royalty-free specifications relies on the W3C process [Jacobs2005] and the W3C patent policy [Weitzner2004]. The latter describes licensing and patent disclosure requirements for the participation in W3C work and is an important factor for many organizations to decide about their engagement in W3C.
Two key aspects of W3C Recommendation development are that we aim to reach consensus, within the W3C and the public, about the actual features of new technologies, and it is required to demonstrate several interoperable implementations before publishing a Recommendation. The need for consensus sometimes slows the development process for a Recommendation, but it significantly increases the likelihood that the result will actually be accepted, implemented, and deployed in the community. An example is the XML Query language XQuery 1.0. Between the publication of the first public draft and the final publication of the Recommendation, about six years elapsed. However, during that period a large developer and user community took shape, and the Recommendation was published with around 40 implementations.
XSD, the XML Schema Definition language, is a meta-language for defining XML vocabularies. Essentially, the author of an XSD schema provides a (regular right-part) document grammar for documents which use the vocabulary; unlike some other XML schema languages, XSD makes first-class citizens out of the types associated with elements and attribute. Types may be defined by restricting or extending other types, so that the class hierarchies usual in object-oriented design have a relatively natural representation in XSD schemas. A typical schema consists primarily of the definition of simple and complex types and the association of elements and attributes with types.
A number of primitive datatypes (or "simple types") are provided: strings, booleans, decimal numbers, floating-point numbers, date-time stamps of varying precision (date-time, date alone, year, year plus month, time alone, etc.), URIs, and some others. From these, a number of other built-in types are derived by restriction, including integers and various subtypes of integer (long, short, byte, positiveInteger, etc.).
Modularization facilities are also provided for using several vocabularies in conjunction. In the usual case, one or more schema documents are used to define the vocabulary associated with a given namespace (see [Bray 2006]), and a schema is constructed by consulting one or more schema documents. The schema thus constructed will often consist of components from several namespaces.
Because they are essentially document grammars with a few additional constraint mechanisms, schemas can vary in many of the same ways that grammars can vary. They can be tight or loose, over- or under-generate vis-a-vis some given body of material. In XSD, they can also be incomplete: wildcards and "lax validation" can be used to define constraints on some elements and attributes, while leaving others undefined and unconstrained. Schemas are often used for validating XML data as a quality assurance measure, to detect typographic and tagging errors. They are very useful in this role, but schemas can also be used in other ways. Data binding tools read schemas and generate object-oriented code to read documents which conform to the schema and from them create objects of particular classes, or serialize objects of particular classes in XML which conforms to the schema.
Schemas can document the contract between data sources and data recipients: the source typically undertakes to provide only valid data, and the recipient undertakes to accept any and all valid data. In such scenarios, invalid documents are often simply rejected. In other cases, schemas can be used to document a particular understanding of a kind of document, capturing a simple view of the "standard" realization of the document type without allowing for all variations. In dictionaries, for example, it is a commonplace observation that there are important regularities among entries, but that a small number of entries require rather unusual structures. Grammars which allow for all of the structural variations actually encountered in a dictionary often provide no very clear account at all of the regularities which apply in 99% or more of all cases. Grammars which capture the regularities clearly often do not accommodate the deviant structures which can appear in a small number of cases. (See [Birnbaum and Mundie1999] for fuller discussion.) In part to assist in handling such situations, XSD defines validity not solely as a Boolean property of documents, but describes validation as providing a much richer result: each element and each attribute is individually labeled as to validity or partial validity. XSD can thus be used either for conventional prescriptive grammars or for descriptive grammars which focus on capturing the salient regularities of the material; material with the typical structures described by the grammar can be handled in one process, while anomalous structures can be detected automatically (by their failure to be valid against the schema) and handled specially. XSD does not require that applications reject documents which are invalid or only partially valid.
XSD 1.1 (see [Gao 2007] and [Peterson 2006]) offers a number of enhancements to XSD 1.0, most visibly the addition of
Assertions allow the schema author to express constraints using XPath expressions: the assertions associated with any type are evaluated for each instance of the type, and if any assertion fails to evaluate to true, the instance is not valid against the type. The most common use of assertions will be to formulate co-occurrence constraints. When declaring an element with integer-valued attributes named min and max, for example, a schema author might wish to specify that the value of the one should be less than the value of the other. This is easily accomplished with an appropriate assertion:
<xs:assert test="@min le @max"/>
Some XML vocabularies specify two or more attributes for a given element with the proviso that at most one of them may occur. This can also be handled conveniently with assertions. To ensure that either attribute a or attribute b may appear, but not both, one might write:
<xs:assert test="not(@a and @b)"/>
To require additionally that at least one of the two must appear, one might write:
<xs:assert test=" (@a or @b) and not(@a and @b)"/>
The assertions of XSD 1.1 are restricted in one important way: they can refer to attributes or descendants of the element being validated, but they cannot refer to its ancestors, to its siblings, or to any elements or attributes outside the element itself. This restriction helps preserve the design invariant that the validity of an element or attribute against a given type can be tested in isolation from the rest of the document. This provides a certain context-independence of type validity, which is useful in transformation contexts like XSLT or XQuery.
Another way to capture co-occurrence constraints is to make the assignment of a given type to an element depend upon conditions to be checked in the instance. XSD 1.1 provides a form of such conditional type assignment based on [Marinelli 2004]. Here, too, the conditions which govern type assignment are given in the form of XPath expressions. To specify, for example, that the type assigned to a message element depends upon its kind attribute, given appropriate definitions of messageType, string-message, base64-message, binary-message, and xml-message, one might write:
<xs:element name="message" type="messageType"> <xs:alternative test="@kind="string"" type="string-message"/> <xs:alternative test="@kind="base64"" type="base64-message"/> <xs:alternative test="@kind="binary"" type="binary-message"/> <xs:alternative test="@kind="xml"" type="xml-message"/> <xs:alternative test="@kind="XML"" type="xml-message"/> </xs:element>
The third major change in XSD 1.1 to be discussed here is the provision of methods for specifying what is sometimes called "open content". In defining a document grammar, one may wish to specify, for example, that a particular element must have an a, a b, and a c among its children, in that order, without forbidding other material to appear before, after, or between these required elements. This can be done in XSD 1.0 with judicious use of wildcards, but experience shows that this method is error-prone and apt to fail for uninteresting technical reasons.3 XSD 1.1 allows the schema author to specify (on a case-by-case basis) that types have open content; the schema author can specify a wildcard which is notionally inserted everywhere in the content model, or allowed only at the end of the model. The result is that it is much easier using XSD 1.1 to specify vocabularies which allow arbitrary extension by others, and which can accept new material in new versions of the vocabulary without breaking existing infrastructure keyed to earlier versions of the vocabulary. If version 1.0 of the definition of a vocabulary specifies open content everywhere, then any new elements added in later versions will be accepted by 1.0 processors without difficulty (albeit also without any knowledge of their meaning).
For those charged with developing or maintaining language resources, schema languages offer numerous opportunities for finding errors in the XML transcription of material, or distinguishing material with standard, straight-forward structure from material with anomous structures, and for describing explicitly the class of documents which certain processes are allowed to produce and other processes are required to consume without error.
SML, the so-called "service modeling language", is an additional validation technology built on and layered on top of XSD. See [Pandit 2008].
SML was originally developed for checking the validity of models intended to describe complex sets of information-technology services; the name "Service Modeling Language" thus reflects the historical origin of the technology, but in the meantime the name has become a misnomer: SML is a generic mechanism for validation across document boundaries, and has nothing in particular to do with services or their modeling.
An SML model is a set of XML documents, some of them are model instance documents, which contain representations of the information being modeled, and others are definition documents which define schemas to be used when validating the model instance documents.
In addition to requiring XSD validation of the instance documents, SML provides several additional mechanisms for specifying constraints which can be expressed only awkwardly in XSD, or not at all.
Schematron assertions can be associated with element declarations and type definitions; the asertions are checked for each instance of the element declaration or of the type definition.
The central innovation of SML, however, is its definition of a way to validate references from one document to another. This has a number of aspects.
First, the set of validatable links is not assumed identical to the set of links in the documents: the instance documents may well contain hyperlinks which are not constrained by the SML model and need not be validated. Those inter-document links which are to be validated are "SML references", indicated by the presence of the attribute-value pair sml:ref="true" (or its equivalent) on the element which constitutes the reference.4
The actual form of the link is not constrained: any systematically defined method of pointing from one XML document to an element in another XML document (all targets of SML references must be elements) may be used. Indeed, a single reference may refer to the the target element in multiple ways, each suitable for a different deployment scenario. Of course, SML processors will understand only a particular set of reference schemes; in the interests of interoperability, SML defines one schema, the SML URI reference scheme, which uses URIs to address the target of the link. XPath 1.0 expressions are as fragment identifiers, and XPath 1.0 is augmented by a deref() function to allow SML references to be followed. SML processors may support any reference schemes they choose, but all are required to support the SML URI reference scheme.
Having ensured that the set of links to be validated can be reliably and easily determined, SML then allows various constraints to be imposed on links.
Language resources sometimes are stored in single large documents, and sometimes in many smaller documents, and the choice between monolithic or fragmented representations often depends heavily on external factors rather than upon any logic intrinsic to the material. It is convenient, in such situations, to allow the material to be realized in either form, without losing the ability to validate it. SML"s ability to validate across XML document boundaries is a useful way to ensure that relations within the data can be validated whether in a single document or in many.
"XQuery 1.0 and XPath 2.0 Full-Text 1.0" [Amer-Yahia 2007] is a specification that defines full-text search capabilities. It provides various full-text expressions and options to be used from within XQuery 1.0 or XPath 2.0 expressions. The options relate to stemming, thesauri, the use of stop words, and so forth, as well as to distances (for example "within five words") and units (words, sentences, paragraphs). They also support ranking and relative weighting of sub-expressions.
The specification does not dictate specific algorithms for full-text search implementations, but instead describes only the results of operations. As a consequence, one must expect some variation between implementations. From the point of view of linguistic research, this variation means that it is important to determine, whether through documentation or experimentation, the exact facilities provided by any given implementation.
The tokenization algorithm splits the input data into a sequence of tokens, which, conceptually, are then indexed; the same tokenizer is used to parse queries at run-time into sequences of tokens to be matched against the index. The tokenizer is expected to recognise xml:lang attributes and to perform multilingual matching as necessary.
At the time of writing, this specification is a Last Call Working Draft; a formal call for implementations is expected in May of 2008, and so although there are already some implementations, there may be changes in the final specification as a result of implementation experience.
Probably the biggest limitation of the current Full Text draft for language research is the lack of introspection: one cannot find out exactly which token or tokens matched the query, and one cannot directly implement match highlighting in the way one might want for a concordance or keyword-in-context index. Some implementations do provide a way to do this, and a future version of the specification may well standardize it, but for now it represents a severe limitation.
The limitation is greatly ameliorated when one considers that XQuery (like XPath 2.0 itself) operates not only on XML files, but on any data that can be represented as XPath and XQuery Data Model (XDM) instances. This includes for example geospatial data, relational data, RDF, and more. As a result, one can perform joins across different types of database, correlating them with efficient text searching. In summary, "XQuery 1.0 and XPath 2.0 Full-Text 1.0" [Amer-Yahia 2007] will be a valuable tool to researchers, including the language resources community.
Using XML to represent language resources has become the norm. Actually processing language resources for some purpose often consists of a sequence of processing steps which split, merge, restructure, and transform XML. "XProc: An XML Pipeline Language" [Walsh 2007] is a specification which provides an XML vocabulary for specifying just such sequences, together with an inventory of both simple structural manipulations such as renaming, wrapping, deleting, and extracting items in an XML data stream, as well as larger-scale standards-based operations such as validation, transformation, and querying (see above).
Many kinds of XML technology and standards, including XSLT, XML Schema, XInclude, XQuery, and even SOAP and WSDL, can be understood as mapping from one kind of infoset [Cowan and Tobin2004] to another. Today most implementations of XML-based language processing applications process XML directly using programming languages and an API such as SAX or DOM. But many XML processing tasks don"t need to be done at this low level. There are a number of XML Pipeline languages already available which allow you to specify sequences of standards-based XML operations. It is often possible to replace programming-language-based XML processing with short and simple XML Pipeline descriptions, for example
The W3C"s XML Processing Model WG is working to produce an interoperable XML Pipeline language based on existing technology. The work has nearly finished, and the result should provide a language with wide application to language processing tasks.
The XProc language is itself expressed in XML, and has two main parts:
There is support for more than just straight-through pipelining of operations, with data-flow equivalents of conditions, loops, and exception-handlers.
The low-level manipulations available include
The higher-level operations available include
The pipeline paradigm for producing NLP systems has been heavily exploited by the Language Technology Group at the University of Edinburgh. One example, described in [B. Alex, C. Grover et al.2008], uses a multi-step pipeline to extract named entities, in particular proteins, from biomedical text, classify the extracted terms, and detect relations between terms. Steps in the pipeline range for generic, low-level tasks such as tokenisation and sentence-boundary detection to high-level processes such as relation extraction which involve not only entity-tagged data but also pre-computed statistical models.
The availability of a standardised XML pipeline language offers a real opportunity to improve the principled comparison of alternative approaches, as the modular nature of the pipeline architecture, together with the well-defined interfaces between modules which the XML document structures represent, will make it possible to do properly controlled comparisons of alternative approaches to the key stages in a complex process.
The XSL 1.1 specification [Berglund2006] includes facilities for formatting XML, for example into PDF. XSL-FO 2.0 is currently being designed, with increased sophistication and also with increased support for Japanese formatting. The XSL-FO 2.0 requirements document [Bals2008] provides more information. XSL-FO is currently the most powerful and most completely internationalized of any widely-used standard for text formatting, with strong support for mixed-language work.
XSL-FO is a fixed XML vocabulary for formatting. In normal use one transforms input XML into the XSL-FO vocabulary using XSLT, and this transformed XML document is then rendered. It is also possible to produce XSL-FO directly, for example using XQuery.
XSl-FO copes with an arbitrary mix of text directions, both in what it calls the inline progression direction (e.g. right-to-left for Hebrew) and in what it calls the block progression direction (e.g. top to bottom for English, or right-to-left for vertical Japanese). It also defines how baselines should be mixed, for example when combining Devanagari and Arabic on the same line. Since language information is available to the formatter, language-specific hyphenation and line-breaking is also generally used.
Currently, XSL-FO is primarily aimed at automatic formatting in a content-driven environment: text flows into page areas, and new pages are created on demand from templates. XSL-FO 2.0 is expected to add support for format-driven processing, in which page areas fetch content as needed, but the 2.0 work is still in the early stages.
Readers interested in the future of XSL-FO are strongly encouraged to inspect the requirements document previously cited and to send comments to the Working Group as instructed in the Status section of that document.
The Internationalization Tag Set (ITS) 1.0 [Lieske and Sasaki2007]5 is a specification which provides an XML vocabulary related to Internationalization and Localization of XML. A prototypical use is specifying which parts of an XML document should be translated or not translated during the localization of XML data. Such information can be expressed with two approaches, which can be used alternatively or complementary. First, locally, by adding in an XML document a translate attribute to the targeted element node, with the values yes or no. Second, by describing ITS 1.0 global rules which are independent of a specific location and can be applied to several XML documents. Such rules make use of XPath to specify the nodes to which the ITS information should pertain to.
ITS 1.0 specifies for 7 so-called "data categories" a way to express global and local information, defaults, and inheritance behavior (that is, does the information pertain to attributes and / or child elements):
Such information can be applied in many scenarios, for example within localization tools, for the extraction of translatable text, or as a preparation for the localization process. Below are two examples of using ITS 1.0 together with two important standards for XML localization: XLIFF and TBX.
XLIFF [Savourel 2008b] is the "XML Localization Interchange File Format", a interchange file format for localizable content. A prototypical usage scenario is that out of some input data (e.g. an XML document) an XLIFF document is being generated, containing several translation units which wrap markup for the input "source" data and output "target" translations. Translators then create these translations, which are finally integrated into the source data.
The ITS 1.0 data category "Translate" can be used to generate XLIFF documents out of XML input data. Below is an XProc (see sec. 5) pipeline informally described with the purpose of comparing results of translation tools. It consists of the following XProc steps:
The use of ITS 1.0 as the first step in the XProc pipeline description helps to "hide" the specifics of input XML data. The automatic translation tools only have to understand the XLIFF format. In this way, the same processing chain can easily be re-used for new translation tools.
The TermBase eXchange (TBX) format is another example where ITS 1.0 helps to generalize processing chains. TBX is used for the representation of terminological information for human consumption or in NLP lexicons. The ITS 1.0 "Terminology" data category helps to identify terms locally or with global rules. In combination with the "Translate" data category, the following processing chain can be envisaged:
The language resources community has struggled for years with the challenge of combining separately developed resources: how to combine your lexicon with mine, or your grammar, corpus etc. This problem is sometimes termed data integration. There are several areas of difficulty in the field of data integration. Some of these have been solved, at least partially, by some of the technologies that have been described in this paper:
In all cases, the fact that every XML tool can process any XML document is a major benefit that greatly simplifies work.