Architectural Forms, CSS, and RDF - What do they have in Common and why should you care?

Felix Sasaki
World Wide Web Consortium
Tokyo
fsasaki@w3.org

Keywords: Re-usage; Architectural Forms; RDF; Markup Semantics; CSS

Abstract

This paper takes a "real world scenario" of re-using XML in different contexts, with different people participating in different places over time. Various approaches for re-usage like architectural forms or (possibly RDF-based) markup semantics are discussed. The proposed methodology for re-usage is based on these approaches, but also - somehow surprisingly - on CSS. A specific implementation and application of the methodology in the area of XML internationalization and localization will be described. In addition, the usefulness of the methodology will be proved for the general task of relating arbitrary XML vocabularies.

1 Introduction

This paper1 deals with a need which has come up again and again in the last years, since the creation of XML (and a long time before): The need to re-use markup schemes and marked-up documents in different contexts by different people, probably unforeseen at the time of their creation. A prominent name of the failure to respond to this need is the tag abuse syndrome. Compared to its "mother" SGML, which mandates schemas for document creation, XML has an additional challenge: The reuse of marked up documents, without schemas2.

The need to reuse markup schemes and marked up documents (for abbreviation, to reuse XML), is complementary to demands for flexibility and ease of usage. There have been various attempts in the past to solve the reusability problem, see sec. “Previous Approaches”. Unfortunately, so far none of the approaches has been adopted widely. Some reasons are: the need to implement additional, complex technologies, a narrow, inflexible re-usage scenario, and re-usage mechanisms which do not respond to "real world" re-usage scenarios.

This paper introduces a methodology which tries to avoid these shortcomings. As the title of the paper suggests, the methodology is inspired by technologies which normally are not related to each other. A key difference to other, previous methodologies for re-usage is that the new methodology has been developed "bottom-up", from a concrete "real world" re-usage scenario: expressing information about internationalization and localization of XML, see [Lieske and Sasaki 2006]. The purpose of this paper is to generalize the ITS [Internationalization Tag Set] methodology and prove its general applicability.

The paper is structured as follows. Sec. “The Re-usage Problem” describes the re-usage problem bottom up, from the task of XML internationalization and localization. Sec. “Previous Approaches” discusses previous solutions and the motivation to take design principles of SGML architectural forms, CSS and RDF into account to develop the ITS methodology. Sec. “A Bottom-Up Approach to Re-usage: Internationalization Tag Set (ITS)” describes the implementation of the methodology for localization / internationalization related tasks. In sec. “Generalization of the ITS Approach”, the methodology is generalized, implemented and applied to a general, ITS independent usage scenario. Sec. “Conclusion” summarizes the discussions and gives an outlook to remaining work.

2 The Re-usage Problem

2.1 A Sample Task: How to Express Information about Translatability of Marked-up Text?

The re-usage scenario of internationalization and localization of XML has various requirements, see [Savourel 2006], [Sasaki et al. 2005] and sec. “Requirements for ITS”. Throughout this paper, the (often unforeseen) need to express information about translatability of attribute and element textual content will be used as an example3. Fig. 1 shows a document which needs such information, and a possible RELAX NG schema in compact syntax4.


Figure 1:

Document with translatable and not translatable content, and possible schema

<Manual>
 <Info>          
  <PhaseCode>Review Level</PhaseCode>
  <FormNo>8U81-GS-52C</FormNo>
  <Name>Owner's Manual</Name> 
 </Info>
 <Section id="s0" title="#Introduction#" translatable="false">
  <Ltitle id="lt005" title="#ZOOM#">
   <Mtitle id="mt00501" title="Getting started" option="no" cols="1">
    <MultiCol cols="1" myLangAttr="en">
     <Text addInfo="...">Some text to localize</Text>
    </MultiCol>
   </Mtitle>
  </Ltitle>
 </Section>
</Manual>
start = element Manual { Info, Section+ }
Info = element Info { PhaseCode, FormNo, Name }
PhaseCode = element PhaseCode { text }
FormNo = element FormNo { text }
Name = element Name { text }
Section = element Section { attribute id { xsd:ID }, attribute title { text },
                            attribute translatable { "true" | "false" | Ltitle }
Ltitle = element Ltitle { attribute id { xsd:ID }, attribute title { text }, Mtitle }
Mtitle = element Mtitle { attribute id { xsd:ID }, attribute title { text },
                          attribute option { text }, attribute cols { text },
                          attribute myLangAttr { text }, MultiCol }
MultiCol = element MultiCol { attribute cols { text }, Text }
Text = element Text { attribute addinfo { text }, text }

It is assumed that some of the elements and attributes need translations, but others must not be translated. This example shows a common problem of (XML) documents: Many people participate in their processing, and not in parallel. They create and manipulate the documents at different times, and with different motivations. The creator of the document might know what content needs translation. But the translator might be in a different company, which gets the document months after its creation. In other words: There need to be a way to make knowledge about translatability explicit, i.e. to convey the intentions of the original creator.

One might argue: If knowledge about translatability is necessary, why not adding it to the document from the start? In the example document, there is already an @translatable="false" attribute. But still it leaves open questions: What are the values: "true", "false", "null", ...? How is a value to be interpreted, i.e. does it relate to child elements as well as attributes? How is the absence of the attribute to be interpreted? In practice, even the presence of this attribute is rather unrealistic: people think about translation long after document creation. It is then too late to add information about the new usage scenario "Translate the document!", since it might disturb other, existing processing chains5.

A different proposal to make translatability information explicit could be: just transform the document, e.g. replace all markup with <translate> or <notranslate> elements. The drawback of this solution is of course again that processing which relies on the original names is not possible anymore, and that the transformation might not be reversible.

2.2 Generalization of the Problem and Separation to related, yet different Areas

The generalization of the problem mentioned above is: people need to be able to specify new purposes for existing markup schemes / marked-up documents, without changing them.

A somewhat related, yet still different problem is the one of "markup semantics", see sec. “Specifying Relations between Markup Vocabularies (2): Markup Semantics (and Semantic Markup)”. Specifying markup semantics does not mean that the re-usage problem is solved. The re-usage problem is about usage scenarios unforeseen during markup scheme / document creation. Re-usage does not change the original meaning of the markup: the <Section> element in fig. 1 is markup for a section. In addition and somewhat independent to its meaning, it needs information about being (not) translatable.

However, as will be seen later, markup semantics is sometimes regarded as a means for reusability in a new context. Hence, this paper talks about this topic as well, but always in the light of the purpose "reusing markup in a new context".

A different task which also needs to be separated from the reusability problem is extensibility and versioning of documents and markup schemes. If one could use such extensions, the problem of translatability would be solved: A new version of the markup scheme could encompass means to specify translatability information. However, as mentioned above, in the "real world" of e.g. XML translation, documents appear with or without schemas and are handled by various people at different places and times. Hence, a new version of a schema might not reach the person who needs it, or the person might not be allowed to change the schema.

Both the specification of markup semantics and versions and extensions provide no answer to one part of the reusability problem: Sometimes there is contradictory information by different re-users. As for the translatability information, a person might make different decisions what has to be translated than somebody else.

As a preliminary summary, the previous discussion leads to a more fine grained description of what tasks are necessary to achieve reusability:

In sec. “Generalization of the ITS Approach”, after the discussion of existing approaches and the introduction of the ITS methodology, additional tasks will come up.

3 Previous Approaches

3.1 Specifying Relations between Markup Vocabularies (1): Architectural Forms

Architectural forms, defined as part of the HyTime standard [ISO/IEC10744], are a mechanism to describe relations between schemas in the format of SGML or XML DTDs. They allow for defining sets of rules about relations between markup declarations. Each rule set has a name, the architectural form. Fig. 2 shows parts of an architectural form which could solve the re-usage problem for the translation scenario.


Figure 2:

Architectural form for the usage scenario "translatability information"

CLIENT DTD:
<?IS10744:arch name="translation-arch"
  ...
  renamer-att="translation-arch.atts"
  dtd-system-id="translation.dtd"?>
<!ELEMENT Manual (Info, Section+)>
<!ATTLIST Manual translate-arch NAME #FIXED "translate"
...>
<!ELEMENT PhaseCode (#PCDATA)>
<!ATTLIST PhaseCode translate-arch NAME #FIXED "notTranslate"
...>
<!ATTLIST Section translation-arch.atts CDATA #FIXED "id noTransAttr title transAttr"
...>
...

Within the processing instruction IS10744, the architectural form is named translation-arch. It is possible to specify multiple architectural forms, e.g. translation-arch1 versus translation-arch2. This would solve the need to specify multiple, possibly contradictory information (see sec. “Generalization of the Problem and Separation to related, yet different Areas”).

The architectural form is applied in the "client DTD", that is the DTD which is enhanced with mapping rules to the "architectural DTD". dtd-system-id points to the architectural DTD. It is assumed that the architectural DTD contains only two elements: <translate> and <notTranslate>. In the attribute lists in the client DTD, it can now be specified for the translate-arch architecture that an element should be mapped to one of these, e.g. <Manual> to <translate> and <PhaseCode> to <notTranslate>.

Mappings of attributes can be specified as well. In fig. 2, the @id attribute of the <Section> element is mapped to a @noTransAttr attribute, whereas the @title attribute is mapped to a @translate attribute.

It is also possible to describe mappings within a document, if the client DTD is not available. For this purpose, a @translation-arch attribute is added to an element. Its value specifies the name of the element in the architectural form, e.g. @translation-arch="translate". This responds to the need mentioned in sec. “Generalization of the Problem and Separation to related, yet different Areas” to specify information directly in marked up documents.

Architectural forms seem to solve most of the problems for re-usage. Why are they not used? First there is a historic reason: The framework of architectural forms is not part of the XML standard (although XML documents can be processed with architectural forms definitions). To solve this problem, various XML versions of architectural forms have been or are being created, e.g. AF:NG [Cowan 2002] or DSRL [ISO/IEC 19757-8].

There is another practical reason: the complexity of the HyTime standard, which has led to a lack of implementations. But the most important obstacle for the adoption of architectural forms seems to be a different one: architectural forms combine task which are not necessarily part of each re-usage scenario: identifying a mapping and subsequent processing, that is: transformation and / or validation. As for transformation, there are many re-usage scenarios which need more powerful mechanisms. An example is the need to handle element mapping differently, depending on the presence or absence of attributes, or specific attribute values.

As for the validation task: architectural forms solve the problem of relating declarations, but they do not solve many other problems which are much more important for validation scenarios, like information about typing after validation (addressed by XML Schema), ambiguous content models (addressed by RELAX NG) or usage of validation patterns instead of grammar rules (addressed by Schematron). This makes another aspect of the re-usage problem explicit: A re-usage mechanism should allow for adding information about a new usage scenario (see sec. “Generalization of the Problem and Separation to related, yet different Areas”), while being independent of any further processing. As for translatability information, various processes built on top of the task "identifying translatable content" are useful, e.g. editing with highlighting of translatable text, or text extraction.

Finally, the mechanism of defining multiple architectural forms like translation-arch1 versus translation-arch2 has one shortcoming: the architectural forms do not interact with each other. It is possible to apply them to the same document, but there must not be conflicting rules for the same markup. For the scenario of various people providing translatability information, this is an important requirement.

3.2 Specifying Relations between Markup Vocabularies (2): Markup Semantics (and Semantic Markup)

Architectural forms are a grammar-based mechanism for describing relations between markup declarations. The field of markup semantics provides methodologies which can be used for the same purpose, but which go beyond the grammatical level.

[Sperberg-McQueen and Miller 2004] categorize four different approaches towards markup semantics, which focus on different problems:

An example of markup semantics6 which uses FOPC is the BECHAMEL project. Its first approach is originally described in [Sperberg-McQueen et al. 2000] and specifies a mapping of XML into a prolog-based format. Fig. 3 replicates this format for the translatability use case.


Figure 3:

BECHAMEL example

translatable(Manual,[1]).
non_translatable(PhaseCode,[1,1,1]).

The Prolog-based notation contains two predicates, for translatable and not translatable units. They take the name of the element and a path description as arguments. Among others, the BECHAMEL approach allows for specifying inheritance information, e.g. for indicating whether translatability information should be applied to child elements and / or attributes as well.

Markup semantics has a counterpart called "semantic markup". Whereas markup semantics is concerned with a formal description of the meaning of schemas and documents, the latter focuses on the addition of external semantics to markup, using RDF [Klyne and Carroll 2004] or related formats7. An example which combines markup semantics and semantic markup is given in fig. 4.


Figure 4:

Markup semantics combined with semantic markup

[1] sekStruk:Book sekStruk2primStruk "element Manual { ... }".
[2] sekStruk:Code sekStruk2primStruk "element PhaseCode { ...}".
[3] sekStruk:Book equal sekStruk:TranslatableUnit.
[4] sekStruk:Code equal sekStruk:NonTranslatableUnit.
[5] sekStruk:translatableUnit sekStruk2primStruk "element translatableUnit { ...}".
[6] sekStruk:nonTranslatableUnit sekStruk2primStruk "element nonTranslatableUnit { ...}".
[7] sekStruk:Book sekStruk2conLevel my-ontology:Book.
[8] sekStruk:Code sekStruk2conLevel my-ontology:Code.

Eight RDF statements in the form subject predicate object. are given. They make use of predefined predicates which are described in [Sasaki 2004]. The statements [1] and [2] use the predicate sekStruk2primStruk for a mapping of concepts to markup declarations, i.e. sekStruk:Book to a declaration of the <Manual> element and sekStruk:Code to a declaration of the <PhaseCode> element. The statements [3] and [4] use the predicate equal and make explicit that these concepts are equal to translatable or not translatable units. sekStruk:TranslatableUnit and sekStruk:NonTranslatableUnit are concepts and can be mapped to element declarations via sekStruk2primStruk, see the statements [5] and [6].

These six statements are closely related to architectural forms: they describe the relations between markup declarations, i.e. <PhaseCode> and <nonTranslatableUnit>. The difference is that they are statements and not grammar-based rules, and they are not bound to a specific application (validation or transformation).

The "semantic markup" aspect of this approach is given in the statements [7] and [8]: The concepts sekStruk:Book and sekStruk:Code are mapped to concepts from a given ontology. It is assumed that the ontology contains information about properties of books and code, i.e. being (not) translatable. In other words: the ontological knowledge is added to the markup.

The problem with these approaches to markup semantics and semantic markup is that they imply a change of documents and schemas. Given the ontological or prolog-based representation, the original XML is lost. This contradicts the requirement mentioned in sec. “Generalization of the Problem and Separation to related, yet different Areas” to allow for new usage scenarios with limited impact, since the new data structures are not available for any other, purely XML-based processing chains.

Nevertheless, these approaches provide an important aspect of re-usage which is not so clear if one looks "only" at architectural forms: Re-usage implies a mapping into a target data structure. In architectural forms, this data structure is rather simple: singular elements and attributes. In approaches to markup semantics mentioned above, the data structure departs from the XML format (both on the level of schemas and documents). In the semantic markup demonstrated above, the data structure is only a means to integrate additional, possibly ontological information.

Although a different, not XML-based data structure might be appropriate for some re-usage scenarios, it is clearly not feasible to require such data structures in general. For a re-usage mechanism, it rather seems to be appropriate to specify new usages relying on an XML-based data structure. Different data structures can be built on top of this structure. Also, the existing data structure, i.e. the existing XML documents and schemas, should not be transformed to a different (XML) structure. They should only be associated with the new structure, and be preserved in their existing form.

3.3 Specifying Commonalities of Resources, adding Information: CSS

It may seem strange to mention CSS [Bos et al. 1998] in a discussion of re-usage. However, taking a close look, it becomes obvious that the characteristics of CSS fulfill important requirements for re-usage, see fig. 5.


Figure 5:

Basic CSS example

Manual {  background-color: green; }
PhaseCode { background-color: red; }

With CSS, it is possible



The latest aspect becomes clear in fig. 5. CSS relies on a selection mechanism [Glazman et al. 2005 ] which is independent of the actual styling information (given in curly brackets). In other words: the description of being translatable (here realized via green coloring) or not begin translatable (realized via red coloring) is separated from this mechanism.

CSS has two further aspects which separate it again from architectural forms: first, a notion of inheritance8 (which is also formulated in the BECHAMEL framework to markup semantics). In CSS, it is described how e.g. coloring information pertains to descendant nodes, and is overridden by further information.

The other aspect concerns the resolution of conflicts between multiple, possibly contradictory information. In architectural forms, such conflicts cannot occur, since applications of multiple architectural forms (e.g. translation-arch1 versus translation-arch2, see sec. “Specifying Relations between Markup Vocabularies (1): Architectural Forms”) do not interact. CSS allows for applying multiple information in parallel and relying on precedence rules to resolve conflicts.

One shortcoming of CSS might be that it is concerned only with instances, not schemas. This is an issue in theory, but not in practice: specifying a CSS rule for all instances of an element seems to be as appropriate as architectural forms on the schema level. The XML-based approaches mentioned in sec. “Specifying Relations between Markup Vocabularies (1): Architectural Forms” rely on XPath, that is, they also have abandoned the grammar-based area.

Another issue with CSS is of course the data format: it is not XML, and it mainly applies to styling9. Since the target of the re-usage approach described in this paper are XML documents and schemas, it seems to be appropriate to rely on XPath for selection, and to restrict data structures, which are added to selected nodes, to XML.

3.4 Architectural Forms, CSS, and RDF - What do they have in Common and why should you care?

All approaches which were discussed in the previous sections have in common that they imply a mechanism for selecting XML (parts of a schema and / or a document). On top of these mechanisms, various processing like validation and transformation takes place which is of importance for some re-usage scenarios.

The RDF-based approach towards markup semantics and architectural forms share the characteristics that they describe relations between markup declarations. Like CSS, architectural forms can be applied globally and locally, whereas the RDF-based approach is defined only globally, i.e. independent of a specific position in a document10.

The most important characteristics of CSS are that it clearly separates a selection mechanism from any further processing or assignment of information, and that it has mechanisms for dealing with inheritance and precedence between selections. What is missing is an XML-based selection mechanism (i.e. XPath), and a method to assign not CSS properties, but XML data structures to selected nodes in XML documents. It is also important that this information is assigned without changing the input XML data structure.

Why should you care about this? The simple answer is: Many approaches towards re-usage of XML have failed to be applicable in real-world scenarios. The methodology described in this paper is based on a real-world scenario and on a review of these approaches. It will be shown below that it has already proved to be applicable in this specific scenario, and that a generalized implementation is easy to achieve.

4 A Bottom-Up Approach to Re-usage: Internationalization Tag Set (ITS)

4.1 Requirements for ITS

The purpose of ITS is to define markup for localization and internationalization purposes11. An example purpose is markup to express information about translatability. Another purpose is markup for localization information, i.e. to provide information which is important during the localization process.

Although the work on ITS started with the goal of defining only such markup, it soon became clear that the great variety of application scenarios and requirements (see [Savourel 2006] and [Sasaki et al. 2005]) made a definition of ITS mechanisms necessary. For example, translation information should be applicable by:



Despite this variety of usages, all scenarios have in common that they are about selecting XML markup and adding information about "translatability". This characteristics applies to most of the requirements for ITS, so it was decided to differentiate between data categories like "translatability" and their implementation: locally in an instance document via e.g. an @its:translate attribute, or globally, i.e. independent of a specific position13.

4.2 Realization of Requirements for ITS

The following description of the mechanism of ITS is very brief. It mainly summarizes ITS characteristics which are important for the "re-usage" topic of this paper. A more detailed description can be found in the ITS working draft and in [Lieske et al. 2006].

ITS uses selection of information as a basic mechanism. In addition, it is possible to add information to selected nodes, or to point to existing information in the document which should pertain to selected nodes. The functionality of adding information can be applied locally (in the XML document), or globally (independent of a specific position). Finally, data category specific inheritance / precedence rules and (optionally) defaults are applicable. All mechanisms will be exemplified via fig. 6, which is a modification of the XML document from fig. 1. It contains markup for three ITS data categories: translatability, localization information and language information.


Figure 6:

ITS data categories

<Manual ... its:translate="yes" its:locInfo="The Section element contains
             existing localization information, the Info element contains
             ITS markup which was added later." its:locInfoType="description">
 <Info>
  <its:rules its:version="1.0" ...>
   <its:translateRule select="//*@[translatable='false']" translate="no"/>
   <its:translateRule select="//Section/@title" translate="yes"/>
   <its:translateRule select="//Text" translate="yes"/>
   <its:langRule select="//*[@myLangAttr]" langPointer="@myLangAttr"/>
  </its:rules>            
  <PhaseCode>Review Level</PhaseCode>
  <FormNo its:translate="no">8U81-GS-52C</FormNo>
  <Name>Owner's Manual</Name> 
 </Info>
 <Section id="s0" title="#Introduction#" translatable="false">
  <Ltitle id="lt005" title="#ZOOM#">
   <Mtitle id="mt00501" title="Getting started" option="no" cols="1">
    <MultiCol cols="1" myLangAttr="en">
     <Text addInfo="...">Some text to localize</Text>
    </MultiCol>
   </Mtitle>
  </Ltitle>
 </Section>
</Manual>

Selection of information can be realized locally or globally. For global selection, data category specific rules elements are used, see the content of the <its:rules> element. Each rule element, e.g. <its:translateRule>, has a mandatory @select attribute. Its value is an XPath absolute location path which evaluates to the selected nodes. Locally, selection (in combination with adding data category specific information) is realized via data category specific markup, e.g. the attributes @its:locInfo or @its:translate. ITS locally mainly uses attributes, to reduce the impact on existing markup schemes. What is being selected depends on data category specific inheritance rules. E.g. as for translatability, the @its:translate attribute selects the content of elements, including child elements, but excluding attributes. In contrast, the selection of nodes for language information, in the example realized via the <its:langRule> element, applies to both element and attribute content14.

Adding of information or pointing to existing information are mechanisms which can be applied alternatively to selected nodes. Adding is possible both locally and globally. In the example, the three <its:translateRule> elements add information to selected nodes via the @translate attribute. The first <its:translateRule> element also serves the association of information which is already available in the document (i.e. the @translatable="yes" attribute at the <Section> element) with ITS translatability information. The local variant of adding information is the @its:translate attribute at the <FormNo> element, or the @its:locInfo / @its:locInfoType attributes at the root element <Manual>. Pointing to existing information is realized via the <its:langRule> element. It contains a @langPointer attribute with a relative XPath expression. Applied to the document given, the <MultiCol> element which has a @myLangRule attribute is selected for the "language information" data category. The language information is available in that attribute15.

The mechanisms of selection, adding of information and pointing to existing information are also available in architectural forms. Roughly speaking, global selection corresponds to mapping between client and architectural DTD, and local selection to the usage of attributes specific to an architectural form for element mapping. However, in architectural forms, the mechanisms are combined with the validation and transformation processing tasks, which is not the case in ITS. Another difference is the usage of XPath instead of a grammar based description of relations between markup declarations, and the lack of inheritance mechanisms in architectural forms. As for CSS, global selection resembles the <style> element in e.g. XHTML, and local selection the XHTML @class attribute. The main difference to ITS is that CSS uses a not XML specific selection mechanism, and that CSS does not allow for pointing to existing information.

One important characteristic of CSS is also available in the ITS mechanisms: precedence descriptions which help to resolve conflicts between contradictory selections. In the example, there is a conflict for the <Text> element: the information inherited from the @translatable="false" attribute (see the first <its:translateRule> element) contradicts the information of the third <its:translateRule> element @translate="yes". However, since information specified via global rules has precedence over inherited information, the conflict can be resolved.

Data category specific defaults can be compared to default values in architectural forms. In ITS, defaults are an optional feature and are used e.g. for the translatability data category. Here, the default is that elements are translatable (i.e. the information @its:translate="yes" is added to them), and attributes are not translatable (i.e. they get the information @its:translate="no"). Added means in this case "being added then an XML document is processed, see also sec. “Generalization of the ITS Approach”.

A comparison to markup semantics / semantic markup makes clear that the ITS approach has nothing to do with meaning description. ITS assigns XML structures (e.g. the translate attribute) to nodes, using optionally existing information from the input document. Markup semantics can then be built on top of the assigned structure, or semantic markup can be added to that structure. Nevertheless, even without a semantic description, such syntactical adding of / pointing to information helps to achieve the goal of XML data re-usability in various contexts.

5 Generalization of the ITS Approach

5.1 A Generalized Description of Data Categories

After solving the re-usage problem for the ITS specific data categories, in this section the ITS approach will be generalized. The basis is an extension of the list from sec. “Generalization of the Problem and Separation to related, yet different Areas”:



It must be stated again that, although data categories are an abstraction over (ITS) markup and their processing expectations, they are still no "real" semantics, esp. in comparison to the approaches described in sec. “Specifying Relations between Markup Vocabularies (2): Markup Semantics (and Semantic Markup)”. ITS is rather concerned of association of XML data structures with other XML data structures. "True" markup semantics might be built on top of the association.

Fig. 7 shows a schema for the description of data categories, and its instantiation for the categories translatability, localizationInformation and languageInformation.


Figure 7:

Generalized approach of data category specification: schema for data category description and sample instance

...
start = datacats
datacats = element datacats { datacat+ }
datacat = element datacat { attribute name { xsd:NMTOKEN }, defaults?, inheritance, 
                            rulesElement?, localAdding? }
defaults = element defaults { defaultsElements?, defaultsAttributes? }
inheritance = element inheritance { appliesTo }
localAdding = element localAdding { datcatSelector, addedMarkup }
rulesElement = element rulesElement { attribute name { xsd:NMTOKEN } }
defaultsElements = element defaultsElements { any }
defaultsAttributes = element defaultsAttributes { any }
any = anyAttribute*,  mixed { anyElement* }
anyElement = element *  - datc:* { anyAttribute*, mixed {anyElement*} }
anyAttribute = attribute * - datc:* { text }
appliesTo = attribute appliesTo { "onlyElements" | "elementsAndAttributes" | "none" }
datcatSelector = attribute datcatSelector { text }
addedMarkup = attribute addedMarkup { text }
    
<datacats>
 <datacat name="translatability">
  <defaults>
   <defaultsElements its:translate="yes"/>
   <defaultsAttributes its:translate="no"/>
  </defaults>
  <inheritance appliesTo="onlyElements"/>
  <rulesElement name="translateRule"/>
  <localAdding datcatSelector="*[@its:translate]" addedMarkup="@its:translate"/>
 </datacat>
 <datacat name="localizationInformation">
  <inheritance appliesTo="onlyElements"/>
  <rulesElement name="locInfoRule"/>
  <localAdding datcatSelector="*[@its:locInfoType]" addedMarkup="@its:locInfo | @its:locInfoType |
   @its:locInfoRef"/>
 </datacat>
 <datacat name="languageInformation">
  <inheritance appliesTo="elementsAndAttributes"/>
  <rulesElement name="langRule"/>
 </datacat>
 </datacats>

Each data category is uniquely named and described within a <datcat> element. Optionally, defaults for elements and attributes can be specified. This is the case for the translatability data category. The nodes attached to the elements <defaultsElements> and <defaultsAttributes> can be interpreted as: If there is no further information available (i.e. global rules, local ITS markup or inherited information), then the markup at these elements should be added to a node in a document. That is: for elements @its:translate="yes", and for attributes16 @its:translate="no"17.

The mandatory <inheritance> element specifies whether information at selected nodes should inherit to elements, to elements and attributes, or to nothing. The optional <rulesElement> provides the name of the rule element for the data category, e.g. <translateRule>. For the processing of global adding and pointing information the convention is made that pointing is realized only via attributes which have the naming pattern xxxPointer, e.g. @langPointer. All other markup encountered at or within a rules element is interpreted as markup to be added to selected nodes.

Finally, for local adding there is the optional <localAdding> element. It has two attributes: @datcatSelector contains the local selector for a datacategory, e.g. *[@its:locInfoType]. The @addedMarkup attribute specifies the information which is added to the selected nodes. Both attributes use XPath; the latter is interpreted relative to the former. Reading aloud, the usage of these attributes for the localizationInformation data category means: "Select all element nodes with an @its:locInfoType attribute and add the information to them which is given in the two attributes (from the same element node) @its:locInfoType and @its:locInfo.

5.2 Implementation

An XSLT implementation has been developed. It processes XML input documents like in fig. 6 and data category descriptions like in fig. 7. The XML documents can contain global rules, local markup or nothing. In the last case, only defaults for that data category are processed.

The processing works in two steps: first, the data category descriptions and (if existing) global rules from the input document are used to generate a new XSLT stylesheet. This is then applied to generate data category information for each node in the input document. The output format is shown in fig. 8.


Figure 8:

Output format for processing data category descriptions and global / local, data category specific markup

<nodeList ...>
 <nodeList datacat="translatability">
  <node path="/Manual" outputType="new-value-local">
   <output its:translate="yes"/>
  </node>
  <node path="/Manual/Info[1]" outputType="inherited">
   <output its:translate="yes"/>
  </node>[...]
  <node path="/Manual/Section[1]/@id" outputType="default-value">
   <output its:translate="no"/>
  </node>
  <node path="/Manual/Section[1]/@title" outputType="new-value-global">
   <output translate="yes"/>
  </node>[...]</nodeList>
    
 <nodeList datacat="localizationInformation">
  <node path="/Manual" outputType="new-value-local">
   <output its:locInfo="The Section element contains existing localization information,
                the Info element contains ITS markup which was added later."
                its:locInfoType="description"/>
  </node>
  <node path="/Manual/Info[1]" outputType="inherited">
   <output its:locInfo="The Section element contains existing localization information,
                the Info element contains ITS markup which was added later."
                its:locInfoType="description"/>
  </node>[...]
  <node path="/Manual/Section[1]/@id" outputType="no-value">
   <output/>
  </node>[...]</nodeList>
    
 <nodeList datacat="languageInformation">
  <node path="/Manual" outputType="no-value">
   <output/>
  </node>[...]
  <node path="/Manual/Section[1]/Ltitle[1]/Mtitle[1]/MultiCol[1]" outputType="new-value-global">
   <output myLangAttr="en"/>
  </node>[...]
  <node path="/Manual/Section[1]/Ltitle[1]/Mtitle[1]/MultiCol[1]/Text[1]/@addInfo" outputType="inherited">
   <output myLangAttr="en"/>
  </node>[...]</nodeList>
</nodeList>

For each data category there is a <nodeList> element, with a list of <node> elements and path information. The @outputType attribute value describes what kind of data category value is available for the node: no-value, default-value, new-global-value, new-local-value or inherited. The <output> element is a wrapper for the information which pertains to the node. If there is no value, this element is empty.

For the domain for which ITS originally was developed, one major application is the identification of translatable or not translatable text. The output format described in the previous section makes this information explicit and provides an input for test suite development. An application in the area of localization is the generation of XLIFF [XML Localization Interchange File Format] [Savourel and Reid 2003] documents. They contain extracted text sequences, which are presented to the translators. Another possible implementation would be an ITS sensitive editing environment which highlights translatable units.

5.3 Sample Application: Interrelating XML Vocabularies

In this section, the general applicability of the ITS methodology will be demonstrated. It is independent of the domain of internationalization and localization.

The scenario of interrelating XML Vocabularies is described by [Ogbuji 2006] as a key means to allow people to share information in the same domain, even if they use different vocabularies. Fig. 9 is based on [Ogbuji 2006] and shows two product descriptions which should be interrelated.


Figure 9:

Example of XML vocabularies which share some characteristics

<line-item is-a="product" sku="438-AX">
 <prod-name>Python Perfect IDE</prod-name>
 <price units="USD">250</price>
 <detail>Integrated Development environment software for Python</detail>
 <freight class="C">
  <weight units="Kg">0.8</weight>
 </freight>
</line-item>
...
<product sku="438-AX" info-page="http://example.com/product-info/ppi">
 <name>Python <trademark>Perfect</trademark> IDE</name>
 <description>Integrated Development environment software for Python</description>
 <long-description>Uses mind-reading technology to anticipate and 
  accommodate all user needs in Python development./long-description>
 <price currency="USD">250</price>
</product>

The documents share some information, but use different names for some elements and attributes which contain this information. Using the ITS methodology, the data category description and the global rules in fig. 10 can be created.


Figure 10:

Interrelation of markup from fig. 9

<datacat name="product">
 <inheritance appliesTo="none"/>
 <rulesElement name="productRule"/>
</datacat>
...
<its:rules ...>
 <its:productRule select="//line-item[@is-a='product']" skuPointer="@sku" namePointer="prod-name" 
                  descriptionPointer="detail" priceCurrencyPointer="price/@units"/>
 <its:productRule select="//product" skuPointer="@sku" namePointer="name"
                  descriptionPointer="description" priceCurrencyPointer="price/@currency"/>
 <its:productRule select="//product" skuPointer="@sku" namePointer="name"
                  descriptionPointer="description" priceCurrencyPointer="price/@currency"
                  trademarkPointer="name/trademark"/>
</its:rules>

The data category description contains only the name of the category and the rules element, and the inheritance description. It is set to "none", because an interrelation of e.g. the <line-item> element and the <product> element does not imply a relation between their child elements or attributes. The usage of this data category definition relies only on pointer attributes: for each information which is shared across the vocabularies, there is such an attribute. For example, the @namePointer attribute is used to point to the name of the product. For the first document / the first <productRule> element, the appropriate value is prodname, for the second document / the second <productRule> element, the value is name.

With the ITS methodology, contradictory relation descriptions are possible as well. An example is given by the second and the third <productRule> element. They differ only with respect to the @trademarkPointer attribute, which is only present at the last <productRule> element. A reason for such contradictory global rules could be that they are created for different generalization scenarios and by different persons / business departments, and only one of the scenarios allows for the <trademark> element. The current order of the productRule element would give this scenario the precedence.

The output format shown in sec. “Implementation” made explicit that in contrast to architectural forms, processing of data category descriptions are not bound to a transformation. In the application scenario of XML vocabulary interrelation, this means that no information about the original markup is lost. This is similar to Schematron abstract patterns, as [Ogbuji 2006] notes: they allow for expressing various information about selected nodes, without requiring validation and transformation.

The difference between the Schematron and the ITS approach is that ITS uses different mechanisms than the Schematron assertion based style. The key advantage of ITS is that these mechanisms allow for clearly separating the tasks of selecting versus adding / pointing to information, and for specifying inheritance and precedence18 behavior.

6 Conclusion

This paper used a "real world scenario" of re-using XML in different contexts, with different people participating over time. The scenario was the requirement to express information about translatability of element and attribute content, which is highly demanded for the internationalization and localization of XML. Problems within this scenario were introduced, and the general task of specifying a new purpose for existing schemes / markup was discussed.

Other topics are related to this task, like markup semantics, extensibility and versioning. Still they are different, and the methodologies developed for them have not proved so far to solve the reusability problem. However, these methodologies, together with the unexpected companion "CSS", provided important input for the development of the ITS mechanism. This allows for specifying information in new usage scenarios to existing documents / schemas. The ITS mechanism has proved to be useful in the given domain of internationalization / localization, but also in the general task of relating arbitrary, existing vocabularies.

An important area of future development is the selection mechanism deployed within ITS. Currently the ITS methodology relies on XPath, which comes to a price: computing XPath expressions yields for a static processing model. This seems to be computationally too expensive to be implemented in e.g. an editing environment which needs dynamic selections. An existing approach which might provide input to this application area is XBL [XML Binding Language] [Ferraiolo et al. 2005]. A further topic are new application areas of the ITS methodology, beyond the existing ITS data categories and the interrelation of markup vocabularies. XML data binding, which falls under the "concrete data-structure mapping problem" mentioned in sec. “Specifying Relations between Markup Vocabularies (2): Markup Semantics (and Semantic Markup)”, seems to be a task which could profit from ITS.

Notes

1.

The methodology introduced in this paper was developed within the W3C i18n ITS [Internationalization Tag Set] Working Group. Many Working Group participants have contributed to this development. However, especially the description of the generalization of the ITS methodology in sec. “Generalization of the ITS Approach” does not express the opinion of the working group. The author of this paper is responsible for any unclearness and mistakes in any parts of this paper.

2.

Having no schema available is a challenge because a re-usage scenario and its implementation cannot be verified to be applicable for more than the given set of documents.

3.

This need is a demand of extremely high priority for the internationalization and localization of XML. Localization of XML is a task which is rarely considered for the design of markup schemes or ad hoc creation of documents without schemes. In this somewhat unfortunate sense, it seems to be appropriate to call expression of translatability a "real world" re-usage scenario.

4.

This paper provides schema examples mainly in the format of RELAX NG compact syntax. There is no specific reason for this, except shortness and ease of readability. Nevertheless, the discussions of the schemas are not specific to RELAX NG, since - except where explicitly noted - none of the schemas uses constructs which could not be easily converted into an XML Schema document or an XML DTD.

5.

An example of additional information could be a <transinfo> element which describes what has to be translated. Unfortunately, adding this element to the document would disturb e.g. XPath expressions which rely on the given document structure, and which could be - again - created by a different person than the translator.

6.

Various approaches to markup semantics are discussed in [Sasaki 2004].

7.

The terminology markup semantics versus semantic markup has been introduced by [Renear et al. 2002].

8.

Architectural forms allow for specifying whether child elements of a given element should be processed, copied to the transformation output or filtered. However, such specifications differ from inheritance descriptions like e.g.: "Descendant elements inherit translatability information from their ancestors."

9.

The separation of the CSS selection mechanism into a separate specification theoretically allows for CSS independent usages. However, it seems to be unlikely that it will be widely deployed in the area of XML re-usage.

10.

To allow for using RDF locally and globally in (mainly XHTML) documents, currently a syntax called RDF/A [Adida and Birbeck 2006] is being defined.

11.

The difference between localization and internationalization will not be addressed in this paper. However, for further information, see [Ishida and Miller 2005].

12.

An example target for such an association is the @translatable="false" attribute in fig. 1. The association of existing markup with the ITS definition of translatability can then be used by an ITS aware tool e.g. for the extraction of translatable units from varying markup schemes.

13.

Another reason why ITS does not define markup directly is that one requirement was the availability of markup declarations in at least the schema languages XML DTD, XML Schema and RELAX NG. For this purpose, the ODD [One Document Does it all] language [Rahtz et al. 2004] was used, a literate programming scheme for the generation of documentation and schema fragments in these schema languages.

14.

As for language information, the decision for the inheritance pattern "elements and attributes" is necessary to be compliant with xml:lang. The choice for the inheritance pattern "only elements" for translatability is motivated by existing practice.

15.

The use case for language information is to state that the value given in the @myLangRule attribute can be interpreted like xml:lang. That is, tools which understand these rules can specify a common semantics of the - unfortunately great - variety of language information markup.

16.

Adding information to attributes can clearly not be used for a translation scenario. However, it is useful if e.g. text extraction is a subsequent processing step.

17.

In this example, only attributes are added to selected nodes. But it would also be possible to specify a sequence of attributes, text and elements within the <defaultsElements> and / or <defaultsAttributes> element.

18.

Schematron implements a precedence mechanism which is motivated by the @priority attribute for XSLT templates. The problem with this mechanism is that priority has to be defined hard-wired and does not separate between global / local / inherited information. However, such information is necessary to be able to deal with conflicting rules (see the last two <productRule> elements in fig. 10).


Bibliography

[Adida and Birbeck 2006] Adida, B. and M. Birbeck. RDFa Primer 1.0. Embedding RDF in XHTML. W3C Working Draft 16 May 2006. Available at
http://www.w3.org/TR/2006/WD-xhtml-rdfa-primer-20060516/.

[Bos et al. 1998] Bos, B., H. Wium Lie, C. Lilley and Ian Jacobs. Cascading Style Sheets, Level 2 CSS2 Specification. W3C Recommendation 12 May 1998. Available at
http://www.w3.org/TR/1998/REC-CSS2-19980512.

[Cowan 2002] Cowan, J. Architectural Forms: A New Generation (Draft 2.3). Available at
http://home.ccil.org/~cowan/XML/afng.html.

[Ferraiolo et al. 2005] Ferraiolo, J., I. Hickson and D. Hyatt. SVG's XML Binding Language (sXBL). W3C Working Draft 15 August 2005. Available at
http://www.w3.org/TR/2005/WD-sXBL-20050815.

[Glazman et al. 2005 ] Glazman, D., T. Çelik, I. Hickson, P. Linss and J. Williams. Selectors. W3C Working Draft 15 December 2005. Available at
http://www.w3.org/TR/2005/WD-css3-selectors-20051215.

[Ishida and Miller 2005] Ishida, R. and S. K. Miller. Localization vs. Internationalization. Article, W3C Internationalization Activity. Available at
http://www.w3.org/International/questions/qa-i18n.

[ISO/IEC10744] Information Technology - Hypermedia/Time-based Structuring Language (HyTime). International Organization for Standardization, 1997.

[ISO/IEC 19757-8] Information Technology - Document Schema Definition Languages (DSDL) - Part 8: Document Schema Renaming Language - DSRL, ISO/IEC 19757-8 International Organization for Standardization, 2006 (under development).

[Klyne and Carroll 2004] Klyne, G. and J. J. Carroll. Resource Description Framework (RDF): Concepts and Abstract Syntax. W3C Recommendation 10 February 2004. Available at
http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/.

[Lieske and Sasaki 2006] Lieske, C. and F. Sasaki. Internationalization Tag Set (ITS) Version 1.0. W3C Working Draft 18 May 2006. Available at
http://www.w3.org/TR/2006/WD-its-20060518.

[Lieske et al. 2006] Lieske, C., S. Rahtz and F. Sasaki. Internationalization and Localization of XML: Introducing "ITS". In: Proceedings of XTech 2006, Amsterdam.

[Ogbuji 2006] Ogbuji, U. Chameleon XML models. In: Proceedings of XTech 2006, Amsterdam.

[Rahtz et al. 2004] Rahtz, S., N. Walsh and L. Burnard. A unified model for text markup: TEI, Docbook, and beyond. In: Proceedings of XML Europe 2004, Amsterdam.

[Renear et al. 2002] Renear, A., D. Dubin, C. M. Sperberg-McQueen and C. Huitfeldt. Towards a Semantics for XML Markup. In: Proceedings of the 2002 ACM Symposium on Document Engineering, Virginia, 2002.

[Sasaki 2004] Sasaki, F. Secondary Information Structuring - A Methodology for the Vertical Interrelation of Information Resources. In: Proceedings of Extreme Markup Languages 2004, Montréal.

[Sasaki et al. 2005] Sasaki, F., C. Lieske and A. Witt. Schema Languages & Internationalization Issues: A survey. In: Proceedings of Extreme Markup Languages 2005, Montréal.

[Savourel 2006] Savourel, Y. Internationalization and Localization Markup Requirements. W3C Working Draft 18 May 2006. Available at
http://www.w3.org/TR/2006/WD-itsreq-20060518/.

[Savourel and Reid 2003] Savourel, Y. and J. Reid. XLIFF 1.1 Specification. Oasis Committee Specification, 31 October 2003. Available at
http://www.oasis-open.org/committees/xliff/documents/cs-xliff-core-1.1-20031031.htm.

[Sperberg-McQueen et al. 2000] Sperberg-McQueen, C. M., C. Huitfeldt and A. Renear. Meaning and interpretation of markup. Available at
http://www.w3.org/People/cmsmcq/2000/mim.html.

[Sperberg-McQueen and Miller 2004] Sperberg-McQueen, C. M. and E. Miller. On mapping from colloquial XML to RDF using XSLT. In: Proceedings of Extreme Markup Languages 2004, Montréal.