The technology described in this document - the
The technology described in this document - the
This document was published by the
All
Substantive changes during the first last call period are:
Since the
To give feedback send your comments to
Publication as a Last Call Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the
This is the first version of this document.
ITS 2.0 is a technology to add metadata to Web content, for the benefit of localization,
language technologies, and internationalization. The ITS 2.0 specification both identifies
concepts (such as Translate
) that are important for internationalization and
localization, and defines implementations of these concepts (termed “ITS data categories”)
as a set of elements and attributes called the
This document aims to realize many of the ideas formulated in the ITS 2.0 Requirements
document, in
Not all requirements listed there are addressed in this document. Those which are not
addressed here are either covered in
ITS 2.0 has the following relations to ITS 1.0
It adopts and maintains the following principles from ITS 1.0:
ITS 1.0 provided the Ruby data category. ITS 2.0 does not provide ruby since at the time of writing, a stable model for ruby was not available. There are ongoing discussions about the ruby model in HTML5. Once these discussions are settled, in a subsequent version of ITS, the ruby data category may be re-introduced.
ITS 2.0 also adds the following principles and features not found in ITS 1.0:
The new data categories included in ITS 2.0 are:
Content or software that is authored in one language (the
In addition, document formats expressed by schemas may be used by people in different parts of the world, and these people may need special markup to support the local language or script. For example, people authoring in languages such as Arabic, Hebrew, Persian, or Urdu need special markup to specify directionality in mixed direction text.
From the viewpoints of feasibility, cost, and efficiency, it is important that the
original material should be suitable for localization. This is achieved by appropriate
design and development, and the corresponding process is referred to as
internationalization. For a detailed explanation of the terms “localization” and
“internationalization”, see
The increasing usage of XML as a medium for documentation-related content (e.g. DocBook> and DITA as formats for writing structured documentation, well suited to computer
hardware and software manuals) and software-related content (e.g. the eXtensible User
Interface Language
The following examples sketch one of the issues that currently hinder efficient XML-related localization: the lack of a standard, declarative mechanism that identifies which parts of an XML document need to be translated. Tools often cannot automatically perform this identification.
In this document it is difficult to distinguish between those string
elements that are translatable and those that are not. Only the addition of an
explicit flag could resolve the issue.
Even when metadata are available to identify non-translatable text, the conditions
may be quite complex and not directly indicated with a simple flag. Here, for
instance, only the text in the nodes matching the expression
//component[@type!='image']/data[@type='text'] is translatable.
The ITS specification aims to provide different types of users with information about what markup should be supported to enable worldwide use and effective internationalization and localization of content. The following paragraphs sketch these different types of users, and their usage of ITS. In order to support all of these users, the information about what markup should be supported to enable worldwide use and effective localization of content is provided in this specification in two ways:
This type of user will find proposals for attribute and element names to be included in their new schema (also called "host vocabulary"). Using the attribute and element names proposed in the ITS specification may be helpful because it leads to easier recognition of the concepts represented by both schema users and processors. It is perfectly possible, however, for a schema developer to develop his own set of attribute and element names. The specification sets out, first and foremost, to ensure that the required markup is available, and that the behavior of that markup meets established needs.
This type of user will be working with schemas such as DocBook, DITA, or perhaps a proprietary schema. The ITS Working Group has sought input from experts developing widely used formats such as the ones mentioned.
The question "How to use ITS with existing popular markup schemes?" is
covered in more details (including examples) in a separate document:
Developers working on existing schemas should check whether their schemas support the markup proposed in this specification, and, where appropriate, add the markup proposed here to their schema.
In some cases, an existing schema may already contain markup equivalent to that
recommended in ITS. In this case it is not necessary to add duplicate markup since
ITS provides mechanisms for associating ITS markup with markup in the host
vocabulary which serves a similar purpose (see
This type of user includes companies which provide tools for authoring, translation or other flavors of content-related software solutions. It is important to ensure that such tools enable worldwide use and effective localization of content. For example, translation tools should prevent content marked up as not for translation from being changed or translated. It is hoped that the ITS specification will make the job of vendors easier by standardizing the format and processing expectations of certain relevant markup items, and allowing them to more effectively identify how content should be handled.
This type of user comprises authors, translators and other types of content author. The markup proposed in this specification may be used by them to mark up specific bits of content. Aside: The burden of inserting markup can be removed from content producers by relating the ITS information to relevant bits of content in a global manner (see global, rule-based approach). This global work, however, may fall to information architects, rather than the content producers themselves.
Content producers often work with content management systems (CMS). In various CMS, some of the CMS fields only allow to store plain text. For these fields, the current ITS 2.0 data categories can only be applied globally and not with local attributes. This issue should be addressed in another way, apart from the ITS 2.0 standard. One way would be to allow HTML in these fields if possible, or using an extra field which allows HTML input and save the plain text of this extra field in the plain text field.
This type of service is intended for a broad user community ranging from developers and integrators through translation companies and agencies, freelance translators and post-editors to ordinary translation consumers and other types of MT employment. Data categories are envisaged for supporting and guiding the different automated backend processes of this service type, thereby adding substantial value to the service results as well as possible subsequent services. These processes include basic tasks, like parsing constraints and markup, and compositional tasks, such as disambiguation. These tasks consume and generate valuable metadata from and for third party users, for example, provenance information and quality scoring, and add relevant information for follow-on tasks, processes and services, such as MT post-editing, MT training and MT terminological enhancement.
This type of service provides automatically generated metadata for improving localization, data integration or knowledge management workflows. This class of users comprises of developers and integrators of services that automate language technology tasks such as domain classification, named entity recognition and disambiguation, term extraction, language identification and others. Text analytics services generate data that contextualizes the raw content with more explicit information. This can be used to improve the output quality in machine translation systems, search result relevance in information retrieval systems, as well as management and integration of unstructured data in knowledge management systems.
These types of users are concerned with localization workflows in which content
goes through certain steps: preparation for localization, start of the localization
process by e.g. a conversion into a bitext (aligned parallel text) format like Metadata
roundtripping
, that is the availibility of metadata both before and after
the localization process is crucial for many tasks of the localization workflow
manager. An example is metadata based quality control, with checks like Have
all pieces of content set to
. Other pieces of metadata are relevant for proper
internationalization during the localization workflow, e.g. the availibility of Directionality markup for adequate visualization of
bidirectional text.translate="no" been left
unchanged?
The ITS specification proposes several mechanisms for supporting worldwide use and effective internationalization and localization of content. We will sketch them below by looking at them from the perspectives of certain user types. For the purpose of illustration, we will demonstrate how ITS can indicate that certain parts of content should or should not be translated.
A content author uses an attribute on a particular element to say that the text in the element should not be translated.
The its:translate="no" attributes indicate that the path
and the cmd elements should not be translated.
A content author or information architect uses markup at the top of the document to identify a particular type of element or context in which the content should not be translated.
The path or cmd elements should be
translated.
A processor may insert markup at the top of the document which links to ITS information outside of the document.
A
The path or cmd element should be translated.
A schema developer integrates ITS markup declarations in his schema to allow users to indicate that specific parts of the content should not be translated.
The declarations for the commonAtts. This allows to use the
The first two approaches above can be likened to the use of CSS in style attribute, an XHTML
content author may assign a color to a particular paragraph. That author could also
have used the style element at the top of the page to say that all
paragraphs of a particular class or in a particular context would be colored red.
For applying ITS 2.0 data categories to HTML, four aspects must be considered:
In the following sections these aspects are briefly discussed.
To account for the so-called “global
approach” in HTML, this specification (see script element.
It is preferred to use external global rules linked via the link element than to have inline global rules in the HTML document.
The link element points to the rules file
EX-translateRule-html5-1.xml The rel attribute identifies
the ITS specific link relation its-rules.
The rules file linked in
In HTML, an ITS 2.0 local data category is realized with the specific prefix its-*.
The general mapping of the XML based ITS 2.0 attributes to their HTML its-* counterparts is defined in
There are four ITS 2.0 data categories, which have direct counterparts in HTML markup. For theses data categories, ITS 2.0 defines the following specific behaviour:
lang
attribute counterpart; in XHTML this is the xml:lang attribute. These attributes act as
local markup for the Language Information data category in HTML and
take precedence over language information conveyed via a global id attribute.
This attribute acts as local markup for the Id Value data category in HTML and take precedence over
id information conveyed via a global withinText="yes" by default.translate attribute. ITS 2.0 does not define its own behaviour for HTML5 translate, but just refers to the HTML5 definition.The html element is interpreted to convey the
Language Information value p element is interpreted to
convey the Id Value of em element
is interpreted to be withinText="yes". The img element is set to be translatable via an translate attribute. Here the alt attribute will also be translatable.
There are also some HTML markup elements that have similar, but not always identical, roles and behaviour than certain ITS 2.0 data categories.
For example, the HTML dfn element
could be used to identify a term in the sense of the Terminology data
category. However, this is not always the case and it depends on the
intentions of the content author. To accomodate this situation, users
of ITS 2.0 are encouraged to specifiy the association of existing HTML
markup with a dedicated global rules file. For an example rules file see the
XML I18N Best Practices document.
The Provenance and the Localization Quality Issue data categories allow for using standoff markup. In HTML such standoff markup is put into a script element. The constraints for Provenance standoff markup in HTML and Localization quality issue markup in HTML need to be taken into account. Examples of standoff markup in HTML for the two data categories are
ITS 2.0 does not define how to use ITS in HTML versions prior version 5. Users are
encouraged to migrate their content to HTML5 or XHTML. While it is possible to use
its-* attributes introduced for its-* attributes will be marked as invalid in validators.
ITS 2.0 has no normative dependency on
The definition of what a localization process or localization parameters must address is outside the scope of this standard and it does not address all of the mechanisms or data formats (sometimes called localization project parameters) that may be needed to configure localization workflows or process specific formats. However, it does define standard data categories that may be used in defining localization workflows or processing specific formats.
“
Abstraction via
Powerful
Content authors, for example, need a simple way to work with the Translate data category in order to express whether the
content of an element or attribute should be translated or not. Localization managers,
on the other hand, need an efficient way to manage translations of large document sets
based on the same schema. These needs could by realized by a specification of defaults
for the Translate data category along with exceptions
to those defaults (e.g. all p elements should be translated, but not
p elements inside of an index element).
To meet these requirements this specification introduces mechanisms that add ITS
information to XML documents, see
The ITS selection mechanisms allows you to provide information about content locally (specified at the XML or HTML element to which it pertains) or globally (specified in another part of the document). Global selection mechanisms can be in the same document, or in a separate file.
As a general guidance, implementations of ITS 2.0 should use a normalizing transcoder. Further information on the topic of Unicode normalization is provided by
Information (e.g. "translate this") captured by ITS markup (e.g.
its:translate='yes') always pertains to one or more XML or HTML nodes,
primarily element and attribute nodes, as defined in
The mechanisms defined for ITS selection resemble those defined in style attribute in HTML/XHTML, and the approach with global rules is
similar to the style element in HTML/XHTML. ITS usually uses XPath for
identifying nodes although CSS Selectors and other query languages can be used if
supported by the application. Thus,
author element in DocBook)ITS markup can be used with XML documents (e.g. a DocBook article), or schemas (e.g. an XML Schema document for a proprietary document format).
The following two examples sketch the distinction between the local and global
approaches, using the
The document in author element should be protected from
translation. Translation tools that are aware of the meaning of this attribute can
then screen the relevant content from the translation process.
For this example to work, the schema developer will need to add the
The local approach cannot be applied on a particular attribute. It can be applied for the content of the current element and all its inherited nodes as described in
The document in style element in
For this approach to work, the schema developer needs to add the
For specification of the Translate data category
information, the contents of the
The global, rule-based approach has the following benefits:
p elements in an XML
instance)term element in DITA)The commonality in both examples above is the markup translate='no'.
This piece of ITS markup can be interpreted as follows:
The ITS
p elements in an XML document)term element in
DITA)The power of the ITS selection mechanisms comes at a price: rules related to overriding/precedence, and inheritance, have to be established.
The document in head element inside
text and for all its children. Because the title element is
actually translatable, the global rule needs to be overridden by a local
its:translate="yes". Note that the global rule is processed first,
regardless of its position inside the document. In the main body of the document, the
default applies, and here it is its:translate="no" that is used to set
“faux pas” as non-translatable.
For some data categories, special attributes add or point to information about the
selected nodes. For example, the Localization Note
data category can add information to selected nodes (using a
The data category overview table, in
The functionalities of adding information and pointing to existing information are
The keywords “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL
NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are
to be interpreted as described in
The namespace URI that MUST be used by implementations of this specification is:
The namespace prefix used in this specification for XML
implementations of ITS for the above URI is
In addition, the following namespaces are used in this document:
http://www.w3.org/2001/XMLSchema for the XML Schema namespace, here
used with the prefix “xs”http://www.w3.org/1999/xlink for the XLink namespace, here used with
the prefix “xlink”http://www.w3.org/1999/xhtml for the HTML namespace, here used with
the prefix “h”For each data category, ITS distinguishes between the following:
The Translate data category conveys information as to whether a piece of content should be translated or not.
The simplest formalization of this prose description on a schema language independent
level is a
The selection of the ITS data categories applies to
textual values contained within element or attribute nodes. In some cases these nodes
form pointers to other resources; a well-known example is the src
attribute on the img element in HTML. The ITS Translate data category applies to the text of the
pointer itself, not the object to which it points. Thus in the following example, the
translation information specified via the
All attributes that have the type anyURI in the normative RELAX NG schema
in
This specification uses the term HTML to refer to HTML5 or its successor
in HTML syntax
This specification uses the term CSS Selectors in the sense of
Selectors as specifies in
The usage of the term
This specification defines three types of conformance: conformance of 1) ITS markup declarations , conformance of
2) processing expectations
for ITS Markup and conformance of 3) processing expectations
for ITS Markup in HTML. Also special conformance class is defined for using ITS markup in HTML5 documents, HTML5+ITS,
which serves as an
head element in
Full implementations of this conformance type will implement all markup declarations for ITS. Statements related to this conformance type MUST list all markup declarations they implement.
Application-specific processing (that is processing that goes beyond the computation of ITS information for a node) such as automated filtering of translatable content based on the Translate data category is not covered by the conformance clauses below.
The conformance clause 2-4 essentially means that the conversion to NIF is an optional feature of ITS 2.0, and that the conversion is independent of whether ITS information has been made available via the global or local selection mechanisms, see conformance clause 2-1-1.
Statements related to this conformance type MUST list all data categories they implement, and for each data category which type of selection they support, whether they support processing of XML. If the implementation provides the conversion to NIF (see conformance clause 2-4), this MUST be stated.
The above conformance clauses are directly reflected in the ITS 2.0 test suite. All tests specify which data category is processed (clause 2-1); they are relevant for (clause 2-1-1) global or local selection, or both; they require the processing of defaults and precedence of selections (clauses 2-1-2 and 2-1-3); for each data category there are tests with linked rules (2-2); and all types of tests are given for XML (clause 2-3). In addition, there are test cases for conversion to NIF (clause 2-4). Implementors are encouraged to organize their documentation in a similar way, so that users of ITS 2.0 easily can understand the processing capabilities available.
Application-specific processing (that is processing that goes beyond the computation of ITS information for a node) such as automated filtering of translatable content based on the Translate data category is not covered by the conformance clauses below.
link elements which has a rel attribute with the value
its-rules.
Statements related to this conformance type MUST list all data categories they implement, and for each data category which type of selection they support.
Conforming HTML5+ITS documents are those that comply with all the conformance criteria
for documents as defined in
Additional definitions about processing of HTML are given in
The version of the ITS schema defined in this specification is
If there is no its:version) MUST be provided on the element where the ITS markup is
used, or on one of its ancestors.
If there is no its:version) MUST be provided on one of its ancestors.
There MUST NOT be two different versions of ITS in the same document.
External, linked rules can have different versions than internal rules.
ITS data categories can appear in two places:
The two locations are described in detail below.
Global, rule-based selection is implemented using the
If there is more than one
Depending on the data category and its usage, there are
additional attributes for adding information to the selected nodes, or for pointing to
existing information in the document. For example, the Localization Note data category can be used for adding notes to selected
nodes, or for pointing to existing notes in the document. For the former purpose, a
The data category overview table, in
The functionalities of adding information and pointing to existing information are
Global rules can appear in the XML document they will be applied to, or in a separate
XML document. The precedence of their processing depends on these variations. See also
Local selection in XML documents is realized with ITS
local attributes or the
The data category determines what is being selected. The necessary data category
specific defaults are described in
By default the content of all elements in a document is translatable. The attribute
its:translate="no" in the head element means that the
content of this element, including child elements, should not be translated. The
attribute its:translate="yes" in the title element means
that the content of this element, should be translated (overriding the
its:translate="no" in head). Attribute values of the
selected elements or their children are not affected by local
The default directionality of a document is left-to-right. The
its:dir="rtl" in the quote element means that the
directionality of the content of this element, including child elements and
attributes, is right-to-left. Note that xml:lang indicates only the
language, not the directionality.
The its-
attributes. The definition of the two attributes in HTML is compatibly, that is it
provides the same values and interpretation, as the definition for the two data
categories Translate and Directionality.
Rule elements have attributes which contain
absolute and relative selectors. Interpretation of these selectors depends on the
actual query language. The query language is set by
XPath 1.0 is identified by xpath value in
The absolute selector MUST be an XPath expression
which starts with "/". That is, it must be an
AbsoluteLocationPath or union of
AbsoluteLocationPaths as described in XPath 1.0.
This ensures that the selection is not relative to a specific location. The
resulting nodes MUST be either element or
attribute nodes.
Context for evaluation of the XPath expression is as follows:
Context node is set to Root Node.
Both context position and context size are 1.
All variables defined by
All functions defined in the XPath Core Function Library are available. It is an error for an expression to include a call to any other function.
The set of namespace declarations are those in scope on the element which has
the attribute in which the expression occurs. This includes the implicit
declaration of the prefix xml required by the XML Namespaces Recommendation; the default namespace (as declared by
xmlns) is not part of this set.
The term element from the TEI is in a namespace
http://www.tei-c.org/ns/1.0.
The
The relative selector MUST use a RelativeLocationPath or an AbsoluteLocationPath as described in XPath 1.0. The XPath expression is evaluated relative to the nodes selected by the selector attribute.
The following attributes point to existing
information:
Context for evaluation of the XPath expression is same as for absolute selector with the following changes:
Nodes selected by the expression in the
Context node comes from the current node list.
The context position comes from the position of the current node in the current node list; the first position is 1.
The context size comes from the size of the current node list.
The term CSS Selectors is used throughout the specification in the
sense of Selectors as specified in
The working group will not provide a CSS Selectors based implementation; nevertheless there are several existing libraries, which can translate CSS Selectors to XPath, so that XPath selectors based implementations can be used.
CSS selectors have no ability to point to attributes.
CSS Selectors are identified by css value in
Absolute selector MUST be interpreted as selector
as defined in
Relative selector MUST be interpreted as selector
as defined in
ITS processors MAY support additional query languages. For each additional query language the processor MUST define:
Because future versions of this specification are likely to define additional query
languages, the following query language identifiers are reserved: xpath,
css, xpath2, xpath3, xquery,
xquery3, xslt2, xslt3.
A
Implementation MUST support the
The
The $LCID
variable. In this case, only the msg element with the attribute
lcid set to
In XSLT-based applications, it may make sense to map ITS parameters directly to XSLT parameters. To avoid naming conflicts one can use a prefix with the parameter name's value to distinguish between the ITS parameters and the XSLT parameters.
One way to associate a document with a set of external ITS rules is to use the optional
XLink
The rules contained in the referenced document MUST
be processed as if they were at the top of the
The example demonstrates how metadata can be added to ITS rules.
The result of processing the two documents above is the same as processing the following document.
Like
Applications processing global ITS markup MUST
recognize the XLink
External rules may also have links to other external rules (see
The following precedence order is defined for selections of ITS information in various positions (the first item in the list has the highest precedence):
Global selections in documents (using a
Inside each
ITS does not define precedence related to rules defined or linked based on non-ITS mechanisms (such as processing instructions for linking rules).
Inheritance for element nodes. Selection via inheritance takes precedence over default values, see below item.
In case of conflicts between global selections via multiple rules elements, the last rule has higher precedence.
The precedence order fulfills the same purpose as the built-in template rules of
The two elements title and author of this document should
be treated as separate content when inside a prolog element, but as part
of the content of their parent element otherwise. In order to make this distinction
two
The first rule specifies that title and author in general
should be treated as an element within text. This overrides the default.
The second rule indicates that when title or author are
found in a prolog element their content should be treated separately.
This is normally the default, but the rule is needed to override the first rule.
Some markup schemes provide markup which can be used to express ITS data categories.
ITS data categories can be associated with such existing markup, using the global
selection mechanism described in
Associating existing markup with ITS data categories can be done only if the processing
expectations of the host markup are the same as, or greater than, those of ITS. For
example, the
In this example, there is an existing translate attribute in DITA, and
it is associated with the ITS semantics using the its:rules section. Similarly, the
DITA dt and term elements are associated with the ITS Terminology data category.
Global rules can be associated with a given XML document using different means:
By using an
This section defines an algorithm to convert XML or HTML documents (or their DOM
representations) that contain ITS metadata to the RDF-based format based on
The algorithm is intended to extract the text from the XML/HTML/DOM for an NLP
tool. It can produce a lot of phantom
predicates from excessive
whitespace, which 1) increases the size of the intermediate mapping and 2) extracts
this whitespace as text, and therefore might decrease NLP performance. It is strongly recommended to
normalize whitespace in the input XML/HTML/DOM in order to minimize such phantom
predicates. A normalized example is given below. Since the whitespace normalization
algorithm itself is format dependent, for example, it differs for HTML compared to general
XML, no normative algorithm for whitespace normalization is given as part of
this specification.
The output of the algorithm shown below uses the ITS RDF ontology
The conversion algorithm to generate NIF consists of seven steps.
STEP 1: Get an ordered list of all text nodes of the document.
STEP 2: Generate an XPath expression for each non-empty text node of all leaf elements and memorize them.
STEP 3: Get the text for each node and make a tuple with the XPath expressions (X,T). Since the text nodes have a certain order we now have a list of ordered tuples ((x0,t0), (x1,t1), ..., (xn,tn)).
STEP 4 (optional): Serialize as XML or as RDF.
The list with the XPath-to-text mapping can also be kept in memory. Part of a
serialization example is given below. Note that in the example consists both of an RDF part and
and XML part (the mappings element).
where
Example (continued)
STEP 5: Create a context URI and attach the whole concatenated text of the document as reference.
STEP 6: Attach any ITS metadata annotations from the XML/HTML/DOM input to the respective NIF URIs.
STEP 7: Omit all URIs that do not carry annotations (they will just bloat the data).
A complete sample output in RDF/XML format after step 7, given the input document
The conversion to NIF is a possible basis for a natural language processing (NLP) application that creates, for example, named entity annotations. A non-normative algorithm to integrate these annotations into the original input document is given in
In some cases, it may be important for instances of data categories to be associated
with information about the processor that generated them. For example, the score of the
MT Confidence data category (provided via the
ITS 2.0 provides a mechanism to associate such processor information with the use of individual data categories in a document, independently from data category annotations themselves.
The attribute
Three cases of providing tool information can be expected:
information about tools used for creating or modifying the textual content;
information about tools that do 1), but also create ITS annotations, see
information about tools that don’t modify or create content, but just create ITS annotations.
An example of case 2) is an MT engine that modifies content and creates ITS MT Confidence annotations. Here the situation may occur that several tools are involved in creating MT Confidence annotations: the MT engine and the tool inserting the markup. The annotatorsRef attribute should identify the tool most useful in further processes, in this case the MT engine.
The value of | VERTICAL LINE (U+007C).
The data category identifier MUST be one of the identifiers specified in the data category overview table.
The IRI indicates information about the processor used to generate the data category annotation. No single means is specified for how this IRI should be used to indicate processor information. Possible mechanisms are: to encode information directly in the IRI, e.g. as parameters; to reference an external resource that provides such information, e.g. an XML file or an RDF declaration; or to reference another part of the document that provides such information.
In HTML documents, the mechanism is implemented with the
The attribute applies to the content of the element where it is declared (including its children elements) and to the attributes of that element.
On any given node, the information provided by this mechanism is a space-separated list
of the accumulated references found in the
In this example, the text shows the computed tools reference information for the given node. Note that the references are ordered alphabetically and that the IRI values are always the ones of the inner-most declaration.
The p elements are found in element with id="T1"
in the external document tools.xml, while that information for the third
p element is found in the element with id="T2" in the same
document. In addition,
The span elements come from one MT (French to English) engine,
while the annotation on the third comes from another (Italian to English) engine. Both
Please note that the term HTML refers to HTML5 or its successor in
HTML syntax
All data categories defined in
The above mentioned data categories are excluded because HTML has native markup for them.
In HTML data categories are implemented as attributes. Name of the HTML attribute is
derived from the name of the attribute defined in the local implementation by using the
following rules:
its--
(U+002D) followed by a lowercase variant of the letter.
Values of attributes which corresponds to data categories with a predefined set of values MUST be matched ASCII-case-insensitively.
Case of attribute names is also irrelevant given the nature of HTML syntax. So in
HTML terminology data category can be stored as ITS-TERM, its-Term etc. All those attributes are treated
as equivalent and will get normalized upon DOM construction.
Values of attributes which corresponds to data categories which use XML Schema double
data type
MUST be also valid floating-point numbers as defined in
Various aspects for global rules in general, external global rules or inline global
rules need to be taken into account. An example of an HTML5 document using global rules
is
By default XPath 1.0 will be used for selection in global rules. If users prefer
easier selection mechanism, they can switch query language to CSS selectors by using
the
HTML5 parsing algorithm automatically puts all HTML elements into XHTML namespace
(http://www.w3.org/1999/xhtml). Selectors used in global rules must
take this into account.
Link to external global rules is specified in
link element, with the link relation
its-rules.
Using XPath in global rules linked from HTML documents does not create an additional burden to implementers. Parsing HTML content produces a DOM tree that can be directly queried using XPath, functionality supported by all major browsers.
Inline global rules MUST be specified inside script which has type
attribute with the value application/its+xml. The script
element itself SHOULD be child of head
element. Comments MUST NOT be used inside global rules.
Each script element MUST NOT contain more than
one
It is preferred to use external global rules
linked using the link element than to have global rules embedded in the
document.
The constraints for Provenance standoff markup in HTML and Localization quality issues markup in HTML MUST be followed.
The following precedence order is defined for selections of ITS information in various positions of HTML document (the first item in the list has the highest precedence):
Global selections in documents (using mechanism of external global rules or inline global rules), to be processed in
a document order, see
ITS does not define precedence related to rules defined or linked based on non-ITS mechanisms (such as processing instructions for linking rules). Selection via inheritance takes precedence over default values, see below item.
Inheritance for element nodes. Selection via inheritance takes precedence over default values, see below item.
In case of conflicts between global selections via multiple rules elements, the last rule has higher precedence.
The forehand mentioned code element with the code
elements as untranslatable.
XHTML documents aimed at public consumption by Web browsers, including HTML5 documents in
XHTML synatx, SHOULD use the syntax described in
This examples illustrates the use of ITS 2.0 local markup in XHTML.
Please note that this section defines how to use ITS in XHTML content which is directly served to Web browsers. Such XHTML is very often sent with a wrong media type and parsed as HTML not as XML in Web browsers. In such case it is more robust and safer to use HTML-like syntax for ITS metadata.
However when XHTML is not used as a delivery but rather as an exchange or storage format all XML features can be used in XHTML and it's advised to use XML syntax for ITS metadata.
The following table summarizes for each data category which selection, default value,
and inheritance and overriding behavior applies. It also provides data category
identifiers used in
For ITS data categories with inheritance, the
information conveyed by the data category can be overridden. For example, a local
An ITS application is free to decide what pieces of content it uses. For example:
term element. The information pertains only to the content of the
element, since there is no inheritance for Terminology. Nevertheless an ITS application can make use of the complete
element, e.g. including attribute nodes etc. p element. An application can make use of the complete p
element, including child nodes and attributes nodes. The application is also free to
make use just of the string value of p. Nevertheless the id provided
via ID value pertains only to the p
element. It cannot be used to identify nested elements or attributes.source element have the ITS information that their translation is
available in a target element; see target pointer. E.g., the translation of a
span element nested in source is not available in a
specific target element. Nevertheless, an application is free to use
the complete content of source, including span, and e.g.
present it to a translator.In this example, the content of all the data elements is translatable and none of the attributes are translatable, because the default for the Translate data category in elements is its:translate="no" attribute. The content of revision, profile, reviser and locNote elements are not translatable. This is because the default is overridden by the same its:translate="no" that these elements inherit from the local ITS markup in the prolog element. The exception is the field element where the second type is not translatable because the global rule takes precedence over the default value.
The localization note for the two first data elements is the text defined globally with the data element by the local
The data categories differ with respect to defaults. This is due to existing standards and practices. It is common practice for example that information about translation refers only to textual content of an element. Thus, the default selection for the Translate data category is the textual content.
The Translate data category expresses information
about whether the content of an element or attribute should be translated or not. The
values of this data category are
The Translate data category can be expressed with
global rules, or locally on an individual element. Handling of inheritance and interaction between elements and attributes is different for XML content versus
For XML: for elements, the data category
information inherits to the textual content of
the element,
The interpretation of the
As of writing, the default in alt or title.
GLOBAL: The
The code
must not be translated.
LOCAL: The following local markup is available for the Translate data category:
In translate attribute MUST be used to express
the Translate data category.
It is not possible to override the Translate
data category settings of attributes using local markup. This limitation is
consistent with the advised practice of not using translatable attributes. If
attributes need to be translatable (e.g., an HTML alt attribute), then
this must be declared globally.
The local its:translate="no" specifies that the content of
panelmsg must not be translated.
The local translate="no" attribute specifies that the content of
span must not be translated.
The Localization Note data category is used to communicate notes to localizers about a particular item of content.
This data category can be used for several purposes, including, but not limited to:
enabledin isolation without knowing the gender, number and case of the thing it refers to.)
Two types of informative notes are needed:
Editing tools may offer an easy way to create this type of information. Translation tools can be made to recognize the difference between these two types of localization notes, and present the information to translators in different ways.
The Localization Note data category can be
expressed with global rules, or locally on an individual element. For elements, the
data category information inherits to the textual
content of the element,
GLOBAL: The
Exactly one of the following:
The
The
The
The
LOCAL: The following local markup is available for the Localization Note data category:
One of the following:
An optional
It is generally recommended to avoid using attributes to store text, however, in this specific case, the need to provide the notes without interfering with the structure of the host document is outweighing the drawbacks of using an attribute.
The Terminology data category is used to mark terms and optionally associate them with information, such as definitions. This helps to increase consistency across different parts of the documentation. It is also helpful for translation.
Existing terminology standards such as
The Terminology data category can be expressed with global rules, or locally on an individual element. There is no inheritance. The default is that neither elements nor attributes are terms.
GLOBAL: The
None or exactly one of the following:
LOCAL: The following local markup is available for the Terminology data category:
A
An optional
An optional
Any node selected by the terminology data category with the
At the time of writing, enhancements
are being discussed in the context of HTML5 that are expected to change the approach
to marking up Directionality, in particular to
support content whose directionality needs to be isolated from that of surrounding
content. However, these enhancements are not finalized yet. This section therefore
reflects directionality markup in
The Directionality data category allows the user
to specify the base writing direction of blocks, embeddings and overrides for the
Unicode bidirectional algorithm. It has four values:
ITS defines only the values of the Directionality data category and their inheritance. The behavior of text labeled in this way may vary, according to the implementation. Implementers are encouraged, however, to model the behavior on that described in the CSS 2.1 specification or its successor. In such a case, the effect of the data category's values would correspond to the following CSS rules:
Data category value:
CSS rule:
*[dir="ltr"] { unicode-bidi: embed; direction: ltr}
Data category value:
CSS rule:
*[dir="rtl"] { unicode-bidi: embed; direction: rtl}
Data category value:
CSS
rule: *[dir="lro"] { unicode-bidi: bidi-override; direction:
ltr}
Data category value:
CSS
rule: *[dir="rlo"] { unicode-bidi: bidi-override; direction:
rtl}
More information about how to use this data category is provided by
The Directionality data category can be expressed
with global rules, or locally on an individual element. For elements, the data
category information inherits to the textual
content of the element,
GLOBAL: The
In this document the right-to-left directionality is marked using a
direction attribute with a value
The direction="rtlText" have right-to-left content, except that bdo
elements with that attribute have right-to-left override content.
LOCAL: The following local markup is available for the Directionality data category:
dir attribute, so these values are not
used for HTML documents. HTML uses an inline bdo element
instead.
On the first quote element, the its:dir="rtl" attribute
indicates a right-to-left content.
The element xml:lang in XML, and lang in HTML.
The
The following p elements (including attribute values and textual content of child
elements) are in the language indicated by mylangattribute, which is
attached to the p elements, and expresses language using values
conformant to
The Language Information data category
only provides for rules to be expressed at a global level. Locally users are able to
use xml:lang (which is defined by XML), or lang in HTML,
or an attribute specific to the format in question (as in
In XML xml:lang is the preferable means of language identification. To
ease the usage of xml:lang, a declaration for this attribute is part of
the non-normative XML DTD and XML Schema document for ITS markup declarations. There
is no declaration of xml:lang in the non-normative RELAX NG document
for ITS, since in RELAX NG it is not necessary to declare attributes from the XML
namespace.
Applying the Language Information data
category to xml:lang attributes using global rules is not necessary,
since xml:lang is the standard way to specify language information in
In HTML lang is the mandated means of language identification.
The Language Information data category can
be expressed only with global rules. For elements, the data category information inherits to the textual content of the element,
GLOBAL: The
xml:lang is present or
lang in HTML for the selected node, the value of the
xml:lang attribute or lang in HTML MUST take precedence over the The Elements Within Text data category reveals if and how an element affects the way text content behaves from a linguistic viewpoint. This information is for example relevant to provide basic text segmentation hints for tools such as translation memory systems. The values associated with this data category are:
strong in
<strong>Appaloosa horses</strong> have spotted
coats.
fn in
Palouse horses<fn>A Palouse horse is the same as an
Appaloosa.</fn> have spotted coats.
p
when inside the element li in DITA or XHTML:
<li>Palouse horses: <p>They have spotted coats.</p>
<p>They have been bred by the Nez Perce.</p> </li>
The Elements Within Text data category can
be expressed with global rules, or locally on an individual element. There is no
inheritance. The default is that elements are not within text. For HTML5 phrasing content the
default is withinText="yes".
GLOBAL: The
LOCAL: The following local markup is available for the Elements Within Text data category:
The Domain data category is used to identify the topic or subject of a given content. Such information allows to make more relevant lingusitic choices during various processes.
Examples of usage include:
This data category addresses various challenges:
meta element). The Domain data
category provides a mechanism to point to this information.The Domain data category can be expressed only with
global rules. For elements, the data category information inherits to the textual content of the element,
The information provided by this data category is a comma-separated list of one or more values which is obtained by applying the following algorithm:
STEP 1: Set the initial value of the resulting string as an empty string.
STEP 2: Get the list of nodes resulting of the evaluation of the
STEP 3: For each node:
STEP 3-1: If the node value contains a COMMA (U+002C):
STEP 3-1-1: Split the node value into separate strings using the COMMA (U+002C) as separator.
STEP 3-1-2: For each string:
STEP 3-1-2-1: Trim the leading and trailing white spaces of the string.
STEP 3-1-2-2: If the first character of the value is an APOSTROPHE (U+0027) or a QUOTATION MARK (U+0022): Remove it.
STEP 3-1-2-3: If the last character of the value is an APOSTROPHE (U+0027) or a QUOTATION MARK (U+0022): Remove it.
STEP 3-1-2-4: If the value is empty: Go to STEP 3-1-2.
STEP 3-1-2-5: Check the domainMapping attribute to
see if there is a mapping set for the string:
STEP 3-1-2-5-1. If a mapping is found: Add the corresponding value to the result string.
STEP 3-1-2-5-2. Else (if no mapping is found): Add the string to the result string.
STEP 3-2: Else (if the node value does not contain a COMMA (U+002C)):
STEP 3-2-1: Trim the leading and trailing white spaces of the string.
STEP 3-2-2: If the first character of the value is an APOSTROPHE (U+0027) or a QUOTATION MARK (U+0022): Remove it.
STEP 3-2-3: If the last character of the value is an APOSTROPHE (U+0027) or a QUOTATION MARK (U+0022): Remove it.
STEP 3-2-4: If the value is empty: Go to STEP 3.
STEP 3-2-5: Check if there is a mapping for the string:
STEP 3-2-5-1: If a mapping is found: Add the corresponding value to the result string.
STEP 3-2-5-2: Else (if no mapping is found): Add the string (in its original cases) to the result string.
STEP 4: Remove duplicated values from the resulting string.
STEP 5: Return the resulting string.
GLOBAL: The
Although the keywords or
dcterms.subject in Web pages or other types of content.
Values used in the http://example.com/domains/automotive. The
Although the focus of ITS 2.0, and some of the usage scenarios addressed in ITS 2.0 High-level Usage Scenarios) is on “single engine” environments, ITS 2.0 - for example in the context of the Domain data category - can accommodate "workflow/multi engine" scenarios.
Example:
A scenario involves Machine Translation (MT) engines A and B. The domain labels used by engine A follow the naming scheme A_123, the one for engine B follow the naming scheme B_456.
A
Engine A maps 'Legal' to A_4711, Engine B maps 'Legal' to B_42.
Thus, ITS does not encode a process or workflow (like "Use MT engine A with domain A_4711, and use MT engine B with domain A_42"). Rather, it encodes information that can be used in workflows.
The body element is in the domain expressed by the HTML meta
element with the name attribute, value keywords. The
meta element.
The body element is in the domain expressed by associated values. The
meta elements with the name
attribute set to content attributes. The
In HTML, one possible way how to express domain information is a meta
element with the name attribute set to
In the area of machine translation (e.g. machine translation systems or systems
harvesting content for machine translation training), there is no agreed upon set of
value sets for domain. Nevertheless it is recommended to use a small set of values
both in source content and within consumer tools, to foster interoperability. If
larger value sets are needed (e.g. detailed terms in the law or medical domain),
mappings to the smaller value set needed for interoperability should be provided. An
example would be a domainMapping="'criminal law' law, 'property law' law, 'contract law'
law".
It is possible to have more than one domain associated with a piece of content. For example, if the consumer tool is a statistical machine translation engine, it could include corpora from all domains available in the source content in training the machine translation engine.
The consumer machine translation engine might choose to ignore the domain and take
a one size fits all approach, or may be selective in which domains to use, based on
the range of content marked with domain. For example, if the content has hundreds of
sentences marked with domain
The Text Analysis data category is used to annotate content with lexical or conceptual information for the purpose of contextual disambiguation. This information can be provided by so-called text analysis software agents such as named entity recognizers, lexical concept disambiguators, etc., and is represented by either string valued or IRI references to possible resource descriptions. Example: A named entity recognizer provides the information that the string "Dublin" in a certain context denotes a town in Ireland.
While text analysis can be done by humans, this data category is targeted more at software agents.
The information can be used for several purposes, including, but not limited to:
The data category provides three pieces of annotation: confidence, entity type or concept class, entity identifier or concept identifier as specified in the following table.
The use case for Text Analysis is distinct from that for the Terminology data category. Text Analysis informs human agents or software agents in cases where either explicit terminology information is not (yet) available, or would not be appropriate, e.g. conceptual information for general vocabulary.
Text Analysis support is achieved by associating a fragment of
text with an external resource that can be interpreted by a
language review agent. The agent may for example use the web
resource to disambiguate the meaning or lexical choice of the
fragment, and thereby contributing to its correct translation. The
web resource may as well provide information on appropriate synonyms
and example usage. This is for example the case if the web resource
is WordNet
Extended example: The word 'City' in the fragment 'I am going to the City' may be enhanced by one of the following:
A given document fragment can only be annotated once. When support for multiple annotations is necessary e.g. when all three of the annotations in the extended example above need to be accommodated - NIF 2.0, TEI Stand-off Markup, or other so-called stand-off annotation mechanisms should be considered.
Some external resources such as DBpedia also provide information for some ontological concepts and named entity definitions in multiple languages, and this facilitates translation even more because a possible link traversal would allow a direct access to foreign language labels for named entities.
The Text Analysis data category can be expressed with global rules, or locally on an individual element. There is no inheritance.
This specification defines a normative way to represent text analysis information in XML and HTML locally. However, text analysis information can also be represented in other formats, e.g. JSON. The Internationalization Tag Set Interest Group maintains a description of such alternative serializations. Readers of this specification are encouraged to evaluate whether that description fulfills their needs and to provide comments in the ITS IG mailing list (public archive).
GLOBAL: The
At least one of the following:
A
Exactly one of the following:
When using identification
mode 1: A
When using identification
mode 2: A
For an example, see
LOCAL: The following local markup is available for the Text Analysis data category:
An optional
At least one of the following:
A
Exactly one of the following:
When using identification
mode 1: A
When using identification
mode 2: A
Any node selected by the
Text Analysis
data category with the
For expressing
Entity type / concept class
information, implementers are encouraged to use an existing
repository of entity types such as the Named Entity Recognition and
Disambiguation
Various target types can be expressed via Entity type / concept class : types of entities, types of lexical concepts, or ontology concepts. While a relationship between these types may exist, this specification does not prescribe a way of automatically inferring a one target type from another.
Text Analysis is primarily intended for textual content. Nevertheless, the data category can also be used in multi-media contexts. Example: objects on an image could be annotated with DBpedia IRIs.
When serializing the Text Analysis data category markup in HTML, one way to serialize the markup is RDFa Lite or Microdata. This serialization is due to the existing search and crawling infrastructure that is able to consume these formats. For other usage scenarios, e.g. add text annotation to feed into a subsequent terminology process, using ITS Text Analysis data category markup natively is preferred. In this way, the markup easily can be stripped out again later.
See
The Locale Filter data category specifies that a node is only applicable to certain locales.
This data category can be used for several purposes, including, but not limited to:
The Locale Filter data category associates with
each selected node a filter type and a list of extended language ranges conforming to
The list is comma-separated and can include the wildcard extended language range
The type can take the values
A single wildcard
A single wildcard
An empty string with a type
An empty string with a type
Otherwise, with a type
If, instead, the type is
The Locale Filter data category can be expressed
with global rules, or locally on an individual element. For elements, the data
category information inherits to the textual
content of the element,
GLOBAL: The
This document contain three legalnotice with a role set to
legalnotice with a role set to
remark elements apply to any locale.
LOCAL: The following local markup is available for the Locale Filter data category:
In this example the Locale Filter data category is used to select different sections depending on whether the locale is a Canadian one or not.
The Provenance data category is used to communicate the identity of agents that have been involved in the translation of the content or the revision of the translated content. This allows translation and translation revision consumers, such as post-editors, translation quality reviewers or localization workflow managers, to assess how the performance of these agents may impact the quality of the translation. Translation and translation revision agents can be identified as a person, a piece of software or an organization that has been involved in providing a translation that resulted in the selected content.
This data category offers three types of information. First, it allows to identify translation agents. Second, it allows to identify revision agents. Third, if provenance information is needed that includes temporal or sequence information about translation processes (e.g. multiple revision cycles) or requires agents that support a wider range of activities, the data category offers a mechanism to refer to external provenance information.
The specification does not define the format of external provenance
information, but it is recommended that an open provenance or change logging format
be used, e.g. the W3C provenance data model
Translation or translation revision tools, such as machine translation engines or computer assisted translation tools, may offer an easy way to create this information. Translation tools can then present this information to post-editors or translation workflow managers. Web applications may to present such information to consumers of translated documents.
The data category defines seven pieces of information:
The tool related provenance and tool related revision provenance pieces of
information are not meant to express information about tools used for creating ITS
annotations themselves. For this purpose, ITS 2.0 provides a separate mechanism. See
The Provenance data category can be expressed with
global rules, or locally on individual elements. For elements, the data category
information inherits to the textual content of
the element,
GLOBAL: The
A required
A
This example expresses provenance information in a standoff manner using
provenanceRecords elements. The ref attribute, that ref
attribute holds a reference to an associated legalnotice element has
been revised two times. Hence, the related
LOCAL: Using the inline markup to represent the data
category locally is limited to a single occurrence for a given content (e.g. one
cannot have different
The following local markup is available for the Provenance data category:
Either (inline markup): at least one of the following attributes:
A
An
A
A
A
A
A
Or (standoff markup):
A
An element
One or more elements
A
An
A
A
A
A
A
The order of
When the attributes
In HTML the standoff markup MUST either be stored inside a script
element in the same HTML document, or can be linked from any
script element, that element
MUST have a type
attribute with the value application/its+xml. Its id
attribute MUST be set to the same value as the
xml:id attribute of the
The provenance related attributes at the par and
legalnotice elements are used to associate the provenance information
directly with the content of these elements.
In this example several spans of content are associated with provenance information.
The following example shows a document using local standoff markup to encode
provenance information. The p elements delimit the content to markup.
They hold script elements.
The External Resource data category indicates that a node represents or references potentially translatable data in a resource outside the document. Examples of such resources are external images and audio or video files.
The External Resource data category can be expressed only with global rules. There is no inheritance. There is no default.
GLOBAL: The
The imagedata, audiodata and videodata elements
contain references to external resources. These references are expressed via a
fileref attribute. The
video elements
The two src and
the poster attributes at HTML video elements. These
attributes identify different external resources, and at the same time contain the
references to these resources. For this reason, the
src and poster respectively. The underlying HTML
document is given in
Some formats, such as those designed for localization or for multilingual resources, hold the same content in different languages inside a single document. The Target Pointer data category is used to associate the node of a given source content (i.e. the content to be translated) and the node of its corresponding target content (i.e. the source content translated into a given target language).
This specification makes no provision regarding the presence of the target nodes or their content: A target node may or may not exist and it may or may not have content.
This data category can be used for several purposes, including but not limited to:
Extract the source content to translate and put back the translation at its proper location.
Compare source and target content for quality verification.
Re-use existing translations when localizing the new version of an existing document.
Access aligned bi-lingual content to build memories, or to train machine translation engines.
In general, it is recommended to avoid developing formats where the same
content is stored in different languages in the same document, unless for very
specific use cases. See the best practices Working
with multilingual documents
from
The Target Pointer data category can be expressed only with global rules. There is no inheritance. There is no default.
GLOBAL: The
The source node and the target node may be of different types, but the target node must be able to contain the same content of the source node (e.g. an attribute node cannot be the target node of a source node that is an element with children).
The Id Value data category indicates a value that can be used as unique identifier for a given part of the content.
The recommended way to specify a unique identifier is to use xml:id
id in HTML (See the best
practice Defining
markup for unique identifiers
from
Providing a unique identifier that is maintained in the original document can be useful for several purposes, for example:
Allow automated alignment between different versions of the source document, or between source and translated documents.
Improve the confidence in leveraged translation for exact matches.
Provide back-tracking information between displayed text and source material when testing or debugging.
The Id Value data category only provides for
rules to be expressed at a global level. Locally, users are able to use
xml:id (which is defined by XML) or id in HTML, or
an attribute specific to the format in question (as in
Applying the Id Value data category to
xml:id (in XML) or id (in HTML) attributes in global
rules is not necessary, since these attributes are the recommended way to
specify an identifier.
The id Value data category can be expressed only with global rules. There is no inheritance. There is no default.
GLOBAL: The
xml:id is present or
id in HTML for the selected node, the value of the
xml:id attribute or id in HTML MUST take precedence over the The <text> element is the value of the attribute name of
its parent element.
The <text> and <desc> are translatable, but they have
only one corresponding identifier, the name attribute in their parent
element.
To make sure the identifier is unique for both the content of
<text> and the content of <desc>, the XPath
expression concat(../@name, '_t') gives the identifier
"settingsMissing_t" for the content of <text> and the expression
concat(../@name, '_d') gives the identifier "settingsMissing_d" for
the content of <desc>.
xml:id and When an xml:id attribute is present for a node selected by an
xml:id takes precedence
over the value defined by the <res> element, and
“retryTip” for the second <res> element.
The Preserve Space data category indicates how whitespace should be handled in content. The possible values for this data category are "default" and "preserve" and carry the same meaning as the corresponding values of the xml:space attribute. The default value is "default". The Preserve Space data category does not apply to HTML documents in HTML syntax.
The Preserve Space data category can be expressed
with global rules, or locally using the xml:space attribute. For
elements, the data category information inherits
to the textual content of the element,
The Preserve
Space data category is not applicable to HTML documents in HTML synatx
because xml:space (and by extension Preserve Space) has no effect in documents parsed as text/html. However,
the data category can be used in HTML
GLOBAL: The
The
LOCAL: The xml:space attribute, as defined
in section 2.10 of
The standard xml:space attribute specifies that the whitespace in the
verse element must be treated literally.
The Localization Quality Issue data category is used to express information related to localization quality assessment tasks. Such tasks can be conducted on the translation of some source text into a target language or on the source text itself where its quality may impact on the localization process.
This data category can be used in a number of ways, including the following example scenarios:
An automatic quality checking tool flags a number of potential quality issues in an XML or HTML file and marks them up using ITS 2.0 markup. Other tools in the workflow then examine this markup and decide whether the file needs to be reviewed manually or passed on for further processing without a manual review stage.
A quality assessment process identifies a number of issues and adds the ITS markup to a rendered HTML preview of an XML file along with CSS styling that highlights these issues. The resulting HTML file is then sent back to the translator to assist his or her revision efforts.
A human reviewer working with a web-based tool adds quality markup, including comments and suggestions, to a localized text as part of the review process. A subsequent process examines this markup to ensure that changes were made.
What issues should be considered in quality
assessment tasks depends on the nature of the project and tools used. For more
information on setting translation project specifications and determining quality
expectations, implementers are encouraged to consult
The data category defines five pieces of information:
The Localization Quality Issue data category can be
expressed with global rules, or locally on individual elements. For elements, the data
category information inherits to the textual
content of the element,
GLOBAL: The
A required
Either (in parallel to local inline markup)
At least one of the following attributes:
A
A
An optional
An optional
An optional
Or (standoff markup) exactly one of the following:
A
A
The attribute
The text attribute.
The following example shows a document using local standoff markup to encode
several issues. But because, in this case, the mrk element does not
allow attributes from another namespace we cannot use ref
attribute of any mrk elements that has its attribute type
set to "x-itslq".
LOCAL: Using the inline markup to represent the data category
locally is limited to a single occurrence for a given content (e.g. one cannot have
different
The following local markup is available for the Localization Quality Issue data category:
Either (inline markup):
At least one of the following attributes:
A
A
An optional
An optional
An optional
Or (standoff markup):
A
An element xml:id attribute set to the identifier specified in the
One or more elements
At least one of the following attributes:
A
A
An optional
An optional
An optional
The order of
When the attributes
In HTML the standoff markup MUST either be stored inside a script
element in the same HTML document, or can be linked from any
script element, that element
MUST have a type
attribute with the value application/its+xml. Its id
attribute MUST be set to the same value as the
xml:id attribute of the
The attributes
In this example several spans of content are associated with a quality issue.
The following example shows a document using local standoff markup to encode
several issues. The mrk element delimits the content to markup and
holds a
The following example shows a document using local standoff markup to encode
several issues. The span element delimits the content to markup and
holds a span element where the issues are listed within a set of other
special span elements.
The Localization Quality Rating data category is used to express an overall measurement of the localization quality of a document or an item in a document.
This data category allows to specify a quality score or a voting result for a given item or document, as well as to indicate what constitutes a passing score or vote. It also allows to point to a profile describing the quality assessment model used for the scoring or the voting.
The Localization Quality Rating data category is only
expressed locally on individual elements. The data category information inherits to the textual content of the element,
LOCAL: The following local markup is available for the Localization Quality Rating data category:
Exactly one of the following:
A
A
If
an optional
If
an optional
An optional
The
The
The MT Confidence data category is used to communicate the self-reported confidence score from a machine translation engine of the accuracy of a translation it has provided. It is not intended to provide a score that is comparable between machine translation engines and platforms. This data category does NOT aim to establish any sort of correlation between the self-reported confidence score and either human evaluation of MT usefulness, or post-editing cognitive effort. For harmonization’s sake, MT Confidence is provided as a rational number in the interval 0 to 1 (inclusive).
Implementers are expected to interpret the floating point number and present it to human and other consumers in a convenient form, such as percentage (0-100%) with up to 2 decimal digits, font or background color coding, etc.
This data category can be used for several purposes, including, but not limited to:
Automated prioritising of raw machine translated text for further processing based on empirically set thresholds.
Providing readers, translators, post-editors, reviewers and proof-readers of machine translated text with self-reported relative accuracy prediction.
MT confidence scores can be displayed e.g. on websites machine translated on the fly, by simple web-based translation editors or on Computer Aided Translation (CAT) tools.
The MT Confidence category can be expressed with
global rules or locally on individual elements. For elements, the data category
information is inherited by the textual content
of the element,
Any node selected by the MT Confidence data
category MUST be contained in an element with the
GLOBAL: The
A required
A required
title
attributes of two img elements.
Where the external ITS rules file is as shown:
LOCAL: the following local markup is available for the MT Confidence data category:
A
The Allowed Characters data category is used to specify the characters that are permitted in a given piece of content.
This data category can be used for various purposes, including the following examples:
The Allowed Characters data category is not
intended to disallow HTML markup. The purpose is to restrict the content to various
characters only, e.g., when the content is to be used for URL or filename
generation. In most Content Management Systems, content is divided into several
fields, some of which may be restricted to plain text, while in other fields HTML
fragments may be allowed. Enforcing such restrictions is outside the scope of this
data category. For further information see
The set of characters that are allowed is specified using a regular expression. That is, each character in the selected content MUST be included in the set specified by the regular expression.
The regular expression is the character class construct charClass defined as follows:
[1] charClass ::= singleCharEsc | charClassExpr | wildcardEsc[2] singleCharEsc ::= '\' [nrt\|.?*+(){}#x2D#x5B#x5D#x5E][3] charClassExpr ::= '[' charGroup ']'[4] charGroup ::= posCharGroup | negCharGroup[5] posCharGroup ::= ( charRange | singleCharEsc )+[6] charRange ::= seRange | xmlCharIncDash[7] seRange ::= charOrEsc '-' charOrEsc[8] charOrEsc ::= xmlChar | singleCharEsc[9] xmlChar ::= [^\#x2D#x5B#x5D][10] xmlCharIncDash ::= [^\#x5B#x5D][11] negCharGroup ::= '^' posCharGroup[12] wildcardEsc ::= '.'The . metacharacter matches also CARRIAGE RETURN (U+000D) and LINE FEED
(U+000F). That is the
This construct is a sub-set of the Character Classes construct
of XML Schema
Users may want to use a regular expression to make sure that they follow above definition. Sample regular expressions to verify the regular expression in allowed characters are provided: for XML and for Java.
Example of expressions (shown as XML source):
"[abc]" : allows the characters 'a', 'b' and 'c'."[a-c]" : allows the characters 'a', 'b' and 'c'."[a-zA-Z]" : allows the characters from 'a' to 'z' and from 'A' to
'Z'."[^abc]" : allows any characters except 'a', 'b', and 'c'."[^a-c]" : allows any characters except 'a', 'b', and
'c'."[^<>:"\\/|\?*]" : allows
only the characters valid for Windows file names."." : allows any character."" : allows no character.The Allowed Characters data category can be
expressed with global rules, or locally on individual elements. For elements, the data
category information inherits to the textual
content of the element,
GLOBAL: The
A required
Exactly one of the following:
An
An
The content must not contain the characters * and
+.
The attribute set in this document. The attribute has the
same semantics as
LOCAL: the following local markup is available for the Allowed Characters data category:
A
The local panelmsg must contain only Unicode characters
between U+0020 and U+00FE.
The local code must not contain the characters other than 'a'
to 'z' in any case and the characters underscore and minus.
The Storage Size data category is used to specify the maximum storage size of a given content.
This data category can be used for various purposes, including the following examples:
The storage size is always expressed in bytes and excludes any leading Byte-Order-Markers. It is provided along with the character set encoding and the line break type which will be used when the content is stored. If the encoding form does not use the byte as its unit (e.g. UTF-16 uses 16-bit code units) the storage size MUST still be given in byte (e.g. for UTF-16: 2 bytes per 16-bit code unit).
An application verifying the storage size for a given content is expected to perform the following steps:
All the LINE FEED (U+000A) characters of the content to verify are replaced by the character or characters specified by the line break type.
The resulting string is converted to an array of bytes using a character set encoder for the specified encoding. If a character cannot be represented with the specified encoding, an error is generated.
If the leading bytes represent a Byte-Order-Mark, they are stripped from that array.
The length of the resulting array is compared to the storage size provided. The content is too long if the length is greater than the storage size.
Storage size is not related to the display length of a text, and therefore should not be used to constrain a certain display length.
The Storage Size data category can be expressed with
global rules, or locally on individual elements. There is no inheritance. The default
value of the character set encoding is
GLOBAL: The
A required
Exactly one of the following:
A
A
None or exactly one of the following:
A
A
An optional
The country element must not be more than 25
bytes. The name "Papouasie-Nouvelle-Guinée" is 25 character long and fits because
all characters in ISO-8859-1 are encoded as a single byte.
The max to the same functionality as
LOCAL: the following local markup is available for the Storage Size data category:
A
An optional
An optional
The CONTINUE does not fit the specified
restriction of 8 bytes. The minimal number of bytes to store such a string in UTF-16 is 16.
The