From MultilingualWeb-LT EC Project Wiki
1 Summary of MLW-LT Requirements Breakout session at MLW Workshop, Luxembourg March 2012
This session was facilitated by David Lewis, one of the co-chairs of the new MultilingualWeb-Language Technology W3C workgroup. He introduced MultilingualWeb-LT as the successor to the ITS standard, but with the aim to better integrate localisation, machine translation and text analytics technologies with content management. Therefore, the standardisation of meta-data for interoperability between processes in content creation, through localisation/translation to publication was now in scope. MultiligualWeb-LT would retain successful features of ITS, including support for current ITS data categories, the separation of data category definitions from its mapping to a specific implementation and the independence between data categories so that conformance could be claim by implementing any one. It would however address implementation in HTML5, using existing meta-data annotation such as RDFa and microdata and round-trip interoperability with XLIFF
The session focussed largely on the general characteristics of data categories more than on specific data categories, e.g. whether it was generated automatically or manually (or both) or whether it affected the structure of a HTML document or the interaction with other meta-data processing, e.g. for style sheets or accessibility. One specific data category to emerge was the priority assigned to content for translation. A point that repeatedly emerged was the importance of relating meta-data definitions to specific processes. This would influence how meta-data for processing provenance and processing instruction would be represented and was also important in defining processing expectations. The challenge identified here was that process flow definitions are intimately linked to localisation contracts and service level agreements. Therefore trying to standardise process models could meet resistance due to the homogenising effect on business models and service offerings. CMS-based workflows that were contained within a single content-creating organisation and assembled from multiple LT components, rather than outsourced to LSPs, may be more accepting of a common process definition. It was recognised, however, that language technologies, including machine translation and crowd-sourcing may mean that many process boundaries will change, quickly dating any standardised process model.
2 Detail Contemporaneous Notes from Arle Lommel
Intro from Dave Lewis, co-chair of the MultilingualWeb-LT working group.
Workflow emphasis (both CMS and MT)
MLW-LT will be the successor to ITS
Need standard ways to mark up texts so that implementers can use them. Need to move beyond verbal agreements between two parties.
Need to recall that there are at least three potential implementations: XML, XLIFF, html5
Dave: Recall that we need to separate the definitions of the categories and the specifics of the implementation. We need to define an abstract model, but there are some suppositions: there are tags where attributes can be applied. Also, each data category is independent from the others: there is no obligation to support them all. We want to lower the barrier to entry.
List of existing tags:
• translate • localization note (communication between parties) • text direction • language • text in element (flows, nested elements, e.g., distinguish a bulleted list from a footnote) • term • ruby
Many were discussed in the internationalization activity and then moved up to html. This gives it higher profile than ITS.
Comment: this principle of only requiring a bit is good. Too many standards are too complex.
Text in element can be important for accessibility. How do we deal with dynamic content in HTML apps? What exists now seems to focus on institutions creating content. How do we get websites to work when you have dynamic content?
We need standards useful for large companies but also by individuals working on their own?
Dave: The idea of standardizing on the definitions of the data categories goes part of the way for that. Think of "translate", which started for XML in DITA, Docbook, but is now in html5. If we can seed these things in more popular locations, it will lead to greater adoption. Some things may not be used, others may be, but if we structure it right, those that are not adopted won't slow down those that are. Olaf: Are these structured so that they can cascade? Do we have to identify language in terms of a tag? The Commission is doing automatic identification of languages.
Olaf: There are two groups of tags: those that are automatable and others that require manual automation. It may be simplified if these things are split into two groups.
Des: I don't think that distinction needs to be made.
Olaf: I think if you split those out, it would encourage adoption.
Ryan: At Microsoft, we typically have a large amount of metadata in our system that we use to determine what should be localized. We don't want to pass those twenty pieces of metadata to translators: e.g., is the string stable or still changing, what tier is it in, revenue value of the feature. It generates one priority in the end. All of it can be overridden.
Dave: express a prioritization as an item of metadata.
Ryan: There is a lot of metadata that could be relevant, but we don't need to define all of it as a standard.
Depth of information. Information can cascade. It can be defined at the element or in the ITS rules that describe document-wide behavior using X-Path.
We want to use microdata or RDFa rather than coming up with our own way of doing things.
Arle: we need provenance for the tagging that is added to the document so you know what added the metadata.
Dave: We need to consider the end-to-end flow.
• Create (CMS) • Translate/localize (CAT/MT) • Consume (CMS?)
We need to know how it was produced (MT, HT, etc.)
The last two steps can often be swapped.
Des: ITS defines only one arrow in one direction in the workflow. We need to open up the entire picture. You need to be sure that information gets back to the CMS, for example. How do you know that what is produced in one node will survive as it passes through other nodes?
Metrics are needed about failures and actions.
Processing expectations need to be laid out.
Dave: me may want to distinguish between imperative and informative.
Ryan: but that is relative. We are missing state of a document. Both what needs to be done, but what has happened?
Dave: state is difficult to define.
Dave: we are missing a set of labels.
Des: State is dangerous. It will need to reference user-defined data.
Dave: how about putting a pointer to an external state document.
Tadej: term tag. I may want to annotate non-terms similarly. E.g., a personal name may not be a term, but I may want to be able to annotate an arbitrary fragment of text with some properties.
Make sure that our tags cannot break CSS hierarchies.
Des: one of the tenants of ITS is that it is non-invasive. Do we want to make sure it stays. It can't change the document structure.
Send link to questionnaire to attendees.
Report back to the plenary:
• Distinction between flows where we remain on the CMS and those where we deal with interface between parties. • Boundaries are going to go away: information can come from anywhere (e.g., crowdsourcing). Persistent metadata will become more important. • Much of the discussion was around recording state, but we anticipate that this will be external to the standards we are working on. State is critical, but CMS state may be a bit more tractable in future versions of ITS. TMS state will be out of scope. We decided that a pointer to external state data may be sufficient. • Principle: cannot break CSS or document structure. Need to anticipate ways in which our requirements could create problems with HTML structure. • Some of the metadata and annotation is specific to the original language and some to the target language. • Provenance may be related to state. We believe that this will be an important issue (e.g., where did a piece of information come from?) • The standards must define process expectations. • Automated metadata vs. manual metadata; informational vs. imperative metadata. Automated processes must be overridable.