MLW-LT f2f -- 25 Sep 2012

introduction

felix: this morning we will go through some basic parts of the document

<fsasaki> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#introduction

felix: starting with introduction to specification
... to look for changes that are needed
... read intro to section 1
... need to add reference for HTML5
... has reference ITS requriements and localiizable DTD which influenced this document
... and references potentially unwritten best practices document

<fsasaki> http://www.w3.org/2011/12/mlw-lt-charter.html

felix: but what does this mean?
... In context of workplan, after feature freeze we hae time to add best practice document
... change the refernec to a stable wiki page for best practices.
... section 1.1, relation to its1.0 and new principles
... outlines what the principles needs

<fsasaki> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#datacategories-defaults-etc

felix: notes that additional horizontal feature need not be implemented for ITS1.0 data categories

Yves: asks if we still therefore need test suite for ITS1.0 data categories

daveL: yes for completeness, for those not referencing the its1.0

felix: give brief outline of what it means to be conformant to ITS, with reference to test suite

<fsasaki> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#datacategories-defaults-etc

<fsasaki> http://phaedrus.scss.tcd.ie/its2.0/its-testsuite.html#translate-local-host

<fsasaki> http://phaedrus.scss.tcd.ie/its2.0/expected/translate/xml/translate4XmlOutput.txt

felix: tomorrow we need to look a tthis in more detail

omstefanov: it seems the ITS1.0 requirement may be redundant

shaunm: this indicates that ITS2.0 encompasses ITS1.0

<fsasaki> "Where ITS 1.0 data categories are implemented in XML, the implementation must be conformant with the ITS 1.0 approach to XML to claim conformance to ITS 2.0."

pedro: HTML5 add new features

<fsasaki> "ITS 2.0 is backwards compatibly with ITS 1.0 in terms of ITS mechanisms"

<omstefanov> suggest rephrasing that to ""ITS 2.0 is backwards compatible with ITS 1.0 in terms of ITS mechanisms"

felix: so this last bullet of 1.1.1 will update to this
... section 1.1.2, new principles
... in first bullet, drop refernece to RDFa and NIF, since these are not the format for confromance
... RDFa and NIF status are correctly referenced in second bullet, they are a possible output option
... third bullet clarifies the need for XPATH1.0, with new mechanisms for other queries, i.e. CSS and later xpath version
... but there seems no interest in CSS as a selector language, so we might drop it

phil: may be using CSS selector in our implementation

felix: so we may keep it, as it is optional

<fsasaki> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#datacategories-defaults-etc

felix: list of new data categories need to be updated, with reference to table
... now review text in seciton 1.2

<fsasaki> "The increasing usage of XML as a medium for documentation-related content (e.g. DocBook and DITA as formats for writing structured documentation, well suited to computer hardware and software manuals)": should mention also HTML5

jirka: need to review the last paragraph related to XML

felix: agree, this needs a rewrite

olaf: can we continue refining this after the meeting

jan: would be helpful to reference other documents
... a question about directionality, is vertical being discussed

felix: this is being discussed elsewhere, in CSS for Asian layout

olaf: suggest adding vertical layout by referring to Japanese to list of example language

felix: agrees and add reference to best practice document on japanese
... discusses examples
... but it would be good to have some html examples as well as XML in this section

shaun: seems harder to come up with example with both human and machine readable aspects

dave: it would be good to have some real industrial content for examples

des: there is no mention of XLIFF, is that deliberate

dF: XLIFF isn't a source format in the same way that XML and HTML5

felix: but for example yves processes many XML as XLIFF

Yves: agrees

df: need to be careful defining XLIFF binding, since this may impinge of scope of XLIFF TC

daveL: suggest mentioning multilanguage and bitext files

df: this would be better in in usages section

felix: agrees - we can have a section in 1.3 focussed on XLIFF
... currently we have users identified as schema developers, schema managers, vendors of tools.
... need to add for localisaiton workflow managers

<fsasaki> "1.3.1.5" workflow process manager

<scribe> ACTION: dF to add section 1.3.5 on usage wby localisation workflow managers [recorded in http://www.w3.org/2012/09/25-mlw-lt-minutes.html#action01]

<trackbot> Created ACTION-222 - Add section 1.3.5 on usage wby localisation workflow managers [on David Filip - due 2012-10-02].

felix: another gorup on the table but not mentioned, that is people working with terminology and language technology

dF: there might be two, one for terrmoinology and one for language technology
... so there is a bridge to open data and ontologies and also terminologists

jan: are we regarding these text analytics as separate services

<scribe> ACTION: Tatiana to draft text for terminology user with Tadej [recorded in http://www.w3.org/2012/09/25-mlw-lt-minutes.html#action02]

<trackbot> Sorry, couldn't find user - Tatiana

df: we should look at the use of data categories in terminology lifecycle

<scribe> ACTION: tadej to provide section on text analytics [recorded in http://www.w3.org/2012/09/25-mlw-lt-minutes.html#action03]

<trackbot> Created ACTION-223 - Provide section on text analytics [on Tadej Štajner - due 2012-10-02].

<scribe> ACTION: pedro to provide a section of MT service provider as user [recorded in http://www.w3.org/2012/09/25-mlw-lt-minutes.html#action04]

<trackbot> Created ACTION-224 - Provide a section of MT service provider as user [on Pedro Luis Díez Orzas - due 2012-10-02].

<Tatiana> Tilde could also contribute to the MT service part as the consumer of ITS

felix: section 1.3.2, explains the use of global and local selectors

<Tatiana> I mean, as a support to Pedro's paragraph ;)

pedro: this section should explain a bit more clearly how meta data can be produced and consumed by different actors or processes

felix: perhaps revise example from the use cases being shown today
... 1.3.2 ways to use ITS
... needs to still address how to extend scehma, but also how to wor with existing formats
... in particular with HTML5

continuation of session 1

felix: now we will review specific data categories

tadej: summarises the changes to disambiguation
... concerned with superfluous information and also the lack of RDF bindings for several existing lexical repositories
... but encouraging this behaviour in repositories is a big issue.
... Also added disambiguation level.
... Also generalised entity type to more general target type
... current issues discussed on mailing list.
... one is that the type can be inferred form the link
... but keep disambig level as optional, but allow it also to be inferred from disambig ident
... also make 'target' more specific by naming to 'disambiguation target'
... Also, wording needs some work, to make it both accessible and also accurate.

jirka: comment on example that disambig level should just be literals, so don't need 'its:' prefix

<scribe> ACTION: Tadej to update disambiguation to chanrge name of target type and to remove level value prefix [recorded in http://www.w3.org/2012/09/25-mlw-lt-minutes.html#action05]

<trackbot> Created ACTION-225 - Update disambiguation to chanrge name of target type and to remove level value prefix [on Tadej Štajner - due 2012-10-02].

arle: suggest use of alternative to target, using 'category' instead, or 'class', i.e. its -disambig-class-ref

daveL: does 'level' make sense

Tadej: yes, well understood in language processing circles

phil: perhaps use category or type

tadej: perhaps use 'granularity'

felix: suggest that these changes and also the descriptive text in breakout session tomorrow with Arle

daveL: suggest to supplement introductory description with an example

tadej: agrees

felix: not time now for breakout, so perhaps introduce some other topics
... tool identification is one issue, yves to summarise

Yves: we have some data categories where there is some data that is at a dcoument level and some that is local, e.g. at every segement
... so agreed overide is always complete, but still want this orthogonal tool id feature
... felix suggested a separate format based on OLIF for this

<fsasaki> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Sep/0160.html

Yves: but own opinion that this might be a bit complex, and an in-document way of identifying tool would be attractive

felix: this definitely needs a breakout session

dF: indicate he will lead this breakout

felix: examples in the spec - this needs some work and shaun volunteered to look at that
... we also need schema fragements to integrate into XML and HTML5 (jirka's action)
... we will have a breakout session on provenance tommorrow, led by dave. Later this topic will be handed over to Phil, though he is leaving early

pedro: presents a quick overview of use of readiness

<Yves_> proposal is attached here: http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Sep/0025.html

pedro: the advantage of this is that client is more independent from providers
... there is a concrete need for this, but nowwhere to put this

jan: invites us to look at microsoft translator API that offers some potential for this

<Yves_> test

<Arle> Scribe: Arle

Felix: This next section is to who to the project officer that we are making process.
... Arle will fill in templates to show what we are doing.

implementation enlaso

<fsasaki> presentation from yves

Yves: Question about what to do with multiple keywords.
... Conducted a demo showing that non-translatable content was in fact not translated.
... Showed slide on Translation Package Creation
..ist: storageSizeEncoding provides information not otherwise available in XLIFF 1.2 concerning the encoding.

s\ist:\its:\

scribe: Third use case: Moses Translation (M4Loc). Essentially identical to the case with Microsoft Translator.
... (Used imitation of M4Loc in the demo)
... Last use case is a bit different. It uses the categories after extraction, not to make a kit, but to use them directly, to validate things. I hope to add locQuality later.
... This is quality check. It uses the same extraction mechanism and preserve space is important. Need id value.
... Finds problems in source as well as target.
... The UI of CheckMate lets you decide whether to use the ITS categories in some cases.

Felix: Question: The M4Loc bit was made up, didn't actually use Moses. Is it something we could leverage since this is a workflow that does half the job?
... I'm just wondering if we can use this with Moses.

Milan: I think we could change the M4Loc process to use ITS and it will be very helpful.

Des: Storage Size was an example. Just it get propagated through to the translator?

Yves: Yes. CheckMate doesn't modify the file. We could allow that.
... For allowed characters, we don't use the schema. We use a subset in Java Regex. I don't intend to support the entire XML regex. It's a dependency we don't want.
... We do everything else with it, but if you use more of a regex than what we can handle, you will get an error.

Jirka: I think there is a Saxon library that might convert this. You should look into it.

Felix: Is there a concrete action following for M4Loc from this?

Milan: It looks much easier now, so we should analyze the new version of these tools.

Yves: You'll get HTML5 support by going this route.
... We can also add information about the domain. It might be useful for choosing the process in MT.

David: There is a potential to expand what M4Loc parses. Not just inlines, but the domain would be an obvious thing. Property bugs could be another thing. It depends on the MT consumer.
... Asia online could consume property bugs. It would be nice to add terminology and entity markup in M4Loc.

Declan: We might be able to releverage some of the M4Loc stuff in what we are doing to avoid duplication of effort.

David: It would be great if you could consume it.

Felix: You don't need a separate filter for translate from Okapi as long as you can consume it.

David: Yves is working on the XLIFF 2.0 library, which will make switching easy when the time comes for it.

Yves: We do have some XLIFF 2.0 stuff done. But we don't want to fall back on everyone using Okapi because we need several implementations. It helps make the standard better by seeing what problems they run into. It is important to have multiple implementations.

Felix: That's not a W3C process question: We can have "fake" implementations, but we need real ones.
... We didn't address the keyword mapping topic. Let's put that down for later.

<fsasaki> ACTION: felix to come back to keyword mapping issue in domain [recorded in http://www.w3.org/2012/09/25-mlw-lt-minutes.html#action06]

<trackbot> Created ACTION-226 - Come back to keyword mapping issue in domain [on Felix Sasaki - due 2012-10-02].

Milan: Is there a new version of Okapi with this?

Yves: The HTML5 branch in the GIT repository has it.

Des: Will it move into the dev branch?

Yves: Later on.

HTML5+ITS to XHTML+ITS convertor

<Jirka> https://github.com/kosek/html5-its-tools

Arle: can it convert back from XHTML to HTML5?

Jirka: Not currently, but it wouldn't be hard.

Felix: It might be useful to Pedro if it did.

Shaun: If there is no ITS target information in the target file, do you have to convert back?
... It should take only a few lines of XSLT. It's not difficult.

Pedro: The transition to HTML5 will take some time and this will help.

Yves: This was *extremely* useful to me. If you are working with Java, using validator.nu is the natural way.

Felix: This validator.nu is used by the W3C's own validator.

Jirka: for HTML5+ITS there is web and command line versions. If there is interest, I can make it accessible through university website when stable.

Felix: This will become part of the W3C validator once stable.

Jirka: Before that, I can find a server and make it available.
... It will help us catch typos in examples.

<fsasaki> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#its-schematron-constraints

Felix: For ITS 1.0 you made Schematron rules to check all sorts of things. I'm not sure if people are familiar with that.
... See the link I posted. These are checks that go well beyond schema checks.
... E.g., cooccurrence constraints, etc.
... Could the Schematron be integrated into the W3C validator?

Jirka: I'll need to check on that.

CMS to TMS and Online TM System

Pedro: This features Drupal integration with Cocomore for the showcase.

Felix: These are hand-made examples for now, right?

Mauricio: Yes.
... The implementation of translate allows CAT tool users to see the content, but not to change it.

Felix: When will there be a prototype?

Pedro: Here there are three parts. The first is the Drupal connection. We have checked our web service. Today or tomorrow I hope that we can ramp up but it has been tested.
... The second is the engine for normalization. That will be done in October, in a couple of weeks.
... The third are the effects in the localization platform. Everything has to be ready before the end of December.

Felix: If you look at the description of work we have until next year. But see how Yves is implementing while we are defining and providing feedback.
... You are working in a waterfall mode, waiting for the definition to be complete. For example, the <meta> tag has content, so it wouldn't work. The waterfall model wouldn't catch that early on, otherwise you don't see the errors until later on.
... I hope you can move towards Yves' model to catch errors early on.

<philr> ITS 2.0 Specification says that Provenance category will be updated in next version of the spec. Is this still the case? Has Provenance category been dropped?

Felix: It is really useful to use a feature prototype model.

<Pnietoca> https://www.w3.org/International/multilingualweb/lt/wiki/Online_MT_System_Internationalization_Project_Information_Metadata

Felix: We need to start contributing test cases.

Jan: That will help those interested in that to start getting involved.

Des: In the first use case, why did you go first to XML, then to XLIFF, then to HTML5? HTML5 doesn't seem to be an optimized interchange format?
... When you don't have a CMS, there are valid reasons to use HTML5. But when you do, why not go straight to XML?
... You obviously have a reason since you considered them.

Moritz: We started with XML, moved to XLIFF, and that was hard. And then Felix asked for more HTML5 implementations, so we thought we'd try that. We found XLIFF was a pain, so we could move back to XML.

Felix: While authors may want to work with HTML5, internally use what works best. I don't think corporations are using HTML5-based workflows right now.
... I've seen examples of XLIFF, but see what works for you. Make sure it is useful for you internally.

Des: It seems to me that this is going Publishing → Localization → Publishing by using HTML5. It may work for you though.

Dave: Jan told us yesterday, however, that more authoring is in HTML5.

Pedro: Normally we have discussion between integrators, the client, and us. Perhaps in that case someone would have asked why we use HTML for a roundtrip like this.

Felix: It would't work without HTML5 support, but we didn't discuss any specific HTML5 application. Yves showed how HTML5 could enter the chain, be converted to XLIFF, etc.
... But I'm not sure if HTML5 should serve for the whole chain. XLIFF would seem to make more sense.

David: HTML5 lacks the mechanism for bitext translations.
... I thought Tektronix donated their XLIFF-to-Drupal extractor to an open-source project, so this was taken care of.

Felix: You don't have to use HTML5, so please look at it and do what you need to that makes sense.

Des: I think that we need to distinguish between authoring and publication formats on the one hand and interchange formats on the other. We need to consider what is best practice.
... There is a lot that isn't possible in HTML5. I think we need to consider what is best practice and what we should promote.

Dave: Smaller clients running their own websites might have only an off-the-shelf Drupal and don't want to set up XLIFF and so forth.
... So that is one market, different from the enterprise client.

Felix: You can consider using XLIFF in your process, or might continue as you are and make it clear where your workflow applies with a good description.

Dave: We need clear business cases.

Moritz: Mauricio and I should knock this out tonight.

Felix: Include David F. in this discussion.

Pedro: Concerning readiness, there are a few of us who see this as very useful (Dave, Yves, Cocomore, and us). In the case that you can choose where to put that information, is more political than technical.
... In the use case of HTML with no API, wrapper, etc., you might put it right in the HTML material.
... We need to push this hard right now since it needs to be ready by November.

Felix: Let me point to what Yves and Shaun did: they implemented features they liked and discussed them in the ITS discussion forum. Some of their ideas are now being implemented.
... Implement things, but not privately, even if they don't make it into ITS 2.0, so that others can see them.
... One reason for an implementation-driven approach is that it allows people to see what is being thought of and tried.

David: I see why you want readiness in HTML5, but most clients don't want that information published.

Dave: One thing we haven't discussed much is the need to strip information.

Shaun: For ITS 2.0 we use DocBook and Mallard. Before we had tools, the translators had to work directly in those files.
... Our translators use PO files.

<fsasaki> ACTION: phil to move provenance forward (off-line discussion at prague f2f) [recorded in http://www.w3.org/2012/09/25-mlw-lt-minutes.html#action07]

<trackbot> Created ACTION-227 - Move provenance forward (off-line discussion at prague f2f) [on Phil Ritchie - due 2012-10-02].

ITS Tool

Shaun: Colleague created XML2PO, but it created problems for us in some ways (despite being a step forward). There were issues for us concerning how to map the XML structure to PO.
... I redid this as ITS Tool when I discovered it.
... ITS couldn't provide all the information needed by PO. We added a number of extensions, some of which have now gone into the ITS 2.0.

Shaun: ITS tool ships with a set of rules and uses them to parse files.

<fsasaki> Arle, maybe for the slides: ITS tool ships with a set of default rules for various formats and uses these for PO file generation

Quality issue in the browser

Phil: I'm going to show our work on review.
... Our system is something like CheckMate, doing automated checks. We added a browser client that works both off and online, using AJAX to post back to a server, capturing provenance.
... Allowed use of audit trails to find quality problems in other documents.
... Tool focuses on sentences where we expect there may be problems.
... Allow tagging error types in the UI. The process alters the DOM in HTML and puts the errors into stand-off markup.
... By editing the DOM, we can save the file with the markup.
... It doesn't require copying and pasting.

Des: What are the constraints? Can you use any HTML file?

Phil: It's browser-independent. It doesn't have any dependencies because when we do the transformation from XLIFF everything is wired into the file and all you have to do is references some standard JQuery/JavaScript libraries.
... Everything is embedded in the HTML5 when it is converted from XLIFF.

<fsasaki> http://about.validator.nu/htmlparser/

<fsasaki> "The jar file contains sample main() entry points:"

<fsasaki> above library can be used not only for validation, but also for parsing and e.g. creating various serializations

Simple Segment Machine Translation Use Case

Dave: Will discuss simple MT.

Pedro: With MT there should also be CAT tools and human at the segment level. What strategy did you take to addressing metadata that applies to more than one segment/level?

Dave: Before we call the service, we have to do a full parse down to the segment level.

Pedro: In our case we don't do the segmentation. The CAT tool does, because it has to be consistent with the TM.
... It is an external service to us.

Dave: We do it because we want to focus on the MT and still have control. But we are't working with a CAT tool.

David: It's a small loop here, so we can do it this way. But in a bigger process, you have to make sure these things are handled appropriately early on. You will need ways to reverse the process too, at the end.

Pedro: Some things are handled at the segment level, but others apply to the document or sections.

David: in some cases segment-by-segment is too slow.
... You won't want to rely on the MT system for segmentation if you have to use TM.

Declan: We need to know whether the MT service would ever get a full document or whether it would only get pieces. In the past we have usually dealt with sub-paragraph segments.

Felix: Domain-mapping here used space separated rather than comma-separated. We need to make sure there is consistency here.

Yves: I wanted to know how to map domains in HTML. The problem was the format of the keywords in META. Currently we point to a node and expect a string to map to it, but we don't have an internal syntax for the contents. We need to specify this.

David: Talking about XLIFF used to provide CMS-TMS roundtrip.
... Proxy problem means we can't show the demo.
... We initiate projects on the CMS. Want to show examples of how the XLIFF half works.
... Note this is nothing like a traditional TMS. It is a service-oriented architecture. Previously it had an unrestricted number of specialized agents.
... It routes XLIFF between the specialized agents.
... It is a closed localization loop, before the CMS enters.
... The idea is to use this modularized system based on XLIFF I/O.

SOLAS CMS-LION ITS

David: Sometimes the dumb components need to be clever.
... We can start processes from an arbitrary XLIFF file, or from Okapi.
... We work with Moravia and M4Loc (Moses). Moses uses text only, but M4Loc adds XLIFF capabilities for Moses. We then pass on MT-relevant metadata. They will add support to the M4Loc project.
... We might want to add support for provenance that Yves doesn't need. For example, if we want to integrate multiple MT systems, we would need that capability.

Yves: We need a consistent way of mapping the data categories to XLIFF.

Dave: The co-chairs need to take the lead in this.

David: Does this belong to XLIFF or ITS? Maybe this is a good reason why Moritz and Pedro did not use XLIFF for an exchange mechanism.
... We need a single XLIFF+ITS method.

Felix: Once the metadata is stable in November, we need to deal with this. We can publish as many best practice documents as we want, so we can have an ITS to XLIFF mapping.

Cocomore demonstration

Moritz: I'd like to make a case for readiness. We need to provide a way for the user to be able to trigger processes upon certain conditions. For examples, we send things off to Enrycher, Linguaserve.
... Even if readiness isn't a data category, it should be a best practice to help smaller enterprises.
... We let users add local metadata.

Dave: Is that an existing HTML editor?

Moritz: Yes.
... We have trouble knowing how to make translate global for the end user in an intelligible fashion.

Felix: Is localization note only global for the whole document?

Moritz: for the content node, yes.

Felix: That doesn't let you mark pieces of nodes.

Moritz: We've not implemented that but it's something to consider.
... Implementing all this required "breaking Drupal’s back a bit”. It's still a bit too complex, but we're working on this.
... Our process in the CMS should be linked to best practice for readiness.

Olaf-Michael: Does it compare metadata in source and target?

Moritz: It's half automatic at this point. We need to see what we can leave in.

Serge: This targets Drupal, but what about the other 1200+ CMS products?

Felix: Because we don't have infinite funding, we are focusing on an open-source CMS, hoping that it can be reused. This is just the start and we want it in open source.

Moritz: We will provide these as Drupal modules for others to use.

Yves: The interface with translation will be standardized, and not tied to Linguaserve?

Moritz: For the showcase, we are focusing on Linguaserve, but we will go wider.

Felix: Adjourn for today.

MLW-LT f2f

25 Sep 2012

Attendees

Contents

introduction

continuation of session 1

implementation enlaso

HTML5+ITS to XHTML+ITS convertor

CMS to TMS and Online TM System

ITS Tool

Quality issue in the browser

Simple Segment Machine Translation Use Case

SOLAS CMS-LION ITS

Cocomore demonstration

Summary of Action Items