See also: IRC log
<fsasaki> scribe: variousToo
<daveL> chair: felix
<daveL> scribe: daveL
<scribe> Meeting: MLW-LT face to face, Prague, 25 Feb 2012, 09.00 CET
felix: one change to demo is continuation of session 1 will be breakout between coffee break today and lunch
felix: this morning we will go through some basic parts of the document
felix: starting with introduction
... to look for changes that are needed
... read intro to section 1
... need to add reference for HTML5
... has reference ITS requriements and localiizable DTD which influenced this document
... and references potentially unwritten best practices document
felix: but what does this
... In context of workplan, after feature freeze we hae time to add best practice document
... change the refernec to a stable wiki page for best practices.
... section 1.1, relation to its1.0 and new principles
... outlines what the principles needs
felix: notes that additional horizontal feature need not be implemented for ITS1.0 data categories
Yves: asks if we still therefore need test suite for ITS1.0 data categories
daveL: yes for completeness, for those not referencing the its1.0
felix: give brief outline of what it means to be conformant to ITS, with reference to test suite
felix: tomorrow we need to look a tthis in more detail
omstefanov: it seems the ITS1.0 requirement may be redundant
shaunm: this indicates that ITS2.0 encompasses ITS1.0
<fsasaki> "Where ITS 1.0 data categories are implemented in XML, the implementation must be conformant with the ITS 1.0 approach to XML to claim conformance to ITS 2.0."
pedro: HTML5 add new features
<fsasaki> "ITS 2.0 is backwards compatibly with ITS 1.0 in terms of ITS mechanisms"
<omstefanov> suggest rephrasing that to ""ITS 2.0 is backwards compatible with ITS 1.0 in terms of ITS mechanisms"
felix: so this last bullet of
1.1.1 will update to this
... section 1.1.2, new principles
... in first bullet, drop refernece to RDFa and NIF, since these are not the format for confromance
... RDFa and NIF status are correctly referenced in second bullet, they are a possible output option
... third bullet clarifies the need for XPATH1.0, with new mechanisms for other queries, i.e. CSS and later xpath version
... but there seems no interest in CSS as a selector language, so we might drop it
phil: may be using CSS selector in our implementation
felix: so we may keep it, as it is optional
felix: list of new data
categories need to be updated, with reference to table
... now review text in seciton 1.2
<fsasaki> "The increasing usage of XML as a medium for documentation-related content (e.g. DocBook and DITA as formats for writing structured documentation, well suited to computer hardware and software manuals)": should mention also HTML5
jirka: need to review the last paragraph related to XML
felix: agree, this needs a rewrite
olaf: can we continue refining this after the meeting
jan: would be helpful to
reference other documents
... a question about directionality, is vertical being discussed
felix: this is being discussed elsewhere, in CSS for Asian layout
olaf: suggest adding vertical layout by referring to Japanese to list of example language
felix: agrees and add reference
to best practice document on japanese
... discusses examples
... but it would be good to have some html examples as well as XML in this section
shaun: seems harder to come up with example with both human and machine readable aspects
dave: it would be good to have some real industrial content for examples
des: there is no mention of XLIFF, is that deliberate
dF: XLIFF isn't a source format in the same way that XML and HTML5
felix: but for example yves processes many XML as XLIFF
df: need to be careful defining XLIFF binding, since this may impinge of scope of XLIFF TC
daveL: suggest mentioning multilanguage and bitext files
df: this would be better in in usages section
felix: agrees - we can have a
section in 1.3 focussed on XLIFF
... currently we have users identified as schema developers, schema managers, vendors of tools.
... need to add for localisaiton workflow managers
<fsasaki> "22.214.171.124" workflow process manager
<scribe> ACTION: dF to add section 1.3.5 on usage wby localisation workflow managers [recorded in http://www.w3.org/2012/09/25-mlw-lt-minutes.html#action01]
<trackbot> Created ACTION-222 - Add section 1.3.5 on usage wby localisation workflow managers [on David Filip - due 2012-10-02].
felix: another gorup on the table but not mentioned, that is people working with terminology and language technology
dF: there might be two, one for
terrmoinology and one for language technology
... so there is a bridge to open data and ontologies and also terminologists
jan: are we regarding these text analytics as separate services
<scribe> ACTION: Tatiana to draft text for terminology user with Tadej [recorded in http://www.w3.org/2012/09/25-mlw-lt-minutes.html#action02]
<trackbot> Sorry, couldn't find user - Tatiana
df: we should look at the use of data categories in terminology lifecycle
<scribe> ACTION: tadej to provide section on text analytics [recorded in http://www.w3.org/2012/09/25-mlw-lt-minutes.html#action03]
<trackbot> Created ACTION-223 - Provide section on text analytics [on Tadej Štajner - due 2012-10-02].
<scribe> ACTION: pedro to provide a section of MT service provider as user [recorded in http://www.w3.org/2012/09/25-mlw-lt-minutes.html#action04]
<trackbot> Created ACTION-224 - Provide a section of MT service provider as user [on Pedro Luis Díez Orzas - due 2012-10-02].
<Tatiana> Tilde could also contribute to the MT service part as the consumer of ITS
felix: section 1.3.2, explains the use of global and local selectors
<Tatiana> I mean, as a support to Pedro's paragraph ;)
pedro: this section should explain a bit more clearly how meta data can be produced and consumed by different actors or processes
felix: perhaps revise example
from the use cases being shown today
... 1.3.2 ways to use ITS
... needs to still address how to extend scehma, but also how to wor with existing formats
... in particular with HTML5
felix: now we will review specific data categories
tadej: summarises the changes to
... concerned with superfluous information and also the lack of RDF bindings for several existing lexical repositories
... but encouraging this behaviour in repositories is a big issue.
... Also added disambiguation level.
... Also generalised entity type to more general target type
... current issues discussed on mailing list.
... one is that the type can be inferred form the link
... but keep disambig level as optional, but allow it also to be inferred from disambig ident
... also make 'target' more specific by naming to 'disambiguation target'
... Also, wording needs some work, to make it both accessible and also accurate.
jirka: comment on example that disambig level should just be literals, so don't need 'its:' prefix
<scribe> ACTION: Tadej to update disambiguation to chanrge name of target type and to remove level value prefix [recorded in http://www.w3.org/2012/09/25-mlw-lt-minutes.html#action05]
<trackbot> Created ACTION-225 - Update disambiguation to chanrge name of target type and to remove level value prefix [on Tadej Štajner - due 2012-10-02].
arle: suggest use of alternative to target, using 'category' instead, or 'class', i.e. its -disambig-class-ref
daveL: does 'level' make sense
Tadej: yes, well understood in language processing circles
phil: perhaps use category or type
tadej: perhaps use 'granularity'
felix: suggest that these changes and also the descriptive text in breakout session tomorrow with Arle
daveL: suggest to supplement introductory description with an example
felix: not time now for breakout,
so perhaps introduce some other topics
... tool identification is one issue, yves to summarise
Yves: we have some data
categories where there is some data that is at a dcoument level
and some that is local, e.g. at every segement
... so agreed overide is always complete, but still want this orthogonal tool id feature
... felix suggested a separate format based on OLIF for this
Yves: but own opinion that this might be a bit complex, and an in-document way of identifying tool would be attractive
felix: this definitely needs a breakout session
dF: indicate he will lead this breakout
felix: examples in the spec -
this needs some work and shaun volunteered to look at
... we also need schema fragements to integrate into XML and HTML5 (jirka's action)
... we will have a breakout session on provenance tommorrow, led by dave. Later this topic will be handed over to Phil, though he is leaving early
pedro: presents a quick overview of use of readiness
<Yves_> proposal is attached here: http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Sep/0025.html
pedro: the advantage of this is
that client is more independent from providers
... there is a concrete need for this, but nowwhere to put this
jan: invites us to look at microsoft translator API that offers some potential for this
<Arle> Scribe: Arle
Felix: This next section is to
who to the project officer that we are making process.
... Arle will fill in templates to show what we are doing.
<fsasaki> presentation from yves
Yves: Question about what to do
with multiple keywords.
... Conducted a demo showing that non-translatable content was in fact not translated.
... Showed slide on Translation Package Creation
..ist: storageSizeEncoding provides information not otherwise available in XLIFF 1.2 concerning the encoding.
scribe: Third use case: Moses
Translation (M4Loc). Essentially identical to the case with
... (Used imitation of M4Loc in the demo)
... Last use case is a bit different. It uses the categories after extraction, not to make a kit, but to use them directly, to validate things. I hope to add locQuality later.
... This is quality check. It uses the same extraction mechanism and preserve space is important. Need id value.
... Finds problems in source as well as target.
... The UI of CheckMate lets you decide whether to use the ITS categories in some cases.
Felix: Question: The M4Loc bit
was made up, didn't actually use Moses. Is it something we
could leverage since this is a workflow that does half the
... I'm just wondering if we can use this with Moses.
Milan: I think we could change the M4Loc process to use ITS and it will be very helpful.
Des: Storage Size was an example. Just it get propagated through to the translator?
Yves: Yes. CheckMate doesn't
modify the file. We could allow that.
... For allowed characters, we don't use the schema. We use a subset in Java Regex. I don't intend to support the entire XML regex. It's a dependency we don't want.
... We do everything else with it, but if you use more of a regex than what we can handle, you will get an error.
Jirka: I think there is a Saxon library that might convert this. You should look into it.
Felix: Is there a concrete action following for M4Loc from this?
Milan: It looks much easier now, so we should analyze the new version of these tools.
Yves: You'll get HTML5 support by
going this route.
... We can also add information about the domain. It might be useful for choosing the process in MT.
David: There is a potential to
expand what M4Loc parses. Not just inlines, but the domain
would be an obvious thing. Property bugs could be another
thing. It depends on the MT consumer.
... Asia online could consume property bugs. It would be nice to add terminology and entity markup in M4Loc.
Declan: We might be able to releverage some of the M4Loc stuff in what we are doing to avoid duplication of effort.
David: It would be great if you could consume it.
Felix: You don't need a separate filter for translate from Okapi as long as you can consume it.
David: Yves is working on the XLIFF 2.0 library, which will make switching easy when the time comes for it.
Yves: We do have some XLIFF 2.0 stuff done. But we don't want to fall back on everyone using Okapi because we need several implementations. It helps make the standard better by seeing what problems they run into. It is important to have multiple implementations.
Felix: That's not a W3C process
question: We can have "fake" implementations, but we need real
... We didn't address the keyword mapping topic. Let's put that down for later.
<fsasaki> ACTION: felix to come back to keyword mapping issue in domain [recorded in http://www.w3.org/2012/09/25-mlw-lt-minutes.html#action06]
<trackbot> Created ACTION-226 - Come back to keyword mapping issue in domain [on Felix Sasaki - due 2012-10-02].
Milan: Is there a new version of Okapi with this?
Yves: The HTML5 branch in the GIT repository has it.
Des: Will it move into the dev branch?
Yves: Later on.
Arle: can it convert back from XHTML to HTML5?
Jirka: Not currently, but it wouldn't be hard.
Felix: It might be useful to Pedro if it did.
Shaun: If there is no ITS target
information in the target file, do you have to convert
... It should take only a few lines of XSLT. It's not difficult.
Pedro: The transition to HTML5 will take some time and this will help.
Yves: This was *extremely* useful to me. If you are working with Java, using validator.nu is the natural way.
Felix: This validator.nu is used by the W3C's own validator.
Jirka: for HTML5+ITS there is web and command line versions. If there is interest, I can make it accessible through university website when stable.
Felix: This will become part of the W3C validator once stable.
Jirka: Before that, I can find a
server and make it available.
... It will help us catch typos in examples.
Felix: For ITS 1.0 you made
Schematron rules to check all sorts of things. I'm not sure if
people are familiar with that.
... See the link I posted. These are checks that go well beyond schema checks.
... E.g., cooccurrence constraints, etc.
... Could the Schematron be integrated into the W3C validator?
Jirka: I'll need to check on that.
Pedro: This features Drupal integration with Cocomore for the showcase.
Felix: These are hand-made examples for now, right?
... The implementation of translate allows CAT tool users to see the content, but not to change it.
Felix: When will there be a prototype?
Pedro: Here there are three
parts. The first is the Drupal connection. We have checked our
web service. Today or tomorrow I hope that we can ramp up but
it has been tested.
... The second is the engine for normalization. That will be done in October, in a couple of weeks.
... The third are the effects in the localization platform. Everything has to be ready before the end of December.
Felix: If you look at the
description of work we have until next year. But see how Yves
is implementing while we are defining and providing
... You are working in a waterfall mode, waiting for the definition to be complete. For example, the <meta> tag has content, so it wouldn't work. The waterfall model wouldn't catch that early on, otherwise you don't see the errors until later on.
... I hope you can move towards Yves' model to catch errors early on.
<philr> ITS 2.0 Specification says that Provenance category will be updated in next version of the spec. Is this still the case? Has Provenance category been dropped?
Felix: It is really useful to use a feature prototype model.
Felix: We need to start contributing test cases.
Jan: That will help those interested in that to start getting involved.
Des: In the first use case, why
did you go first to XML, then to XLIFF, then to HTML5? HTML5
doesn't seem to be an optimized interchange format?
... When you don't have a CMS, there are valid reasons to use HTML5. But when you do, why not go straight to XML?
... You obviously have a reason since you considered them.
Moritz: We started with XML, moved to XLIFF, and that was hard. And then Felix asked for more HTML5 implementations, so we thought we'd try that. We found XLIFF was a pain, so we could move back to XML.
Felix: While authors may want to
work with HTML5, internally use what works best. I don't think
corporations are using HTML5-based workflows right now.
... I've seen examples of XLIFF, but see what works for you. Make sure it is useful for you internally.
Des: It seems to me that this is going Publishing → Localization → Publishing by using HTML5. It may work for you though.
Dave: Jan told us yesterday, however, that more authoring is in HTML5.
Pedro: Normally we have discussion between integrators, the client, and us. Perhaps in that case someone would have asked why we use HTML for a roundtrip like this.
Felix: It would't work without
HTML5 support, but we didn't discuss any specific HTML5
application. Yves showed how HTML5 could enter the chain, be
converted to XLIFF, etc.
... But I'm not sure if HTML5 should serve for the whole chain. XLIFF would seem to make more sense.
David: HTML5 lacks the mechanism
for bitext translations.
... I thought Tektronix donated their XLIFF-to-Drupal extractor to an open-source project, so this was taken care of.
Felix: You don't have to use HTML5, so please look at it and do what you need to that makes sense.
Des: I think that we need to
distinguish between authoring and publication formats on the
one hand and interchange formats on the other. We need to
consider what is best practice.
... There is a lot that isn't possible in HTML5. I think we need to consider what is best practice and what we should promote.
Dave: Smaller clients running
their own websites might have only an off-the-shelf Drupal and
don't want to set up XLIFF and so forth.
... So that is one market, different from the enterprise client.
Felix: You can consider using XLIFF in your process, or might continue as you are and make it clear where your workflow applies with a good description.
Dave: We need clear business cases.
Moritz: Mauricio and I should knock this out tonight.
Felix: Include David F. in this discussion.
Pedro: Concerning readiness,
there are a few of us who see this as very useful (Dave, Yves,
Cocomore, and us). In the case that you can choose where to put
that information, is more political than technical.
... In the use case of HTML with no API, wrapper, etc., you might put it right in the HTML material.
... We need to push this hard right now since it needs to be ready by November.
Felix: Let me point to what Yves
and Shaun did: they implemented features they liked and
discussed them in the ITS discussion forum. Some of their ideas
are now being implemented.
... Implement things, but not privately, even if they don't make it into ITS 2.0, so that others can see them.
... One reason for an implementation-driven approach is that it allows people to see what is being thought of and tried.
David: I see why you want readiness in HTML5, but most clients don't want that information published.
Dave: One thing we haven't discussed much is the need to strip information.
Shaun: For ITS 2.0 we use DocBook
and Mallard. Before we had tools, the translators had to work
directly in those files.
... Our translators use PO files.
<fsasaki> ACTION: phil to move provenance forward (off-line discussion at prague f2f) [recorded in http://www.w3.org/2012/09/25-mlw-lt-minutes.html#action07]
<trackbot> Created ACTION-227 - Move provenance forward (off-line discussion at prague f2f) [on Phil Ritchie - due 2012-10-02].
Shaun: Colleague created XML2PO,
but it created problems for us in some ways (despite being a
step forward). There were issues for us concerning how to map
the XML structure to PO.
... I redid this as ITS Tool when I discovered it.
... ITS couldn't provide all the information needed by PO. We added a number of extensions, some of which have now gone into the ITS 2.0.
Shaun: ITS tool ships with a set of rules and uses them to parse files.
<fsasaki> Arle, maybe for the slides: ITS tool ships with a set of default rules for various formats and uses these for PO file generation
Phil: I'm going to show our work
... Our system is something like CheckMate, doing automated checks. We added a browser client that works both off and online, using AJAX to post back to a server, capturing provenance.
... Allowed use of audit trails to find quality problems in other documents.
... Tool focuses on sentences where we expect there may be problems.
... Allow tagging error types in the UI. The process alters the DOM in HTML and puts the errors into stand-off markup.
... By editing the DOM, we can save the file with the markup.
... It doesn't require copying and pasting.
Des: What are the constraints? Can you use any HTML file?
Phil: It's browser-independent.
It doesn't have any dependencies because when we do the
transformation from XLIFF everything is wired into the file and
all you have to do is references some standard
... Everything is embedded in the HTML5 when it is converted from XLIFF.
Dave: Will discuss simple MT.
<fsasaki> "The jar file contains sample main() entry points:"
Pedro: With MT there should also be CAT tools and human at the segment level. What strategy did you take to addressing metadata that applies to more than one segment/level?
<fsasaki> above library can be used not only for validation, but also for parsing and e.g. creating various serializations
Dave: Before we call the service, we have to do a full parse down to the segment level.
Pedro: In our case we don't do
the segmentation. The CAT tool does, because it has to be
consistent with the TM.
... It is an external service to us.
Dave: We do it because we want to focus on the MT and still have control. But we are't working with a CAT tool.
David: It's a small loop here, so we can do it this way. But in a bigger process, you have to make sure these things are handled appropriately early on. You will need ways to reverse the process too, at the end.
Pedro: Some things are handled at the segment level, but others apply to the document or sections.
David: in some cases
segment-by-segment is too slow.
... You won't want to rely on the MT system for segmentation if you have to use TM.
Declan: We need to know whether the MT service would ever get a full document or whether it would only get pieces. In the past we have usually dealt with sub-paragraph segments.
Felix: Domain-mapping here used space separated rather than comma-separated. We need to make sure there is consistency here.
Yves: I wanted to know how to map domains in HTML. The problem was the format of the keywords in META. Currently we point to a node and expect a string to map to it, but we don't have an internal syntax for the contents. We need to specify this.
David: Talking about XLIFF used
to provide CMS-TMS roundtrip.
... Proxy problem means we can't show the demo.
... We initiate projects on the CMS. Want to show examples of how the XLIFF half works.
... Note this is nothing like a traditional TMS. It is a service-oriented architecture. Previously it had an unrestricted number of specialized agents.
... It routes XLIFF between the specialized agents.
... It is a closed localization loop, before the CMS enters.
... The idea is to use this modularized system based on XLIFF I/O.
David: Sometimes the dumb
components need to be clever.
... We can start processes from an arbitrary XLIFF file, or from Okapi.
... We work with Moravia and M4Loc (Moses). Moses uses text only, but M4Loc adds XLIFF capabilities for Moses. We then pass on MT-relevant metadata. They will add support to the M4Loc project.
... We might want to add support for provenance that Yves doesn't need. For example, if we want to integrate multiple MT systems, we would need that capability.
Yves: We need a consistent way of mapping the data categories to XLIFF.
Dave: The co-chairs need to take the lead in this.
David: Does this belong to XLIFF
or ITS? Maybe this is a good reason why Moritz and Pedro did
not use XLIFF for an exchange mechanism.
... We need a single XLIFF+ITS method.
Felix: Once the metadata is stable in November, we need to deal with this. We can publish as many best practice documents as we want, so we can have an ITS to XLIFF mapping.
Moritz: I'd like to make a case
for readiness. We need to provide a way for the user to be able
to trigger processes upon certain conditions. For examples, we
send things off to Enrycher, Linguaserve.
... Even if readiness isn't a data category, it should be a best practice to help smaller enterprises.
... We let users add local metadata.
Dave: Is that an existing HTML editor?
... We have trouble knowing how to make translate global for the end user in an intelligible fashion.
Felix: Is localization note only global for the whole document?
Moritz: for the content node, yes.
Felix: That doesn't let you mark pieces of nodes.
Moritz: We've not implemented
that but it's something to consider.
... Implementing all this required "breaking Drupal’s back a bit”. It's still a bit too complex, but we're working on this.
... Our process in the CMS should be linked to best practice for readiness.
Olaf-Michael: Does it compare metadata in source and target?
Moritz: It's half automatic at this point. We need to see what we can leave in.
Serge: This targets Drupal, but what about the other 1200+ CMS products?
Felix: Because we don't have infinite funding, we are focusing on an open-source CMS, hoping that it can be reused. This is just the start and we want it in open source.
Moritz: We will provide these as Drupal modules for others to use.
Yves: The interface with translation will be standardized, and not tied to Linguaserve?
Moritz: For the showcase, we are focusing on Linguaserve, but we will go wider.
Felix: Adjourn for today.
This is scribe.perl Revision: 1.136 of Date: 2011/05/12 12:01:43 Check for newer version at http://dev.w3.org/cvsweb/~checkout~/2002/scribe/ Guessing input format: RRSAgent_Text_Format (score 1.00) Succeeded: s/topic: CMS-to-TMS and Online MT System Readiness prototype// Succeeded: s/Pablo:/Mauricio:/ Succeeded: s/level./level?/ Succeeded: s/Some times/Sometimes/ Found Scribe: variousToo Found Scribe: daveL Inferring ScribeNick: daveL Found Scribe: Arle Inferring ScribeNick: Arle Scribes: variousToo, daveL, Arle ScribeNicks: daveL, Arle WARNING: No "Present: ... " found! Possibly Present: Ankit Arle_ Dave David DomJones Jan Mauricio Milan Moritz Olaf-Michael Phil Pnietoca Sebastian SebastianSkl Serge Shaun Tadej Tatiana Yves Yves_ arle daveL declan des df felix fsasaki giuseppe https jirka joined left leroy leroy_ mdelolmo mhellwig micha mlw-lt olaf omstefanov pedro philr shaunm trackbot You can indicate people for the Present list like this: <dbooth> Present: dbooth jonathan mary <dbooth> Present+ amy Agenda: http://www.w3.org/International/multilingualweb/lt/wiki/PragueSep2012#25_Sept:_MLW-LT_WG_meeting_agenda Got date from IRC log name: 25 Sep 2012 Guessing minutes URL: http://www.w3.org/2012/09/25-mlw-lt-minutes.html People with action items: df felix pedro phil tadej tatiana[End of scribe.perl diagnostic output]