See also: IRC log
felix: this morning we will go through some basic parts of the document
<fsasaki> http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#introduction
felix: starting with introduction to specification
... to
look for changes that are needed
... read intro to section 1
... need to add
reference for HTML5
... has reference ITS requriements and localiizable DTD which
influenced this document
... and references potentially unwritten best practices
document
<fsasaki> http://www.w3.org/2011/12/mlw-lt-charter.html
felix: but what does this mean?
... In context of
workplan, after feature freeze we hae time to add best practice document
... change
the refernec to a stable wiki page for best practices.
... section 1.1, relation to
its1.0 and new principles
... outlines what the principles needs
felix: notes that additional horizontal feature need not be implemented for ITS1.0 data categories
Yves: asks if we still therefore need test suite for ITS1.0 data categories
daveL: yes for completeness, for those not referencing the its1.0
felix: give brief outline of what it means to be conformant to ITS, with reference to test suite
<fsasaki> http://phaedrus.scss.tcd.ie/its2.0/its-testsuite.html#translate-local-host
<fsasaki> http://phaedrus.scss.tcd.ie/its2.0/expected/translate/xml/translate4XmlOutput.txt
felix: tomorrow we need to look a tthis in more detail
omstefanov: it seems the ITS1.0 requirement may be redundant
shaunm: this indicates that ITS2.0 encompasses ITS1.0
<fsasaki> "Where ITS 1.0 data categories are implemented in XML, the implementation must be conformant with the ITS 1.0 approach to XML to claim conformance to ITS 2.0."
pedro: HTML5 add new features
<fsasaki> "ITS 2.0 is backwards compatibly with ITS 1.0 in terms of ITS mechanisms"
<omstefanov> suggest rephrasing that to ""ITS 2.0 is backwards compatible with ITS 1.0 in terms of ITS mechanisms"
felix: so this last bullet of 1.1.1 will update to this
... section 1.1.2, new principles
... in first bullet, drop refernece to RDFa and NIF,
since these are not the format for confromance
... RDFa and NIF status are correctly
referenced in second bullet, they are a possible output option
... third bullet
clarifies the need for XPATH1.0, with new mechanisms for other queries, i.e. CSS and later
xpath version
... but there seems no interest in CSS as a selector language, so we
might drop it
phil: may be using CSS selector in our implementation
felix: so we may keep it, as it is optional
felix: list of new data categories need to be updated, with
reference to table
... now review text in seciton 1.2
<fsasaki> "The increasing usage of XML as a medium for documentation-related content (e.g. DocBook and DITA as formats for writing structured documentation, well suited to computer hardware and software manuals)": should mention also HTML5
jirka: need to review the last paragraph related to XML
felix: agree, this needs a rewrite
olaf: can we continue refining this after the meeting
jan: would be helpful to reference other documents
... a
question about directionality, is vertical being discussed
felix: this is being discussed elsewhere, in CSS for Asian layout
olaf: suggest adding vertical layout by referring to Japanese to list of example language
felix: agrees and add reference to best practice document on
japanese
... discusses examples
... but it would be good to have some html
examples as well as XML in this section
shaun: seems harder to come up with example with both human and machine readable aspects
dave: it would be good to have some real industrial content for examples
des: there is no mention of XLIFF, is that deliberate
dF: XLIFF isn't a source format in the same way that XML and HTML5
felix: but for example yves processes many XML as XLIFF
Yves: agrees
df: need to be careful defining XLIFF binding, since this may impinge of scope of XLIFF TC
daveL: suggest mentioning multilanguage and bitext files
df: this would be better in in usages section
felix: agrees - we can have a section in 1.3 focussed on
XLIFF
... currently we have users identified as schema developers, schema managers,
vendors of tools.
... need to add for localisaiton workflow managers
<fsasaki> "1.3.1.5" workflow process manager
<scribe> ACTION: dF to add section 1.3.5 on usage wby localisation workflow managers [recorded in http://www.w3.org/2012/09/25-mlw-lt-minutes.html#action01]
<trackbot> Created ACTION-222 - Add section 1.3.5 on usage wby localisation workflow managers [on David Filip - due 2012-10-02].
felix: another gorup on the table but not mentioned, that is people working with terminology and language technology
dF: there might be two, one for terrmoinology and one for
language technology
... so there is a bridge to open data and ontologies and also
terminologists
jan: are we regarding these text analytics as separate services
<scribe> ACTION: Tatiana to draft text for terminology user with Tadej [recorded in http://www.w3.org/2012/09/25-mlw-lt-minutes.html#action02]
<trackbot> Sorry, couldn't find user - Tatiana
df: we should look at the use of data categories in terminology lifecycle
<scribe> ACTION: tadej to provide section on text analytics [recorded in http://www.w3.org/2012/09/25-mlw-lt-minutes.html#action03]
<trackbot> Created ACTION-223 - Provide section on text analytics [on Tadej Štajner - due 2012-10-02].
<scribe> ACTION: pedro to provide a section of MT service provider as user [recorded in http://www.w3.org/2012/09/25-mlw-lt-minutes.html#action04]
<trackbot> Created ACTION-224 - Provide a section of MT service provider as user [on Pedro Luis Díez Orzas - due 2012-10-02].
<Tatiana> Tilde could also contribute to the MT service part as the consumer of ITS
felix: section 1.3.2, explains the use of global and local selectors
<Tatiana> I mean, as a support to Pedro's paragraph ;)
pedro: this section should explain a bit more clearly how meta data can be produced and consumed by different actors or processes
felix: perhaps revise example from the use cases being shown
today
... 1.3.2 ways to use ITS
... needs to still address how to extend scehma,
but also how to wor with existing formats
... in particular with HTML5
felix: now we will review specific data categories
tadej: summarises the changes to disambiguation
...
concerned with superfluous information and also the lack of RDF bindings for several
existing lexical repositories
... but encouraging this behaviour in repositories is a
big issue.
... Also added disambiguation level.
... Also generalised entity type
to more general target type
... current issues discussed on mailing list.
...
one is that the type can be inferred form the link
... but keep disambig level as
optional, but allow it also to be inferred from disambig ident
... also make 'target'
more specific by naming to 'disambiguation target'
... Also, wording needs some work,
to make it both accessible and also accurate.
jirka: comment on example that disambig level should just be literals, so don't need 'its:' prefix
<scribe> ACTION: Tadej to update disambiguation to chanrge name of target type and to remove level value prefix [recorded in http://www.w3.org/2012/09/25-mlw-lt-minutes.html#action05]
<trackbot> Created ACTION-225 - Update disambiguation to chanrge name of target type and to remove level value prefix [on Tadej Štajner - due 2012-10-02].
arle: suggest use of alternative to target, using 'category' instead, or 'class', i.e. its -disambig-class-ref
daveL: does 'level' make sense
Tadej: yes, well understood in language processing circles
phil: perhaps use category or type
tadej: perhaps use 'granularity'
felix: suggest that these changes and also the descriptive text in breakout session tomorrow with Arle
daveL: suggest to supplement introductory description with an example
tadej: agrees
felix: not time now for breakout, so perhaps introduce some
other topics
... tool identification is one issue, yves to summarise
Yves: we have some data categories where there is some data that
is at a dcoument level and some that is local, e.g. at every segement
... so agreed
overide is always complete, but still want this orthogonal tool id feature
... felix
suggested a separate format based on OLIF for this
<fsasaki> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Sep/0160.html
<fsasaki> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Sep/0160.html
Yves: but own opinion that this might be a bit complex, and an in-document way of identifying tool would be attractive
felix: this definitely needs a breakout session
dF: indicate he will lead this breakout
felix: examples in the spec - this needs some work and shaun
volunteered to look at that
... we also need schema fragements to integrate into XML
and HTML5 (jirka's action)
... we will have a breakout session on provenance
tommorrow, led by dave. Later this topic will be handed over to Phil, though he is leaving
early
pedro: presents a quick overview of use of readiness
<Yves_> proposal is attached here: http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Sep/0025.html
pedro: the advantage of this is that client is more independent
from providers
... there is a concrete need for this, but nowwhere to put this
jan: invites us to look at microsoft translator API that offers some potential for this
<Yves_> test
<Arle> Scribe: Arle
Felix: This next section is to who to the project officer that
we are making process.
... Arle will fill in templates to show what we are doing.
<fsasaki> presentation from yves
Yves: Question about what to do with multiple keywords.
... Conducted a demo showing that non-translatable content was in fact not translated.
... Showed slide on Translation Package Creation
..ist: storageSizeEncoding provides
information not otherwise available in XLIFF 1.2 concerning the encoding.
s\ist:\its:\
scribe: Third use case: Moses Translation (M4Loc). Essentially
identical to the case with Microsoft Translator.
... (Used imitation of M4Loc in the
demo)
... Last use case is a bit different. It uses the categories after extraction,
not to make a kit, but to use them directly, to validate things. I hope to add locQuality
later.
... This is quality check. It uses the same extraction mechanism and preserve
space is important. Need id value.
... Finds problems in source as well as
target.
... The UI of CheckMate lets you decide whether to use the ITS categories in
some cases.
Felix: Question: The M4Loc bit was made up, didn't actually use
Moses. Is it something we could leverage since this is a workflow that does half the
job?
... I'm just wondering if we can use this with Moses.
Milan: I think we could change the M4Loc process to use ITS and it will be very helpful.
Des: Storage Size was an example. Just it get propagated through to the translator?
Yves: Yes. CheckMate doesn't modify the file. We could allow
that.
... For allowed characters, we don't use the schema. We use a subset in Java
Regex. I don't intend to support the entire XML regex. It's a dependency we don't
want.
... We do everything else with it, but if you use more of a regex than what we
can handle, you will get an error.
Jirka: I think there is a Saxon library that might convert this. You should look into it.
Felix: Is there a concrete action following for M4Loc from this?
Milan: It looks much easier now, so we should analyze the new version of these tools.
Yves: You'll get HTML5 support by going this route.
... We
can also add information about the domain. It might be useful for choosing the process in
MT.
David: There is a potential to expand what M4Loc parses. Not
just inlines, but the domain would be an obvious thing. Property bugs could be another
thing. It depends on the MT consumer.
... Asia online could consume property bugs. It
would be nice to add terminology and entity markup in M4Loc.
Declan: We might be able to releverage some of the M4Loc stuff in what we are doing to avoid duplication of effort.
David: It would be great if you could consume it.
Felix: You don't need a separate filter for translate from Okapi as long as you can consume it.
David: Yves is working on the XLIFF 2.0 library, which will make switching easy when the time comes for it.
Yves: We do have some XLIFF 2.0 stuff done. But we don't want to fall back on everyone using Okapi because we need several implementations. It helps make the standard better by seeing what problems they run into. It is important to have multiple implementations.
Felix: That's not a W3C process question: We can have "fake"
implementations, but we need real ones.
... We didn't address the keyword mapping
topic. Let's put that down for later.
<fsasaki> ACTION: felix to come back to keyword mapping issue in domain [recorded in http://www.w3.org/2012/09/25-mlw-lt-minutes.html#action06]
<trackbot> Created ACTION-226 - Come back to keyword mapping issue in domain [on Felix Sasaki - due 2012-10-02].
Milan: Is there a new version of Okapi with this?
Yves: The HTML5 branch in the GIT repository has it.
Des: Will it move into the dev branch?
Yves: Later on.
<Jirka> https://github.com/kosek/html5-its-tools
Arle: can it convert back from XHTML to HTML5?
Jirka: Not currently, but it wouldn't be hard.
Felix: It might be useful to Pedro if it did.
Shaun: If there is no ITS target information in the target file,
do you have to convert back?
... It should take only a few lines of XSLT. It's not
difficult.
Pedro: The transition to HTML5 will take some time and this will help.
Yves: This was *extremely* useful to me. If you are working with Java, using validator.nu is the natural way.
Felix: This validator.nu is used by the W3C's own validator.
Jirka: for HTML5+ITS there is web and command line versions. If there is interest, I can make it accessible through university website when stable.
Felix: This will become part of the W3C validator once stable.
Jirka: Before that, I can find a server and make it
available.
... It will help us catch typos in examples.
Felix: For ITS 1.0 you made Schematron rules to check all sorts
of things. I'm not sure if people are familiar with that.
... See the link I posted.
These are checks that go well beyond schema checks.
... E.g., cooccurrence
constraints, etc.
... Could the Schematron be integrated into the W3C validator?
Jirka: I'll need to check on that.
Pedro: This features Drupal integration with Cocomore for the showcase.
Felix: These are hand-made examples for now, right?
Mauricio: Yes.
... The implementation of translate allows
CAT tool users to see the content, but not to change it.
Felix: When will there be a prototype?
Pedro: Here there are three parts. The first is the Drupal
connection. We have checked our web service. Today or tomorrow I hope that we can ramp up
but it has been tested.
... The second is the engine for normalization. That will be
done in October, in a couple of weeks.
... The third are the effects in the
localization platform. Everything has to be ready before the end of December.
Felix: If you look at the description of work we have until next
year. But see how Yves is implementing while we are defining and providing feedback.
... You are working in a waterfall mode, waiting for the definition to be complete. For
example, the <meta> tag has content, so it wouldn't work. The waterfall model wouldn't
catch that early on, otherwise you don't see the errors until later on.
... I hope you
can move towards Yves' model to catch errors early on.
<philr> ITS 2.0 Specification says that Provenance category will be updated in next version of the spec. Is this still the case? Has Provenance category been dropped?
Felix: It is really useful to use a feature prototype model.
Felix: We need to start contributing test cases.
Jan: That will help those interested in that to start getting involved.
Des: In the first use case, why did you go first to XML, then to
XLIFF, then to HTML5? HTML5 doesn't seem to be an optimized interchange format?
...
When you don't have a CMS, there are valid reasons to use HTML5. But when you do, why not go
straight to XML?
... You obviously have a reason since you considered them.
Moritz: We started with XML, moved to XLIFF, and that was hard. And then Felix asked for more HTML5 implementations, so we thought we'd try that. We found XLIFF was a pain, so we could move back to XML.
Felix: While authors may want to work with HTML5, internally use
what works best. I don't think corporations are using HTML5-based workflows right now.
... I've seen examples of XLIFF, but see what works for you. Make sure it is useful for you
internally.
Des: It seems to me that this is going Publishing → Localization → Publishing by using HTML5. It may work for you though.
Dave: Jan told us yesterday, however, that more authoring is in HTML5.
Pedro: Normally we have discussion between integrators, the client, and us. Perhaps in that case someone would have asked why we use HTML for a roundtrip like this.
Felix: It would't work without HTML5 support, but we didn't
discuss any specific HTML5 application. Yves showed how HTML5 could enter the chain, be
converted to XLIFF, etc.
... But I'm not sure if HTML5 should serve for the whole
chain. XLIFF would seem to make more sense.
David: HTML5 lacks the mechanism for bitext translations.
... I thought Tektronix donated their XLIFF-to-Drupal extractor to an open-source project,
so this was taken care of.
Felix: You don't have to use HTML5, so please look at it and do what you need to that makes sense.
Des: I think that we need to distinguish between authoring and
publication formats on the one hand and interchange formats on the other. We need to
consider what is best practice.
... There is a lot that isn't possible in HTML5. I
think we need to consider what is best practice and what we should promote.
Dave: Smaller clients running their own websites might have only
an off-the-shelf Drupal and don't want to set up XLIFF and so forth.
... So that is
one market, different from the enterprise client.
Felix: You can consider using XLIFF in your process, or might continue as you are and make it clear where your workflow applies with a good description.
Dave: We need clear business cases.
Moritz: Mauricio and I should knock this out tonight.
Felix: Include David F. in this discussion.
Pedro: Concerning readiness, there are a few of us who see this
as very useful (Dave, Yves, Cocomore, and us). In the case that you can choose where to put
that information, is more political than technical.
... In the use case of HTML with
no API, wrapper, etc., you might put it right in the HTML material.
... We need to
push this hard right now since it needs to be ready by November.
Felix: Let me point to what Yves and Shaun did: they implemented
features they liked and discussed them in the ITS discussion forum. Some of their ideas are
now being implemented.
... Implement things, but not privately, even if they don't
make it into ITS 2.0, so that others can see them.
... One reason for an
implementation-driven approach is that it allows people to see what is being thought of and
tried.
David: I see why you want readiness in HTML5, but most clients don't want that information published.
Dave: One thing we haven't discussed much is the need to strip information.
Shaun: For ITS 2.0 we use DocBook and Mallard. Before we had
tools, the translators had to work directly in those files.
... Our translators use PO
files.
<fsasaki> ACTION: phil to move provenance forward (off-line discussion at prague f2f) [recorded in http://www.w3.org/2012/09/25-mlw-lt-minutes.html#action07]
<trackbot> Created ACTION-227 - Move provenance forward (off-line discussion at prague f2f) [on Phil Ritchie - due 2012-10-02].
Shaun: Colleague created XML2PO,
but it created problems for us in some ways (despite being a step forward). There were
issues for us concerning how to map the XML structure to PO.
... I redid this as ITS
Tool when I discovered it.
... ITS couldn't provide all the information needed by PO.
We added a number of extensions, some of which have now gone into the ITS 2.0.
Shaun: ITS tool ships with a set of rules and uses them to parse files.
<fsasaki> Arle, maybe for the slides: ITS tool ships with a set of default rules for various formats and uses these for PO file generation
Phil: I'm going to show our work on
review.
... Our system is something like CheckMate, doing automated checks. We added a
browser client that works both off and online, using AJAX to post back to a server,
capturing provenance.
... Allowed use of audit trails to find quality problems in
other documents.
... Tool focuses on sentences where we expect there may be
problems.
... Allow tagging error types in the UI. The process alters the DOM in HTML
and puts the errors into stand-off markup.
... By editing the DOM, we can save the
file with the markup.
... It doesn't require copying and pasting.
Des: What are the constraints? Can you use any HTML file?
Phil: It's browser-independent. It doesn't have any dependencies
because when we do the transformation from XLIFF everything is wired into the file and all
you have to do is references some standard JQuery/JavaScript libraries.
... Everything
is embedded in the HTML5 when it is converted from XLIFF.
<fsasaki> http://about.validator.nu/htmlparser/
<fsasaki> "The jar file contains sample main() entry points:"
<fsasaki> above library can be used not only for validation, but also for parsing and e.g. creating various serializations
Dave: Will discuss simple MT.
Pedro: With MT there should also be CAT tools and human at the segment level. What strategy did you take to addressing metadata that applies to more than one segment/level?
Dave: Before we call the service, we have to do a full parse down to the segment level.
Pedro: In our case we don't do the segmentation. The CAT tool
does, because it has to be consistent with the TM.
... It is an external service to
us.
Dave: We do it because we want to focus on the MT and still have control. But we are't working with a CAT tool.
David: It's a small loop here, so we can do it this way. But in a bigger process, you have to make sure these things are handled appropriately early on. You will need ways to reverse the process too, at the end.
Pedro: Some things are handled at the segment level, but others apply to the document or sections.
David: in some cases segment-by-segment is too slow.
...
You won't want to rely on the MT system for segmentation if you have to use TM.
Declan: We need to know whether the MT service would ever get a full document or whether it would only get pieces. In the past we have usually dealt with sub-paragraph segments.
Felix: Domain-mapping here used space separated rather than comma-separated. We need to make sure there is consistency here.
Yves: I wanted to know how to map domains in HTML. The problem was the format of the keywords in META. Currently we point to a node and expect a string to map to it, but we don't have an internal syntax for the contents. We need to specify this.
David: Talking about XLIFF used to provide CMS-TMS
roundtrip.
... Proxy problem means we can't show the demo.
... We initiate
projects on the CMS. Want to show examples of how the XLIFF half works.
... Note this
is nothing like a traditional TMS. It is a service-oriented architecture. Previously it had
an unrestricted number of specialized agents.
... It routes XLIFF between the
specialized agents.
... It is a closed localization loop, before the CMS enters.
... The idea is to use this modularized system based on XLIFF I/O.
David: Sometimes the dumb components need to be clever.
... We can start processes from an arbitrary XLIFF file, or from Okapi.
... We work
with Moravia and M4Loc (Moses). Moses uses text only, but M4Loc adds XLIFF capabilities for
Moses. We then pass on MT-relevant metadata. They will add support to the M4Loc
project.
... We might want to add support for provenance that Yves doesn't need. For
example, if we want to integrate multiple MT systems, we would need that capability.
Yves: We need a consistent way of mapping the data categories to XLIFF.
Dave: The co-chairs need to take the lead in this.
David: Does this belong to XLIFF or ITS? Maybe this is a good
reason why Moritz and Pedro did not use XLIFF for an exchange mechanism.
... We need a
single XLIFF+ITS method.
Felix: Once the metadata is stable in November, we need to deal with this. We can publish as many best practice documents as we want, so we can have an ITS to XLIFF mapping.
Moritz: I'd like to make a case for readiness. We need to
provide a way for the user to be able to trigger processes upon certain conditions. For
examples, we send things off to Enrycher, Linguaserve.
... Even if readiness isn't a
data category, it should be a best practice to help smaller enterprises.
... We let
users add local metadata.
Dave: Is that an existing HTML editor?
Moritz: Yes.
... We have trouble knowing how to make
translate global for the end user in an intelligible fashion.
Felix: Is localization note only global for the whole document?
Moritz: for the content node, yes.
Felix: That doesn't let you mark pieces of nodes.
Moritz: We've not implemented that but it's something to
consider.
... Implementing all this required "breaking Drupal’s back a bit”. It's
still a bit too complex, but we're working on this.
... Our process in the CMS should
be linked to best practice for readiness.
Olaf-Michael: Does it compare metadata in source and target?
Moritz: It's half automatic at this point. We need to see what we can leave in.
Serge: This targets Drupal, but what about the other 1200+ CMS products?
Felix: Because we don't have infinite funding, we are focusing on an open-source CMS, hoping that it can be reused. This is just the start and we want it in open source.
Moritz: We will provide these as Drupal modules for others to use.
Yves: The interface with translation will be standardized, and not tied to Linguaserve?
Moritz: For the showcase, we are focusing on Linguaserve, but we will go wider.
Felix: Adjourn for today.