Re: [Issue-41][Action-190] Draft a section about mtConfidence, based on the discussion from Dr. David Filip on 2012-08-09 (public-multilingualweb-lt@w3.org from August 2012)

From: Dr. David Filip <David.Filip@ul.ie>
Date: Thu, 9 Aug 2012 11:16:37 +0100
To: Yves Savourel <ysavourel@enlaso.com>
Cc: public-multilingualweb-lt@w3.org
Message-ID: <CANw5LKkJAH64nV+=Am9jBXuj9m1e6OA5hmP_+xBsx-Zc4dmEJA@mail.gmail.com>
Thanks Yves, inline again

Dr. David Filip
=======================
LRC | CNGL | LT-Web | CSIS
University of Limerick, Ireland
telephone: +353-6120-2781
*cellphone: +353-86-0222-158*
facsimile: +353-6120-2734
mailto: david.filip@ul.ie



On Thu, Aug 9, 2012 at 4:31 AM, Yves Savourel <ysavourel@enlaso.com> wrote:

> Thanks for the explanations David.
>
>
> > the XOR just includes slightly different header
> > as another example inserted in the example
>
> IMO examples should be straight real files, that we can actually process.
> If we want to show two ways to do something we should use two separate
> examples.
>

I said I am going to split the example. No need to continue digging here. I
used the XOR in time pressure to indicate that it is not part of the same
example.. just as a first draft measure. I understand this needs to be
split and will be..

>
>
> > The value "en-t-cs" follow the t extension
> > syntax from BCP 47, so it means English transformed
> > from Czech. I am aware that the usual MT pair
> > convention is the other way round (I use this
> > convention in the private string examples), but I
> > thought that the t extension would find valid usage here..
>
> I see that now. So none of my notes about text that should be translated
> stand.
>
> This said, I'm not sure using the t extension is a good way to identify an
> 'engine', in addition to be counter intuitive we don't really intend to
> standardize that value don't we? So I suppose one example can use it, but
> we could have several other examples maybe.
>
> The XOR example (that will be forked into a separate example) from above
shows a possible private string that would use the more "natural"
convention.
I do not shoot for a standard solution for mtEngine string and I do not
love the extension overloaded BCP 47. I just think that the t extension
might come really handy here as it is the only standardized way how to
simply give an ordered pair. The reason that I do not like another standard
should not prevent me from using it if it's relevant and does the job.

>
> > I do not understand this part at all. MT candidate translations
> > are always 100% matches in the terms of TM matching.
> > The self-reported confidence expresses what might be the
> > chance that the 100% match is accurate/usable.. I do not
> > think we need a combined value here. And this is also a
> > reason why XLIFF would need a separate mechanism for
> > reporting the confidence, we could not overload the normal
> > match rate..
>
> I guess the point I was making was that Bing doesn't provide 0-100%
> confidence score. So if we use this as an example we should explain how we
> get it. Or use another example.
>
Let us wiat what Chris and/or other MT producers say here. The initial
feedback from Chris and from Declan was that single figure will cut it. If
MT producers will ask for splitting, split it will be..

>
>
> >> I'm not understanding why it's there. I think you
> >> mean that the global rule must not use that attribute.
> >> Then just don't say anything. If it's not listed it
> >> cannot be used (it's just not an attribute of
> >> <mtConfidenceRule>)
> >
> > It is true, still in my experience redundancy serves
> > the purpose of absolute clarity
>
> As a developer I'm utterly confused to see a mention of an attribute that
> does not exist in that element.
>

I will transform it into a Note, something like. Please note that
mtConfidenceScore does not exist at the global level and it is by design..

>
>
> > Well, the whole point is that the score is
> > worth nothing at all if you do not know what
> > the producer and engine are. I first thought that
> > GLOBAL does not make sense at all for confidence.
> > But later reintroduced GLOBAL for producer and
> > engine, as they are likely to be the same throughout
> > the whole document in many scenarios, so that
> > you can save lot of space not specifying them for
> > each and every segment. So mtProducer and mtEngine
> > are only optional at the local level if thez
> > have been specified at the gloabl level
>
> So your real goal is to have a value set for mtProducer and mtEngine when
> we have a local mtConfidenceScore. You don't really care how or where it is
> set, right?
> Then we should have defaults for those values. Validating if those
> attribute should be defined locally or not based on whether they are
> defined at a higher level is going to be very difficult to implement.
>

I am not sure what the defaults should be in order to cut this. If multiple
mtProducers and/or mtEngines default to the same values, this category
collapses as the confidence scores are NOT comparable among
producers/engines..

>
>
> > And there must be a processing requirement to
> > move them onto the segment level should the
> > header be separated during processing..
>
> Some formats may not allow those attribute elsewhere than the top of the
> document. But anyway, I don't think we can have such processing
> requirements for ITS. Default values solve all this as far as I can tell.
>
How, see above?

>
> I would also disagree that the score is worth nothing without the
> mtProvider and mtEngine values. Actually, in many scenarios knowing the
> provider or the engine means diddly squat to the end-users. They just care
> about the score.
>

It is worth nothing if you cannot discern among producers and engines. The
end user who does not understand this MUST NOT be exposed to values coming
from mixed engines/producers.
In other words it is OK to DISPLAY SCORE ONLY TO THE END USER if you have
ensured up the stream that they DO come from the same producer AND engine.
Again not sure how to cut this with defaults, as the defaults would
collapse filtering.

>
>
> Cheers,
> -yves
>
>
>
Received on Thursday, 9 August 2012 10:17:44 UTC