Re: [Issue-41][Action-190] Draft a section about mtConfidence, based on the discussion

Thanks Yves, answers inline..

Dr. David Filip
=======================
LRC | CNGL | LT-Web | CSIS
University of Limerick, Ireland
telephone: +353-6120-2781
*cellphone: +353-86-0222-158*
facsimile: +353-6120-2734
mailto: david.filip@ul.ie



On Wed, Aug 8, 2012 at 10:58 PM, Yves Savourel <ysavourel@enlaso.com> wrote:

> Hi David,
>
> A few notes:
>
>
> -- the element mtConfidence should be mtConfidenceRule to follow our usual
> pattern.
>
> -- I’m not sure I understand the example with:
>
> >  its:mtEngine="en-t-cs" />
> > XOR
>
the XOR just includes slightly different header as another example inserted
in the example

> > <its:mtConfidenceRule selector="/text/body/p/"
> > its:mtProducer=”vanilla Moses”
> > its:mtEngine="medical:EN-ES_LA” />
> >    </its:rules
>
> Is it a typo or is 'disambiguation' (should be disambiguationRule BTW)
> involved?
> Also, why do we have two rules in the example?
>

typo mtConfidence[Rule] intended will fix before posting again after call
tomorrow

>
> Also that example says it's using a EN to ES_LA engine, but the text looks
> very English to me (for a 89.82% confidence that doesn't look good :)
>

The local markup applies to the first gloabl example, I will split the
examples before submitting again

>
> -- I would suggest to allow only one type of value not 0.0 to 1.0 or
> 0-100%.
>

I was not sure what would be the preference. The applications can display
whatever they want. I have no real preference here. I would normalize to
one or not when we hear more from MT producers..

>
> -- I'm not sure the paragraph:
>
> "MT confidence can be displayed on websites machine translated on the fly,
> by simple translation editors, and Computer Aided Translation (CAT) tools.
> To facilitate usage in CAT tools, the data category should be promoted for
> inclusion in the match element of XLIFF 2.0. MT Confidence MAY be displayed
> for human consumers as segment annotation or as color-coded font or
> background."
>
> Brings anything to the definition of the data category. Or maybe it could
> be re-worded and moved to the list of possible purposes.
>

It is not intended as part of the definition, I was closely following the
LocNote template where the definition was followed by such illustrative
description, no problem with moving it to possible usage scenarios

>
> I agree with the bit about XLIFF, but I don't think it should be noted in
> the specification.
>

I noted it, so that it gets noted but agree that it does not need to be
there..

>
>
> --- For the example:
>
> > <body>
> >   <p><span its:mtProducer=”Bing Translator” its:mtEngine=”en-t-cs”
> > its:mtConfidenceScore=”89.82%”>Dublin is the capital city of Ireland.</p>
> >   </body>
> > </text>
>
> The text should be in Czech not English.
>

The value "en-t-cs" follow the t extension syntax from BCP 47, so it means
English transformed from Czech. I am aware that the usual MT pair
convention is the other way round (I use this convention in the private
string examples), but I thought that the t extension would find valid usage
here..

>
> Also, Jan can correct me, but I think the confidence for Bing Translator
> would be some combination of the MatchDegree and the Rating values it
> return. They certainly can be somehow mashed into a single value, but maybe
> we could use a more straightforward example?
>
I do not understand this part at all. MT candidate translations are always
100% matches in the terms of TM matching. The self-reported confidence
expresses what might be the chance that the 100% match is accurate/usable..
I do not think we need a combined value here. And this is also a reason why
XLIFF would need a separate mechanism for reporting the confidence, we
could not overload the normal match rate..

>
>
> --- Example 32:
>
> The text should be in (hopefully) Czech.
>
see above, English intended CS was source

> Also there is no space between the two sentences.
> And the double quotes should be ASCII
>
> The sentence "Prague is the capital city of Prague in the Czech Republic."
> Is weird.
>
> It is intended to be weird, it is authentic Bing translation of  "Praha je
hlavním městem České republiky." The fake confidence score is
correspondingly low. The engine is likely to be aware that the sentence
does not sound right :-0

>
> --- The text of the GLOBAL section says:
>
> "mtConfidenceScore MUST NOT be specified globally MUST NOT be specified
> globally."
>
> I'm not understanding why it's there. I think you mean that the global
> rule must not use that attribute. Then just don't say anything. If it's not
> listed it cannot be used (it's just not an attribute of <mtConfidenceRule>)
>
> It is true, still in my experience redundancy serves the purpose of
absolute clarity

>
> --- In the LOCAL section the text says:
>
> "All of the following MUST be specified locally, UNLESS mtProducer and
> mtEngine have been specified globally."
>
> It's not clear if mtProducer and mtEngine are allowed or not. Also, I
> don't think we should have dependency like that: one may look at a
> paragraph without having access to the top of the document.
>
> I think we should simply say:
>
> - An mtConfidenceScore
> - An optional mtProducer
> - An optional mtEngine
>
> Well, the whole point is that the score is worth nothing at all if you do
not know what the producer and engine are. I first thought that GLOBAL does
not make sense at all for confidence. But later reintroduced GLOBAL for
producer and engine, as they are likely to be the same throughout the whole
document in many scenarios, so that you can save lot of space not
specifying them for each and every segment. So mtProducer and mtEngine are
only optional at the local level if thez have been specified at the gloabl
level

And there must be a processing requirement to move them onto the segment
level should the header be separated during processing..

>
> --- attribute names
>
> If we have mtConfidenceScore we  probably should have mtConfidenceProducer
> and mtConfidenceEngine, like for the other data categories.
>

Right

>
>
> Cheers,
> -ys
>
>
>
>

Received on Wednesday, 8 August 2012 23:13:17 UTC