RE: issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup

+1

From: Felix Sasaki [mailto:fsasaki@w3.org]
Sent: Mittwoch, 30. Januar 2013 08:31
To: Mārcis Pinnis; Stephan Walter
Cc: Tadej Štajner; Yves Savourel; public-multilingualweb-lt@w3.org ; Artūrs Vasiļevskis
Subject: AW: issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup

Very good point, Stephan.

Best,

Felix

Von meinem Sony Xperia™-Smartphone gesendet

Stephan Walter <stephan.walter@cocomore.com<mailto:stephan.walter@cocomore.com>> schrieb:
Hi Felix,

(this time sent out to all recipients… ;))

just a short additional argument for dropping the granularity attribute (that I don’t seem to have read in the discussion so far).

There may be substantive disagreement on whether some backgraoud resource is to be counted (for instance) as an ontology or as a lexical resource (I’ve even heard people calling Wordnet an ontology). Without any authoritative classification, we might end up with producers annotating the same disambiguation information (i.e. a pointer to the same entry in the same resource) as different granularities.

Best
Stephan

Von: Felix Sasaki [mailto:fsasaki@w3.org]
Gesendet: Dienstag, 29. Januar 2013 08:51
An: Mārcis Pinnis
Cc: Tadej Štajner; Yves Savourel; public-multilingualweb-lt@w3.org<mailto:public-multilingualweb-lt@w3.org>; Artūrs Vasiļevskis
Betreff: Re: issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup

Am 29.01.13 07:52, schrieb Mārcis Pinnis:
Hi Felix,

If I understood correctly, the new proposal is to slightly change the Disambiguation data category (by dropping granularity)

Hi Mārcis,

the below proposal is like that, correct. However, it has the drawback that no relation between terminology and disambiguation is expressed. That brings us back to the original issue-68. That included deprecating terminology. I assume that you would not agree with that, but would continue to generate terminology markup? So in a sense we are back at the start.

In a different sense we made a progress. At
http://lists.w3.org/Archives/Public/public-multilingualweb-lt-comments/2013Jan/0042.html

your main concern about disambiguation was the granularities, and below proposal includes dropping them. However, another concern may be the naming "disambiguation". I'm not sure about this, hence just asking you and others interested in the issue.

Best,

Felix

and leave Terminology as is? If yes, then I’m OK with that if everyone else is.

Best regards,
Mārcis ;o)

From: Felix Sasaki [mailto:fsasaki@w3.org]
Sent: Monday, January 28, 2013 9:57 PM
To: Tadej Štajner
Cc: Mārcis Pinnis; Yves Savourel; public-multilingualweb-lt@w3.org<mailto:public-multilingualweb-lt@w3.org>; Artūrs Vasiļevskis
Subject: Re: issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup

Hi Tadej, all,

sorry for not giving detailed replies to other mails. Trying to bring together *some* loose ends here.

Am 28.01.13 19:08, schrieb Tadej Štajner:
Hi, all, (long e-mail ahead, you can scroll to TL;DR)
true - the current state is a local optimum that satisfies the requirements. It would need some polish, better guidance and stricter definitions, and possibly renaming disambigGranularity back to disambigType.

As an improvement, Felix's proposal makes some sense, since it makes ITS2.0 capable of proper multi-layer annotation. If this two mechanisms for inline+standoff annotation is too complex to implement, it would be an acceptable compromise to just have only the stand-off and no inline (except for term="yes", maybe), but I'd vote in favor of keeping the inline part.

Also, the ref/id pointing could also be expressed the other way around, pointing from fragment to the annotation. Instead of:
<span id="dublin1">Dublin</span>
...
<its:textAnalysisAnnotation its:tanType="entity" its:tanIdentRef="http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin> ref="dublin1" />

I would suggest same mechanism as in LQI, so we have some symmetry:

<span its:tanRefs="tan1">Dublin</span>
<its:textAnalysisAnnotations id="tan1">
    <its:textAnalysisAnnotation its:tanType="entity" its:tanIdentRef="http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin>/>
</its:textAnalysisAnnotations>

In the above you use the name its:tanRefs. Does that imply that you assume referencs to several annotations?
At Yves, as a reply to
http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Jan/0206.html

"I don't see a difference between what the standoff markup of LQI/Provenance does and this standoff for Term+Disambiguation does."
I think the difference is how you store in my example the external annotations: in separate units, pointing to the same ID. In Tadejs example you then also have the potential to point to several units. I think that is different from the current LQI/Provenance approach: here the idea is to just add one link relation. I'm not sure yet whether that difference is significant - I have to think about it.
But while doing that a question on the LQI/Provenance implementers: is it a feature that you point to just one external standoff unit, or an oversight, and it could it be several ones?

Wrt to the below, the lowest effort would probably be "drop granularity", that is 2) below. To accomodate one part of Christian's comment at
http://lists.w3.org/Archives/Public/public-multilingualweb-lt-comments/2013Jan/0014.html

we could rename disambigatution to its-tan-*, and re-write the disambiguation section.

If we then forsee that several annotations might happen, we could accomodate for the LQI/Provenance standoff approach.

Since there have been many others mails on this, and I can't reply to these here: Mārcis, Yves, would that resolve your concerns and questions? Christian, I assume that Tadej's characterization "less-specific 'pointer to some meaning identifier' brother to Terminology." of disambiguation (or "tan") would not satisfy your concern - what would you propose?

Best,

Felix




Secondly, I'll give another alternative (and orthogonal) proposal, repeating what Pablo Mendes already hinted at in August: remember the question of supporting the distinction between different disambiguation types - entity, lexical concept, ontology, concept, which is now encoded in the 'disambigGranularity' attribute (relevant discussion http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Aug/0322.html).

When trying to merge Terminology and Disambiguation, having that many disambiguation types supported in the same way implies that we end up with 16 or so attributes. After some discussion in Prague, we realized that although we've established that a distinction between those types exists and it is important, we couldn't come up with a use case where having that information would make a difference in the actual workflows.

Let me clarify:  if a consumer component cares about disambiguation, it will try to resolve the disambigIdentRef identifier. By resolving it, it is able to know what type/level/granularity of disambiguation it's dealing with. By that reasoning, having this information explicit is redundant, because the system already did its job. The question is, is there a use case that justifies keeping the 'disambigGranularity'? For instance, operating on the disambiguation values without actually resolving them? Maybe filtering?

So, we'd go from:
<span
          its-disambig-confidence="0.7"
          its-disambig-class-ref="http://nerd.eurecom.fr/ontology#Place"<http://nerd.eurecom.fr/ontology#Place>
          its-disambig-ident-ref="http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin>
          its-disambig-granularity="entity">Dublin</span>
      is the <span
          its-disambig-source="Wordnet3.0"
          its-disambig-ident="301467919"
          its-disambig-granularity="lexical-concept"
          its-disambig-confidence="0.5"
          >capital</span> of Ireland.

to:
<span
          its-disambig-confidence="0.7"
          its-disambig-class-ref="http://nerd.eurecom.fr/ontology#Place"<http://nerd.eurecom.fr/ontology#Place>
          its-disambig-ident-ref="http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin>>Dublin</span>
      is the <span
          its-disambig-source="Wordnet3.0"
          its-disambig-ident="301467919"
          its-disambig-confidence="0.5"
          >capital</span> of Ireland.

In this setting, ITS would just operate with references to identifiers and wouldn't care about the type of that relationship. I understand this is losing information, and it weakens the expressive power, but I'm asking this because it might simplify a couple of solutions here. Even though I proposed it initially, I wouldn't push something that hasn't got any consumers behind it (the T in ITS doesn't stand for Tadej.. :) ). It would also establish a clearer boundary between what ITS covers and what other formats should cover.

TL;DR
In short, I see the some scenarios that I'd be ok with:
1) If we keep 'granularity':
    1a) We keep granularity in the form of its:tanType and go with Felix's proposal in the form of its:tanType, and possibly inverting the addressing so it's like LQI;
    1b) We keep granularity, we keep current proposed Disambiguation data model, possibly renaming 'disambigGranularity' back to 'disambigType';
2) If we drop 'granularity', we probably wouldn't need the new its:tan* model, and it would make sense to keep the rest of the disambiguation data category as-is, and describing the three usage scenarios only as best practices. Disambiguation would then serve as a less-specific 'pointer to some meaning identifier' brother to Terminology.

-- Tadej

On 28. 01. 2013 16:42, Mārcis Pinnis wrote:

Hi Felix, all,



I also do not have anything against leaving everything as is.

I however (as I made clear in my previous e-mail) don't think that the stand-off markup is a nice solution.



Best regards,

Mārcis ;o)



-----Original Message-----

From: Yves Savourel [mailto:ysavourel@enlaso.com]

Sent: Monday, January 28, 2013 5:31 PM

To: 'Felix Sasaki'; Mārcis Pinnis

Cc: public-multilingualweb-lt@w3.org<mailto:public-multilingualweb-lt@w3.org>; Artūrs Vasiļevskis

Subject: RE: issue-68 from an annotation representation point of view, with potential implications for annotatorsRef and standoff markup



Hi Felix, all,



Just a judgment from my side: I think at the moment we don't have

consensus for



- leaving everything as is (Dave's proposal)

I don't have anything against leaving things as is.

There is nothing really broken.



It's just that having both data categories fused would be a bit nicer. But overall if there is no time to make that work, we can indeed just leave it as it is.



cheers,

-yves

Received on Wednesday, 30 January 2013 08:05:47 UTC