RE: Disambiguation and terminology producers (Re: issue-68 (Re: Comment on ITS 2.0 WD-its20-20121206 - Disambiguation (and term)))

Hi Felix,

That is not the best example for terminology :)

I will take another one:

„Computer software, or just software, is a collection of computer programs and related data that provides the instructions for telling a computer what to do and how to do it.” (from: http://en.wikipedia.org/wiki/Software)

There are several levels of annotation richness that can be produced:


1)      Term-candidate annotation

a.       Computer software

                                                               i.      its-term="yes"

                                                             ii.      its-term-confidence="0.6"

                                                            iii.      its-annotators-ref="terminology|http://www.tilde.com/TerminologyAnnotationService"

2)      Term linked with a term-base

a.       software

                                                               i.      its-term="yes"

                                                             ii.      its-term-info-ref="http://www.eurotermbank.com/GetEntryDetailed.aspx?item=192328"

1.       Do not trust the link however – that will be subject to change...

                                                            iii.      its-annotators-ref="terminology|http://www.tilde.com/TerminologyAnnotationService"

That is the additional mark-up that our service will add to the existing mark-up.

Maybe this answers your question?

Best regards,
Mārcis ;o)

From: Felix Sasaki [mailto:fsasaki@w3.org]
Sent: Tuesday, January 15, 2013 2:20 PM
To: Mārcis Pinnis
Cc: public-multilingualweb-lt-comments@w3.org
Subject: Re: Disambiguation and terminology producers (Re: issue-68 (Re: Comment on ITS 2.0 WD-its20-20121206 - Disambiguation (and term)))

Hi all, Mārcis again,

to move this forward, I have worked with an example. For the sentence
"Welcome to Dublin in Ireland!"
From Enrycher you will get an annotation like in example
http://www.w3.org/TR/2012/WD-its20-20121206/#EX-disambiguation-html5-local-1

from the NERD API, using the same sentence, you will get this JSON output:

[{"idEntity":169970,"label":"Dublin","startChar":0,"endChar":6,"extractorType":"CITY","nerdType":"http://nerd.eurecom.fr/ontology#Location"<http://nerd.eurecom.fr/ontology#Location>,"uri":"http://dbpedia.org/resource/Dublin"<http://dbpedia.org/resource/Dublin>,"confidence":1.0,"relevance":0.5,"extractor":"extractiv","startNPT":0.0,"endNPT":0.0},{"idEntity":169971,"label":"Ireland","startChar":25,"endChar":32,"extractorType":"COUNTRY","nerdType":"http://nerd.eurecom.fr/ontology#Location"<http://nerd.eurecom.fr/ontology#Location>,"uri":"http://dbpedia.org/resource/Ireland"<http://dbpedia.org/resource/Ireland>,"confidence":1.0,"relevance":0.5,"extractor":"extractiv","startNPT":0.0,"endNPT":0.0}]

The mappings NERD - ITS2 "disambiguation" are:
- "nerdType" maps to "its-disambig-class-ref"
- "confidence" maps to "its-disambig-confidence"
- "uri" maps to "its-disambig-ident-ref"

So we have some interoperability with 11 tools (NERD is a broker for 10 annotation tools, plus Enrycher): they produce easy to map output.

So the question - again focusing on production, not consumption: what do you, Mārcis, expect "your" automatic term annotation tool to produce for the example sentence "Welcome to Dublin in Ireland!" ?

Best,

Felix

Am 15.01.13 10:34, schrieb Felix Sasaki:
Hi Mārcis,

thanks a lot for your detailed mail. I must however say that I don't see an answer to my question: "what is the difference in terms of producing the metadata?". The question was really focused on your implementation approach. I understand your consumption scenario and the terminology use case. But I assume that in your automatic term annotation implementation you apply the same linguistic processing pipeline as Tadej does, using basic analysis (tokenization, stemming, morphology etc.), then some resources (e.g. a lexicon) to define the type of "unit" (I'm saying unit to avoid "term" or "entity"). As I understand it, the disambiguation output gives background information what resources have been used: an ontology like dbpedia, a lexicon like wordnet.

See e.g. the NERD API
http://nerd.eurecom.fr/documentation#nerdapi

that gives you back a nerdType
[

  [

                  {

                    idEntity: 120,

                    label: "BBC",

                    startChar: 138,

                    endChar: 141,

                    extractorType: "Company",

                    nerdType: "http://nerd.eurecom.fr/ontology#Organization"<http://nerd.eurecom.fr/ontology#Organization>,

                    uri: "http://dbpedia.org/resource/BBC"<http://dbpedia.org/resource/BBC>,

                    confidence: 0.0582796,

                    relevance: 0.5,

                    extractor: "dbspotlight",

                    startNPT: 0,

                    endNPT: 0

                    },

                   ...

                  ]
]

I'm mentioning this API since in a sense it is the API counterpart to what we are standardizing with markup: it provides a JSON format as the output of annotation.
So again you have a confidence field and type - and I'd like to understand not what the difference is in your use case, but in the implementation approach (see above)? If the answer is "none", that is fine too, and it would give us a path to explain to users (both producers and consumers) how to deal with both use cases.

Best,

Felix

Am 15.01.13 08:55, schrieb Mārcis Pinnis:

Hi Felix,



Terminology from a practical standpoint identifies concepts (often also common term phrases - concepts in multi-word phrases) commonly found in a specific domain (subject field) and infrequently found or not found at all in a general language. That is the purpose of the Terminology data category. It should identify domain-specific terms (or possible term-candidates with a confidence score when automated annotation is performed) and, if possible, link the identified terms with entries in a term-base.



I am not that familiar with the Disambiguation data category and its history, but the question is, what is the main goal of the Disambiguation data category (whom is it meant for and who will provide data for it?)? Should it identify or try sorting out ambiguities in any type of phrases (no matter - terms, named entities, general language, etc.)? Then - have we identified all types if we have only three "granularities"? Then also - a phrase can actually simultaneously belong to all "granularities" (I think the naming does not reflect the meaning correctly) depending on a client, which I guess makes it difficult for content providers to create reasonable mark-up (that is one reason why I would prefer not using Disambiguation).



In my opinion, such different content mark-ups - (terms as concepts (also - what is meant with lexical-concept is not explained (and how does that overlap with what is a term?)!!!), named entities as concept instances (the main difference between understanding what is a term and what is a named entity - however a named entity in many cases can be also a term - for instance for the term "weapon" suitable named entities may very well be: "knife", "gun", "axe", however, is "knife" a named entity or is it a term? It can be both! Take a look at the biggest classification table of named entities: http://nlp.cs.nyu.edu/ene/version7_1_0Beng.html - under the category Product you may probably find many things that you may have not thought to be named entities?!) and ontology-concept (which may very well overlap with the previous two...)) - should not be mixed together!



It is hard to understand the reasoning behind the different "granularity" levels also because of lacking definitions and it is not clear why such different data types should at all be mixed together in one category.



From that aspect, I would prefer separate data categories for all three "granularities" (if necessary; although I do not particularly like the naming here) as the applications for all these may be quite different.



From an implementer's and content provider's viewpoint I prefer the Terminology data category as its purpose is clear and it is also clear what is meant with the annotation - it is clearly identifiable and mark-up can be easily applied (which is not the case with the Disambiguation data category).



I hope I did not make things more confusing?! I wanted to raise the point that the disambiguation data category itself is quite ambiguous (at least to me).



Best regards,

Mārcis ;o)



-----Original Message-----
From: Felix Sasaki [mailto:fsasaki@w3.org]
Sent: Monday, January 14, 2013 8:35 PM
To: public-multilingualweb-lt-comments@w3.org<mailto:public-multilingualweb-lt-comments@w3.org>
Cc: Mārcis Pinnis
Subject: Disambiguation and terminology producers (Re: issue-68 (Re: Comment on ITS 2.0 WD-its20-20121206 - Disambiguation (and term)))



Hi all, esp. Tadej and Mārcis (FYI, it might be helpful for you to subscribe to the public-multilingualweb-lt-comments@w3.org<mailto:public-multilingualweb-lt-comments@w3.org> list),



Yves has responded from the point of view of a consumer. Now it would be interesting to understand: what is the difference in terms of producing the metadata?



Is in essence the process for creating

<span its:term="yes" its:termConfidence="0.98">screwdriver</span>



the same as creating



<span its:disambigSource="mywordnet" its:disambigIdent="474646"

its:disambigGranularity="lexical-concept"

its:disambigConfidence="0.98">screwdriver</span>



with the only difference that in the case of terminology, information is left out (Source, Ident, Granularity) and there is different naming for attributes (termConfidence vs. disambigConfidence)?



This would mean that we could create some guidance for producers of the metadata, related to different consumption scenarios.



Best,



Felix



Am 14.01.13 18:54, schrieb Lieske, Christian:

> Hi David, Jörg, Felix, all,

>

> It's great to see timely replies to this comment.

>

> It would indeed be valuable - as indicated by Felix - to get comments from additional angles.

>

> Cheers,

> Christian

>

> -----Original Message-----

> From: Felix Sasaki [mailto:fsasaki@w3.org]

> Sent: Freitag, 11. Januar 2013 18:17

> To: public-multilingualweb-lt-comments@w3.org<mailto:public-multilingualweb-lt-comments@w3.org>

> Subject: issue-68 (Re: Comment on ITS 2.0 WD-its20-20121206 -

> Disambiguation (and term))

>

> All (co-chair hat on),

>

> thank you for this discussion. General remark: as explained at

> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Jan/


> 0045.html please add the issue number to the mail subject. Otherwise

> it will be very hard to track discussions.

>

> It would now be interesting to hear the implementors: according to

> http://tinyurl.com/its20-testsuite-dashboard


> Enlaso, Tilde and UL will implement terminology. As I understand it,

> UL will make a wrapper around the Enlaso / Okapi engine, correct?

> Now, for Disambiguation we have Enlaso, JSI, Moravia and UL. Here I

> *think* that Moravia and UL will basically have an Okapi wrapper.

> Please correct me if I'm wrong.

>

> This leaves us with the following situation:

> - two implementations for terminology (Enlaso and Tilde)

> - two for disambiguation (Enlaso and JSI)

>

> So Mārcis, Tadej, Yves - what do you think about this proposal?

>

> I'm asking this also since I have to remind people about the W3C process:

>

> (W3C process hat on) We cannot just say "we don't like a comment".

> There needs to be good reasons to reject it. Below argumentation can

> support the rejection, but the rejection is rather weak if

> implementers don't have an opinion or would even say "I would do the

> change". So please express your thoughts in this thread.

>

> Best,

>

> Felix

>

> Am 11.01.13 14:07, schrieb Jörg Schütz:

>> +1

>>

>> Hi Christian, David, and all,

>>

>> I would have similar arguments for keeping term and disambiguation

>> separat although they are related. There are several use cases out

>> there in the wild that need this kind of separation, e.g. terminology

>> based workflows in a particular supply chain vs. data stream analyses

>> which prepare the data for further treatment such as a machine

>> translation application (vocubulary support and training/tuning life

>> cycles).

>>

>> One other topic is the discussion of the ISOCat elements which to

>> some extend would force applications to adopt an NLP standard that

>> might not be appropriate for a given application scenario, e.g. those

>> that do not use NLP technologies at all. Therefore, I would also

>> recommend that we do not talk about bringing ITS closer to NLP

>> because ITS should remain open and deployable for different language

>> processing strategies.

>>

>> Nevertheless, thanks a lot for raising these concerns.

>>

>> All the best -- Jörg

>>

>> On Jan 11, 2013, at 12:22 (CET), Dr. David Filip wrote:

>>> Dear Christian, thanks for this insightful comment.

>>> I agree that the disambiguation category is one of the most

>>> important additions that can expand the usage of the standard and

>>> become more useful across technologies and industries.

>>>

>>> The group had discussed and it is clear that disambiguation and term

>>> are somehow related categories. We have however not considered

>>> deprecation of the ITS 1.0 term, at least not explicitly.

>>>

>>> I believe that this is given by the chartered principles of the

>>> group [paraphrasing]

>>> 1) Do not break 1.0

>>> 2) Keep the 1.0 principle of independent categories that can also be

>>> independently implemented.

>>>

>>> I believe that your proposal to fuse term and disambiguation is

>>> inline with 2) in the sense of making two seemingly interdependent

>>> categories into one fully self contained and independent category,

>>> but would violate 1).

>>>

>>> But even if we did not care for 1), I believe that the relationship

>>> between term and disambiguation is a reasonably loose one, i.e. not

>>> a hard formal interdependency that would warrant or even mandate

>>> normative handling, and thus can and should be handled in

>>> non-normative material such as a best practice document, while we

>>> are keeping both categories, because they have discernable use cases

>>> and still can be implemented independently.

>>>

>>> A)

>>> A user that uses both a terminology management system and a text

>>> analytics system for disambiguation can reasonably combine them and

>>> their combination can be driven by organization specific process

>>> driven considerations. They can for instance harvest spans marked as

>>> disambiguation as term candidates for their Terminology database and

>>> these can be encoded as terms next time if e.g. a  terminologist

>>> approves them as terms.

>>>

>>> B)

>>> People using text analytics input only do not need to care about term.

>>>

>>> C)

>>> People using terminology management as the only source do not need

>>> to bother with complexities of the disambiguation category.

>>>

>>> To summarize:

>>> While many ITS categories, and prominently term and disambiguation,

>>> are informally semantically related, it seems important to keep a

>>> reasonable and manageable granularity of the independently

>>> implementable categories.

>>>

>>> I hope this helps to understand the group's motivation for keeping

>>> the categories apart.

>>> Please let me know

>>> Rgds

>>> dF

>>>

>>> Dr. David Filip

>>> =======================

>>> LRC | CNGL | LT-Web | CSIS

>>> University of Limerick, Ireland

>>> telephone: +353-6120-2781

>>> *cellphone: +353-86-0222-158*

>>> facsimile: +353-6120-2734

>>> mailto: david.filip@ul.ie<mailto:david.filip@ul.ie> <mailto:david.filip@ul.ie>

>>>

>>>

>>> On Thu, Jan 10, 2013 at 9:14 AM, Lieske, Christian

>>> <christian.lieske@sap.com <mailto:christian.lieske@sap.com<mailto:christian.lieske@sap.com%20%3cmailto:christian.lieske@sap.com>>> wrote:

>>>

>>>      Hi,____

>>>

>>>      __ __

>>>

>>>      Please find below comments/observations/questions/ideas concerning

>>>      the ITS 2.0 working draft dated December 6, 2012

>>>      (http://www.w3.org/TR/2012/WD-its20-20121206/).  Please feel free to

>>>      contact me for clarifications if anything is unclear.____

>>>

>>>      __ __

>>>

>>>      The section related to the “disambiguation” data category to me is

>>>      one of the most important ones of the draft. ITS 2.0 from my

>>>      point-of-view moves ITS 1.0 closer to Natural Language Processing

>>>      (NLP), and “disambiguation” to me is related to NLP in various ways.

>>>      Thus, making “disambiguation” powerful and easy to use (e.g. via a

>>>      clear distinction to other data categories, as well as

>>>      conceptualizations and wording that are not just known within

>>>      linguistics) seems important to me.____

>>>

>>>      ____

>>>

>>>      While looking at “disambiguation” from this angle, I started to

>>>      wonder if it could benefit from additions/modifications. I apologize

>>>      in advance if a reply to this comment may require that discussions

>>>      which presumably already took place may have to be

>>> summarized.____

>>>

>>>      __ __

>>>

>>>      Here are my observations/questions/ideas:____

>>>

>>>      ____

>>>

>>>      __a.__I sense that ITS users will have difficulties to decide when

>>>      to use “term” and when to use “disambiguation” (the note in the

>>>      Working Draft indicates this). ____

>>>

>>>      __ __

>>>

>>>      __b.__Annotation of known terms, generation of so-called “term

>>>      candidates”, (named) entity recognition, and other automation can be

>>>      subsumed under the heading “(automated) text analysis”.____

>>>

>>>      __ __

>>>

>>>      I am thus wondering if the following would be worth

>>> considering:____

>>>

>>>      ____

>>>

>>>      __1.__Enhance the current “disambiguation” so that also the current

>>>      “term” can be covered____

>>>

>>>      __2.__Deprecate “term”____

>>>

>>>      __3.__Revising some of the terminology used in the spec (e.g.

>>>      “disambiguation”, “disambigGranularity”)____

>>>

>>>      ____

>>>

>>>      An example use of a revised “disambiguation” (and deprecated “term”)

>>>      – partially inspired by ISOCat (see http://www.isocat.org/ ) – is

>>>      the following:____

>>>

>>>      __ __

>>>

>>>      Data category name: (automated) text analysis annotation (atan/tan);

>>>      using “text analysis annotation” would have the advantage that even

>>>      manual work (e.g. “promoting a term candidate to a term”) could be

>>>      covered____

>>>

>>>      __ __

>>>

>>>      Data category “qualifier” (currently “disambigGranularity”):

>>>      atan-type or tan-type____

>>>

>>>      __ __

>>>

>>>      Values for “qualifier”: lexical, term, termCandidate,

>>>      ontological-class, ontological-entity; possibly even URIs such as

>>>      http://www.isocat.org/datcat/DC-2275 - would allow rather

>>>      fine-grained and under certain provisions standard-conformant (ISO

>>>      12620; see http://www.ttt.org/clsframe/datcats.html)

>>> annotation____

>>>

>>>      __ __

>>>

>>>      Example:____

>>>

>>>      __ __

>>>

>>>              <span ____

>>>

>>>      __ __

>>>

>>>                 its-tan-confidence="0.7"____

>>>

>>>      __ __

>>>

>>> its-tan-class-ref="http://nerd.eurecom.fr/ontology#Place"

>>>      ____

>>>

>>>      __ __

>>>

>>> its-tan-ident-ref="http://dbpedia.org/resource/Dublin" ____

>>>

>>>      __ __

>>>

>>>                 its-tan-type="

>>>      http://www.isocat.org/datcat/DC-2275">Dublin</span<http://www.isocat.org/datcat/DC-2275%22%3eDublin%3c/span>> ____

>>>

>>>      __ __

>>>

>>>      Cheers,____

>>>

>>>      Christian____

>>>

>

Received on Tuesday, 15 January 2013 13:39:36 UTC