Warning:
This wiki has been archived and is now read-only.

Text Analysis serializations

From ITS
Jump to: navigation, search

Version: 16 January 2014

Overview

The ITS 2.0 specifiation defines a normative way to represent Text Analysis information in XML and HTML locally. Text Analysis information can also be represented in other formats, e.g. JSON. This page provides a description of such alternative serializations. Please edit this page or provide comments on the ITS IG mailing list.

Comparison to NERD API output

The output of the NERD API is described in a JSON format. Here is an example API call output.

[
{
(1) idEntity: 120,
(2) label: "BBC",
(3) startChar: 138, endChar: 141,
(4) extractorType: "Company",
(5) nerdType: "http://nerd.eurecom.fr/ontology#Organization",
(6) uri: "http://dbpedia.org/resource/BBC",
(7) confidence: 0.0582796,
(8) relevance: 0.5,
(9) extractor: "dbspotlight"
},
...]

There are the following correspondences between the NERD API and Text Analysis information pieces:

  1. idEntity: no correspondance
  2. label: content of the annotated element in XML or HTML
  3. startChar, endChar: not represented as part of Text Analysis information piece, but is generated in a NIF workflow, see conversion to NIF
  4. extractorType: no correspondance
  5. nerdType: entity type / concept class, e.g. in HTML its-ta-class-ref="http://nerd.eurecom.fr/ontology#Organization"
  6. uri: Entity / concept identifier, e.g. in HTML its-ta-ident-ref="http://dbpedia.org/resource/BBC"
  7. confidence: Text analysis confidence, e.g. in HTML its-ta-confidence="0.0582796"
  8. relevance: no correspondance
  9. extractor: its-annotators-ref (in HTML) or annotatorsRef (in XML) attribute, e.g. its-annotators-ref="text-analysis|dbspotlight".