ࡱ > [ r bjbj1[1[ $ S1\S1\V " - - . . . D G. G. G. . d +/ G. b / 2 2 2 2 !> R sB 'C \ $ E . C = = ^ C C - - 2 2 d 'H 'H 'H C j - 2 - R 2 'H C 'H 'H - " S 2 U F + : z 0 e L G 4 L t S S B L . P 'H C C C 'H C C C C C C C L C C C C C C C C C K U, : ASK "delnum" Deliverable number \d 55D3.1.3: Text Processing Component Tadej tajner Distribution: Public MultilingualWeb-LT (LT-Web) Language Technology in the Web FP7-ICT-2011-7 Project no: 287815 Document Information Deliverable number:3.1.3Deliverable title:Text Processing ComponentDissemination level:PUContractual date of delivery:30 September 2013Actual date of delivery:30 September 2013Author(s):Tadej tajnerParticipants:JSIInternal Reviewer:JSIWorkpackage:WP3Task Responsible:Tadej tajnerWorkpackage Leader: Revision History RevisionDateAuthorOrganizationDescription127/09/2013Tadej tajnerJSIDraft209/10/2013Tadej tajnerJSIClarity improvements, graphics Contents TOC HYPERLINK \l "__RefHeading__2_1247652609" Document Information 2 HYPERLINK \l "__RefHeading__4_1247652609" Revision History 2 HYPERLINK \l "__RefHeading__6_1247652609" Contents 3 HYPERLINK \l "__RefHeading__8_1247652609" Executive Summary 4 Executive Summary This deliverable describes how natural language processing can be used to improve the content localization lifecycle using technologies, such as named entity disambiguation with semantic knowledge bases. We describe the requirements from the localization side to support the use case of translating named entities and the resulting design constraints. We describe a standard data representation that was defined in the ITS2.0 W3C standard that allows integration of various language tools into this process. We present a reference implementation for the selected data categories based on the Enrycher text analysis system. Introduction Translation mechanisms for named entities depend on both the source and target languages. There are specific rules to translate (or transliterate) particular proper names or concepts. Sometimes, they should not even be translated. In order to support this use case, we propose to use an automatic natural processing method to annotate the content so that it can be correctly processed. The purpose of this work is to enable that the results of text analysis can be annotated in content. Besides translating names of entities, there are several other translation-related tasks could be improved with the help of such NLP information, for example: Term suggestion Contextualization Suggestion of things not to translate Automated transliteration of proper names Requirements In the requirements gathering phase of the standardization process we had outlined a specification and a use case for the role of text processing components in the content localization process. In general, the purpose of automatic annotation reduces the manual cost of annotation and may increase the accuracy, consistency and comprehensiveness of such annotations. For example, the enrichment of source content with named entity annotations is one example of such an automatic process. The goal of this work is to define and standardize an interface for text processing components in a localization workflow, supported by a reference implementation. The initial data modelling discussions in the requirements gathering phase [ REF __RefNumPara__168_1247652609 \r \h 1] resulted in identifying three concrete data category prototypes that were relevant for text processing tools: sense disambiguation, named entity annotation, and annotation of text analysis. These prototype data categories illustrate the requirements of what ITS2.0 should support. Prototype data categories This section outlines the prototype data categories and their functional requirements, which were later changed and consolidated into the final data categories. Sense disambiguation Definition Annotation of a single word, pointing to its intended meaning within a semantic network. Can be used by MT systems in disambiguating difficult content. Data model meaning reference: a pointer that points to meaning (synonym set) in a semantic network that this fragment of text represents. semantic network: a pointer (URI) that points to a resource, representing a semantic network that defines valid meanings. The value of the semantic network attribute should identify a single language resource that describes possible meanings within that semantic networks. The mechanism should allow for the validation of individual meanings against the semantic networks using common mechanisms. The sense disambiguation, as discussed in the requirements phase, consists both of individual word senses, as well as more conceptual senses. Named Entity annotation Definition Annotations of a phrase spanning one or more words, mentioning a named entity of a certain type. When describing a fragment of text that has been identified as a named entity, we would like to specify the following pieces of information in order to help downstream consumers of the data, for instance when training MT systems Data model entity reference: a pointer (URI) that points to the entity in an ontology. entity type: a pointer (URI) to a concept, defining a particular type of the entity. The named entity annotation proposal had a slight conceptual overlap with sense disambiguation in the requirements phase, since both are used to link textual fragments to external knowledge bases. Subsequent discussions lead to consolidation of both sense disambiguation and named entity annotation data categories into a common text analysis data category. Text Analysis result Annotation This data category allows the results of text analysis to be annotated in content. Data Model annotation agent - which tool has produced the annotation confidence score - what is the system's confidence for this annotation, on the range of [0.0, 1.0]. This prototype data category represents the requirement to specify what tool was used in a given processing step, and what the tools estimate of the output quality is. Domain This data category specifies the domain of the text. Data model domain name It should be able to point to multiple domains, as well as support mapping between different domain vocabularies. All of the mentioned prototype data category requirements were consolidated and refactored into new data categories during the specification phase, namely into Text Analysis which covers word and entity senses, and the Annotators reference mechanism to represent what tool has produced a given annotation. Scope The final requirements that were identified within the process represent a subset of what natural language processing can potentially offer to assist content processing for localisation. However, since the purpose of the project was to implement a useful and manageable standard, we had limited our scope to the functionalities that extended existing best practices that we could test, leaving the other use cases for other related standards, such as NIF [ REF __RefNumPara__911_533060041 \r \h 4], which takes care of the morphosyntactic properties of individual words. We had also considered and discussed differentiating between different levels of annotations that link phrases with knowledge bases, such as distinguishing between word sense disambiguation, concept disambiguation and entity disambiguation, as well as connecting it with term disambiguation. However, another survey of requirements revealed that introducing this distinction into the data category would not support any relevant use case. Therefore, we had defined that the its-ta-ident-ref property can be used to represent any type of linkage between the annotated phrase and a knowledge base, making no assumptions about the type of the link. With regard to the domain data category, we had identified that there was no plausible way of using a standardized domain set that could be used to validate the metadata, so the domain data category is now represented as an arbitrary string. Business benefits The benefit of using text processing tools hinges on the adoption of a standardized interface that can lower the barrier to such as system. While named entity extraction has already been shown to improve translation quality [ REF _Ref369092186 \r \h 15], named entity extraction component are typically language-dependent. This often entails that they have different implementations, which increases the integration effort. The benefit of standardizing this interface is the reduced marginal integration effort that needs to be applied for supporting a text processing components for an additional language. 3. Support in ITS2.0 The Text Analysis data category is used to annotate content with lexical or conceptual information for the purpose of contextual disambiguation. This information can be provided by so-called text analysis software agents such as named entity recognizers, lexical concept disambiguators, etc., and is represented by either string valued or IRI references to possible resource descriptions. For example: A named entity recognizer provides the information that the string "Dublin" in a certain context denotes a town in Ireland. Figure 1: The role of text analysis in the ITS 2.0 ecosystem While text analysis can be done by humans, this data category is targeted more at software agents. The information can be used for several purposes, including, but not limited to: Informing a human agent such as a translator that a certain fragment of textual content (so-called text analysis target) may follow specific processing or translation rules. Figure 1 shows where text analysis fits in to the whole ecosystem: it feeds into terminology management, as well as machine translation pre-processing. The ITS2.0 standard fulfilled the requirements with the following properties of the text analysis data category. Text analysis confidence: The confidence of the agent (that produced the annotation)in its own computation Entity type / concept class: The type of entity, or concept class of the text analysis target IRI Entity / concept identifier: A unique identifier for the text analysis target These can be used in the following way in an HTML5 setting in the following fragment:
Welcome to London
" HYPERLINK "http://enrycher.ijs.si/mlw/en/entityType.html5its2"http://enrycher.ijs.si/mlw/en/entityType.html5its2 Returns:Welcome to London
as its output. The implementation of the ITS2.0 processing code used in this demonstration is available at HYPERLINK "https://github.com/tadejs/enrycher-its20"https://github.com/tadejs/enrycher-its20 [ REF __RefNumPara__969_533060041 \r \h 3]. 6. Conclusions This deliverable describes the process of standardizing the output of natural language processing components, its result and a reference implementation of the text analysis and domain data categories. We outline the steps that were necessary to standardize this effort, as well as its final form. We provide recommendations for implementation of the processing pipeline, as well as the recommendations for datasets that can be used as knowledge bases. References Multilingual Web LT working group: ITS2.0 Requirements, HYPERLINK "http://www.w3.org/International/multilingualweb/lt/wiki/Requirements"http://www.w3.org/International/multilingualweb/lt/wiki/Requirements, 2012 Multilingual Web LT working group: The ITS2.0 Test Suite, HYPERLINK "https://github.com/finnle/ITS-2.0-Testsuite/"https://github.com/finnle/ITS-2.0-Testsuite/, 2013 T. tajner: Enrycher-ITS2.0, HYPERLINK "https://github.com/tadejs/enrycher-its20"https://github.com/tadejs/enrycher-its20, 2013 HYPERLINK "http://svn.aksw.org/papers/2013/ISWC_NIF/public.pdf" \n _blankIntegrating NLP using Linked Data. Sebastian Hellmann, Jens Lehmann, Sren Auer, and Martin Brmmer. 12th International Semantic Web Conference, 21-25 October 2013, Sydney, Australia, (2013) H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02). Philadelphia, July 2002. HYPERLINK "http://gate.ac.uk/sale/acl02/acl-main.pdf"PDF. HYPERLINK "http://gate.ac.uk/gate/doc/bibtex.html" \l "Cunningham2002"BibTeX. McCallum, Andrew Kachites. "MALLET: A Machine Learning for Language Toolkit." HYPERLINK "http://mallet.cs.umas s . e d u / " h t t p : / / m a l l e t . c s . u m a s s . e d u . 2 0 0 2 . T a d e j `t a j n e r , T o m a ~ E r j a v e c a n d S i m o n K r e k . R a z p o z n a v a n j e i m e n s k i h e n t i t e t v s l o v e n s k e m b e s e d i l u ; I n P r o c e e d i n g s o f 1 5 t h I n t e r n a t i o n M u l t i c o n f e r e n c e o n I n f o r m a t i o n S o c i e t y - J e z i k o v n e T e h n o l o g i j e 2 0 1 2 , L j u b l j a n a , S l o v e n i a T a d e j `t a j n e r a n d D u n j a M l a d e n i . 2 0 0 9 . E n t i t y R e s o l u t i o n i n T e x t s U s i n g S t a t i s t i c a l L e a r n i n g a n d O n t o l o g i e s . I n P r o c e e d i n g s o f t h e 4 t h A s i a n C o n f e r e n c e o n T h e S e m a n t i c W e b ( A S W C ' 0 9 ) , A s u n c i n G m e z - P r e z , Y o n g Y u , a n d Y i n g D i n g ( E d s . ) . S p r i n g e r - V e r l a g , B e r l i n , H e i d e l b e r g , 9 1 - 1 0 4 . R u s u , D . , S t a j n e r , T . , D a l i , L . , F o r t u n a , B . a n d M l a d e n i c , D . 2 0 1 0 . E n r i c h i n g T e x t w i t h R D F / O W L E n c o d e d S e n s e s . D e m o . 9 t h I n t e r n a t i o n a l S e m a n t i c W e b C o n f e r e n c e ( I S W C 2 0 1 0 ) . S h a n g h a i , C h i n a . G r o b e l n i k , M a r k o , a n d D u n j a M l a d e n i . " S i m p l e c l a s s i f i c a t i o n i n t o l a r g e t o p i c o n t o l o g y o f w e b d o c u m e n t s . " J o u r n a l o f C o m p u t i n g a n d I n f o r m a t i o n T e c h n o l o g y 1 3 . 4 ( 2 0 0 4 ) : 2 7 9 - 2 8 5 . `t a j n e r , T . , R u s u , D . , D a l i , L . , F o r t u n a , B . , * + , - N R _ t w 1 G J i { 4 H I v ¼ h mH sH h hQ08 mH$sH$ h mH$sH$ h hQ08 5mH sH hQ08 mH sH hQ08 CJ hQ08 CJ hQ08 5CJ hQ08 5CJ hQ08 5CJ hQ08 5j hQ08 5U j hQ08 UmH nH sH tH hQ08 4 O P Q R ` u v w dd $If &