LD4LT Group Leipzig September 2014 Meeting

 "Linked Data, Language Technologies and Multilingual Content Analytics" 4th LIDER roadmapping workshop
  
 1st and 2nd September, Leipzig, Germany
 Part of SEMANTiCS 2014. 
 Venue information

Following the successful LD4LT Group Kick-Off and Roadmap Meeting, the LD4LT Group Madrid May 2014 Meeting and the LIDER workshop as part of FEISGILTT 2014, this event will focus on linked data and content analytics.

The 4th LIDER Roadmapping Workshop will take place on September 2nd. The goal of the roadmapping workshop is to gather input from experts and stakeholders in the area of content analytics as a basis to define a European R&D and Innovation Roadmap for the European Commission that will help the EC to prioritize future R&D activities and thus will have a direct impact on the calls that the EC will issue in the context of H2020. The main objective will be to identify areas and tasks in content analytics where Linked Data & semantic technologies can contribute. The roadmapping workshop will be a part of the MLODE Workshop and thus collocated with the SEMANTICS Conference. The roadmapping workshop will feature a number of talks by industrial players in the field of content analytics and human language technologies as well as a final discussion.

The 4th LIDER Roadmaping Workshop will be preceded by a hackathon on the 1st of September.

The event is supported by the LIDER EU project, the MultilingualWeb community, the NLP2RDF project as well as the DBpedia Project.

Detailed information

See the roadmapping workshop and the hackathon pages.

Registration

Participants need to use the SEMANTiCS registration.

Organisers

The LD4LT meeting will be organised by the LIDER Project

Philipp Cimiano (University of Bielefeld)

Matthias Hartung (University of Bielefeld)

Sebastian Hellmann (University of Leipzig)

Report

Introduction

This report gives a summary of the 4th LIDER roadmapping workshop, which took place on 2nd September 2014 as part of the SEMANTiCS conference pre-program and the MLODE 2014. For more information, take a look the workshop program.

The main objective was to identify areas and tasks in content analytics where Linked Data & semantic technologies can contribute. Due to the numerous companies presenting their use cases, special input was gathered concerning special needs for companies, businesses and enterprises. In what follows, a summary of each talk will be presented.

Contributions

Welcome and introduction

Philipp Cimiano (University of Bielefeld) opened the workshop. After a short introduction of the EU LIDER project he outlined the goals of this workshop. The main focus is set on the use and needs of linguistic linked open data for the business and industry sector. Thus, the two main objectives for the workshop day were:

Identification of areas and tasks in content analytics where Linked Data & semantic technologies can contribute.
Gathering input from experts and stakeholders in the area of content analytics as a basis to define a European R&D and Innovation Roadmap for the European Commission that will help the EC to prioritize future R&D activities.

Tatiana Gornostay: Language Meets Knowledge in Digital Content Management

Tatiana Gornostay (Tilde) reported on closing the gap between language and knowledge in digital content management. The main goal is to bring innovation to the market for human professionals and machine users by working on improving communication between engines. Terminology management shall be opened to broader applications in content management which include but are not only limited to machine translation. Hereby, one main challenge is seen in the terminology management which primarily bases concepts on linguistic expressions next to the existing knowledge. The speaker emphasized that terminology should not only be regarded in the context of language but also in the context of content management and enrichment. As a basis for that the creation of rich content which is multilingually and semantically linked to data is needed. The benefits of this concept based approach to enrich existing content will result in a higher quality of terminology that will finally lead to a saving of time and resources.

Ilan Kernerman: Generating Multilingual Lexicographic Resources

Ilan Kernerman (K Dictionaries Ltd, Tel Aviv) introduced K DICTIONARIES Ltd. which have a long tradition in the lexicographic field of dictionary development. He shared his experiences facing a transition from traditional dictionaries to multilingual datasets, data management and software engineering, architectures and design due to the increasing technological development. Today the K Dictionaries Ltd. resources comprise multilingual databases for over 20 major and some minor languages including linguistic information on morphology and pronunciation, lexicographic editorial tools and applications. The main focus lies in the quality of the language data, hence, the data is first collected and edited manually by first language speakers to build monolingual datasets which are then extended and connected to form bi- and multilingual datasets via automatic translations. The main goal is to get from traditional lexicography value for applications such as machine translation, e-learning, word processing, text mining and search engines. The use of linguistic linked open data is desired regarding its interconnectedness in nature and the vast amount of available language data. However, the integration of this data suffers from the mediocre quality of the automatically created content. The challenge is to arrive at automatically generated high quality content that can cope with the central problems of resolving the complex cross-linguistic relations that have rarely a 1:1 equivalence (for instance in compound words) as well as extending the few existing quality sensitive domains, e.g. education and healthcare which are even now interested in high quality linguistic data.

Heiko Ehrig: Resources! Resources! Resources!

Heiko Ehrig (Neofonie) introduced the company shortly. They shifted from developing search engines web and mobile application development and consulting, including interaction design, testing and data analytics. Neofonie developed a German text mining API that performs classification, keyword detection, entity detection, date detection, NER, and quotes (API key http://bit.ly/txtwerk). From their experience with NLP and linked data they point to the examination of the following issues:

extension of entity types
building more individual customer lexica and sentiment detection.
broaden LD and NLP for more languages than English.
development of a gold standard of German (N)ER.
discussion of standardized text mining API.
support of open data and open licenses.

Mark Zöpfgen: Software-Supported Bibliographic Recording and Linked Data

Mark Zöpfgen (German National Library) presented their library activities in content extraction and semantic web. They maintain the National Bibliography, which contains all national print and electronic publications since 1913. They produce an authority file (called GMD “Gemeinsame Normdatei”) with metadata. Activities in content extraction and semantic web comprise several projects. In these they build an ontology for generating the data and which enables a multilingual access to subjects in order to make the German National Library internationally available. Manual effort is also invested in providing high quality translations of the subject headings of the bibliographical records into English and French. So far, an Open Linked Data Service for spreading the data is available and downloadable in RDF format under creative commons zero license. The main goals of the German National Library comprise the following topics:

constant improvement of the poor formal state of the bibliographical highly reliable data.
building an integrated portal with search engine and linked data.
integration of German bibliographical data into The European Library and finding standards for the provision in the linked data format.
increase precision of multi-language term mappings under the assumption that there is rarely 1-1 matching.
the motivation of external parties to work with RDF data and improve search possibilities.

Massimo Romanelli: Social Media Monitoring: from Sentiment to Intention

Massimo Romanelli (Attensity Europe GmbH) introduced the company which provides analytics for customer engagement by retrieving conversational information from social media platforms such as Google and Twitter. He presented the LARA (Listen, Analyze, Relate, and Act) paradigm. A complex enterprise solution suite (http://www.attensity.com/products/) has been developed. Thereby Attensity Q exploits external resources for existing classifications via linked data. Attensity Analyze then combines the social with the internal data using NLP tools and text analytics. Finally, Attensity Respond displays easy topic metrics which suggest what the customer might want. The main goal is to detect the customer’s intention from his sentiment represented in the text and to be able to react accordingly. Even though Attensity makes already successfully use of NLP engines, knowledge engineering (Lingware) and annotated documents, more resources are needed to expand the vertical domains to identify the intention of a user. Such resources do not only encompass more data but also a model that represents pragmatic implications and could be therefore used to create the correct query to a certain customer need.

Marc Egger: Text Analytics for Brand Research -Non-reactive Concept Mapping to Elicit Consumer Perception

Marc Egger (Inius) talked about brand research in the context of product development in companies. On the basis of text analytics for consumer social media content, concept maps for market research are developed. The aim is to find out what the consumer thinks about product, brands and general topics via NLP tools that detect, collect and analyze textual consumer content from the web. As an example, the work with the brand concept map was presented. Out of this map the customers’ associations are turned into a network representation that is then analyzed according to the values i) strength, ii) favorability, iii) uniqueness and iv) patterns of thought. This analytics software which is used to elicit consumer perceptions could be improved with regard to the textual data processing in various aspects. These include:

refine POS tagging and dependency parsing for written oral language such as forum posts for more accurate concept candidate detection.
also cover intra-article topic relevance.
face aggregation challenges such as spelling mistakes (burger = burgr), synonymous concepts (tasty burger = delicious burger)
increase accuracy in ratings of topic relevance by providing high quality resources for German NER and better German anaphora resolution tools

Alessio Bosca: Linked Data for Content Analytics in CELI

Alessio Bosca (CELI) presented how CELI is exploiting linked data. Their focus is on speech applications, semantic search, text analytics, opinion mining and social media intelligence. The core technology used encompasses language processes such as language identification, morphological analyses and semantic analysis. CELI exploits the linked data in the LOD cloud a) as a user by making use of for NER, and b) as a provider for internal use and for crafting RDF artifacts. Two projects were addressed: a book project for the digital humanities and the Homer project for multilingual interfaces to assessing data from different public administration. From the work with linked open data the the LOD cloud community is advised to put more emphasis on truly linking of the datasets. With regard to the public sectors it is suggested that more data should be published as linked open data and that international standards should be used. The issue of publishing companies’ linked data under an open license was also addressed. The speaker made the point that besides the resistance to sharing, because of valid competitive concerns, company data is generally over-fitted to their solutions and clients. In other words, companies need to be able to manage ‘micro-domains’ which are regarded as less useful in general. Compromisingly it was suggested by the audience that companies should not answer the question why they do not publish their linked data, but what they could publish.

Oscar Muñoz: Content Analytics for Media Agencies

Oscar Muñoz (HAVAS Media) shortly introduced the company. HAVAS Media is an agency that offers market studies which extend traditional market survey with social media analysis. The presentation mainly focused on how relevant touch points between brands and the consumers can be established. Therefore, consumer profiles are created by gathering information on the awareness, evaluation, purchase intention and post-purchase experience (e.g. integrated sentiment analysis with UPM) of the user. This intelligent consumer profiling further includes time series analysis, e.g. event detection and explanation, of relations between social buzz and advertising pressure. Another method used is the social graph analysis which detects influences, brand ambassadors, detractors, viralisers and content propagation. Due to the large volume of data sources (e.g. over thousands of brands) as well as their heterogeneity several challenges arise. These are summarized in the following questions:

How is it possible to associate different data sources from social media, search engine marketing, customer date, site analytics, offline advertising and digital display advertising?
Can we arrive at Big Linked Data integrating multiple heterogeneous and unstructured data sources at scale?
How can we tackle the problem of the variety and velocity of data sources to decrease the integration costs which are incurred by the lack of social media formats?
In what ways must a consumer connection platform be improved to enable a cross platform information tracking that is able to infer online and offline user behavior?
How is high a high accuracy ensured by a rising complexity resulting of multilingual processing?

Andreas Nickel: Applicated Insights: Computational Linguistics and Semantic Analysis as Part of Business Workflows

Andreas Nickel (Ferret Go) reported on the challenges of content analytics in the Ferret Go startup business that exists since 2012. They are mainly dealing with media and textual resources such as articles, reviews and other online text and provide a structured content analysis for their customers. Especially heterogeneous clients, e.g. newspaper reader comments and social media are challenging, which is why some work is still done manually. The structure in non-structured data is discovered by applying computational linguistics. Four use cases have been presented:

) Insights in community management via fast moderation which is a real-time analysis of reader’s comments for bild.de.
) Insights in opinion management by tracking the user’s opinion and analyze customer feedback in unstructured hotel reviews.
) Feedback dispatching for commerce and industry, e.g. customer relationship integration, workflow priority.
) Deep content mining e.g. topic detection for long periods of time.

The conclusions Ferret Go could draw from previous work are:

more accurate ways of automated content analytics must be found since the manual effort is too high and the the quality of crowdsourced results is questionable as well
companies often do not know what to do with analytics - aid must be provided to help the clients then to decide how they can react
still we cannot get 100% accuracy, so should be aware of that fact while simultaneously effort should be invested to arrive at 100% in the future
content analytics could be facilitated if clients take up the advice to store only selectively potentially relevant data rather than saving everything

Patrick Bunk: Setting them up for Failure – How Customer Expectations Collide with Economic Realities of Text Analytics

Patrick Bunk (uberMetrics) talked about customer expectations with regard to text analytics. He outlined the functions of internal and external data within companies. The former is used for knowledge management and business intelligence, whereas the latter is mainly taken as analytical basis for search engines and market intelligence. Working in the fields of (social) media monitoring and sentiment analysis uberMetrics reports that their clients have a mean of 500k articles per month. Speaking from previous experiences it can be concluded that expectation gaps and varying quality over time and domains become recurring issues. Both are addressed by focussing on the economic realities, which means that expectation fulfillment and quality are strongly connected to the different pricings of the various analytical approaches: free for automated 70-80% accuracy, 1 Euro per article for manual work, tailor made solution by training a customer model and with costs of employing one person for one year and also crowd based tagging with costs of 0.05 Euro per article. Finally, with respect to the economic and quality aspects of text mining tasks the following suggestions have been proposed:

coping with failure gracefully
focus on generalized solutions
testing algorithms on humanities majors
be aware of manual labor substitute
tailor-made mining is at a local maximum pre scalable product
automation through knowledge should be socially beneficial

Dirk Goldhahn: Introduction to the German Wortschatz Project

Dirk Goldhahn (University of Leipzig, NLP group) was the only speaker presenting a linguistic dataset from the academic field. He introduced the the Leipzig Corpora Collection. The dataset comprises corpus-based full form monolingual dictionaries for more than 220 languages which comes with a variety of meta-data, e.g. word frequencies, POS tagging and co-occurrences. Furthermore, the corpora are enriched with statistical annotations such as POS, topics, word and co-occurrence frequencies. At the moment the NLP group is working on a conversion of their data into a linked data format. At the same time integration work of external sourced still needs to be done.

Michael Wetzel: Towards the Single Digital Market – Processing Knowledge, Independent from Language

Michael Wetzel ( Coreon) focussed on on the management of language resources that can be used in different applications. Thereby it was discovered that knowledge is mainly accessible by multilingual data, hence, forcing it to stay in knowledge silos. The approach undertaken to open up the knowledge access is to discard the string driven search/access, because it has to fail given that one and the same object has multiple expressions. Rather, it has to be searched for the thing instead of the string! This can be achieved by a fusion of concepts and multilingual terminology. For this purpose Coreon has developed a knowledge software that establishes a knowledge map starting from a multilingual terminology list. The primary challenge that has to be tackled is the need to bridge the various format standards of TBX, SKOW and OWL.

Requirements Gathering, Use Cases and Key points of the workshop

Philipp Cimiano closed the workshop presenting a summary of the most discussed topics during the workshop. Overall, most participants agreed that the issues of creating more standards as well as ensuring working links within the LOD cloud should gain more emphasis in the linked data communities. Further topics that are regarded as central work orders for the LIDER project were identified by a significant number of participants. These are summarized as key points below:

Sharing of linked data involving a cooperative data curation
Providing more resources for micro-domains that generalize and can be shared
Avoid knowledge silos by emphasizing more the linking both in communities and enterprises
Focus on more high-quality open data
Work on deeper analysis and more semantics to enable semantic search for things, rather than strings
Clarification of what accuracy rates of linked data analytics are reasonable for clients with high statistical result expectations