Background

Why Discussing Content Analytics at XML Prague?

This report summarizes a session held by Felix Sasaki at the XML Prague 2014 conference. The session was dedicated to the topic of XML, Semantic Web and Content Analytics. Technologies like XML and RDF are rarely discussed in the same context. Speaking practically, XML tools that process RDF are not yet common - but they do exist, see below for more information.

With this background, it came to a surprise that the session was crowded. Between 40-50 people attended and gave feedback on various aspects of using XML and Semantic Web technologies in content analytics applications. Key discussion points are summarized below.

The session title uses the term Semantic Web. During the session also the term Linked Data was used to refer to the ability to represent machine readable, interlinked information on the Web.

XML Prague is a conference series with a great variety of attendees. Many, but not all are "geeks": they are interested in real code and tool demonstrations. They also share a strong interest in XML, but in recent years, more and more technologies are being discussed at XML Prague. The 2014 edition of the conference had presentations also in the main program on Semantic Web, browser technologies, layout on the Web and many other topics.

The LIDER Project

The content analytics session was funded by the LIDER project. LIDER aims at building a community around the topic of content analytics and linked data, with a focus on linguistic linked data, that is: linked data representation of language resources (e.g. lexica) needed for natural language processing task.

LIDER is reaching out to various industries and research communities via the W3C Linked Data for Language Technology (LD4LT) community group. The organization of the XML Prague session aimed at raising awareness for content analytics and linked data topics in the XML community. LD4LT is not the only group in W3C working on multilingual topics. Under the umbrella of the W3C MultilingualWeb brand, there are various other initiatives; an overview provides further information.

The Session

In the session, a small set of slides helped to introduce the topic. During the oXygen Users Meetup at XML Prague, a demo showed how to integrate automatic entity annotation functionality into the oXygen XML editor.

The main part of the content analytics session was an interactive discussion. Key points are summarized below.

Target Audiences: Who needs to know about Content Analytics?

A re-occurring question during the session was: what type of user actually needs to know about content analytics and linked data / semantics? Depending on the user in question, requirements for tooling and usage scenarios differ.

Developers of XML editing tools may want to add basic functionality to their tools, like the forehand mentioned semi-automatic entity annotation. An important usage scenario for such annotations is to provide context for content authors and translators: entity annotations can help them to disambiguate the meaning of a content item easily, which can safe time in translation processes.

Some people may be called content architects. They are not dealing with a single piece of content or document, but with larger volumes. Two types of content architects can be differentiated: people who set up the actual processing chain technically, for processing potentially thousands of documents; and people who add value to large document sets. The former may want to add functionality to tools that are used by the latter, e.g., a way to decide what documents need specific review before publication, or a way to categorize documents (semi) automatically.

The classical and manual counterpart of such categorization tasks is done by human topic indexers. With the fast amount of data to be processed today, this job does not scale anymore. The semi automatic approach of categorization seems to be promising for this group of people. It allows them to work with general or domain and project specific controlled vocabularies, and applying these to fast amounts of content. However, topic indexers are still missing the tooling to work in this manner without becoming a software programmer.

In general, content architects, both from the technical side as well from the manual or semi automatic content processing side, so far don't have knowledge about linked data or about the content analytics technologies. They also rarely know about the potentials of content analytics application scenarios. But things are starting to change. In the XML Prague main conference, a presentation from Charles Greer showed how RDF data can be stored and queried within a major XML data base. The challenge is now to educate data base users. They need to know both SPARQL and XQuery to be able to work with this solution.

Usability of Content Analytics Tools

This aspect brings up the next topic: usability of content analytics tools. The forehand entity annotation example has the advantage that a user does not need to know anything about RDF, linked data or the automatic annotation process. The functionality can be used in the WYSIWYG environment of an XML editor. However, in practice, many content producer even don't use such editors, but rely on word processors.

There are two approaches one could take from here: first, to enhance the usability of content analytics solutions, and second, to add these solutions into the tools commonly used by the content author (working with one document) or by content architects (working with large volumes). The previous section made clear that adding entity annotation to a authoring tool hits just the tip of the ice berg; many other parts of the content production tool chain that are handled or at least set up by content architects need to be adapted. Examples are CMS systems, publishing pipelines, automatic type setting tools, or content integration portals.

Various session participants pointed out that more and more publishing houses have started to look into (semi)automizing metadata creation. The term metadata here is used to describe any kind of content related information. In this sense, the outcome of a content analytics process leads to content enriched with various types of metadata. That enrichment may happen on a word, paragraph, document or even document collection level. And all kinds of metadata may benefit from manual refinement.

Workflows and Interplay between Content related Technologies

For real deployment of content analytics solutions, it is important to integrate these into the appropriate part(s) of the content production workflow. The forehand mentioned XML data base allows to integrate semantic information into the XML data itself, and use SPARQL and XQuery at the same time for querying. A blog post provides further information about how this works technically. In this scenario, it is assumed that the actual content creation is finished. The data base is then processed by the content architect or by an end user.

Many participants in the session pointed out that a workflow including content analytics processes needs to allow for human intervention before producing final results. The previous section gave examples why this is needed for content categorization. Such intervention has the potential to improve the content analytics processes itself. However, the session participants were not sure wether the algorithms used in current content analytics tools are able to incorporate such feedback loops.

The final workflow aspect discussed was related to snapshots of semantic interpretation. Depending on what static or dynamic semantic resources are used, the outcome of a content analytics process may differ. An example is the Wikipedia categorization of a tablet computer. The first version of the related wikipedia page has been created in 2006. The current version of the page categorizes a tablet computer as a kind of mobile device. But the tablet computer definition itself and this kind of categorization are not available in pre 2006 Wikipedia data. For such reasons, a user of content analytics tools not only wants to be aware of the relevant semantic resources, but also needs to know their temporal dimension.

Data Aspects in Content Analytics Applications

Data Formats and Storage

The previous discussion on XQuery and SPARQL has touched upon the aspect of storing semantic information in various tools and various parts of the content production workflow. The RDF/XML syntax provides a standardized way to store RDF as XML and as part of XML content items. However, the previously mentioned XML data base does not use RDF/XML, but rather a proprietary approach of storing stets of RDF triples.

Another piece of information that may need (further) standardization is how to store results of content analytics processes. The previously mentioned outcome of entity annotation processes can be stored as ITS 2.0 Text Analysis information, see a related example. However, ITS 2.0 defines Text Analysis in a rather broad sense and does not provide fine grained information that may be specific to a certain type of content analytics process (opinion mining, sentiment analysis, document categorization etc.). Further standardization may be needed.

An important lesson to learn from ITS 2.0 is that information about content analytics is only useful if analytics processing tools information is available. One reason is that tools output is difficult to compare e.g. in terms of quality or auto generated confidence scores. ITS 2.0 provides a Tools Annotation mechanism to identify the tools involved in producing analytics or other kind of information.

Language, Content Domains and Data Sharing

Many participants of the session are working with multiple languages and topic domains on a daily basis. From this experience, they pointed out that there is a need not only for general semantic resources, but also resources specific to the domains and languages in question. Hence, there should be efforts for building high quality & curated domain specific and multilingual semantic resources.

Especially in the realm of public open data such resources are more and more being created. A key challenge is to find business models that demonstrate the value of data, and that encourage people from both the public and the private sector to share their data. Content analytics solutions bear the potential to become a catalyst for open data applications, but this has still be to be proved by example.

Education about Content Analytics and Linked Data

Overall, the session clearly showed that there is a huge difference in terms of knowledge about linked data and content analytics. In this respect, several participant pointed out that the BBC has great examples that demonstrate the general value of semantic information. This can be used as a basis to educate people not aware of content analytics and linked data at all. But more education is needed, especially for demonstrating the role of linked data in more complex content analytics applications like sentiment analysis or opinion mining.

The session promoted both the topics of content analytics and linked data. Given that deep knowledge in both areas cannot be expected when talking to end users, one may have to put specific efforts to raise awareness about making their relationship clear. And especially providing clear and simple answers to the question "Why should one use linked data for content analytics?" may help to foster industry adoption for linked data based content analytics applications.

Conclusions and Next Steps

The outcome of the XML Prague session can be summarized as follows.

Various groups of users can be identified for linked data aware content analytics tooling, from an individual content author to a content architect and indexing specialists working with masses of content. For all these users the usability of content analytics tools are of high importance. Tooling needs to be available in the right part of the content production workflow. Certain standardization challenge can help the interoperability of metadata produced by content analytics applications. These applications only add value for the end user if they are tailored to selected domains and languages. In that way content analytics also has the potential to become a booster for business models demonstrating the value of public open data.

The XML Prague session was only a small event. In the next years, the LIDER project will organize many more opportunities to learn about content analytics and linguistic linked data. This includes tutorials dedicated to distinct target audiences, and sessions to gather feedback at various research and industry conferences.

An overview of upcoming events can be found at http://lider-project.eu/?q=content/next-events.

The LD4LT group serves as a forum to summarize the outcomes of these events and to build a roadmap for upcoming work on content analytics and linked data. As first dedicated survey aims at gathering feedback on use cases for content analytics, the role of linked data and other areas.

Comments

This document is not providing information set in stone; please send comments to the LD4LT public mailing list and consider joining the group.