This is an archive of an inactive wiki and cannot be modified.

Authors: ChristianHalaschekWiener, ThierryDeclerck, RaphaëlTroncy

Use Case: Searching and Presenting News in the Semantic Web

Index


1. Introduction

More and more news is produced and consumed each day. News generally consists of mainly textual stories, which are more and more often illustrated with graphics, images and videos. News can be further processed by professional (newspapers), directly accessible for web users through news agencies, or automatically aggregated on the web, generally by search engine portal and not without copyright problems.

For easing the exchange of news, the International Press Telecommunication Council (IPTC) is currently developping the NewsML G2 Architecture (NAR) whose goal is to provide a single generic model for exchanging all kinds of newsworthy information, thus providing a framework for a future family of IPTC news exchange standards [1]. This family includes NewsML, SportsML, EventsML, ProgramGuideML and a future WeatherML. All are XML-based languages used for describing not only the news content (traditional metadata), but also their management and packaging, or related to the exchange itself (transportation, routing).

However, despite this general framework, interoperability problems can occur. News is about the world, so its metadata might use specific controlled vocabularies. For example, IPTC itself is developing the IPTC News Codes [2] that currently contain 28 sets of controlled terms. These terms will be the values of the metadata in the NewsML G2 Architecture. The news descriptions often refer to other thesaurus and controlled vocabularies, that might come from the industry (for example, XBRL [18] in the financial domain), and all are represented using different formats. From the media point of view, the pictures taken by the journalist come with their EXIF metadata [3]. Some videos might be described using the EBU format [4] or even with MPEG-7 [5].

We illustrate these interoperability issues between domain vocabularies and other multimedia standards in the financial news domain. For example, the Reuters Newswires [8] and the Dow Jones Newswires [12] provide categorical metadata associated with news feeds. The particular vocabularies of category codes, however, have been developed independently, leading to clear interoperability issues. The general goal is to improve the search and the presentation of news content in such an heterogeneous environment. We provide a motivating example that highlight the issues discussed above and we present a potential solution to this problem, which leverages Semantic Web technologies.

2. Motivating Example

XBRL (Extended Business Reporting Language) [18] is a standardized way of enconding financial information of companies, and about the management structure, location, number of employes, etc. of such entities. XBRL is basically about "quantitative" information in the financial domain, and is based on the periodic reports generated by the companies. But for many Business Intelligence applications, there is also a need to consider "qualitative" information, which is mostly delivered by news articles. The problem is therefore how to optimally integrate information from the periodic reports and the day to day information provided by specialized news agencies. Our goal is to provide a platform that allows more semantics in automated ranking of creditworthiness of companies. The financial news are playing an important role since they provide "qualitative" information on companies, branches, trends, countries, regions etc.

There are quite a few news feeds services within the financial domain, including the Dow Jones Newswire and Reuters. Both Reuters and Dow Jones provides an XML based representation and have associated with each article metadata with date, time, headline, full story, company ticker symbol, and category codes.

2.1. Example 1: NewsML 1 Format

We consider the news feeds similar to that published by Reuters [8], where along with the text of the article, there is associated metadata in the form of XML tags. The terms in these tags are associated with a controlled vocabulary developed by Reuters and other industry bodies. Below is a sample news article formatted in NewsML 1, which is similar to the structural format used by Reuters. For exposition, the metadata tags associated with the article are aligned with those used by Reurters.

<?xml version="1.0" encoding="UTF-8"?>
<NewsML Duid="MTFH93022_2006-12-14_23-16-17_NewsML">
    <Catalog Href="..."/>
    <NewsEnvelope>
        <DateAndTime>20061214T231617+0000</DateAndTime>
        <NewsService FormalName="..."/>
        <NewsProduct FormalName="TXT"/>
        <Priority FormalName="3"/>
    </NewsEnvelope>
    <NewsItem Duid="MTFH93022_2006-12-14_23-16-17_NEWSITEM">
        <Identification>
            <NewsIdentifier>
                <ProviderId>...</ProviderId>
                <DateId>20061214</DateId>
                <NewsItemId>MTFH93022_2006-12-14_23-16-17</NewsItemId>
                <RevisionId Update="N" PreviousRevision="0">1</RevisionId>
                <PublicIdentifier>...</PublicIdentifier>
            </NewsIdentifier>
            <DateLabel>2006-12-14 23:16:17 GMT</DateLabel>
        </Identification>
        <NewsManagement>
            <NewsItemType FormalName="News"/>
            <FirstCreated>...</FirstCreated>
            <ThisRevisionCreated>...</ThisRevisionCreated>
            <Status FormalName="Usable"/>
            <Urgency FormalName="3"/>
        </NewsManagement>
        <NewsComponent EquivalentsList="no" Essential="no" Duid="MTFH92062_2002-09-23_09-29-03_T88093_MAIN_NC" xml:lang="en">
            <TopicSet FormalName="HighImportance">  
              <Topic Duid="t1">  
                <TopicType FormalName="CategoryCode"/> 
                <FormalName Scheme="MediaCategory">OEC</FormalName>  
                <Description xml:lang="en">Economic news, EC, business/financial pages</Description>  
              </Topic>  
              <Topic Duid="t2">  
                <TopicType FormalName="Geography"/>  
                <FormalName Scheme="N2000">DE</FormalName>  
                <Description xml:lang="en">Germany</Description>  
              </Topic> 
            </TopicSet>
            <Role FormalName="Main"/>
            <AdministrativeMetadata>
                <FileName>MTFH93022_2006-12-14_23-16-17.XML</FileName>
                <Provider>
                    <Party FormalName="..."/>
                </Provider>
                <Source>
                    <Party FormalName="..."/>
                </Source>
                <Property FormalName="SourceFeed" Value="IDS"/>
                <Property FormalName="IDSPublisher" Value="..."/>
            </AdministrativeMetadata>
            <NewsComponent EquivalentsList="no" Essential="no" Duid="MTFH93022_2006-12-14_23-16-17" xml:lang="en">
                <Role FormalName="Main Text"/>
                <NewsLines>
                    <HeadLine>Insurances get support</HeadLine>
                    <ByLine/>
                    <DateLine>December 14, 2006</DateLine>
                    <CreditLine>...</CreditLine>
                    <CopyrightLine>...</CopyrightLine>
                    <SlugLine>...</SlugLine>
                    <NewsLine>
                        <NewsLineType FormalName="Caption"/>
                        <NewsLineText>Insurances get support</NewsLineText>
                    </NewsLine>
                </NewsLines>
                <DescriptiveMetadata>
                    <Language FormalName="en"/>
                    <TopicOccurrence Importance="High" Topic="#t1"/>
                    <TopicOccurrence Importance="High" Topic="#t2"/>
                </DescriptiveMetadata>
                <ContentItem Duid="MTFH93022_2006-12-14_23-16-17">
                    <MediaType FormalName="Text"/>
                    <Format FormalName="XHTML"/>
                    <Characteristics>
                        <Property FormalName="ContentID" Value="urn:...20061214:MTFH93022_2006-12-14_23-16-17_T88093_TXT:1"/>
                        ...
                    </Characteristics>
                    <DataContent>
                        <html xmlns="http://www.w3.org/1999/xhtml">
                        <head>
                          <title>Insurances get support</title>
                        </head>
                        <body>
                            <h1>The Senate of Germany wants to constraint the participation of clients to the hidden reserves</h1>
                            <p>
                            DÜSSELDORF The German Senate supports the point of view of insurance companies in a central point of the new law
                            defining insurance contracts, foreseen for 2008. In a statement, the Senators show disagreements with the proposal of
                            the Federal Government, who was in favor of including investment bonds in the hidden reserves, which in the next future 
                            should be accessible to the clients of the insurance companies.
                            ...
                            </p>
                            </body>
                        </html>
                    </DataContent>
                </ContentItem>
            </NewsComponent>
        </NewsComponent>
    </NewsItem>
</NewsML>

2.2. Example 2: NewsML G2 Format

If we consider the same data, but expressed in NewsML G2:

<?xml version="1.0" encoding="UTF-8"?>

<newsMessage xmlns="http://iptc.org/std/newsml/2006-05-01/" xmlns:xhtml="http://www.w3.org/1999/xhtml">
  <header>
    <date>2006-12-14T23:16:17Z</date>
    <transmitId>696</transmitId>
    <priority>3</priority>
    <channel>ANA</channel>
  </header>
  <itemSet>
    <newsItem guid="urn:newsml:afp.com:20060720:TX-SGE-SNK66" schema="0.7" version="1">
      <catalogRef href="http://www.afp.com/newsml2/catalog-2006-01-01.xml"/>
      <itemMeta>
        <contentClass code="ccls:text"/>
        <provider literal="Handelsblatt"/>
        <itemCreated>2006-07-20T23:16:17Z</itemCreated>
        <pubStatus code="stat:usable"/>
        <service code="srv:Archives"/>
      </itemMeta>
      <contentMeta>
        <contentCreated>2006-07-20T23:16:17Z</contentCreated>
        <creator/>
        <language literal="en"/>
        <subject code="cat:04006002" type="ctyp:category"/>        #cat:04006002= banking
        <subject code="cat:04006006" type="ctyp:category"/>        #cat:04006006= insurance
        <slugline separator="-">Insurances get support</slugline>
        <headline>The Senate of Germany wants to constraint the participation of clients to the hidden reserves</headline>
      </contentMeta>
      <contentSet>
        <inlineXML type="text/plain">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>Insurances get support</title>
  </head>
  <body>
    <h1>The Senate of Germany wants to constraint the participation of clients to the hidden reserves</h1>
    <p>
    ##pRITA LANSCH | DÜSSELDORF The German Senate supports the point of view of insurance companies in a central point of the new law defining insurance contracts, foreseen for 2008. In a statement, the Senators show disagreements with the proposal of the Federal Government, who was in favor of including investment bonds in the hidden reserves, which in the next future should be accessible to the clients of the insurance companies.
...
    </p>
  </body>
</html>
        </inlineXML>
      </contentSet>
    </newsItem>
  </itemSet>
</newsMessage>

2.3. Example 3: German Broadcaster Format

The terms in the tags displayed just above are associated with a controlled vocabulary developed by Reuters. If we consider the internal XML encoding that has been proposed provisionally by a running European project (the MUSING project, see www.musing.eu) for the encoding of similar articles in German Newspapers (mapping the HTML tags of the online articles into XML and adding others), we have the following:

  <ID>1091484</ID>                 # Internal encoding
  <SOURCE>Handelsblatt</SOURCE>    # Name of the newspaper we get the information from
  <DATE>14.12.2006</DATE>          # Date of publication
  <NUMBER>242</NUMBER>             # Numbering of the publication
  <PAGE>27</PAGE>                  # Page number in the publication
  <LENGTH>111</LENGTH>             # The number of lines in the main article

  <ACTIVITY_FIELD>Banking_Insurance</ACTIVITY_FIELD>   # corresponding to the financial domain reported in the article
  <TITLE>Insurances get support</TITLE>
  <SUBTITLE>The Senate of Germany wants to constraint the participation of clients to the hidden reserves</SUBTITLE>
  <ABSTRACT></ABSTRACT>
  <AUTHORS>Lansch, Rita</AUTHORS>
  <LOCATION>Federal Republic of Germany</LOCATION>
  <KEYWORDS>Bank supervision, Money and Stock exchange, Bank</KEYWORDS> 
  <PROPERNAMES>Meister, Edgar Remsperger, Hermann Reckers, Hans Fabritius, Hans Georg Zeitler, Franz-Christoph</PROPERNAMES> 
  <ORGANISATIONS>Bundesanstalt für Finanzdienstleistungsaufsicht BAFin</ORGANISATIONS>
  <TEXT>~##pRITA LANSCH | DÜSSELDORF The German Senate supports the point of view of insurance companies in a central point of the new law defining insurance contracts, foreseen for 2008. In a statement, the Senators show disagreements with the proposal of the Federal Government, who was in favor of including investment bonds in the hidden reserves, which in the next future should be accessible to the clients of the insurance companies.
...</TEXT>

2.4. Example 4: XBRL Format

Structured data and documents such as Profit & Loss tables can finally be mapped onto existing taxonomies, like XBRL, which is an emerging standard for Business Reporting.

XBRL definition in Wikipedia: "XBRL is an emerging XML-based standard to define and exchange business and financial performance information. The standard is governed by a not-for-profit international consortium (*XBRL International Incorporated*) WWW.XBRL.ORG <http://www.xbrl.org> of approximately 450 organizations, including regulators, government agencies, infomediaries and software vendors. XBRL is a standard way to communicate business and financial performance data. These communications are defined by metadata set in taxonomies. Taxonomies capture the definition of individual reporting elements as well as the relationships between elements within a taxonomy and in other taxonomies.

The relations between elements supported, for the time being, (at least for the German Accounting Principles expressed in the corresponding XBRL taxonomy, see http://www.xbrl.de/) are:

In fact the child-parent/parent-child relation haves to be understood as part-of relations within finanical reporting documents rather than as sub-class relations, as we noticed in an attempt to formlize XBRL in OWL, in the context of the European MUSING R&D project (http://www.musing.eu/).

The table below shows how a balance sheet looks like:

structured P&L

2002 EUR

2002 EUR

2002 EUR

Sales

850.000,00

800.000,00

300.000,00

Changes in stock

171.000,00

104.000,00

83.000,00

Own work capitalized

0,00

0,00

0,00

Total output

1.021.000,00

904.000,00

383.000,00

...

Net income/net loss for the yea

139.000,00

180.000,00

-154.000,00

2002

2001

2000

Number of Employees

27

25

23

....

There is a lot of variations in both the way the information can be displayed (number of columns, use of fonts, etc.) but also in the terminology used: the financial terms in the leftmost column are not normalized at all. Also the figures are not normalized (clearly, the company has more than just "27" employees, but it is not indicated in the table if we deal with 27000 employess). This makes this kind of information unable to be used by semantic applications. XBRL is a very important step in the normalization of such data, as can be seen in the following example displaying the XBRL encoding of the kind of data that was presented just above in the table:

<group xsi:schemaLocation="http://www.xbrl.org/german/ap/ci/2002-02-15 german_ap.xsd">
   <numericContext id="c0" precision="8" cwa="false"> 
      <entity>
         <identifier scheme="urn:datev:www.datev.de/zmsd">11115,129472/12346</identifier>
      </entity> 
      <period>
         <startDate>2002-01-01</startDate> 
         <endDate>2002-12-31</endDate>
      </period> 
      <unit>
         <measure>ISO4217:EUR</measure>
      </unit> 
   </numericContext> 
   <numericContext id="c1" precision="8" cwa="false"> 
      <entity>
         <identifier scheme="urn:datev:www.datev.de/zmsd">11115,129472/12346</identifier>
      </entity> 
      <period>
         <startDate>2001-01-01</startDate> 
         <endDate>2001-12-31</endDate>
      </period> 
      <unit>
         <measure>ISO4217:EUR</measure>
      </unit>
   </numericContext> 
   <numericContext id="c2" precision="8" cwa="false"> 
      <entity>
         <identifier scheme="urn:datev:www.datev.de/zmsd">11115,129472/12346</identifier>
      </entity> 
      <period>
         <startDate>2000-01-01</startDate> 
         <endDate>2000-12-31</endDate>
      </period> 
      <unit>
         <measure>ISO4217:EUR</measure>
      </unit> 
   </numericContext> 
   <t:bs.ass numericContext="c2">1954000</t:bs.ass> 
   <t:bs.ass.accountingConvenience numericContext="c0">40000</t:bs.ass.accountingConvenience> 
   <t:bs.ass.accountingConvenience numericContext="c1">70000</t:bs.ass.accountingConvenience> 
   <t:bs.ass.accountingConvenience numericContext="c2">0</t:bs.ass.accountingConvenience> 
   <t:bs.ass.accountingConvenience.changeDem2Eur numericContext="c0">0</t:bs.ass.accountingConvenience.changeDem2Eur>     
   <t:bs.ass.accountingConvenience.changeDem2Eur numericContext="c1">20000</t:bs.ass.accountingConvenience.changeDem2Eur> 
   <t:bs.ass.accountingConvenience.changeDem2Eur numericContext="c2">0</t:bs.ass.accountingConvenience.changeDem2Eur> 
   <t:bs.ass.accountingConvenience.startUpCost numericContext="c0">40000</t:bs.ass.accountingConvenience.startUpCost> 
   <t:bs.ass.accountingConvenience.startUpCost numericContext="c1">50000</t:bs.ass.accountingConvenience.startUpCost> 
   <t:bs.ass.accountingConvenience.startUpCost numericContext="c2">0</t:bs.ass.accountingConvenience.startUpCost> 
   <t:bs.ass.currAss numericContext="c0">571500</t:bs.ass.currAss> 
   <t:bs.ass.currAss numericContext="c1">558000</t:bs.ass.currAss> 
   <t:bs.ass.currAss numericContext="c2">394000</t:bs.ass.currAss>
</group>

In the XBRL example shown just above, one can see the normalization of the periods for which the reporting is valid, and for the currency used in the report. The annotation of the financial values of the financial items is then proposed on the base of a XBRL tag (language independent) in the context of the uniquely identified period (the "c0", "c1" etc), and with the encoded currency.

The XBRL representation is marking a real progress compared to the "classical" way of displaying financial information. And as such XBRL allows for some semantics, describing for example various types of relations. The need for more semantics is mainly driven by applications requiring merging of the quantitative information encoded in XBRL with other kind of information, which is crucial in Business Intelligence scenarios, for example merging balance sheet information with information coming from newswires or with information in related domain, like politics. Therefore some initiatives started looking at representing information encoded in XBRL within OWL, as the basic ontology language representation in the Semantic Web community [19], [20].

3. Potential Solution: Converting Various Vocabularies into RDF

In this section, we discuss a potential solution to the problems highlighted in this document. We propose utilizing Semantic Web technologies for the purpose of aligning these standards and controlled vocabularies. Specifically, we discuss adding an RDF/OWL layer on top of these standards and vocabularies for the purpose of data integration and reuse. The following sections discuss this approach in more detail.

3.1. XBRL in the Semantic Web

We sketch how we convert XBRL to OWL. The XBRL OWL base taxonomy was manually developed using the OWL plugin of the Protege knowledge base editor [15]. The version of XBRL we used together with the Accounting Principles for German consists of 2,414 concepts, 34 properties, and 4,780 instances. Overall, this translates into 24,395 unique RDF triples. The basic idea during our export was that even though we are developing an XBRL taxonomy in OWL using Protege, the information that is stored on disk is still RDF on the syntactic level. We were thus interested in RDF data base systems which make sense of the semantics of OWL and RDFS constructs such as rdfs:subClassOf or owl:equivalentClass. We have been experimenting with the Sesame open-source middleware framework for storing and retrieving RDF data [16].

Sesame partially supports the semantics of RDFS and OWL constructs via entailment rules that compute "missing" RDF triples (the deductive closure) in a forward-chaining style at compile time. Since sets of RDF statements represent RDF graphs, querying information in an RDF framework means to specify path expressions. Sesame comes with a very powerful query language, SeRQL, which includes (i) generalised path expressions, (ii) a restricted form of disjunction through optional matching, (iii) existential quantifiation over predicates, and (iv) Boolean constraints. From an RDF point of view, additional 62,598 triples were generated through Sesame's (incomplete) forward chaining inference mechanism.

For proof of concept, we looked at the freely available financial reporting taxonomies(http://www.xbrl.org/FRTaxonomies/) and took the final German AP Commercial and Industrial (German Accounting Principles) taxonomy (February 15, 2002; http://www.xbrl-deutschland.de/xe news2.htm), acknowledged by XBRL International. The taxonomy can be obtained as a packed zip file from http://www.xbrl-deutschland.de/germanap.zip.

xbrl-instance.xsd specifies the XBRL base taxonomy using XML Schema. The file makes use of XML schema datatypes, such as xsd:string or xsd:date, but also defines simple types (simpleType), complex types (complexType), elements (element), and attributes (attribute). Element and attribute declarations are used to restrict the usage of elements and attributes in XBRL XML documents. Since OWL only knows the distinction between classes and properties, the correpondences between XBRL and OWL description primitives is not a one-to-one mapping:

However, OWL allows to characterize properties more precisely than just having only a domain and a range. We can mark a property as functional (instead of being relational, the default case), meaning that it takes at most one value. This clearly means that a property must not have a value for each instance of a class on which it is defined. Thus a functional property is in fact a partial (and must not necessarily be a total) function. Exactly the distinction functional vs. relational is represented by the attribute vs. element distinction, since multiple elements are allowed within a surrounding context. However, at most one attribute-value combination for each attribute name is allowed within an element:

XBRL

OWL

simple type

class

complex type

class

attribute

functional property

element

relational property

Simple and complex types differs from one another in that simple types are essentially defined as extensions of the basic XML Schema datatypes, whereas complex types are XBRL specifications that do not build upon XSD types, but instead introduce their own element and attribute descriptions. Here are simple type specifications found in the base terminology of XBRL, located in the file xbrl-instance.xsd:

Since OWL only claims that "As a minimum, tools must support datatype reasoning for the XML Schema datatypes xsd:string and xsd:integer." [17, p. 30] and because "It is not illegal, although not recommended, for applications to define their own datatypes ..." [17, p. 29], we have decided to implement a workaround that represents all the necessary XML Schema datatypes used in XBRL. This was done by having a wrapper type for each simple XML Schema type. For instance, "monetary" is a simple subtype of the wrapper type "decimal": <restriction base="decimal"/>. Below we show the first lines of the actual OWL version of XBRL we have implemented:

<?xml version="1.0"?>
<rdf:RDF xmlns="http://xbrl.dfki.de/main.owl#" 
         xmlns:protege="http://protege.stanford.edu/plugins/owl/protege#"
         xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
         xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
         xmlns:owl="http://www.w3.org/2002/07/owl#"
         xml:base="http://xbrl.dfki.de/main.owl">
  <owl:Ontology rdf:about=""/>
  <owl:Class rdf:ID="bs.ass.fixAss.tan.machinery.installations">
    <rdfs:subClassOf>
      <owl:Class rdf:ID="Locator"/>
    </rdfs:subClassOf>
  </owl:Class>
  <owl:Class rdf:ID="nt.ass.fixAss.fin.loansToParticip.net.addition">
    <rdfs:subClassOf>
      <owl:Class rdf:about="#Locator"/>
    </rdfs:subClassOf>
  </owl:Class>
  <owl:Class rdf:ID="nt.ass.fixAss.fin.loansToSharehold.net.beginOfPeriod.endOfPrevPeriod">
    <rdfs:subClassOf>
      <owl:Class rdf:about="#Locator"/>
    </rdfs:subClassOf>
  </owl:Class>
  <owl:Class rdf:ID="nt.ass.fixAss.fin.gross.revaluation.comment">
    <rdfs:subClassOf>
      <owl:Class rdf:about="#Locator"/>
    </rdfs:subClassOf>
  </owl:Class>
  <owl:Class rdf:ID="nt.ass.fixAss.fin.securities.gross.beginOfPeriod.otherDiff">
    <rdfs:subClassOf>
      <owl:Class rdf:about="#Locator"/>
    </rdfs:subClassOf>
  </owl:Class>
   ...
</owl:Ontology>
</rdf:RDF>

The German Accounting Principles taxonomy consists of 2,387 concepts, plus 27 concepts from the base taxonomy for XBRL. 34 properties were defined and 4,780 instance fnally generated.

Besides the ontologization of XBRL, we would propose to build an ontology on the top of the taxonomic organization of NACE codes. Then we need a clear ontological representation of the time units/information relevant in the domain. And last but not least, we would also use all the classification/categorization information of NewsML/IPTC to use more accurate semantic Metadata for the encoding of the (financial) news articles.

3.2. NewsML G2 in the Semantic Web

@TODO Raphael: gives some hints about the conversion

3.3. EXIF in the Semantic Web

One of today's commonly used image format and metadata standard is the Exchangeable Image File Format [3]. This file format provides a standard specification for storing metadata regarding image. Metadata elements pertaining to the image are stored in the image file header and are marked with unique tags, which serves as an element identifying.

As we note in this document, one potentional way to integrate EXIF metadata with additinoal news/multimedia metadata formats is to add an RDF layer on top of the metadata standards. Recently there has been efforts to encode EXIF metadata in such Semantic Web standards, which we briefly detail below. We note that both of these ontologies are semantically very similar, thus this issue is not addressed here. Essentially both are a straightforward encodings of the EXIF metadata tags for images (see [3]). There are some syntactic differences, but again they are quite similar; they primarily differ in their naming conventions utilized.

3.3.1. Kanzaki EXIF RDF Schema

The Kanzaki EXIF RDF Schema [6] provides an encoding of the basic EXIF metadata tags in RDFS. Essentially these are the tags defined from Section 4.6 of [3]. We also note here that relevant domains and ranges are utilized as well. [6] additionally provides an EXIF conversion service, EXIF-to-RDF (found at [6]), which extracts EXIF metadata from images and automatically maps it to the RDF encoding. In particular the service takes a URL to an EXIF image and extracts the embedded EXIF metadata. The service then converts this metadata to the RDF schema defined in [6] and returns this to the user.

3.3.2. Norm Walsh EXIF RDF Schema

The Norm Walsh EXIF RDF Schema [7] provides another encoding of the basic EXIF metadata tags in RDFS. Again, these are the tags defined from Section 4.6 of [3]. [7] additionally provides JPEGRDF, which is a Java application that provides an API to read and manipulate EXIF meatadata stored in JPEG images. Currently, JPEGRDF can can extract, query, and augment the EXIF/RDF data stored in the file headers. In particular, we note that the API can be used to convert existing EXIF metadata in file headers to the schema defined in [7]. The resulting RDF can then be stored in the image file header, etc. (Note here that the API's functionality greatly extends that which was briefly presented here).

3.4. Putting All That Together

Some text showing how this qualitative and quantitative information benefits to interoperate ...

4. References

[1] News Architecture (NAR) for G2-standards, http://www.iptc.org/NAR/

[2] The IPTC NewsCodes - Metadata taxonomies for the news industry, http://www.iptc.org/NewsCodes/

[3] EXIF: Exchangeable Image File Format, Japan Electronic Industry Development Association (JEIDA). Specifications version 2.2 available in HTML and PDF

[4] EBU: European Broadcasting Union, http://www.ebu.ch/

[5] MPEG-7: Multimedia Content Description Interface, Standard No. ISO/IEC n°15938, 2001.

[6] Kanzaki EXIF-RDF Converter, http://www.kanzaki.com/test/exif2rdf

[7] JPEGRDF - Norm Walsh EXIF Converter, http://www.nwalsh.com/java/jpegrdf/

[8] Reuters - http://www.reuters.com/

[9] Reuters News Corpus - http://www.daviddlewis.com/resources/testcollections/reuters21578/

[10] Reuters Corpus @ NIST - http://about.reuters.com/researchandstandards/corpus/available.asp

[11] Press releases:

[12] Dow Jones Newswires - http://www.djnewswires.com/

[13] Reuters: Online News Wires - http://www.about.reuters.com/ids/products/onlinenw.htm#nw11

[14] Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361-397, 2004. http://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf.

[15] Knublauch, H., Musen, M.A., Rector, A.L.: Editing description logic ontologies with the Protege OWL plugin. In: Proceedings of the International Workshop on Description Logics, DL2004. (2004)

[16] Broekstra, J., Kampman, A., van Harmelen, F.: Sesame: A generic archistecture for storing and querying RDF and RDF schema. In: Proceedings of the International Semantic Web Conference (ISWC). Number 2342 in Lecture Notes in Computer Science (LNCS), Springer (2002) 54-68

[17] Bechhofer, S., van Harmelen, F., Hendler, J., Horrocks, I., McGuinness, D.L., Patel- Schneider, P.F., Stein, L.A.: OWL web ontology language reference. Technical report, W3C (2004) 10 February.

[18] XBRL - eXtensible Business Reporting Language, http://www.xbrl.org/Home/, see also Tim Bray's blog

[19] Declerck, T., Krieger, H.-U.: Translating XBRL Into Description Logic. An Approach Using Protege, Sesame & OWL. Proceedings of the 9th International Conference on Business Information Systems (2006).

[20] Lara, R., Cantador, I., Castells, P.: XBRL Taxonomies and OWL Ontologies for Investment Funds. ER (Workshops) 2006: 271-280

5. Trash

5.1. Using News Feeds in the Broadcast Industry

Broadcasting companies use also metadata to describe the content of their videos (and images). The overall objective is still to make available the existing multimedia material to journalists that are writing on a certain topic; and on the other way round, point to certain news articles that might be related to a video sequence (or an image).

Our starting point is material from three broadcasting institutions, specialized in news. We do not name them here and we abstract over their documents, since we still have to ask for permission for displaying information to the general public. We present the (large) commonalities we found in the various metadata documents. The main interoperability problems are the use of controlled vocabularies for describing the content of the news (such as the IPTC News Codes) and the overall structure of the news (which NewsML G2 is standardizing).

The metadata associated with the news video we have been looked at distinguish basically the information about the program itself (the news program broadcasts at a certain day/time) and information about the parts covering various topics (economics, sports, international, entertainment, etc.). Within those sections, there are again different contributions, which we can consider like the "leaves" in the structure: the video on a specific event, topic, person etc.

The general structure looks like:

Detailed description of the shots in the news segment: list of natural language expressions describing what can be seen in the different video shots. These NL expressions consist mostly of short phrases and more seldomly in full sentences.

Interoperabilty can be achieved at different levels.

  1. On the base of data categories, which reflect all the information points mentioned above
  2. At the level of the semantic description of the content of the video (image). The descriptions of the video sequences content are in natural language and use sometimes keywords, which generally belong to controlled vocabularies.

To ensure interoperability here, one has to map the natural language expressions and the keywords to some structured semantic representation. At the beginning we can think of using a combination of both the "Structured Text Annotation" and the "Semantic Description" schemes of MPEG-7. This is not enough though since the values associated with the slots of the structured textual description scheme ("who", "why", "what_action" etc.) and the Semantic Description Scheme ("event", "location", "object" etc) are not per se normalized and not related to semantic resources (like instances of an ontology). So in case a newspaper article is reporting on events occuring in Paris, Texas and some metadata of a news video are describing scenes of Paris, France, it is difficult to differentiate those at the simple level of string values in the XML context of MPEG-7.

5.2. Business Intelligence Applications

IIn Business Intelligence applications, like Credit Risk Management, many decisions are taken on the base of calculations/inferences applied on quantitative data. Under such data, one can understand Profit & Loss tables in balance sheets (yearly reports of companies), company profiles, as they are stored in Business Registers or other Agencies specialized in delivering financial and economic information. This (mostly statistical) processing generates indicators or rating models that can be used in decision procedure on the accordance of a loan or to ranking companies, branches or economic regions.

There is actually a need perceived in some BI sectors for including qualitative information in their services offered to customers or to the general public. One sector is for example the credit risk management, where higher response to actual information, as delivered by newswires, is needed for providing more accurate and faster integrated information to the end-users. Also the Basel II accord (see also Wikipedia for more information, http://en.wikipedia.org/wiki/Basel_II) is requiring more accurate access to credit, with more transparency in the decision procedures. On the other hand, it is asked to the companies to provide for more effort on their “self-assessment”. In this sense Basel II can be considered as introducing an additional burden to SMES for getting access to credit. Therefore there is a need for better (semantic-driven) information services. Here, again, the integration of qualitative information, mostly from unstructured documents like financial newswires is a strong request. This information can also be used in providing explanations on the decision that is resulting from the rating or ranking procedure.

5.3. Enhancing Automatically the Metadata

The basic idea behind the XML codes of the MUSING project shown above in the section about NewsML, is that the bank/rating agency operator can search within a press article for certain strings within a subpart of the XML documents. For example finding the name of a person in the title of the article is showing that the article is very probably really dedicated to this person. But here no information is available about the function of this person (CEO of Company?). And also it is quite possible that two persons have the same name. String based queries are not able to distinguish this. It is also quite possible that the person has two basic functions (at the same time or within two different periods), and querying for a specific function of this person is with the actual annotation strategy not possible. This kind of metadata should ideally be extracted autoamtically from unstrcutured documents, and annotated with semantic tags.

So behind the general issue of interoperability of annotation (the annotation tagset here is not really compatible with the annotation tag set of Reuters, and so the company using those data will have difficulties in integrating the annotated information generated by Reuters), there is also a need for upgrading the actual annotation towards a semantic annotation.

Therefore there is a call to provide for (standardised) semantic annotation, enriching for example the information about the person with information about her/his function, the period of time covered by the article (Person X being CEO of Company Y between 2001-01-01 till 2004-28-02), the date of the article by far not giving enough temporal information! Generally speaking, there is also a urgent need in defining specific relations between the XML elements (upgradet anyway into semantic classes and realtions)

It is clear that in order to provide the integration and/or interoperability of these news feeds, particularly the semantic content, there must be an alignment of such vocabularies associated with the content. Given this motivation, it is the goal of this use case to address this issue. Our aim is to utilize standardized knowledge representation languages for the Web (RDF and/or OWL) to bridge the gap between news provides' controlled vocabularies. Specifically, we will demonstrate this capability by providing ontological alignment of the Reuters and other metadata formats, demonstrating the interoperability capabiliity of representation languages on the Web.

5.4. News Syndication on the Semantic Web

From the discussions presented above, it is clear that a Semantic Web based approach (OWL and RDF) provides a solution to integrating the various news metadata formats. Adopting such an approach provides additional benefits as well, mainly due to the formal foundations of such representation languages; in particular advanced news syndication can be provided.

Over the past years news syndication systems on the Web have attracted increased attention and usage on the Web. As technologies have emerged and matured there has been a transition to more expressive syndication approaches; that is subscribers (and publishers) are provided with more expressive means for describing their interests (resp. published content), enabling more accurate news dissemination. Through the years there has been a transition from keyword based approaches to attribute-value pairs and more recently to XML. Given the lack of expressivity of XML (and XML Schema) there has been interest in using RDF and OWL for syndication purposes.

Using a more expressive approach with a formal semantics, many benefits can be achieved. These include a rich semantics-based mechanism for expressing subscriptions and published news items allowing increased selectivity and finer control for filtering, automated reasoning for discovering subscription matches not found using traditional syntactic syndication approaches, etc. Therefore adopting an approach such as the one presented here, advanced news dissemination services can potentially be provided as well.

5.5. Query Examples

Cross modalities search on a given topic: I would like to find news related the IRAN nuclear politics, pictures, videos, graphics and news stories.

Integrating heterogeneous resources: I am the CEO of a company that provides video sharing platforms on the web. I would like to have all news related to my competitors. There exist a thesaurus of all personality, there exist a thesaurus of all companies, with their CEO, their buisness activity, etc.