W3C W3C Incubator Report

W3C Content Labels

W3C Incubator Group Report Draft 0.4 31 May 2006

This version:
Latest version:
Phil Archer, ICRA
Jo Rabin, Segala
Kai-Dietrich Scheppe, T-Online


Need to write an abstract here

Status of this document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of Final Incubator Group Reports is available. See also the W3C technical reports index at http://www.w3.org/TR/.

This is a preliminary draft designed to act as a focus for discussion by the group. That discussion has yet to take place and the contents of this document must be seen as temporary.

This document was developed by the W3C Content Label Incubator Group.

Publication of this document by W3C as part of the W3C Incubator Activity indicates no endorsement of its content by W3C, nor that W3C has, is, or will be allocating any resources to the issues addressed by it. Participation in Incubator Groups and publication of Incubator Group Reports at the W3C site are benefits of W3C Membership.

Table of Contents

1 Introduction

The group was chartered to look for "... a way of making any number of assertions about a resource or group of resources." Furthermore "... those assertions should be testable in some way through automated means."

It quickly became apparent that the terminology used in that summation needed to be refined and clarified, however it was possible to construct a set of use cases that amply demonstrate the aims in more detail. A set of high level requirements was derived from the use cases that were then formalized for this report.

Throughout the Incubator Activity, decisions have been taken via consensus during regular telephone conferences and a face to face meeting. Discussion of the requirements and what can and cannot be inferred from a content label proving the most exhaustive. Based on that discussion it is possible to reformulate the output of the Web Content Labels Incubator Activity as:

"A way of making any number of assertions, using any number of vocabularies, about a resource or group of resources. The assertions are open to automatic authentication based on available data such as who made the assertions and when."

We have deliberately taken a very broad approach so that it is possible for both the resource creator and third parties to make assertions about all kinds of things, with no architectural limits on the kind of thing they are making claims about. For example, medical content labeling applications might be concerned with properties of the agencies and processes that produce web content (e.g.. companies, people, and their credentials). By contrast, a mobile content labeling application might be more concerned with kinds of information resource and their particular (and varying) representations as streams of bytes. That said, we have focused on Web resources rather than trying to define a universal labeling system for objects.

The group agreed early on that RDF provides the best technology as a basis for achieving this but that alternatives must be discussed explicitly. For this reason, we offer the following:

1.1 Participants

The companies who participated in WCL-XG are as follows:

* Original sponsor organization

The diverse membership reflects a widely recognized need to be able to "label content" for various purposes. These range from child protection through to scientific accuracy, from the identification of mobile-friendly and/or accessible content to linking of thematically-related resources.

2 Definitions

The following terms are used throughout this report. To aid clarity, definitions have been collected from W3C glossaries where possible and provided a priori where necessary.

Assertion (n.) (i) Any expression which is claimed to be true. (ii) The act of claiming something to be true. [W3C definition source]

Authenticate, (n. authentication) To provide evidence that the assertions made in a cLabel were made by the individual or organization stated in the cLabel's metadata. If provided, details such as the timing of the creation of the cLabel, the period for which the cLabel is valid, and the conditions under which it was created must also be reflected in any authentication procedure. [ibid]

Certification (Segala to supply)

Content Label, cLabel A set of assertions made using one or more vocabularies to describe a resource or group of resources. Clients MUST NOT make any inference about a resource or group or resources based on the absence of any descriptor. In other words, if a content label describes a solely in terms of resource's color, no inference can be drawn about, for example, its reliability. [ibid]

cLabel metadata Data about the content label. Typically this will include who created it, when it was created and for how long it is valid etc. [ibid]

Descriptor An individual term from a vocabulary. [ibid]

Labeling Authority (acronym LA) An organization that provides infrastructure for the generation and authentication of content labels. [ibid]

Resource Anything that might be identified by a URI. [W3C definition source]

Resource creator The individual or organization that created the resource. [ibid]

Schema (pl., schemata) A document that describes an XML or RDF vocabulary. Any document which describes, in formal way, a language or parameters of a language.[ W3C definition source]

Trustmark An assertion by an independent body that the resources, products or services offered by an individual or organization meet certain standards. The standards may be presented in various forms such as a code of practice conduct, a set of requirements, a list of criteria etc. [ibid]

validation, validate, validating The process necessary to perform conformance testing in accordance with a prescribed procedure and an official test suite. [W3C definition source]

vocabulary A collection of attributes that can describe one or more resources. When associated with a schema, attributes are expressed as URI references. [This definition is an amalgam of those provided in Composite Capability/Preference Profiles (CC/PP): Structure and Vocabularies 1.0 and OWL Web Ontology Language Guide.]

Well-formed Syntactically legal. [W3C definition source]

3 Detailed requirements

Based on the use cases and the original high level requirements that were derived from them, a set of more detailed requirements were established. These have been loosley categorized for easier comprehension.

  1. it must be possible for both resource creators and third parties to make assertions about Web resources
  2. The assertions must be able to be expressed in terms chosen from different vocabularies. Such vocabularies might include, but are not limited to, those that describe a resource's suitability for children, its conformance with accessibility guidelines and/or Mobile Web Best Practice, its scientific accuracy and the editorial policy applied to its creation.
  3. The assertions must be able to be expressed in terms chosen from different vocabularies. Existing vocabularies are preferred over newly defined ones
  4. It must be possible to group information resources and have cLabels refer to that group of resources. For example, cLabels can refer to all the pages of a Web site, defined sections of a Web site, or all resources on multiple Web sites.

Label semantics

  1. A cLabel is the expression of claims made only by the party that created it.It is the expression of an opinion of a person or organization or an automaton, from a point of view (limited by the vocabulary or vocabularies chosen), potentially qualified by the limitations of present knowledge, expressed at a point in time about the current state of ...) This is defined in section 2, so I think this can be removed
  2. cLabels must support a single composite assertion taking the place of a number of other assertions. For example, WAI AAA can be defined as WAI AA plus a series of detailed descriptors. Other examples include mobileOK and age-based classifications. new wording
  3. Labels must also support the addition or subtraction from composite assertions - the first of these would appear to be supported by the ability to group a composite assertion with atomic ones. The second would appear to require in addition, negative assertions. Further, those negative assertions would, for the sake of a client being able to understand the meaning of a composite assertion which is subtracted from, without having to parse the relevant vocabulary, hence negative assertion probably need to be nested in a composite assertion. Example AAA minus x and y. Vocabulary maybe should be able to constrain whether negative assertions can be made??? Also this suggests requirements as to the usability or utility of labels absent the ability to retrieve and/or parse the accompanying vocabs.This looks like a road to complexity. I see the point but I wonder whether we're going too far? Also, does it conflict with the spirit, if not the letter, of our assertion that one should not make any inference from the absence of a descriptor?
  4. More than one cLabel can refer to the same resource or group of resources. Since conflicting labels are therefore permissible, their acceptance lies with the end user
  5. It must be possible for a resource to refer to one or more cLabels. It follows that there must be a linking mechanism between content and labels.
  6. cLabels must be able independently to point to any resource(s)
  7. It must be possible to make assertions about cLabels using appropriate vocabularies. For example, a cLabel can have metadata describing who created it, what its period of validity is, how to provide feedback about it, a who last verified it and when.
  8. It must be possible for a cLabel to be associated with its metadata and vice versa.
  9. cLabels, metadata statements and individual assertions should have need unique and unambiguous identifiers JR made this suggestion in the section on labels and metadata. Looks like a requiremetns to me. Welcome to the Semantic Web Jo

Fitting in with commercial or other large scale workflows

  1. It must be possible for cLabels and cLabel metadata to be authenticated. That is, it must be possible to determine that a cLabel is an authentic expression of assertions made by an identified individual or organization
  2. It must be possible to affirm or deny assertions that are made in the cLabel or its metadata
  3. It must be possible (independently) to link to and from validity opinions and what they are expressing opinions about. [this is actually true of all types of opinion, I think] I'm confused. Do you mean that it must be possible for me to create metadata about a cLabel you created and for you to add to your cLabel a link to me that allows something like "don't just take my word for it, he agrees with this too" type statements? isn't that semantic annotation for which RDF is already amply suited?
  4. It must be possible to create and edit cLabels without modifying the resources they describe. However, this need not be the only or even primary means of adding labels to content.
  5. cLabels I think it is actually the vocabularies that must support defaults must support defaults. I strongly disagree. We need WCL to support "this is the cLabel for everything on this website unless told otherwise" - that's what I mean by a default label. Can you/I word it better?
  6. cLabels must be able to override defaults.
  7. It must be possible for a labeling organization to make all its labels data available and to define the means through which it can be accessed. This may be through a Web Service, as an xml file or any other means.does this mean that it must be possible to group cLabels? or does it just mean that cLabels, however they are stored, must be accessible. If so I wonder why exactly we are saying it? See also the non-requirement that labels from different labellers do not need be be grouped [below] No, it means that labelling authorities need to be able to make their database of label available for bulk import by a third party

Encoding labels for humans and machines

  1. It must be possible to express cLabels and cLabel metadata in a machine readable way.
  2. The machine readable form of a cLabel and cLabel metadata must be defined by a formal grammar
  3. cLabels must provide support for a human readable summary of the claims it contains
  4. It must be possible to express cLabels and cLabel metadata in a compact form

    vocabularies also need encoding. So do validation statements.

    vocabularies need identifiers and resolution mechanisms.

    Both of these are inherent in the RDF model. Do we need to specify them as requiremetns?

    Non requirements

    It is not necessary for a cLabel to consist of assertions that are made by different entities. [i.e. a cLabel is the expression of opinion of only one party per f. above and there is no foreseen requirement to group labels from more than one party] I think Label Semantics d makes this redundant?

4 A cLabel and its metadata

The requirements above can be expressed in a more programmatic way as follows. A Content Label (cLabel) can carry a variety statements such as:

cLabel {
  That resource R has the property P1 is true
  That resource R has Property P2 that has value V
  That resource R meets WCAG 1.0 AA is true
  That resource R was created in accordance with satisfactory procedures is 

Where R may be either a single resource identified by its own URI or a group of resources. Membership of a group R may be defined either by pattern matching based on URIs or with reference to specified properties of resources. The latter case includes, but is not limited to, properties such as creation date, ISAN number etc. Important new addition to the wording here. It was an oversight on my part, well picked up by Jo.

A URI may stand in for a group of resources, as follows:

  1. it may refer to all resources that match the pattern defined by the URI [which is actually an expression with wild-cards?] with the proviso that those resources must be reachable by following iteratively links originating in the original URI - e.g. http://www.w3c.org/ means all URIs that are linked from the home page that match http://www.w3c.org/*
  2. a list of URIs where no such linking is implied
  3. there may be exclusions from the list
  4. the target URI may include transcluded content and the label may refer to the transcluded content (in which case the exclusions are also needed e.g. this page is fine for kids with the exception that the little photo at the bottom may be upsetting to some)
  5. the target URI may include links and the label may refer to the targets of such links (this is a general case of 1 above)

Sorry Jo, I disagree with a lot of this! One for the next meeting. But we do need to include the text resolved at the f2f: "when applying a cLabel, label creators SHOULD ensure that the label decribes the content as it is intended to appear when rendered by a client. This means that, for exmple, the label for an HTML page should also describe images included in that page."

Further, it is necessary to be able to make statements like:

metadata {
cLabel was created by $organization
      has the e-mail address mail@organization.org
      has a homepage at $url
      has a feedback page at $URL
cLabel was created on $date
cLabel was last reviewed by $person

Finally, it is necessary to be able to send a real-time request to $organization seeking automatic confirmation that it was responsible for creating the cLabel, i.e. authenticating the label and the claims made.

Also need to be able to make statements like:

validity {
metadata and cLabel verified by $organization  
  { has email sss ... }
verified on $date}

System architecture

Content Labels can be used in a variety of systems and it is not the XG's intention to define a single architecture. However. we do recommend that the following elements are present in any complete system.

  1. Content Labels served either from the labeled site or from a labeling authority
  2. Metadata about a given cLabel giving detail of who created it, when, its period of validity and so on (see WCL Vocabulary)
  3. An authentication route through which a client can make an automated request to the LA
  4. An API through which an LA makes available labels for a given resource (see RDF Model below)
  5. An API though which an LA makes all its labels available as a single download

Then we should reference

And we could posit the general statement that cLabels constitute a single data point that could be used in any framework.

Need to be able to do simple things like embed labels in their metadata and embed that metadata in the content that they refer to.


Do the above cited requirements cope with the requirements and the following actors?

Roles may be shared by single actual entities. Multiple actual entities may share roles. Entities may be human or automated.

Content Creator - A creator of content

Content Label[l]er – Entity that expresses an opinion about content. May be the content creator

Portal Provider – Entity that serves content to end user.

End User – Entity that ultimately consumes content

User Agent - A means of retrieving and rendering Web content for End Users mediated by the use of labels.

User Agent Provider - Entity that provides a tool that renders, decorates or otherwise differentiates content on the basis of label information.

Vocabulary Provider – Entity responsible for creation and maintenance of vocabularies

Certification Provider – Entity that verifies the claims of a content provider.

Search Provider – Entity that provides a tool or service that uses [in whole or in part] the content of labels to discriminate content.


RDF Model

Main WCL model here

As resolved in Edinburgh, this will be RDF-CL with modifications needed to meet the requirements. Things that come to mind are:

Other methods


Steve Ives has sent this:

<?xml version="1.0"?>
      <description>Find tasteful ringtones at this site.</description>

      <!-- use one of these for crawl/refresh/frequency -->
      <pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate>
      <lastBuildDate>Tue, 10 Jun 2003 09:41:01 GMT</lastBuildDate>

      <!-- future idea: define a collection of default values for
           item fields up top, saves repeating prices etc -->

      <!-- the tags below start off from RSS, but then
           mix in tag names from Dublin Core and iTunes RSS.
           Namespacing could be added, but has been left out
           to keep the appearance more simple. -->

      <!-- list of items -->

         <link>URL to page that will provide the item (buy link)</link>
         <description>Optional description of item</description>

         <!-- date when the item was added/released,
              could help annotate a search result with
              "new" if its recently published -->
         <pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate>

         <!-- unique ID for the item to assist in comparing new
              crawls of this file from the previous crawl. 
              Doesn't need to be a URL, but using a URL helps
              ensure uniqueness, but any string will do.

         <!-- 'type' is dublin core, and we could spec a starting
              vocabulary, e.g.
                 video ringtone

         <!-- 'creator' is DC,
              'author' is itunes (with podcasts in mind) -->
         <creator>Grarls Barkley</creator>

         <!-- 'rights' is DC -->
         <rights>Copyright 2006 Grarls Barkley</rights>
         <!-- borrow the image tag idea from itunes -->
         <image href="http://www.tastefultunes.com/images/item573"/>
         <!-- 'format' is dublin core, but DC doesn't spec
              the contents. So we could propose IMT, or "auto" -->

         <!-- borrow duration if wanted from itunes -->
         <!-- scope for many other properties of the asset, e.g.
            color depth
            frame rate

         <!-- borrow category from itunes, maybe don't nest the
              levels like itunes does.
              Need a vocabulary for these categories.

         <!-- price could be a simple string, including
              currency symbol, idea is just to repeat the
              string in any search result rather than try to
              understand it.
              Pricing needs to be able to address all kinds
              of options, e.g. subscriptions, multibuys,
              conditional discounts, first time purchases.
              A plain string might be the easiest way to
              leave this open - afterall, it's the owners
              site that will be performing the transaction. -->

         <!-- One problem still to address: how to handle multiple instances of the same item, but in different formats. Writing a script to build this file from a database may need logic to spot this. Yahoo are proposing media groups, where each different version is it's own <item> but has a tag that declares itself part of a group. Might be better to instead list the available formats inside the one item tag, but this is a bit (not much) harder to generate with a script. 
            However, if this version of the system assumes the target site will do the handset type resolution, then don't need to catalog the type information, but extraction script will still need to list one item per collection of formats that item is available in. -->



Steve also notes areas for future work

Need to do the same for Atom

Kjetil's stuff on XPath

Pantelis's stuff on microformats here


The editors acknowledge significant written contributions from:

Appendix 1

The use cases

Use Case 1: Profile matching

The original use case given in the charter has been simplified by reducing the number of essential actors to three:

One can imagine a range of scenarios with very similar characteristics that amount to "sub-use cases."

Sub use case 1A: END USER discovers content appropriate to their device ["MobileOK"]

Diagrammatic representation of use case 1A

Fig 1. Diagrammatic version of sub-use case 1A.

  1. END USER visits portal
  2. END USER's device profile is extracted with reference to a separate metadata store
  3. END USER searches for a topic of interest.
  4. PORTAL PROVIDER matches END USER's device profile with contentprofiles provided by CONTENT PROVIDER.
  5. PORTAL PROVIDER provides search results matching this topic.
  6. PORTAL PROVIDER filters results based on the metadata encoded in the content with regard to the "mobile friendliness" of the content/presentation in question and the known properties of the device profile according to business rules.

Sub use-case B: END USER discovers content appropriate to their age-group ["Child Protection"]

Diagrammatic representation of use case 1B

Fig 2. Diagrammatic version of sub-use case 1B.

  1. END USER visits portal
  2. END USER's user profile is extracted from a repository, perhaps the portal's own.
  3. END USER searches for a topic of interest.
  4. PORTAL PROVIDER matches END USER's age with content profiles provided by CONTENT PROVIDER.
  5. PORTAL PROVIDER provides search results matching this topic.
  6. PORTAL PROVIDER filters results based on the metadata encoded in the content with regard to the "child friendliness" of the content/presentation in question and the known age of the user according to local business rules.

Use case 2: Trustmark Scheme operator to content portal

The Example Trustmark Scheme reviews online traders, providing a trustmark for those that meet a set of published criteria. The scheme operator wishes to make its trustmark available as machine readable code as well as a graphic so that content aggregators, search engines and end-user tools can recognize and process them in some way.

The trustmark operator maintains a database of sites it has approved and makes this available in two ways:

First, the labelled site includes a link to the database. This can be achieved in a variety of ways such as an XHTML Link tag, an HTTP Response Header or even a digital watermark in an image. A user agent visiting the site detects and follows the link to the trustmark scheme's database from which it can extract the description of the particular site in real time.

Secondly, the scheme operator makes the full database available in a single file for download and processing offline.

Since the actual data comes directly from the trustmark scheme operator, it is not open to corruption by the online trader and can therefore be considered trustworthy to a large degree. To reduce the risk of spoofing, however, the data is digitally signed.

Use case 3: Website to end-user

Mrs Chaplin teaches 7 year olds at her local school. An IT enthusiast, she makes her teaching materials available through her personal website. She adds metadata to her material that describes the subject matter and curriculum area. In order to gain wider trust in her work she submits her site for review by her local education authority and a trustmark scheme. Both reviewers offer Mrs Chaplin a digitally signed, machine-readable version of their trustmark that she can add to her site. She merges these into a single pool of metadata to which she adds content descriptors from a recognized vocabulary that declare the site to contain no sex or violent content. She adds her own digital signature to the metadata. The set of digital signatures allow user-agents to identify the origin of the various assertions made. As in use case 2, links from the content itself point to this metadata.

Since the metadata is on the website itself, user agents are unlikely to take the assertions made in the metadata at face value. Unlike the trustmark operator, the local authority does not operate a web service that can support the label, it does, however, digitally sign its labels and publishes its public key on its website. This can be used to verify that it is indeed the local education authority that issued the relevant data in the label.

Separately, a user-agent can interrogate the trustmark operator's database in real time to check whether Mrs Chaplin is authorized to make the assertions relevant to their namespace. Furthermore, the use of a recognized vocabulary for the content description means that a content analyser trained to work with that vocabulary can give a probabilistic assessment of the accuracy of the relevant data.

Taken together, these multiple sources of data can provide confidence in the quality of the content and the local authority trustmark which is not directly testable. The multiple data sources may be further supported by recognising that Mrs Chaplin's work is cited in many online bookmarks, blog entries and postings to education-related message boards.

Use Case 4: Rich Metadata for RSS/ATOM

Dave Cook's website offers reviews of children's films and the site is summarized in both RSS and ATOM feeds. Most of the films reviewed have an MPAA rating of G and/or British Board of Film Classification rating of U. This is declared in a rating for the channel as a whole. However, Dave includes reviews of some films rated PG-13 or 12 respectively which is declared at the item level and overrides the channel level metadata.

The actual rating information comes from an online service operated by the relevant film classification board itself and is identified using a URL and human-readable text. The movie itself is identified by either an ISAN number or the relevant Internet Movie Database entry ID number. As with use case 2, trust is implicit given the source of the data, which is indicated by a link to Dave's site's policy.

Separately, Fred combines Dave Cook's and other review feeds to provide alternative reviews of the movies by transforming the ATOM feeds into RDF and creating an aggregate view using SPARQL queries.

Use Case 5: MLK and the KKK

Fred operates an antiracism education site which aggregates and curates content from around the Web. Fred wants to label the resources that he aggregates such that educational and other institutions may harvest the resources and associated commentary and metadata automatically for reuse within their instructional support systems, etc.

One of the ways in which Fred wants to curate resources is to say about them that they are pedagogically useful but politically noxious. For example, some sites on the Web make claims about Martin Luther King, Jr that are motivated by a racist ideology and are historically indefensible. Fred's vocabulary allows him to claim that such resources are pedagogically useful for purposes of analysis, but that they are otherwise suspicious and should only be consumed by students in an age-appropriate manner or with appropriate supervision, etc. In other words, Fred needs to be able to make sharply divergent claims about resources: (1) that they are noteworthy, and (2) that they are, from his perspective, dangerous or noxious or troublesome.

Use Case 6: Scalar Classification

A company named Advance Medical Inc. reviews medical literature on the Web based on a range of quality criteria such as effectiveness and research evidence. The criteria may be changed according to current scientific and professional developments. The review process leads to literature being classified as belonging to one of 5 levels as follows.

The company produces label data that declares the classification level value and provides a summary of each document. The label data is stored in a metadata repository which can be accessed via the Web.

M.D. Smith uses the label data in the repository to make decisions about heath care for specific clinical circumstances.


The following requirements have been approved by the group.

  1. It must be possible to group resources and to make assertions that apply to the group as a whole (This is fundamental to all use cases)
  2. It must be possible to self-label (use cases 2 - 4)
  3. To provide as complete a description as possible, labels must be able to contain unambiguous assertions using more than one vocabulary (all use cases, especially 3)
  4. It must be possible for a content provider to make reference to third party labels (use case 2)
  5. It must be possible to make assertions about the accuracy of claims made in a label (use case 2)
  6. The system must be readily usable within a commercial workflow, allowing a content provider to apply metadata to a large number of resources in one step and to separate the activity of labelling from that of content creation, where desired (use case 1).
  7. The system must support a concept of default and override metadata. The mechanism that is used to determine where overrides apply should be based on the full concept of a URI rather than, for example, just a web URL. (Use case 1, 2, 4)
  8. It should be possible to ascertain unambiguously who created the label, using techniques such as digital signatures, S/MIME etc. (use cases 2, 3 and perhaps 5)
  9. It must be possible for a labeling organization to make all its labels available as a single database (use case 2)
  10. It should be possible to include assertions from an unlimited number of vocabularies in a single content label. Assertions from each vocabulary may be subject to its own verification mechanism (use case 3)
  11. Labels should support a human-readable summary as well as the machine-readable code (all).
  12. Labels should validate to formal published grammars (all)
  13. It must be possible to encode labels in a compact/efficient form (all)
  14. It must be possible to identify whether labels are self-applied or created by a third party. (use case 2)
  15. It must be possible to discover a feedback mechanism for reporting false claims (all, especially use case 2)
  16. It must be possible to associate labels with a 'time to live' and/or 'expiry date' (all, especially user case 2)
  17. It must be possible to discover the date and time when a label was last verified and by whom. (all, especially use case 2)
  18. It must be possible to describe the process by which data in labels is to be verified (use case 3)

Although not a testable requirement, the group has further resolved the principle that adding labels to resources should be easy and intuitive. It is recognized that this is likely to be made so through implementation but the design of the system should nonetheless be mindful of the principle (use case 3).