W3C Content Labels

1 Introduction

The group was chartered to look for "... a way of making any number of assertions about a resource or group of resources." Furthermore "... those assertions should be testable in some way through automated means."

It quickly became apparent that the terminology used in that summation needed to be refined and clarified; however it was possible to construct a set of use cases that amply demonstrates the aims in more detail. A set of high level requirements was derived from the use cases that were then formalized for this report.

Throughout the Incubator Activity, decisions have been taken via consensus during regular telephone conferences and a face to face meeting. Discussion of the requirements and what can and cannot be inferred from a content label proved the most exhaustive. Based on that discussion it is possible to reformulate the output of the Web Content Labels Incubator Activity as defining:

"A way of making any number of assertions, using any number of vocabularies, about a resource or group of resources. The assertions are open to automatic authentication based on available data such as who made the assertions and when."

We have deliberately taken a very broad approach so that it is possible for both the resource creator and third parties to make assertions about all kinds of things, with no architectural limits on the kind of thing they are making claims about. For example, medical content labeling applications might be concerned with properties of the agencies and processes that produce Web content (e.g.. companies, people, and their credentials). Equally, a 'Mobile Web' application might need to determine the properties of various devices such as their screen dimensions, and those device types might be labeled with such properties by their manufacturer or by others. By contrast, a mobile content labeling application might be more concerned with different kinds of information resource and their particular (and varying) representations as streams of bytes. That said, we have focused on Web resources rather than trying to define a universal labeling system for objects.

The following report sets out the detailed requirements derived from the original use cases and high level requirements (which are presented in the appendix). A model has been developed that encapsulates the issues discussed and discovered during the XG's work. Comments are also made on possible system architectures that can utilize content labels, and a detailed glossary is provided.

The Incubator Group is now seeking a charter to re-form as a full Working Group on the W3C Recommendation Track. It is believed that the work done to date is substantial and has attracted significant support but there are some outstanding issues that need to be addressed. They are called out in the text like this, and collected together in the summary

1.1 Participants

The companies that participated in or supported WCL-XG are as follows:

Asemantics SRL *
AT&T
AOL Inc.
Center for Democracy and Technology
Centre Virtuel de la Connaissance sur l'Europe
Institute of Informatics & Telecommunications (IIT), NCSR*
ILRT, University of Bristol*
Internet Association of Japan
Internet Content Rating Association (ICRA)*
Maryland Information and Network Dynamics Lab at the University of Maryland
Opera Software
RuleSpace LLC
Segala*
T-Online*
Vodafone*
Yahoo!*

* Original sponsor organization

The diverse membership reflects a widely recognized need to be able to "label content" for various purposes. These range from child protection through to scientific accuracy; from the identification of mobile-friendly and/or accessible content, through to the linking of thematically-related resources.

2 Detailed Requirements

Based on the use cases and the original high level requirements that were derived from them, a set of more detailed requirements was established. These have been loosely categorized for easier comprehension, and care has been taken to use relevant terms as defined in the glossary.

Fundamentals

It must be possible for both resource creators and third parties to make assertions about information resources.
A group of one or more assertions, known as a description, combined with attribution and a scope of resources that they refer to, together constitute a Content Label, also written as cLabel. This must be able to describe aspects of those resources using terms chosen from different vocabularies. Such vocabularies might include, but are not limited to, those that describe a resource's subject matter, its suitability for children, its conformance with accessibility guidelines and/or Mobile Web Best Practice, its scientific accuracy and the editorial policy applied to its creation.
It must be possible to group information resources and have cLabels refer to that group of resources i.e. define the scope of the cLabel. For example, cLabels can refer to all the pages of a Web site, defined sections of a Web site, or all resources on multiple Web sites.
cLabels must support a single composite assertion taking the place of a number of other assertions. For example, WAI AAA can be defined as WAI AA plus a series of detailed descriptors. Other examples include mobileOK and age-based classifications.
It must be possible for more than one cLabel to refer to the same resource or group of resources.
It must be possible for a resource to refer to one or more cLabels. It follows that there must be a linking mechanism between content and labels.
cLabels must be able to point to any resource(s) independently of those resources.
A cLabel must include assertions about itself using appropriate vocabularies. As a minimum, a cLabel must have metadata describing who created it. Good practice would be to declare its period of validity, how to provide feedback about it, who last verified it and when etc.
It must be possible for a cLabel to refer to other cLabels.
There must be standard vocabularies for assertions about cLabels.
cLabels, their components and individual assertions should have unique and unambiguous identifiers.
Assertions within cLabels should be made using descriptors that themselves have unique identifiers

Fitting in with commercial or other large scale workflows

It must be possible for cLabels to be authenticated.
It must be possible to create and edit cLabels without modifying the resources they describe OQ 1: It is an open question whether there may be a requirement for some forms of cLabel that involve editing those resources.
It must be possible to identify a default cLabel for a group of resources and provide an override at specific locations within the scope of the cLabel.

Encoding labels for humans and machines

It must be possible to express cLabels and cLabel metadata in a machine readable way.
The machine readable form of a cLabel must be defined by a formal grammar.
cLabels must provide support for a human readable summary of the claims it contains.
It must be possible to express cLabels in a compact form.
Vocabularies and authentication data must be formally encoded and support URI references.

3 The Content Label Model

The requirements in the previous section can be expressed in a more programmatic way as follows. A Content Label (cLabel) can carry a variety of statements such as:

cLabel {
  That resource R has the property P1 is true
  That resource R has Property P2 that has value V
  That resource R meets WCAG 1.0 AA is true
  That resource R was created in accordance with satisfactory procedures is true
}

where R may be either a single resource identified by its own URI or a group of resources. Membership of a group R may be defined either by pattern matching based on URIs or with reference to specified properties of resources. The latter case includes, but is not limited to, properties such as creation date, ISAN number etc.

Further, it is necessary to be able to make statements like:

metadata {
  cLabel was created by $organization
  {
    has the e-mail address mail@organization.org
    has a homepage at $url
    has a feedback page at $URL
    ...
   }
   cLabel was created on $date
   cLabel was last reviewed by $person
   cLabel has certificate $URL2
}

Finally, it is necessary to be able to send a real-time request to $organization seeking automatic confirmation that it was responsible for creating the cLabel, and to $URL2 for verification of the cLabel, i.e. authenticating the label and the claims made. This amounts to making statements like:

authentication {
  cLabel verified by $organization
  {
     has email sss
      ... 
  }
  verified on $date}

As a design principle, it is noteworthy that the creation of cLabels is usually done by people cf. the reading of cLabels, which is usually done by machines. Therefore it is important that associating cLabels with groups of resources is a simple task, with the burden of processing of the data pushed to the client side. That said, processing of cLabels may be done in real time as requests for resources are made. Therefore the processing burden must be kept as light as possible to minimize any latency caused by systems making use of cLabels.

Such a discussion leads us to the Content Label Model as described in figure 3.1 and the following text.

3.1 Objects of the Content Label Model

Figure 3.1 Schematic of the Content Label Model

A vocabulary collects together a number of related properties or aspects of a resource that may be useful in saying things about that resource. Those properties or aspects are identified by terms of the vocabulary. Each term is identified by a descriptor and may have constraints placed on the values that are appropriate for use with that descriptor. Further information may be associated with a descriptor such as test suites for checking value assignment. The scope of use of the vocabulary as a whole as well as the scope of individual terms may be noted.

Within a vocabulary it is possible to assign terms to a thematically-defined sub-group called a category. Categories may contain further sub-categories.

An expression is a statement in respect of a resource that the aspect of the resource denoted by a descriptor chosen from a specific vocabulary has a certain value. A valid expression makes reference to a vocabulary term that exists and whose chosen value conforms to the constraints specified in the vocabulary.

An assertion is a specific type of expression which is said to be true by the entity that makes it.

A claim is a specific type of assertion, whose veracity can be ascertained, either by reference to the test conditions described in the relevant vocabulary term, or by observation and interpretation of the meaning of the term as given in its scope notes.

A description is a resource that contains only assertions and claims. In isolation, what it describes or who provided the description is not defined.

A classification is a specialization of a description that is pre-defined. It may stand on its own, such as a movie rating, or it may comprise individual claims and assertions, such as those that together constitute WAI A, AA or AAA conformance.

Scope is a reference to a URI or to a group of URIs to which a cLabel (q.v.) is said to refer.

A cLabel is a resource that contains a description, a definition of the scope of the description, and assertions about both the circumstances of its own creation and the entity that created it. A cLabel may contain a short textual summary of the assertions and claims in a form suitable for display to end users.

A certificate is a special type of cLabel which contains assertions about another cLabel, for the purposes of verification.

A package is a collection of cLabels and certificates which, through scoping, provides a set of cLabels which may be applicable to a URI.

A repository is a storage mechanism for descriptions, cLabels and packages from which they can be retrieved without necessarily being linked from the content they describe.

Trustmarks and labelmarks are human perceivable signs that indicate the presence of certificates and cLabels respectively.

3.2 Content Label Scope

The grouping of resources is a fundamental aspect of content labels and is the subject of a separate paper [GROUP ] that sets out its own abstract model for this specific area. It allows a group to be defined in such a way that it is programmatically possible to determine whether a resource to be resolved from a URI is a member of that group. This then makes it possible to determine whether the given resource is within the scope of a cLabel.

Scope definitions are based primarily on pattern matching against URI components: "everything on the example.com domain with a path that starts with 'red'", for example. However, this approach is not always practical. As a result, WCL also supports the definition of scope based on properties of resources, through simple lists of URIs and merely by a resource pointing to a cLabel.

Key aspects of scope definition by URI and resource property are set out in the following sub-sections. The two are not mutually exclusive: scope may be defined both in terms of URI patterns and resource properties.

OQ 2: Should WCL-XG be successful in securing a WG charter, the abstract model for resource grouping will itself come under full scrutiny with a view to its publication either as a WG Note or a full Recommendation in its own right.

3.2.1 Scope Defined by URI

The abstract model for resource grouping uses terms taken primarily from RFC 3986 with minor elaborations, and are illustrated in the following example:

scheme://[userInfo@]host[:port][/path][?query][#fragment]

In addition, the term uri is defined to allow matching against the entire URI without decomposition.

3.2.1.1 Modifiers

Aside from rules governing the normalization process of candidate URIs, the matching process is controlled by a number of modifiers, as follows (default values are given in the table below right):

	match	case
scheme	exact	insensitive
userInfo	exact	sensitive
host	endswith	insensitive
port	exact	sensitive
path	startswith	sensitive
query	startswith	sensitive
fragment	startswith	sensitive
uri	regex	sensitive

Table: Defaults for Modifiers

Match Type: Exact, startswith, endswith (for leading and trailing character matching) and Regex (if the match pattern is a regular expression)
Case: Controls whether the match is case sensitive or not
Negate: True if the result of the match is to be negated (default false)

3.2.1.2 Grouping, Inclusion and Exclusion

Within the abstract model, various facilities are provided for grouping and for providing specific inclusion and exclusion from groups already defined.

3.2.1.3 Encoding

The abstract model for resource grouping does not define an encoding, with the specific intention that it may be encoded equally inter alia within an XML environment, an RDF environment or some other environment. Different encodings are likely to take varying approaches to how they represent modifiers and inclusion and exclusion.

An XML representation, such as the example given in [GROUP], might encode modifiers as attributes. Other representations might choose to create data types by combining terms and modifiers, host vs. exactHost for example.

3.2.2. Scope Defined by Property

When defining groups by resource properties, label creators must make the data on the relevant properties available in such a way that the resource itself does not need to be retrieved and parsed in order to determine whether it is a member of the group or not. There are two reasons for this:

It facilitates the creation of a generalized cLabel system that does not need to be able to parse an unbounded number of types of resource.
It greatly increases efficiency.

The precise property or properties used to define the group will be determined by the label creator.

OQ 3: The mechanism for exposing properties of resources should by preference be both standard and be cLabel based. However, the precise workings of this have not been examined and are for further study.

3.3 Associating Labels and Resources

There are a number of ways of associating labels with the resources they describe and vice versa. The discussion here is not intended to be definitive of how such linking is carried out; it is intended to be descriptive of how linking can be accomplished in various use cases. The discussion proceeds by examining direct linkage from resources to descriptions and to cLabels. It then goes on to look at indirect linkage via packages and also looks at relating cLabels from repositories to the content to which they refer.

3.3.1 Direct Association

Linking to a Description At the simplest level, a resource can reference a description. The description, a collection of assertions, is then said to describe the resource. The description contains no linkage to any resources that refer to it. Consequently the authority for claiming that the description is accurate and appropriate is the provider of the link from the resource.

Relationship between resource and cLabel A more powerful and substantial linkage relationship can be established by the resource making reference to a cLabel. As well as providing descriptions, cLabels also provide an authority and a scope. This provides assurance that a named party, which can be authenticated, believes that the assertions made apply to the content referenced by the scope. Clearly it is important that the scope identified by the cLabel actually includes any resource that refers to it, and if it does not, then the association is invalid and the content label cannot be said to apply to the resource.

3.3.2 Indirect Association

Package Illustration In some deployment environments, content providers do not have control over the metadata that is associated with any specific resource for which they may be responsible, and in particular cannot control the associations between resources and cLabels. This necessitates a more indirect association in which the relationships between resources and their labels are mediated by a package. In a typical instantiation, a package will be provided for a group of resources, e.g. a web site. The package contains various scoping statements about the applicability of the cLabels it contains to various resources. For example, it may contain a number of cLabels referring to the accessibility and child friendliness of its content, and may assert that these labels have general applicability within its scope. It may also say that, exceptionally, these labels do not apply to certain elements of sub-scope, and so replace default labeling with specific labeling for various sections of the site.

An informal illustration of this follows:

{scope "www.example.com"
  {use cLabel1, cLabel2}
  {scope "/tld1" 
  {use cLabel3}
  }
}

In this illustration, cLabels 1 and 2 refer to any content on www.example.com, unless the URI of the content is www.example.com/tld1, in which case cLabel 3 applies.

Clearly, the scope of the labels themselves must be consistent with the scoping statements of the package.

OQ 4: If successful in securing a WG charter, the group will need to address how to resolve situations in which scoping statements do not match.

3.3.3 Repositories

The previous sections identify mechanisms for linking from resources to descriptions, cLabels and packages. In the case of cLabels and packages the resource is referred to from them, providing a bidirectional link.

It is also necessary for labeling authorities to be able to create labels and have them apply to resources, irrespective of whether the resources apply to them or not. In this case, label aware applications need knowledge of the repository so that labels for URIs in which they have an interest may be requested. Such requests might be made in parallel to, or in advance of, a request for the resource. For example, an application might wish to decorate hyperlinks on a web page with accessibility information gleaned from a specialist provider of accessibility labeling information.

OQ 5: It is an open question as to whether access to label repositories should be governed by a standard protocol.

OQ 6: It is also an open question as to whether a repository should provide a bulk data transfer capability alongside whatever capabilities it offers for transfer of description, cLabels and packages.

3.3.4 Instantiating Associations from Resources to Labels

A number of mechanisms suggest themselves through which resources can provide links to descriptions, cLabels and packages. In principle, the association could be provided by inclusion or by reference from the resource itself. However, this may or may not be technically feasible, depending on the format of the resource and the format of the label.

It might be inferred from this that a generic mechanism for associating an HTTP response with labels is required. While this would clearly have the benefit that cLabels for content could be discovered by means of a HEAD request, in practice, many content providers cannot modify the headers of their HTTP responses. Consequently, some format-specific mechanisms are required, especially one for HTML.

OQ 7: [The following is speculative and is for further study].

One such mechanism is suggested by Mark Nottingham in his note referring to the http link and profile parameters [MNOTT]. In this case the presence of a profile response is said to scope the values of the rel attribute. However in HTML the profile attribute on the HEAD element appears to scope the name attribute of the META element, not the rel attribute of the LINK element.

That said, we'd like to see a solution that looked somewhat similar to the following (based on [MNOTT])

Example: cLabels for all resources on the example.com domain are provided in an RDF instance within a single Package.

HTTP Response Header
Link: </cLabels.rdf>; /="/"; rel="cLabel" type="application/rdf+xml";

Equivalent XHTML link tag:
<link rel="cLabel" href="http://example.com/cLabels.rdf" type="application/rdf+xml" />

Note however that specification of the Link rel attribute as, say, 'cLabel' does not provide specific enough information as to whether the content of the cLabel is of interest to a processing application, which typically would be interested only in a specific subset of cLabels - for example only those relating to child protection, or only those relating to mobile friendliness. It would potentially be extremely onerous for an application to retrieve and examine all linked cLabels, so some means of identifying specific types of cLabel is required. This is complicated by the fact that cLabels can contain assertions from multiple vocabularies and may hence span multiple fields of interest.

Vocabularies are identified by URI, and so while it would be possible to construct a rel attribute value of the form 'cLabel/<URI>,<URI>' this would potentially add significantly to the size of retrievals. [It's not clear that the rel attribute is supposed to be structured.]

3.4 Trust

The issue of trust is a key issue for the Semantic Web and several relevant models have been proposed. Any of these could potentially apply here, especially if cLabels are encoded in RDF.

For example, TriQL.P is a query language that allows trust to be evaluated by comparing data drawn from different sources. In the WCL context one might see this working by querying cLabels from different sources describing the same group of URIs.

The University of Maryland's Trust Project looks at ways of building trust on the Web using shared personal movie ratings, shared contacts files etc.

W3C's Annotea project offers mechanisms for sharing annotations and bookmarks - ideas that find expression in things like social network sites and bookmark sharing systems. These networks can readily be leveraged to add trust to, and make use of, cLabels in a variety of ways.

3.4.1 Trust Mechanisms in WCL

WCL facilitates trust by offering authentication and certification.

Authentication provides the ability to verify that the assertions that are claimed to be made by a particular party really were made by that party. Certification provides the ability for a party other than the labeler to assert that they either agree or disagree with the assertions in the label. Since certificates are defined as labels, it is possible to authenticate certificates.

While we do not intend to be prescriptive about the exact deployments, we see the trust mechanisms being implemented in two principal ways. The first is where a content provider has a relationship with one or more Labeling Authorities (LAs) or certification providers who verify the content providers' labels. These LAs and certification providers need to be trusted by Web users. These ideas have been at the heart of the Quatro project, described in Appendix 2.

The second principal deployment we envisage is where the Web user has a relationship with a repository provider - and hence has access to labels that may or may not have been sanctioned by the content provider, and may have been contributed by third parties - as discussed in the references above. The repository provider may, in addition, be a certification provider.

The WCL model sets no limit on the number of cLabels that can apply to a given resource. If multiple cLabels offer differing opinions of the same resource, then the choice of which, if any, engenders more trust, is a matter for the end user. It is expected that attribution, authentication and certification will play a key role in such choices. As noted in Open Question 4, there are other possible conflicts that the putative WCL WG will need to address.

OQ 8: The form of authentication and certification mechanisms for cLabels requires further study.

OQ 9: There is also work to be done to more clearly define the roles of various players in the trust chain, such as labeling authority, certification provider etc.

3.5 Content Label Semantics

We define a Content label as a resource that contains a description, a definition of the scope of the description and assertions about both the circumstances of its own creation and the entity that created it. In other words, a cLabel is the expression of an opinion held by an individual, organization or automaton at a particular point in time. It cannot be taken as proof, in a logical sense, that one or more of the assertions expressed in the cLabel is true as an empirical fact.

Furthermore, the content label is limited by the vocabularies used. That is, inferences cannot be drawn about a resource or group or resources based on the absence of any descriptor. To give a simple example of this, if a content label describes a resource solely in terms of its color, no inference can be drawn about its shape.

3.5.1 Transcluded Content, Fragments and Indirection

After considerable debate, the XG concluded that a label attached to a resource applies to the normal experience of that resource in its representation by an application that is intended for processing it. So, for example, a label attached to a Web page has the semantic that it refers to the rendering of the composite resource by common user agents. This means that the label refers to all transcluded elements of the page as displayed (e.g. images, sounds, scripts) as well as to the markup in the delivery unit implied by the page's URI.

This introduces a number of issues that the XG notes but has not resolved:

Transclusion is carried out with encoding-specific semantics, and arbitrary mechanisms may be at work to carry out the transclusion, depending on those semantics. It may be impractical for an application other than the one primarily intended to process the format in question to determine what the transcluded content is, or to associate requests for transcluded elements with original retrievals. It is possible that, using mechanisms such as the HTTP link header, such linkage could be exposed in a uniform way. Making this a reliable mechanism would be very hard, unless it formed part of a certification process.

OQ 10: It is not clear how labels on transcluded elements relate to any labels on the content that initiates the transclusion.

OQ 11: It is not clear whether it is useful or necessary to identify individual portions of content and to assign them individual labels, or to be able to provide labels that refer only to the resource without any transclusion.

OQ 12: It is also not clear how redirection of HTTP requests affect the interpretation of a cLabel that refers to the non-redirected URI but does not refer to the result of redirection.

3.5.2 cLabel Unique Identity Rules, cLabel Caching and Expiry

Each cLabel must have a unique identity, so any modification to it, such as renewing its validity period, will result in the creation of a new cLabel, with a new URI.

OQ 13: It is not clear how this fits in with the operation of large scale workflows, which cannot easily accommodate changes in associations between content and their labels, and consequently this is for further study.

4 The WCL vocabulary

One of the purposes of WCL is to facilitate the development of vocabularies by which Labeling Authorities can define systems that allow the creation of labels on subjects that are of interest to them. For example, an authority that is interested in promoting accessibility might define a vocabulary that allows description of conformance to the W3C WAI guidelines. Others might wish to provide a vocabulary that allows claims of W3C MWI mobileOK conformance, suitability for children etc. While it is the intention of WCL to facilitate the creation of vocabularies in a standard form, the re-use and extension of existing vocabularies is preferred over the creation of new ones.

The Content Label Model does present a clear need for a vocabulary for describing the creation of content labels themselves for use in processing applications. This is set out in the following table.

OQ 15: [This section is subject to further review and elaboration - e.g. which terms are mandatory, which are optional, how to provide a unique identifier for the vocabulary. Note also that some of the terms suggest that labels can be altered, and there is an open question as to whether this is in fact possible, given that each instance of a Content Label must have a unique and unambiguous ID. Equally it is important that when a label is 'renewed' that it is not then necessary to change all references to it. It may be possible to work around this by accessing labels by 30x redirection, but the rules applications would be required to follow remain to be discussed ]

Descriptor	Value Constraints	Scope Notes	Reference
Organization	Unicode String	Use this as the placeholder for terms describing the organization or individual that created the label	FOAF
Name	Unicode String	Use this to give the full name of the LA	FOAF
HomePage	Valid URL	The main web site of the LA	FOAF
mBox	Valid email address	The contact email address for correspondence relating to this label	FOAF
Address	Unicode String	vCard defines the following terms for postal addresses. Street Locality Post Code (Pcode) Country	vCard
Voice	A string representing an international format telephone number		vCard
Fax	A string representing an international format telephone number		vCard
Description	Unicode String	A short description of the LA and its work. For WCL purposes, this SHOULD be limited to 400 characters	DC
Subject	Unicode String	A brief text description of the subject(s) covered by the labels. Typically this will be in the form of keywords	DC
Shortname	Unicode String	The acronym, initials or other short name for the LA	WCL
Icon	A 16 x 16 pixel logo in GIF or PNG format		WCL
Public PGP Key	string	The LA's public Key (if applicable)	WCL
Public digital certificate	URI	The LA's public digital certificate (if applicable)	WCL
Authority For	URI	A namespace of a vocabulary for which the label creator is an authority. The element can occur any number of times to declare multiple vocabulary namespaces	WCL
Last reviewed	W3C date & time format [W3CDTF]	The date on which the labeled resources were last reviewed.	DC
Reviewed by	Unicode string	The individual who reviewed the resource and verified the claims made in the cLabel.	DC
Approved	Unicode string	The individual who checked the reviewer's verification of the claims made in the label.	DC
Valid until	W3C date & time format [W3CDTF]	The date until which the cLabel or certificate SHOULD be treated as valid.	DC
Withdrawn	W3C date & time format [W3CDTF]	The date on which the cLabel creator or certification authority withdrew the cLabel or certificate.	DC
Test Result	URI	A link to a test result, such as an EARL assertion.	WCL

OQ 16: If successful in securing a WG charter, WCL will need to define additional terms for certifying labels. Also, it may be possible to define tests for some of the vocabulary terms.

Notes

The Dublin Core Term Issued SHOULD be used to declare when a cLabel or certificate was issued (the DC Term 'Created' is more suited as a descriptor for when the labeled resource was created). As with the WCL descriptors 'Last reviewed', 'Valid until' and 'Withdrawn', DC Terms 'Issued' is a specialization of the Dublin Core Date element and SHOULD be expressed in the W3C date & time format [W3CDTF].

The 'Authority for' term is provided because although it is usual for an LA to issue labels from its own vocabulary, it may wish to include other vocabularies as well. Furthermore, additional vocabularies may be included in cLabels by the content provider or others, and this term enables an LA to specify exactly for which descriptions it is and is not responsible.

5 Encodings

As noted throughout this report, the group is seeking a Working Group charter and believes that this is the correct forum in which to define a normative encoding for cLabels. It is anticipated that the primary encoding will be in RDF but that alternatives will be considered: for example, extensions for RSS and ATOM to allow a default cLabel to be declared at the channel/feed level with overriding cLabels at item/entry level.

Several XG members have experience of working with a system that has not been fully standardized known as RDF Content Labels [RDF-CL]. Until and if a WG charter is secured, this method remains adequate, practical and meets a great many of the use cases and requirements set out by the XG. However, it is not a full encoding of the WCL model, differing in the following respects:

RDF-CL seeks to limit the number of labels that can apply to a given resource to 1 (WCL allows any number).
RDF-CL has a much more limited mechanism for defining groups of resources than is set out in this report and its companion document.
RDF-CL offers parallel systems for providing labels, classifications and management information (creator, rights holder etc.). This has the potential to cause unnecessary confusion.
RDF-CL has some native support for labeling of movies and video games, but seeks to restrict the way this is done to a particular paradigm that is relatively inflexible. The issues are probably better handled through extensions to the WCL model.

Work has been done within the XG looking at how RDF-CL, and by extension a future RDF encoding of WCL, might be used in other circumstances. For example, could content labels be developed as a microformat? Possibly. However, the lack of namespaces leaves open the possibility of confusion over common words and terms being used to describe content in different vocabularies/schemata.

RDFa has real potential since it lets us annotate XHTML documents with classes and properties from RDF vocabularies using namespaces and Qnames as in any xml document. There are two options to describe content with RDFa. Either use RDFa annotations in the XHTML document/resource as class names and inject namespaces as needed, or link from the document/resource to another XHTML document that contains the human representation of the labels (readable using any browser), and annotate the XHTML elements (div, span) with RDF class names and properties. In both cases GRDDL could be used to transform the XHTML instance to RDF that could be then queried using SPARQL. The first case covers the issues of defining groups.

The group also considered whether it would be feasible to treat an RDF-CL instance as XML (assuming the RDF were serialized in XML). This is certainly possible in that XSLT transforms were successfully carried out. However, it was a very fragile system that depended on the data being presented according to a particular structure. Such constraints could be imposed by using Schematron, but this is an issue that needs further exploration in later work.

OQ 17: If successful in securing a WG charter, WCL would seek to:

Produce a normative encoding of the WCL model in RDF. This would, of course, take into account any changes made to the model made by the WG itself. The discussion would be likely to include sample SPARQL queries etc. and guidance on making data available for property-based resource grouping.
Show examples of cLabels encoded using RDFa.
At least sketch encoding cLabels entirely in XML.

6 Summary

The WCL Incubator Group began with a simple mission and a likely candidate technology. As with so many aspects of life, the simplicity was deceptive and the ready-made solution found to be lacking in key areas. Many issues and potential problems have been brought to light. However, support for the group's work, and demand for a fully-worked method of applying labels to content, has been, and remains, strong. It is for all these reasons that a Working Group charter is being sought.

The open questions raised throughout the XG process as reported in this document are collated and presented below in an approximate order of priority.

The Content Label Model itself should be reviewed with regard to simplicity, overall coherence and reconciliation with the requirements. (OQ 17)
- How do labels on transcluded elements relate to any labels on the content that initiates the transclusion? (OQ 10)
- Is it useful or necessary to identify individual portions of content and to assign them individual labels, or to be able to provide labels that refer only to the resource without any transclusion? (OQ 11)
- How should processors resolve situations in which scoping statements do not match? (OQ 4)
- Is there a requirement for some form of cLabel that can only be edited by editing the resource it describes? (OQ 1)
The abstract model for resource grouping needs to be scrutinized and perhaps published as a separate document in its own right (OQ 2)
- What should the mechanism be for exposing properties of resources for the purposes of scope definition? (OQ 3)
- The roles of the various players in the trust chain, such as labeling authority and certification provider, need to be more clearly defined (OQ 9)
Should access to label repositories be governed by a standard protocol? (OQ 5)
- Should repositories provide a bulk data transfer capability alongside whatever capabilities it offers for transfer of description, cLabels and packages? (OQ 6)
A normative encoding of the WCL model needs to be developed in RDF (OQ 17)
- The form of authentication and certification mechanisms for cLabels requires further study (OQ 8)
- Examples of cLabels encoded using RDFa should be shown (OQ 17)
- Sketches of encoding cLabels in other technologies, particularly XML, should be shown (OQ 17)
Use of link headers, whether in HTML or HTTP, needs significant work; as well as ways of identifying that cLabels are available, a method is needed to define what type of information is included in the cLabels. (OQ 7)
- How does redirection of HTTP requests affect the interpretation of a cLabel that refers to the non-redirected URI but does not refer to the result of redirection? (OQ 12)
What is the relationship between the validity of the label and the cache headers that may be attached to it? (OQ 14)
How does the requirement that any change to a cLabel entails a change in URI fit in with the operation of large-scale workflows? (OQ 13)
Additional terms for certifying labels need to be defined (OQ 16)
- Which vocabulary terms are mandatory? Which are optional? (OQ 15)

Carrying out these detailed and critical aspects of the WCL group's work on the Recommendation Track will ensure full scrutiny by the wider community, and enable greater interplay with, for example, the Mobile Web Best Practices Group [MWBP], the Rule Interchange Format WG [RIF] and the Evaluation and Repair Tools Working Group [ERT].

7 Glossary

The following terms are used throughout this report. Definitions have been collected from W3C glossaries where possible and provided a priori where necessary.

Assertion Any expression which is claimed to be true. [W3C definition source]

Authenticate, (n. authentication) To provide evidence that assertions made in a cLabel or a certificate are the authentic view of the entity that created them. Such evidence will typically be acquired by direct communication with that entity.

Category A thematically-related sub-group of terms within a vocabulary.

Certificate A cLabel containing assertions about the veracity of claims made in another cLabel.

Certification The process of verification of claims and the creation of a certificate.

Claim An assertion whose truth can be assessed by an independent party.

Classification A specialization of a description; one that is pre-defined .

Content Label, cLabel A resource that contains a description, a definition of the scope of the description and assertions about both the circumstances of its own creation and the entity that created it.

Content provider An entity (individual, organization or automaton) that provides resources in response to requests, whether or not the resource was created by that entity.

Description A resource that contains only assertions and claims.

Descriptor An aspect of a resource about which it is possible to make assertions. For example, color, size and shape. A descriptor becomes a vocabulary term when it is associated with possible values.

Expression An instance of a vocabulary term and its value.

Information resource A resource which has the property that all of its essential characteristics can be conveyed in a message. [W3C definition source]

Labeling Authority (acronym LA) An organization that provides infrastructure for the generation and authentication of content labels.

Labelmark A human perceivable sign that a cLabel has been issued.

Package A collection of cLabels and certificates that apply within some scope.

Repository A storage mechanism for descriptions, cLabels and packages from which they can be retrieved without necessarily being linked from the content they describe.

Resource Anything that might be identified by a URI. [W3C definition source]

Resource creator The individual or organization that created the resource.

Schema (pl., schemata) A document that describes an XML or RDF vocabulary. Any document which describes, in a formal way, a language or parameters of a language. [ W3C definition source]

Scope The set of resources to which a cLabel states it applies, or to which a Package states it applies.

Summary A short description of what is said about the resource by the cLabel, suitable for display to end users.

Trustmark A human perceivable sign that a certificate has been issued.

Valid A cLabel is valid if it has an associated schema or schemata and if it complies with the constraints expressed therein. [Adapted W3C definition]

Verification The process of assessing the correctness of claims.

Vocabulary A collection of vocabulary terms, usually linked to a document that defines the precise meaning of the descriptors and the domain in which the vocabulary is expected to be used. When associated with a schema, attributes are expressed as URI references. [This definition is an amalgam of those provided in Composite Capability/Preference Profiles (CC/PP): Structure and Vocabularies 1.0 and OWL Web Ontology Language Guide.]

Vocabulary term An attribute that can describe one or more resources using a defined set of values or data type. Attributes may be expressed as a URI reference. See also descriptor and expression.

Well-formed Syntactically legal. [W3C definition source]

8 Links and References

Dublin Core: http://dublincore.org/
FOAF: http://www.foaf-project.org/
EARL: Evaluation And Report Language
GROUP: URI Pattern Matching for Groups of Resources
Dublin Core Date: Defined in the Dublin Core Registry
W3CDTF: http://www.w3.org/TR/NOTE-datetime
vCard: IMC specification; W3C Note on encoding vCard in RDF/XML
PROFILE: The global structure of an HTML document: Meta data profiles
MNOTT: See Mark Nottingham's current work on Links HTTP Response Headers, for example Bringing Back the Link - With a Twist
RDF: http://www.w3.org/RDF
QUATRO: http://www.quatro-project.org/
SIP: http://europa.eu.int/saferinternet
RDF-CL: http://www.w3.org/2004/12/q/doc/content-labels-schema.htm
ATOM: http://www.atomenabled.org/developers/syndication/atom-format-spec.php
RDFa: http://www.w3.org/TR/xhtml-rdfa-primer/
GRDDL: Gleaning Resource Descriptions from Dialects of Languages
SPARQL: SPARQL Query Language for RDF

8 Acknowledgements

The editors acknowledge significant contributions from:

Kal Ahmed, Techquila
Dan Appelquist, Vodafone Group Services
Dan Brickley
Kendall Clark
Kjetil Kjernsmo, Opera Software
Pantelis Nasikas, Institute of Informatics & Telecommunications (IIT), NCSR
Diana Pentecost, AOL Inc.
Dave Rooks, Segala
Kai-Dietrich Scheppe, T-Online
Noboru Shimizu, IA Japan

Appendix 1 Original Use Cases & Requirements

Use Case 1: Profile matching

The original use case given in the charter has been simplified by reducing the number of essential actors to three:

CONTENT PROVIDER (metadata provider)
PORTAL PROVIDER (metadata consumer)
END USER

One can imagine a range of scenarios with very similar characteristics that amount to "sub-use cases."

Sub use case 1A: END USER discovers content appropriate to their device ["MobileOK"]

Diagrammatic representation of use case 1A

Fig 1. Diagrammatic version of sub-use case 1A.

END USER visits portal
END USER's device profile is extracted with reference to a separate metadata store
END USER searches for a topic of interest.
PORTAL PROVIDER matches END USER's device profile with content profiles provided by CONTENT PROVIDER.
PORTAL PROVIDER provides search results matching this topic.
PORTAL PROVIDER filters results based on the metadata encoded in the content with regard to the "mobile friendliness" of the content/presentation in question and the known properties of the device profile according to business rules.

Sub use-case B: END USER discovers content appropriate to their age-group ["Child Protection"]

Diagrammatic representation of use case 1B

Fig 2. Diagrammatic version of sub-use case 1B.

END USER visits portal.
END USER's user profile is extracted from a repository, perhaps the portal's own.
END USER searches for a topic of interest.
PORTAL PROVIDER matches END USER's age with content profiles provided by CONTENT PROVIDER.
PORTAL PROVIDER provides search results matching this topic.
PORTAL PROVIDER filters results based on the metadata encoded in the content with regard to the "child friendliness" of the content/presentation in question and the known age of the user according to local business rules.

Use case 2: Trustmark Scheme operator to content portal

The Example Trustmark Scheme reviews online traders, providing a trustmark for those that meet a set of published criteria. The scheme operator wishes to make its trustmark available as machine readable code as well as a graphic so that content aggregators, search engines and end-user tools can recognize and process them in some way.

The trustmark operator maintains a database of sites it has approved and makes this available in two ways:

First, the labeled site includes a link to the database. This can be achieved in a variety of ways such as an XHTML Link tag, an HTTP Response Header or even a digital watermark in an image. A user agent visiting the site detects and follows the link to the trustmark scheme's database from which it can extract the description of the particular site in real time.

Secondly, the scheme operator makes the full database available in a single file for download and processing offline.

Since the actual data comes directly from the trustmark scheme operator, it is not open to corruption by the online trader and can therefore be considered trustworthy to a large degree. To reduce the risk of spoofing, however, the data is digitally signed.

Use case 3: Website to end-user

Mrs. Chaplin teaches 7 year olds at her local school. An IT enthusiast, she makes her teaching materials available through her personal website. She adds metadata to her material that describes the subject matter and curriculum area. In order to gain wider trust in her work she submits her site for review by her local education authority and a trustmark scheme. Both reviewers offer Mrs. Chaplin a digitally signed, machine-readable version of their trustmark that she can add to her site. She merges these into a single pool of metadata to which she adds content descriptors from a recognized vocabulary that declare the site to contain no sex or violent content. She adds her own digital signature to the metadata. The set of digital signatures allows user-agents to identify the origin of the various assertions made. As in use case 2, links from the content itself point to this metadata.

Since the metadata is on the website itself, user agents are unlikely to take the assertions made in the metadata at face value. Unlike the trustmark operator, the local authority does not operate a web service that can support the label. It does, however, digitally sign its labels and publishes its public key on its website. This can be used to verify that it is indeed the local education authority that issued the relevant data in the label.

Separately, a user-agent can interrogate the trustmark operator's database in real time to check whether Mrs. Chaplin is authorized to make the assertions relevant to their namespace. Furthermore, the use of a recognized vocabulary for the content description means that a content analyzer trained to work with that vocabulary can give a probabilistic assessment of the accuracy of the relevant data.

Taken together, these multiple sources of data can provide confidence in the quality of the content and the local authority trustmark which is not directly testable. The multiple data sources may be further supported by recognizing that Mrs. Chaplin's work is cited in many online bookmarks, blog entries and postings to education-related message boards.

Use Case 4: Rich Metadata for RSS/ATOM

Dave Cook's website offers reviews of children's films and the site is summarized in both RSS and ATOM feeds. Most of the films reviewed have an MPAA rating of G and/or British Board of Film Classification rating of U. This is declared in a rating for the channel as a whole. However, Dave includes reviews of some films rated PG-13 or 12 respectively which is declared at the item level and overrides the channel level metadata.

The actual rating information comes from an online service operated by the relevant film classification board itself and is identified using a URL and human-readable text. The movie itself is identified by either an ISAN number or the relevant Internet Movie Database entry ID number. As with use case 2, trust is implicit given the source of the data, which is indicated by a link to Dave's site's policy.

Separately, Fred combines Dave Cook's and other review feeds to provide alternative reviews of the movies by transforming the ATOM feeds into RDF and creating an aggregate view using SPARQL queries.

Use Case 5: MLK and the KKK

Fred operates an antiracism education site which aggregates and curates content from around the Web. Fred wants to label the resources that he aggregates such that educational and other institutions may harvest the resources and associated commentary and metadata automatically for reuse within their instructional support systems, etc.

One of the ways in which Fred wants to curate resources is to say about them that they are pedagogically useful but politically noxious. For example, some sites on the Web make claims about Martin Luther King, Jr that are motivated by a racist ideology and are historically indefensible. Fred's vocabulary allows him to claim that such resources are pedagogically useful for purposes of analysis, but that they are otherwise suspicious and should only be consumed by students in an age-appropriate manner or with appropriate supervision, etc. In other words, Fred needs to be able to make sharply divergent claims about resources: (1) that they are noteworthy, and (2) that they are, from his perspective, dangerous or noxious or troublesome.

Use Case 6: Scalar Classification

A company named Advance Medical Inc. reviews medical literature on the Web based on a range of quality criteria such as effectiveness and research evidence. The criteria may be changed according to current scientific and professional developments. The review process leads to literature being classified as belonging to one of 5 levels as follows.

Level A : clear evidence
Level B : supportive evidence
Level C : poor evidence
Level D : expert opinion with explicit critical appraisal
Level E : no evidence

The company produces label data that declares the classification level value and provides a summary of each document. The label data is stored in a metadata repository which can be accessed via the Web.

M.D. Smith uses the label data in the repository to make decisions about heath care for specific clinical circumstances.

Requirements

The following requirements have been approved by the group.

It must be possible to group resources and to make assertions that apply to the group as a whole. (This is fundamental to all use cases)
It must be possible to self-label (use cases 2 - 4)
To provide as complete a description as possible, labels must be able to contain unambiguous assertions using more than one vocabulary (all use cases, especially 3)
It must be possible for a content provider to make reference to third party labels (use case 2)
It must be possible to make assertions about the accuracy of claims made in a label (use case 2)
The system must be readily usable within a commercial workflow, allowing a content provider to apply metadata to a large number of resources in one step and to separate the activity of labeling from that of content creation, where desired (use case 1).
The system must support a concept of default and override metadata. The mechanism that is used to determine where overrides apply should be based on the full concept of a URI rather than, for example, just a web URL (use case 1, 2, 4)
It should be possible to ascertain unambiguously who created the label, using techniques such as digital signatures, S/MIME etc. (use cases 2, 3 and perhaps 5)
It must be possible for a labeling organization to make all its labels available as a single database (use case 2)
It should be possible to include assertions from an unlimited number of vocabularies in a single content label. Assertions from each vocabulary may be subject to its own verification mechanism (use case 3)
Labels should support a human-readable summary as well as the machine-readable code (all).
Labels should validate to formal published grammars (all)
It must be possible to encode labels in a compact/efficient form (all)
It must be possible to identify whether labels are self-applied or created by a third party (use case 2)
It must be possible to discover a feedback mechanism for reporting false claims (all, especially use case 2)
It must be possible to associate labels with a 'time to live' and/or 'expiry date' (all, especially user case 2)
It must be possible to discover the date and time when a label was last verified and by whom (all, especially use case 2)
It must be possible to describe the process by which data in labels is to be verified (use case 3)

Although not a testable requirement, the group has further resolved the principle that adding labels to resources should be easy and intuitive. It is recognized that this is likely to be made so through implementation, but the design of the system should nonetheless be mindful of the principle (use case 3).

Appendix 2: The Quatro Model

The Quatro project, co-funded by the European Union [SIP], has provided significant input to the WCL-XG, particularly with respect to the concepts of a labeling authority and the authentication of labels. Quatro posits a single basic architecture with 3 variations. In all cases, content is linked to the label and there is a link to the Labeling Authority.

It should be noted that although every resource must be linked to the label in this model, a client seeking that label will only have to retrieve it once. After the data has been retrieved, whether from the labeled site or the LA database, no further network request is required for as long as the label is held in cache by the client.

A2.1 Onsite labels that may be edited

Diagrammatic representation of simple architecture in which data about resources is held on the same server as those resources. There is a link to the LA which is able to authenticate the data

Fig A2.1 Diagrammatic representation of simple architecture in which data about resources is held on the same server as those resources. There is a link to the LA which is able to authenticate the data.

Labels are hosted near to the resources they describe (typically the same website) and the LA's database supplies simple authentication. Since in this model content providers are allowed to edit their label to reflect changes in their content, the cLabel may not be exactly the same as the one issued. However the LA is able to assert that it trusts the content provider to make such changes faithfully.

A2.2 Onsite labels that may not be edited

Diagrammatic representation of simple architecture in which data about resources is held on the same server as those resources. A hash of this data is sent to the LA which is then is able to authenticate the data by comparing it with the hash stored in its database.

Fig A2.2 Diagrammatic representation of simple architecture in which data about resources is held on the same server as those resources. A hash of this data is sent to the LA which is then is able to authenticate the data by comparing it with the hash stored in its database.

This is similar to the previous model but differs in the important respect that the LA does not allow content providers to edit their labels. Therefore a hash of the label can be checked against data held by the LA to ensure label integrity.

A2.3 cLabels delivered directly from the LA database

Fig A2.3 Diagrammatic representation of architecture in which all label data is held by the Labeling Authority.

In this architecture, labels are delivered directly from the LA's database. Therefore there is no possibility for the label to be modified by the content provider, and the source of the labels carries greater inherent trust. On the downside, this model places greater demands on the server infrastructure, system integration and bandwidth usage by the LA.

W3C Content Labels

W3C Incubator Group Report Draft 0.9.2; 11 August 2006

Abstract

Status of this document

Use Case 1: Profile matching

Sub use case 1A: END USER discovers content appropriate to their device ["MobileOK"]

Sub use-case B: END USER discovers content appropriate to their age-group ["Child Protection"]

Requirements