Use Case Provenance in Blogosphere

From XG Provenance Wiki
Jump to: navigation, search


Chris Bizer

(Curator: Satya Sahoo)

Provenance Dimensions

  • Primary: Attribution (Content), Evolution and versioning (Content)
  • Secondary: Scale (Management), Law Enforcement (Management), Understanding (Use), Trust (Use), Incomplete provenance (Use)

Background and Current Practice

Within the Blogosphere, topics are discussed across blogs that refer to each other, for example on personal blogs, project weblogs, and on company blogs. The cross references are in the form of links at the bottom of a blog post, hyperlinks within a blog post, and quotation of text from other blogs. Blog posts are also aggregated and republished by services like Technorati, BlogPulse, Tailrank, and BlogScope, that track the interconnections between bloggers. Correct attribution of blogs, as they are processed, aggregated and republished on the Web, is an important requirement in the blogosphere.


Enable applications on the Web to attribute content from different sources to a specific individual or an organization.

In this use case, blogs are an example of content flow between websites, and it is important to trace back republished posts to their original source.

Use Case Scenario

A website X collates Web content from multiple sites on a particular topic that is processed and aggregated for use by its customers. It is imperative for website X to present only credible and trusted content on its site to satisfy its customer, attract new business, and avoid legal issues.

In the context of this blogosphere use case, a blog aggregator service or an user wants to identify the author of a blog without violating privacy laws. In some scenarios, the aggregator service or user may have only incomplete attribution information. In case the author of a blog is listed by name (first name, last name), disambiguation of an author is difficult with multiple blog authors sharing the same name and this may require use of additional user information (for example, email address) without violation of user privacy or privacy laws.

Problems and Limitations

The provenance of Web content in general and blog posts in particular are necessary to users for correct attribution and to aggregating services. Aggregating services require provenance information to not only attribute content but also offer additional services such as ranking of popular blog posts.

Technical Challenges:

  • Enable Trace back and correct attribution without violating user privacy and privacy laws
  • Disambiguating content authors with incomplete provenance information
  • Extend existing vocabulary for representing posts, such as SIOC, to model finer granularity provenance information.

Existing Work

The SIOC project has developed a vocabulary for representing posts. This vocabulary is often used together with FOAF (that represent information about the physical person related to a sioc:User, e.g. its name, lastname, phone, social network, etc.) and SKOS, used mainly to represent topics and taxonomy relationships between these topics.