SWAD-Europe deliverable 12.1.2: Semantic Blogging and Bibliographies - Requirements Specification

Project name:

Semantic Web Advanced Development for Europe (SWAD-Europe)

Project Number:

IST-2001-34732

Workpackage name:

12.1 Open Demonstrators

Workpackage description:

http://www.w3.org/2001/sw/Europe/plan/workpackages/live/esw-wp-12.1.html

Deliverable title:

Semantic Blogging and Bibliographies - Requirements Specification

URI:

http://www.w3.org/2001/sw/Europe/reports/requirements-demo-1/ hp-requirements-specification.html

Authors:

Steve Cayzer, HP Laboratories, Bristol, UK
Paul Shabajee, Graduate School of Education and ILRT, Bristol, UK

Contributors

Dave Reynolds , Ian Dickinson, HP Laboratories, Bristol, UK

Abstract:

Workpackage 12.1 comprises two demonstrator applications designed to both illustrate the nature of the semantic web and to explore issues involved in developing substantial semantic web applications given the current state of the art.

This document outlines the requirements for the first of these applications; a semantic blogging tool applied to the bibliographic management domain. We start with a brief summary of the domain and our reasons for choosing this particular demonstrator. We outline the context and aims of the project, after which we outline the user requirements and list the associated components, identifying existing resources that are available for inclusion in the demonstrator. The requirements are separated into core functionality, which we expect to deliver, and optional extensions, which will be undertaken if time permits. A series of appendices contain the details of related and background work.

Status:

First release.

Comments on this document are welcome and should be sent to Steve Cayzer or to the public-esw@w3.org list. An archive of this list is available at http://lists.w3.org/Archives/Public/public-esw/

1 Introduction
2 Semantic blogging for bibliographies
3 Context and Aims
4 User Requirements
5 Component Requirements
6 Existing Resources

Appendices
A Criteria for Requirement Selection
B Related Work
C User Study
D Review of Personal Bibliographic Systems
E Overview of Major Library Focused Bibliographic and Related Standards
F References

1 Introduction

This report is part of SWAD-Europe Work package 12.1: Open demonstrators. This workpackage covers the selection and development of two demonstration applications designed to both illustrate the nature of the semantic web and to explore issues involved in developing substantial semantic web applications.

This report forms the requirements specification for the first demonstrator, Semantic Blogging and Bibliographies. The aim of this report is to set out the key criteria that we wish to achieve with this demonstrator. The intention is to provide sufficient framing so that the reader can understand what the demonstrator is expected to do, and why we feel that the capabilities are germane to the SWAD-E agenda. This document is not a design document, and so detailed implementation decisions will not be presented here, although general architectural principles will be discussed where appropriate.

We start by reiterating our notion of semantic blogging for the bibliographic domain. We then set the overall context and aims of the application, discussing the selection process we used in order to draw up a set of requirements for a demonstration vehicle which has sufficient illustrative power, while remaining feasible to implement within the project timescale. Sections 4-6 outline the user requirements and list the associated components, identifying existing resources that are available for inclusion in the demonstrator. The requirements are separated into core functionality, which we expect to deliver, and optional extensions, which will be undertaken if time permits.

The Appendices contain details of work undertaken to support the requirements selection process. The criteria for requirement selection, and details of other use cases considered, are discussed in Appendix A. An extensive survey of related work can be found in Appendix B, while details of current bibliographic standards and software can be found in Appendices D and E. We conducted a short user study to act as a reality check on our assumptions - the results are summarised in Appendix C. .

2 Semantic blogging for bibliographies

In this section we recap some of the key points raised and discussed more fully in our previous report [SWADE_ANALYSIS]. In summary, we aim to take an existing phenomenon (blogging) and semantically enrich it. The new metaphor is grounded by applying it to a concrete domain, bibliographic management.

Web logging, or blogging [ESSENTIAL_BLOGGING], is a well known phenomenon that has a number of attractive features. It provides a very low barrier to entry for personal web publishing and yet these personal publications are automatically syndicated and aggregated via centralized servers (e.g. blogger.com) allowing a wide community to access the blogs. Blogs have a simple to understand structure and yet links between blogs and items (so called blog rolling) supports the decentralized construction of a rich information network. While we want to extend the blogging metaphor, we also want to preserve its key values, especially its simplicity. We want to build on blogging's proven potential for publishing, syndication & discovery, and community formation.

The notion of semantic blogging builds upon the success and clear network value of blogging by adding additional semantic structure to items shared over the blog channels. This semantic structure has two key effects:

Rich Query: Semantically enriched blog metadata enables new subscription, discovery and navigation behaviours.
Rich Structure: Access to ontological markup enables both richer annotation and sharing of higher level structures (like categorisation schemes), encouraging peer commentary and recommendation activity.

There is some movement in the blogging community to what we call semantic blogging. The Movable Type Trackback functionality [MT_TRACKBACK] allows two way linking between blog items. Some blog commentators envisage the next step, which is attaching semantics to these links [LINKING_DANGEROUSLY]. Richer (hierarchical) categories are facilitated by the RSS2.0 standard [RSS2.0]. The Topic Exchange activity [TOPIC_EXCHANGE] uses TrackBack as a step towards the use of shared ontologies. Further details on these and other activities can be found in the appendix on related work, but it is worth emphasising them here. These developments indicate that there is a real need for the capability that we are proposing.

Bibliography management is a large and complex domain, and the appendices contain reviews of relevant tools and standards. Within this domain, there is a need for lightweight tools for small group bibliography management (see the User Study). We feel that this need is an ideal testing ground for the semantic blogging paradigm. It is not our aim to duplicate functionality of the existing bibliographic tools and standards; rather, we seek to integrate our demonstrator with such tools so that users are enabled to use the additional functionality within the context of their current work practice.

3 Context and Aims

It is not immediately clear why two successful, but distinct, paradigms (blogging and the semantic web) should be brought together. We believe, though, that there are compelling reasons to combine the two. The rich structure and query properties enabled by the semantic web greatly extends the range of blogging behaviours, and allows the power of the metaphor to be applied in hitherto unexplored domains. Bibliographic management is a concrete example of a task that illustrates the benefit from the combined paradigm. Although traditional bibliographic management deals mainly with static categorisations, the needs of a small group collectively exploring a domain exhibit a more dynamic, community based flavour. Here is a task which is characterised by a need to share small items of information with a peer group in a timely, lightweight manner. This information should be easily publishable, easily discoverable and easily navigable. It should be simple to enrich the information with annotation, either at the point of delivery or later. The information should be archived in a commonly understood way for effective post-hoc retrieval. It should be possible to be notified, in a timely way, of new items of interest. We believe that a combination of blogging and semantic web technologies offers an ideal solution to this problem - blogging for low barrier publishing, a simple shared conceptual model, and a mechanism for natural, dynamic community formation - semantic web for rich structure, which enables richer community annotation, and rich query, which enables more powerful discovery and navigation.

It is the aim of this demonstrator to develop a tool that is simple, useful, extensible and illustrative. Simple, because it should be easy to learn and to use. Useful, because it should do something that users actually want, efficiently and reliably. It should be deployable. Extensible, because although we ground the requirements in the bibliographic domain, we expect it to be reusable for other semantic blogging applications. And illustrative, because we wish to incorporate features that demonstrate the advantages of the semantic web approach (semi-structured data, semantics and webness) without losing the key advantages of blogging (low effort publishing, easy subscription and decentralized discovery).

Combining these desiderata, we arrive a key set of capabilities that we wish to illustrate through this demonstrator, and which should therefore be captured by the requirements.

Rich Query

Subscription Blogging as it stands provides a convenient time-based channel structure. However the channels must be subscribed to independently and the filtering within these channels is limited to free-text search. By using the semantic blogging approach we make it possible to subscribe to categories such as "what's going on in field x". Aggregation need not be bound to a single feed, and can do semantic filtering within these feeds. It is thus possible to define a subscribed category which contains a subset of items from a number of channels.
Discovery This capability means the discovery of new, potentially useful, sources of information. Blogging discovery mechanisms include directories, blogrolling, item-item hyperlinks and informal routes (such as email). The semantic web offers capabilities that allow an even richer discovery capability. For example, bibliographic items can be related via an ontology, and thus a user can query a wide community (and many channels) for papers "in the semantic web category". If items are associated with a consistent (or at least corresponding) identifier, users can also ask for blog entries "about this item".
Navigation Current navigation within blogs is largely chronological, and hence not well suited for non time-based data, such as bibliographic records. We want to enable users to browse their own blog (and others') for such data in an intuitive manner (for example, across a user-defined topic hierarchy). In addition, semantic links will allow navigation to other items "related to", "agreeing with" or "disagreeing with" this one. These links are an alternative to explicit citation-type links and thus provide a community based network based around ideas and discussions.

Rich Structure

Shared Ontologies By exploiting the semantic web ontology layer we are able to not only represent rich topic hierarchies for classifying citations, but also to link and share these topic sets across communities. Thus different communities can use distinct classification schemes and yet the data can be shared across the same infrastructure and potentially the relationships between terms in the different schemes can be explicitly represented to allow cross-community search.
Annotation Annotations need not be limited to free text entries. They can fit into semantically meaningful structures (like ratings, comments and section-specific annotations), allowing more effective discovery, navigation and maintenance of blogged data. They may also be supplied with context data, such as provenance. Because we index items by URI, we enable such multiple, provenanced annotations to be integrated across a community.

These capabilities need to be set in the context of bibliographic management. So interoperation with existing tools will be explicitly included as part of the requirements.

We seek to build an application which demonstrates these key features and has demonstrable utility. We have chosen bibliography management as a domain, and therefore the overall application should be one that enables a group to manage their collective bibliographic records. We expect to produce a tool which, in general terms, allows a community to effectively manage their bibliographic data, and to harness the power of the group for discovery, recommendation and collective learning. Specifically this means that the demonstrator will exhibit lightweight capture of bibliographic data, rich discovery and navigation mechanisms, useful presentation of the relevant information, and good integration with other tools. In short, an application that is genuinely useful for bibliography management.

4 User Requirements

In this section, we consider the requirements from the standpoint of the user (architectural and technical issues are considered in the next section). We present a use case which captures the core functionality of our demonstrator. It is as simple as possible while still providing useful functionality. It is nevertheless quite feature-rich, and involves a lot of semantic technology. Other features, some of which involve significant research hurdles, can be built onto this framework as extensions. We discuss some of them later.

Core Functionality: Local Group Bibliography Management

This scenario can be summarised by imagining a commonly encountered problem - that of sharing papers and citations with a project group. Existing solutions tend to be ad-hoc and unsatisfactory. For example, email is useful for a speedy, low cost notification of a useful paper. It also allows the shared citation to be annotated at the point of delivery. However, it is often difficult for the recipient to categorize the data appropriately ("I often find that I receive the right paper at the wrong time") and post-hoc, principled retrieval of such received (or even sent) papers is next to impossible. Another method is web pages, on which people can post useful literature with an arbitrary amount of structure. Useful though this is, the publishing process is far from low cost, and the reader is required to understand the publisher's conceptual structure. The reader is also required to 'ping' the website rather than being notified of new papers. Topic portals are another web-based example, but the coverage is often too general and the content not necessarily up to date. A third method is shared bibliographic databases, such as ProCite or EndNote. These formats do allow sharing of bibliographic information between small groups, but there is considerable 'lock-in' to the formats, which are often unwieldy and inflexible. Finally, there is the possibility of managing bibliographic data using the existing blogging infrastructure. We performed some simple, informal trials in our group to identify the main problems and found that blogging is currently an unsuitable environment for the capture of bibliographic content. In particular, it is difficult to organize the blog in such a way that the large numbers of articles can be managed effectively by the publisher, let alone other readers. At a minimum, we need to add structure, in a flexible yet low-cost manner in order to make the metaphor work for this domain.

Note that these are simply our intuitive reflections on the domain. We conducted a short user study to test these intuitions - the results can be found in User Study. This study identified a number of limitations with current approaches and generated a wishlist, which is matched encouragingly well by our requirements.

Illustrative Scenario

Tim is interested in semantic blogging. He does a Google search for relevant papers, and finds some that look interesting. After having read a few, he posts the details on his semantic blog. There are a variety of low cost ways available to do this. For one paper, he chooses a 'copy and paste' importer into which he pastes the BibTex entry (from CiteSeer) and the marked up item is automatically added to his blog entry. He categorises the item, rates it and adds a free text comment. Other, unread, papers are added (to the same category), but not commented on or rated.

For a paper he has not read yet, he wonders if anyone else has. He performs a community query for that paper and receives a summary table with the comments and rating of his peers on that paper. For the paper he has read, he wants to find related papers. He performs a community query for 'papers like this' which generates no hits. He chooses the 'generalize this query option' and this time there are some related papers. Again they are presented in summary form with title, topic and rating (this can be customised) and he follows links to interesting looking papers to examine his peers' comments. Another community query is 'find peer commentary' which finds peers' blog entries linking to this one. Again, the retrieved entries are displayed in summary form.

One of the followed links refers to a paper which looks interesting enough for him to read himself. He accesses the abstract and downloads the PDF. He creates his own blog entry (using a 'blog this' bookmarklet option) which automatically copies all the metadata (title, author etc) created by his peer. It also creates a link between his (new) blog entry and the peer's. He can also, if he wishes, 'bulk import' peers' blog entries from the summary table. He can now add his own comment and rating, and recategorise the item if required. Finally, each bibliographic item also has a list (0 or more) of citation links - papers cited by, and citing, this paper. Tim may follow these links if he wishes to find other interesting bibliographic items, and possibly import them too.

Tim decides to export his blog to ProCite, which he uses for writing papers. He chooses the 'export new items' option which converts all papers added (or modified) since the last export to a ProCite-compatible input file.

Tim is interested in keeping up to date with semantic blogging papers. He does this in a number of ways. Firstly, he enables a tracking feature for his blog entries, which enables him to record all blog entries linking to this one. Secondly, he adds an 'email alert' to his blog entries in the semantic blogging category so that he is notified when anyone links to one of his blog items in this category. Thirdly, he sets up a community alert for the semantic blogging topic, which provides an update on any new community blog entries in this category. He sets up a web page to display a summary of these new semantic blogging entries.

Later, when actually writing a paper, Tim uses his blog (rather than ProCite) because of its superior semantic search capabilities. He browses his bibliography data for papers with a topic (or supertopic) of semantic blogging and again gets a summary table. He can filter this table using other metadata (eg rating) or unstructured data (i.e. free text). He can also augment the table using a community query. Once his is happy he has the right subset of papers, he exports the data to a BibTex file for use with his L^ATEX paper.

Requirements List

This scenario illustrates a number of requirements for the core functionality:

Painless Import. Bibliographic items can be generated automatically from custom sources (EndNote, ProCite), from web pages or from copy and paste. At a minimum, copy and paste of BibTex data should be supported.
Rich Navigation. Blogs should be navigable using metadata. At a minimum, a topic should be selected from a topic hierarchy, and a summary table produced showing bibliographic items at (or below, at user discretion) the chosen topic. Such summaries can be filtered easily by metadata or free text search.
Assisted markup. Users should be helped as much as possible to markup items according to the central ontology. For example, much of the metadata can be automatically created on initial import. Categorisation (and recategorisation) into a topic hierarchy should be possible. An intuitive (and customisable) UI to enter other metadata, such as ratings and comments, is essential. A description for the ontology terms should be available on request.
Provenance. Ratings and comments need to be provenanced by the author. However the provenance information need not actually be used here - it is acceptable for different blog entries about the same paper to be listed separately and not integrated.
Metadata Visible. For a single blog entry (or for a summary view), metadata should be visible (this can be customisable). For example: ratings, annotations, comments, author, title, journal and so on.
Query Three types of community query need to be supported here. The first one is a query for a paper, which returns all blog entries that reference a particular paper. The second is a community query for related papers, which returns all blog items about papers categorised under the same topic as the source paper. This query should be generalisable - for example, search for papers categorised under the immediate super-topic. The third query is a search for any related blog entry. Note that at this stage, the third query is likely to return exactly the same results as the first. It is also achievable using the current TrackBack blogging mechanism. However, this mechanism is a prerequisite of more advanced features, such as semantic networks.
Authoritative Identifier. For community queries to be meaningful, the same paper should be referred to by different people using the same identifier. A simple solution to this is adopted; the use of a common, stable and unique URL. So the implications for the user is that only such sources (eg CiteSeer) will be supported.
Access to data. A blog entry should allow access to the underlying paper. (Indirect access, for example via a CiteSeer URL, will be sufficient).
Subscription/Discovery. Semantic, community alerts such as 'alert me when more semantic blogging articles are added' features will be needed.
Common Ontology. One of the scoping factors for this scenario is that it can be assumed that all the peers use the same ontology. Complex parts of the ontology (eg topic hierarchy) must be browsable and the ontology should be descriptive (so that people know what the ontology terms mean, and are thus given guidance to apply them correctly). The ontology should include: classes for book chapter, journal article and so on; author; title, journal, volume and issue (for journals); booktitle, chapter, publisher and ISBN (for books); and other appropriate bibliographic fields. The ontology should also include user-supplied metadata such as rating & comment, and classification metadata such as topic.
Link to other items. Blog items should link to other blog items in a number of ways. The straight hyperlink is of course possible but links in the other direction also necessary. A "Blog this" feature will copy across metadata relating to the paper, and also create such a bi-directional link. Implicit links (via the topic hierarchy) have been mentioned above.
Export. Bibliographic data should be exported to a variety of useful forms, certainly including ProCite and BibTeX. The export should function should have a choice between 'export new' and 'export all' items.

Exclusions

There are some things explicitly excluded from the core demonstrator, some of which will be discussed as extensions. These considerations should guide the design so that such extensions are not locked out (and where possible are enabled and facilitated by the infrastructure):

Disparate ontologies. Although part of the aims of this demonstrator are to explore disparate, external and extensible ontologies, it is expected that within a small group the peers could be expected to adhere to a common, centralised, predefined ontology. Ontology modifications will be limited to simple extensions and would require no special machinery (text editing of a centralised file would be sufficient).
Semantic links. For the purposes of this scenario, blog items may simply be linked to other blog items. Argumentation networks will not be required here. In addition, explicit, user defined links between the papers will not be supported; the blog links and topic groupings are expected to be sufficient.
Diverse content: A simple solution to providing authoritative identifiers is adopted here; the use of common, stable and unique URLs. Such a solution precludes items from sources other than those with such URLs (eg an agreed online database). The approach taken here is a limitation but one that is necessary in building an feasible demonstrator. Such a limit however not only precludes many papers, it also precludes other useful (and citable) sources such as books, standards, people, companies and personal communications. These will considered below.
Authority files Authority files identify an entity (such as an author, a corporate body or a geographical location) unambiguously. This facility is extremely important for accurate provenancing and annotation, and poses the same problems as for diverse bibliographic content, discussed above. Essentially, it is the problem of generating a stable unique identifier without a universally agreed scheme for doing so.
Visual navigation. The network of bi-directional links is useful, but in order to navigate it effectively, a graphical UI would be helpful. It is considered that within the context of a small group, the network is small enough that although a user would make use of such navigation, the path would not be complex enough to require visualization beyond hyperlinks and simple summary lists.
Citation links The links mentioned above are blog-blog links. However, integrating with an online tool like CiteSeer [CITESEER] would allow access to citation links, which join the underlying papers together. For the purposes of this demonstrator, a simple link to the appropriate tool (eg CiteSeer page) will provide suitable access to the citation network.
Context dependent markup. The meaning of an annotation may be non-intuitive. For example, 'rating' is a term which may be ambiguous. Is a user rating the style of a paper or its content? Is the rating from the perspective of a certain class of user (eg "This paper is a great introduction to RDF but not worth reading if you are already an expert")? In the demonstrator, we do not explore the possibilities of context dependency, but ensure that the ontology terms are described (and the descriptions made available to the user) so that such ambiguities can be avoided.
Rich annotation. There are various possibilities that arise with the use of rich metadata. For example, annotation on partial content, such as a comment on 'paragraph 2 of the article' Also, the integration of annotations from various sources (with the use of provenance). Finally, other contextual information associated with annotations (such as 'type' of rating).
Reputation. Community queries may be augmented by the reputation of the source. This aspect, while interesting, will not be explored in the demonstrator.

Extensions

We now discuss three possible extensions to the semantic blogging demonstrator. These are not considered to lie within the scope of the project deliverables, but may nevertheless be implemented as related projects.

Extension 1: Shared Ontology

This extension deals with groups who have the same data but different topic hierarchies. An example might be a group of users who are interested in the same content and have similar, but subtly different ways of categorizing that content. These would include both simple labelling differences and differences at the level of detail. For example, to take the bibliographic domain, some users might have a category 'blogging', which other users call 'web logging'. Some users might be very interested in the topic of blogging, and subcategorise that arena into MovableType, Blogger, Radio Userland and so on. Other users are not that interested and put all blog resources in the same category. Note that these differences are quite subtle and yet present considerable hurdles to interoperability between the members of a community. Essentially, the problem is where a group of users in a loosely defined community want to share a largely consistent conceptual model, but still to allow individual variations on it. The essential capability would be to align two taxonomies (limited here to saying that two nodes are equivalent). This has two consequences. Firstly, it allows a user to view the same concepts but through a set of labels that s/he finds meaningful. Secondly, it enables the reuse of categorisation effort. In the above example, one user might map a category to 'blogging' and hence gain access to all the finer grained categorizations performed by other peers under that concept.

We believe that such an extension provides a compromise position - on the one hand, it enables genuinely useful functionality to a community with different ontologies, yet on the other hand it avoids the scale and complexity of more ambitious projects like APECKS [APECKS] which would make it difficult to implement within the project timescale.

Extension 2: Semantic Linking

Attaching semantics to item-item links allows the possibility of navigating an argumentation network, as explored in the ClaiMaker project [CLAIMAKER-WEAVE]. Such a possibility is certainly powerful, and yet there is a risk that without appropriate visualisation and navigation tools the capability will simply produce cognitive overload. This extension is therefore limited in scope to three key capabilities. Firstly, an enhanced metadata creation tool that allows users to create semantic links between their blog items and others'. Secondly, an extended query mechanism that allows the user to retrieve 'blog entries that agree with this one' for example. This facility could potentially be made transitive, although there is a danger in assuming that a "someone who agrees with someone I agree with" also agrees with me! Thirdly (and optionally) a visualization tool that enables one to view the activity surrounding a particular bibliographic item (papers that agree, papers that disagree, papers that extend and so on).

This extension is deliberately limited to maintain feasibility. In ClaiMaker, it is not the papers that are linked but the concepts. Thus one paper may give rise to a number of concepts, all of which are nodes in the argumentation network. Such a mechanism is clearly of benefit, but would introduce too much complexity to be considered here. It also raises further issues to address. For example, the concepts have to be identified and categorised so that they can be reused, which presents two difficulties. Firstly, once a non-trivial number of concepts are generated it is difficult to find the right concept to reuse. Secondly, it is difficult to define a concept in a manner both specific enough to be useful and general enough to be reused. In fact, to some extent, even this sophisticated model does not go far enough. Consider, for example, the issue of trust. If I assert that 'paper X contains concept Y' then we need to build a mechanism for someone to dispute this assertion.

Semantic links raise another issue that is currently unexplored. That is - do the links refer to the blog items or to the underlying paper? We have so far blurred the distinction between blog-blog and item-item links. In fact, much of the blog metadata (eg author, title) is more correctly viewed as being attached to the underlying item. Making this distinction explicit allows richer possibilities: for example, "This paper disagrees with that paper" versus "I disagree with what you are saying about this paper". However, such a mechanism might simply be confusing and is thus not considered here.

Extension3: Name by property

In order for two people to talk about the same item, it is necessary that they use a common identifier (or at least that the identifiers can be mapped to one another). The constraint on the naming of items adopted in this core demonstrator provides a simple, workable solution to this problem. Essentially, it uses some socially agreed provider of identifiers (for example CiteSeer) to ensure that when two users reference the same paper, they use the same identifier. Such a solution can provide a significant amount of functionality but more powerful solutions exist. One such solution is for people to take identifiers from different schemes and, where appropriate, link them. One might imagine, for example, an identifier from CiteSeer and an identifier from MEDLINE, referencing the same paper. A user could discover both instances by, for example, performing a pattern matching search on the paper title. Matching papers would be returned in a summary list, and the identical papers linked together. From that point on, as far as the system is concerned, the two papers are the same paper, and the results are available for other users. A query on 'entries that link to this paper' would return the union of linkages to both papers.

Such a solution, while offering a wider source of items, can be extended further. Using a name by property paradigm, users can identify a paper using descriptions such as "The report with an author 'Steve Cayzer' and the title 'SWAD-Europe: Semantic Blogging and Bibliographies - Requirements Specification'". Such an approach would also facilitate versioning of documents. Note that this paradigm does not rely on unique identifiers. It is entirely possible for two users to refer to different papers using the same description (consider "The paper with author 'Ying Ding' and date 2002 and title 'Golden Bullet*' " which matches two papers [GOLDEN-BULLET-1, GOLDEN-BULLET-2]). On the other hand, two equivalent papers might have quite different descriptions ("The paper with author 'Boris Omelayenko' in conference 'FLAIRS2002'" [GOLDEN-BULLET-2]). But these disparate descriptions can be integrated with a suitable query, and those identical papers linked as before. This time, however, the descriptions would be aggregated and the (more complete) description would then be available for reuse. Similarly, disambiguating metadata can be used to enrich the identifiers of distinct papers with identical descriptions. This will have the side effect of disaggregating the annotations that peers had attached to each of these papers.

An immediate consequence of this is that the core demonstrator should adopt an identification scheme that enables this extension. For example, the unique identifier (eg CiteSeer URL) can be attached to the paper as just another property. Of course, properties which can act as unique identifiers can be marked as such (eg using InverseFunctionalProperty from the proposed W3C Web Ontology Language [OWL]). Such a mechanism allows the core demonstrator to function as before, while providing a minimal hook for this extension.

Other Ideas

There are a number of other ideas which, although not under active consideration, provide further examples of extensions which it should, at least in principle, be possible to implement over the core demonstrator.

Rich Discovery: The discovery mechanism described above is useful, but it can be made even more powerful. For example, if people annotate their channels then it should be possible to discover "channels about the semantic web" and to perform a search within that restricted domain, formatting the result as an RSS feed. Another example would to generalize a relationship search i.e. "Are there any more blog entries describing this application idea?" - where "application idea" is a concept related to (perhaps indirectly) the underlying paper.

Visualisation: Rich path visualisation affords a powerful way to improve navigation. Visualisation could be of papers, blog items or peers. One possibility would be to present a network view of blog entries (or bibliographic items), connected by (typed) links. Different types of link would be shown in different colours, and added/omitted from the map as the user chooses. Complexity would be managed by limiting the 'window' (path length) from the current blog entry. Another visualisation possibility is a view of the blog organized along 'semantic UI' lines, using an approach similar to the Haystack project [HAYSTACK]

Large Group Aggregation: Currently, we expect to support only a small, manageable community. The demonstrator will be built in as scalable a way as possible, but will not be deployed in a large group situation. As the community grows, various infrastructural issues arise. For example: community annotations may require an annotation server; shared access to community ontologies becomes problematic; community editing of ontologies may require a more sophisticated approach such as Kaon [KAON] or APECKS [APECKS]. In addition, even assuming the stable name problem has been solved, how do we discover blog items 'about' XYZ in a large (potentially worldwide) community. This is a peer-to-peer query issue. Finally, scaling up presents challenges other than the purely technical; the need for different navigation metaphors to avoid cognitive overload has already been mentioned.

Fine Grain and Rich Media Annotation: This extension is particularly intended to explore the annotation on partial content (e.g. a comment on "section 2" of an article). This could be achieved by context data on the comment metadata. A more natural way would be to create a new URI which referred explicitly to the fragment and to comment on that. Of course, such URI's would have to be aggregated together since a user would expect to be able to ask "who has commented on this paper or any bit of it". Another possibility is to allow annotations on items such as pictures, audio files and video clips. In both cases, a suitable user interface is required to enter the annotations. This scenario also requires a more sophisticated ontology, to add context (in a similar way to provenance) to disambiguate annotations. For example, is a user rating the resolution quality of a video clip or the illustrative nature of its content?

Content Management: This refers to the need to 'get at' the content underlying each blog item. Essentially, this extension encompasses smoother integration abilities. One example would be increased co-operation with sources such as CiteSeer, so that users are enabled not only to retrieve the full text of the article, but also to browse the citation network in conjunction with the blog network.

Privacy and Reputation Blog entries could be marked with various levels of privacy (personal, community, public) or more flexibly we could implement a role based access control mechanism. The flip side of this coin is the use of reputation to augment community queries.

Assisted Markup The demonstrator as scoped has lightweight assistance for markup. Richer possibilities exist - for example: visual markup via drag and drop; automatic suggestions where similar objects/data already exists; automated classification of incoming blogs against an existing channel hierarchy; other enrichment of imported data. Another possibility is to 'cluster' blog entries based on link structure.

5 Component Requirements

There are a number of components that could be expected to be built into the core demonstrator. Some of these can be built on existing resources as explained below.

Core Components

Assisted markup tools: This component will allow the user to easily markup a blog entry with useful metadata. One tool will take an ontology and allow a power-user to build up an markup template. Such templates will then be available for other users to apply to their bibliographic data. Widgets will be required to browse certain types of metadata - for example, enumerated lists, typed data and hierarchical browsers. Another tool will be a 'Blog This!' script (implemented as a bookmark) that will take a bibliographic item, copy across the metadata and create a bi-directional link.
Ontology: We will build a suitable ontology. Actually there is more than one ontology - there is an bibliography ontology, an ontology for annotations (comment, description etc), an ontology for topics, an (optional) ontology for semantic links and an ontology for channels (primarily subchannelOf). In addition, the ontologies will be created such that semantic extensions and provenance mechanisms are enabled. As mentioned above, a community editing tool is considered outside the scope of this project.
Custom metadata view We will provide a mechanism for the user to customise the view of metadata (eg summary tables and the like) which will probably be script based. The resulting views will be available as display templates which can be used to view a single blog entry, an individual blog, or a query result. Other templates, such as query and markup templates, may also be defined in this way.
Navigation module: We provide a navigation module in order to enable a user to find his or her way around a set of items. These items may be hosted on a single blog (possibly their own) or may be the result of a query. The navigation is primarily through filtering, using query templates, and the results will be displayed in a custom metadata view.
Query module: The query module is one of the more complex components. It will primarily offer a community search for blog items of interest. The search can be a metadata based filter: "Find me all blog entries relating to this item", "find me all blog entries marked up with this topic" or other metadata. Note that this query should be generalisable - for example, search for papers categorised under the immediate supertopic. It can also be a network based search: "Find me all blog entries linking to this one", or (in the extended case) "find me all papers who agree with this". The results of the query will be available for subscription (eg RSS feed). Provenance data will also be returned (and rendered according to the display template settings).
Integration bridges A variety of mechanisms will be supplied to make it easier to import data (cut and paste of BibTex) and export data (to BibTex and ProCite). Further import/export mechanisms may be added if time permits, according to user need.
Infrastructure: There are a variety of less visible, though necessary, components that act as the 'plumbing' underlying the demonstrator. The communication protocol between peers is a good example (although here we are likely to leverage the existing blogging infrastructure, see below). Another example is a component that allows the propagation of community queries. Although this would ideally be performed in a peer to peer manner, it is not the purpose of this demonstrator to investigate P2P query mechanisms, and so a centralised solution, such as an aggregator, will probably be adopted. A further requirement is the integration of community annotations. For example, if different people comment on the same item (in different blogs) then these comments could usefully be combined. That is, where an explicit equivalence link exists between two bibliographic items, then it would be desirable (though not essential) for comments on one person's blog to be automatically propagated to the other.

Extensions

We now discuss the components necessary to implement the functionality outlined in the extensions

Shared Ontology: In order to implement this functionality, we require a component which would allow a user to take another user's ontology and to mark equivalences between that ontology and their own. We need an extended ontology which defines such relationships. The component is scoped by only allowing a very constrained ontology (i.e. taxonomy) to be compared, and by restricting the links to be simple equivalence relationships. The query module needs to be enhanced, so that equivalence relationships are followed to return enriched result sets. We also need a mechanism to access the linked ontology (eg by importing a subset of a taxonomy tree, or by providing a link to it). Once such access is provided, the markup, view and navigation modules will need extensions to provide the user with a natural way to access and use the new ontological structure.

Semantic Linking: For this extension, we need a component which allows users to type their links (eg agreesWith) between blog entries. An extension will also be needed to the query component (eg "I want blogs that agree with this one, transitively"). It is possible that the extra semantic richness would necessitate an improved metadata viewer, and clearly a network visualisation module would be relevant (though not necessarily essential) here. In any case the navigation module needs to be extended to support the new navigation metaphor.

Name by property: We need several components to implement this extension. Firstly, we need a component which allows users to mark two items as equal. The query module needs to be enhanced so that future queries on either of the items will return results germane to the other. Users can also remove equivalence links. Secondly, we need an identifier-generating utility that allows a user to choose the properties used to describe the item (there would be a suitable default, for example; author, title, year). The combination of properties is called an identifying reference expression (IRE). Thirdly, a mechanism that detects the equivalence of incoming items (eg from a community query) by comparing IREs (possibly with inference). Such items are then automatically made equivalent just as if they had been marked as such by a user. Fourthly, a component which implements the above approach with respect to authority files, for example to identify peers, authors or companies.

Note that this component is scoped down for feasibility and has two limitations. Firstly, items can be identical without having the same IRE. This possibility is mitigated against by having default, automatically generated (and thus hopefully commonly used) IRE patterns, but ultimately users can mark such items as equivalent manually. Secondly, items with the same IRE might actually be different. Users can correct this, partly by removing the equivalence link and partly enriching the IRE appropriately. Note that a side effect of this disambiguation is that the combined annotations would be split; each annotation being returned to its owning item.

6 Existing Resources

There are some resources that are available for re-use or as a base to build on:

Blogging Infrastructure The extant blogging tools offer a powerful and useful base to build on. In particular, tools such as MovableType [MT], which offer full control over the blogging environment, provide a rich development environment. One utility of particular interest is the bookmarklet, which offers one click blogging of items of interest.
Blogging Extensions There are many useful extensions to blogging in the literature. For example, Trackback [MT-TRACKBACK] is a facility that enables citation links to be recorded. What it does is to send a 'ping' (with summary details and a URL) whenever you comment on somebody else's blog entry. The functionality is part of MovableType but is also available as a standalone module [TB-STANDALONE]. There are a variety of recent extensions to Trackback (for a summary see [TB-SUMMARY]) including ComeBack (community annotation 'in place'), BackTrack and MoreLikeThisFromOthers (following community links). Finally, there are a set of utilities available [MT-RDF] that will help in converting MovableType blog entries to, and enriching them with, RDF. More details on these and other utilities are provided in the appendix on related work.
Ontologies There are a variety of useful, well thought out ontologies suitable for our purposes. For bibliographic information we can draw on standards like BibTeX [BIBTEX] and MODS [MODS] for markup, and ACM [ACM] for categorisation (see the appendix on bibliographic standards for more detail). For annotations and semantic links, there are a number of standards, including Annotea Threads [ANNOTEA-THREAD], IBIS [IBIS-TERMS] and ClaiMaker [ CLAIMAKER-SCHEMA]. For channel hierarchies, there are the XML standards RSS2.0 [RSS20] and XFML [XFML]. We are not making a commitment to any of these standards at this stage but will certainly evaluate them at the earliest opportunity.
Ontology Sharing Divergent ontologies can be reconciled in a number of ways. Trackback can been used to create community topics [TOPIC-EXCHANGE] and thus facilitate emergent ontology formation. XFML [XFML] offers a low cost way to define and link taxonomies, although as yet there is no RDF serialization of the specification. Nevertheless, the combination of the two is an attractive proposition. Another way to share ontologies is to use orthogonal classification schemes, as captured by the Facet Map concept [FACET-MAPS]. Mark Pilgrim [THIS-IS-XFML] demonstrates how this idea can enrich blog navigation. Topic maps [TOPIC-MAPS] offer a similar (though richer) approach, for which an open source Java toolkit [TM4J] is available.
Visualisation tools: In order to view a network, Apache offer a useful tool [APACHE-AGORA]. ClaiMaker [CLAIMAKER] also has a online demonstration showing their approach to concept navigation [CLAIMAKER-SANDPIT]. And an algorithm is proposed by Noel et al [VIS-ALGORITHM] that provides a way to 'untangle' a network of links so that they can be viewed in as uncluttered a way as possible (ie reduce the number of overlapping lines). For an interesting diagram of blogger communities, see [ BLOG-TRIBE].
Enriched metadata Citation links between papers can be scraped from CiteSeer [CITESEER] pages, or obtained from the Web of Science [WEB-OF-SCIENCE], although the latter is subscription only.

Appendices

A Criteria for Requirement Selection

The appendix contains details of the decisions process which led to our final requirements. Our refinement process involved a number of use cases which tested various combinations of our desiderata. These use cases provide helpful context around our final requirements. We provide a brief summary of each use case together with an explanation of how it helped guide our thinking. We conclude this section with a summary table showing how the use cases map to our criteria.

Community Bibliography Management

This use case is where we want to access the bibliographic data of a large (reasonably well defined) community. In this scenario, users would take the semantic blogging approach to bibliography management, aggregating, storing, accessing, navigating and discovering bibliographic items across a loosely defined community. Members of the community are expected to have a similar conceptual model of the domain, although their actual ontologies will probably differ. The underlying data is expected to be, but is not limited to bibliographic data.

Various permutations of this use case covered all of our criteria. However, as formulated, the use case is both too ambitious and too poorly defined to be useful as a design input. Therefore we took a suitable subset of this use case (local group bibliography management) as a core for our demonstrator. We also took the idea of shared (but disparate) conceptualizations as the basis for our first extension - shared ontologies. Further interesting variants will be explored in our second demonstrator, semantic community portals.

Rich Navigation

This use case was inspired by the navigation of an argumentation network, as explored in ClaiMaker [CLAIMAKER-WEAVE]. We noted that an argumentation network needs to be supported by a good navigation interface. We also noted that semantic links raise the issue of whether the links refer to the blog items or to the underlying content.

This use case became the basis for our second extension - semantic links.

Semantic Blogging

We drafted a use case as a reminder that we want the demonstrator to be applicable to domains other than bibliography management. Certainly we want our demonstrator to handle content other then bibliographic data. Consideration of this issue highlighted the need for a flexible way of referring to underlying content (the one finessed in the core demonstrator by using CiteSeer URLs).

This use case became the basis for out third extension - name by property.

Rich Discovery

This use case looked at the need for people to make discoveries based on richer channel ontologies or on the semantic content of blog items. Examples would be the discovery of channels "about" the semantic web, or a search for blog entries describing a particular concept.