The Semantic Web as a Semantic Soup

Simon Cox, Harith Alani, Hugh Glaser, Steve Harris
Electronics & Computer Science
University of Southampton, UK
simon.cox@gmail.com, {ha,hg,swh}@ecs.soton.ac.uk

Abstract

The Semantic Web is currently best known for adding metadata to web pages to allow computers to 'understand' what they contain. This idea has been applied to people by the Friend of a Friend project which builds up a network of who people know through their descriptions placed on web pages in RDF. It is here proposed to use RDF to describe a person and to have their RDF document follow them around the Internet. The proposed technique, dubbed Semantic Cookies, will be implemented by storing a user's RDF in a cookie on their own computer through the browser. This paper considers the concept of Semantic Cookies and investigates how far existing technology can be pushed to accommodate the idea.

1 Introduction

The current effort behind the Semantic Web (SW) is to add computer-readable information to web documents. However, the SW is not limited to describing documents. The Friend of a Friend (FoaF) project[1] is an example of describing personal information using the semantic markup language RDF[2]. It aims to provide a network of information about people and the way they are related to each other. RDF is a computer readable format in that a software agent can read in an RDF document and also make inferences based on the information contained in the document and in other sources of RDF referenced.

Almost all web sites that offer some interactive service store information about the people who use them. This data must be gathered during a registration process that can become repetitive for a person having previously registered for other sites. It is also difficult to keep the data up to date when person's circumstances change. Once a person is registered with a website cookies are used to match a computer with a user.

This report proposes the marrying of the Semantic Web with cookie technology to provide Semantic Cookies (SC). An example system has been created as a proof of concept, exploring how far existing technology can be pushed to provide this new technique.

The report will begin with a discussion of the motivations for semantic cookies and a brief background of similar technologies. The FoaF project will be described, showing how RDF can be used to describe people. Current cookie technology will be explained also noting limitations. Other existing methods of providing more enhanced user data will also be covered. Next, the proposed SCs will be described, stating what they will provide with advantages and limitations of using cookies. Some example applications will be mentioned, showing the varied scope for SCs. The report will finish with the observations and conclusions drawn from the implementation including what would be required in future technology to improve the viability of SCs.

2 Background

2.1 Friend of a Friend (FoaF)

The FoaF project is an application of semantic web technologies aiming to describe people and the links between them. Fundamentally, a person has an RDF fragment describing themselves which is stored in XML format and hosted somewhere on the internet where other RDF documents may use its URI to reference the person. Listing 1 shows a very basic FoaF document saying that the person Simon Cox has a mailbox with a URI of mailto:samc100@soton.ac.uk.

<rdf:RDF
  xmlns:rdf=
    "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:foaf=
    "http://xmlns.com/foaf/0.1/»

 <foaf:Person>
   <foaf:name>Simon Cox</foaf:name>
   <foaf:mbox 
     rdf:resource=
       "mailto:samc100@soton.ac.uk" />
 </foaf:Person>

</rdf:RDF>

Listing 1: FoaF example in RDF

Listing 2 shows a fragment using the FoaF ontology saying that the person Hugh Glaser knows the person Simon Cox, and that they have a homepage as well. This example highlights the aim of RDF, as an application could merge the fragment with the one above, perhaps to show the homepages of all the people who know Simon Cox. [caption=FoaF knows example,label=lst:foaf-knows]

 <foaf:Person>

   <foaf:name>Hugh Glaser</foaf:name>
   <foaf:mbox 
     rdf:resource="mailto:hg@ecs.soton.ac.uk"/>
   <foaf:homepage 
         rdf:resource="http://ecs.soton.ac.uk/ hg/"/>

   <foaf:knows>
     <foaf:Person>
       <foaf:name>Simon Cox</foaf:name>
                   <foaf:mbox 
                     rdf:resource=
                       "mailto:samc100@soton.ac.uk" />
     </foaf:Person>
   </foaf:knows>

 </foaf:Person>

Listing 2: FoaF knows example

FoaF was designed for decentralised networks where the users manage their own information. This is both a benefit and a disadvantage for general public use; it is necessary for a person to be in control of their own information, however, a user with little technical experience would find it extremely difficult to manage their information as RDF is not a trivial format and requires an in depth understanding to use well. It would be desirable for some other party to actually manage the information. Therefore a user with little technical expertise would not be able, or would find it extremely difficult to update their own information.

2.2 Plain Cookies

A cookie is a small item of data placed on a client's computer during an HTTP interaction with a server as a result of a response [3].

Cookies were initially intended to create stateful sessions over several HTTP requests/responses. In design of the cookie mechanism, it was assumed that sessions would be relatively short lived and would only involve the client and a specific website. The specification for cookies states that user agents implementing cookie support should provide at least 4096 bytes per cookie, although session-based servers using cookies should strive to keep cookie size as small as possible. Due to this it is not possible to assume a user agent will be able to handle a cookie larger than 4 KB.

The most common use for cookies is store a session identifier for which a server can use to maintain a session with a client over a short period of time. However, cookies are also used for storing other information with a longer lifespan such as user names and site specific preferences.

As part of the intended design, cookies can only be accessed and manipulated by the server that created them. There are several reasons for this. Firstly, each cookie contains a URL specified at the time of creation and for each HTTP request that matches the URL the cookie is sent along with the request. As the vast majority of servers make use of cookies then there would be a huge amount of extra data sent over the Internet if all cookies were to be sent with all requests. Secondly, many sites use cookies to reduce the amount of information a user need provide to a server and if made publicly available could pose a security risk. For example, Amazon has a mechanism called '1-Click ordering'[4] where a user can store their payment details and need only be logged in (i.e. have a cookie with Amazon's session ID) to be able to click one button which purchases the item currently being displayed.

The data in cookies is held in arbitrary name value pairs with no semantic meaning. There may be a cookie with a name of `surname', but the value could mean anything, not necessarily the family name of the person using the computer. Therefore the values in cookies only bear meaning within the context of the domain that issued it. A standard naming convention could be applied so that all cookies with the same name had the same semantic meaning, but this would be completely impractical due to the nature of the Internet.

2.3 XML Cookies

Microsoft have introduced a form of XML cookies into their Internet Explorer browser under DHTML behaviors[5]. They are effectively a form of client-side only cookies, i.e. they can only be accessed through client-side scripts running in the browser. They are used for storing the state of a form and any other user data in an XML format. There are similar restraints to cookies, in that size is limited to per document (128KB) and per domain (1024KB).

2.4 Wallets

The term wallet, in the context of browser technology, broadly refers to a software agent that manages personal data on behalf of a user. A wallet can either be server-side or client-side. With a server-side wallet, all of the user's personal information is kept on a central server that the user can direct a website to if it requires information from the user. A client-side wallet operates on the user's own machine and interacts directly without the need for a third party server.

Generally, server-side wallets are used for financial transactions, storing a shipping/billing address and either the user's credit card information, or some kind of electronic money. Two examples of server-side wallets are the MSN Wallet[6] and PayPal[7]. The main advantage of a server-side wallet is that there is no need for any software to be installed on a client machine, i.e. only a browser is needed to be able to use the service. However, the user is required to tell any website that requires the information about the location of the wallet service. Also, there is usually some charge, either to the user or the website for using the service.

There is much more scope for the operation of client-side wallets. Current trends show them to be evolving into general stores of personal information for filling in HTML forms. For example, the browsers such as Mozilla have a wallet built in[8] that operates by storing the user's information such as forename, surname or e-mail address, and attempts to fill in form fields that appear to relate to the information based on the name/id of the form element. For example, a form element named pcode would be filled with the user's post code. This a passive form of wallet requiring interaction from the user to make sure forms are filled in correctly. The advantage being that a server need not even know a wallet is being used.

Other more active client side wallets would interact directly with a server to provide information. These would require more trust from a user to be sure that any information given out was only given to authorised websites, and that the information could not be tampered with.

Disadvantages of both forms of client-side wallets are that they require software to be installed on the user's machine. However, it is this aspect that delivers the advantages therefore unavoidable.

Having a software agent managing a user's information, whether server-side or client-side, provides many advantages; mainly:

access can be restricted to authorised sources,
integrity of data can be maintained.

3 Semantic Cookies

The motivations for semantic cookies are the summation of the disadvantages of the different methods described above. Semantic cookies should provide a means of delivering personal information transparently to a website without the need for user interaction. However, the information held should be fully controllable by the user owning it.

Semantic cookies will provide:

a means to update the personal information by a separate server that the user interacts with
full control over the data if the user needs it
semantically rich computer readable information
no need to install extra software other than an Internet browser

The proposed solution is to store RDF fragments in public cookies. Public cookies here define a form of cookie that is sent to any server the client interacts with regardless of their domain. As mentioned in Section 2.2, there is no support in current browsers for public cookies; however, a viable temporary workaround is proposed in Section 4.

The basic mechanism will be that as a user browses a website, their RDF fragments will be sent to the web server. The web server can then process the user's personal details extracting as much information as needed and adapting the service it provides in response. The web server is also able to send new or changed RDF fragments back to the client to update their information. In this way a user need not even realise they have any RDF fragments relating to them. Since the data is stored on their machine they also have full control over the contents if they wish to modify it.

This model differs from the traditional paradigm where the whole Semantic Web is always accessible by following relationships throughout. In effect, a user's RDF fragment will be a part of the Semantic Web only as long as their HTTP session with the web server. This is a requirement to keep the user's information private as would be expected as the default behavior of the system. If a user required their information to be a permanent part of the Semantic Web, then they could visit a website which offers a persistence service. This web site would receive the user's semantic cookie and would provide a permanent URI that pointed to a copy of the RDF.

Storing a user's RDF on a client machine creates an issue if the user uses more than one machine. However, this allows the possibility of having a different semantic context for the user depending on the machine. Obviously certain parts of their description will remain the same, such as name and date of birth, but other parts may be dependent on the context, contact email for example, if given at all might differ between a person's home computer and their work computer. Using a persistence service, a user could swap in/out RDF fragments to the machine they are at and create a different semantic context for themselves.

4 Implementation

The aim of this research is to bring people into the semantic web without their realisation, or with as little interaction required. This means utilising existing Internet technology installed on home computers, therefore the implementation described as follows will not require any additional software to be installed on a user's computer.

The problem of a 3rd party website interacting with a user's cookie can be broken down into four separate cases:

Retrieve RDF fragments from a user's semantic cookie
Provide means to query/manipulate RDF
Create RDF fragments from updated source
Update/Create user's semantic cookie from updated/new RDF

4.1 Technology Used

The implementation created for this report uses the following architecture and any examples will be in the context of this architecture.

3rd party website - website that will make use of a user's RDF. Uses JSP scripts on an Apache Tomcat server so that the Java API to the 3store can be used.
RDF Authority - necessary for the mechanism used to retrieve a user's cookie. Uses PHP scripts on an Apache web server.
RDF Storage - to manage storing, manipulating and querying of RDF. The 3store[9] server is used, requires MySQL to handle underlying data and the Redland Raptor library[10] to parse RDF.

4.2 Cookie Interface

As described above in Section 2.2 cookies are restricted to the domain a server is in. However, this limitation can be overcome by using a single server to manage cookies acting as an intermediary between the user and 3rd party web server.

Fig. 1 shows the process for a 3rd party website to retrieve a cookie from a visiting user. The website can gain access to the cookie by including a JavaScript source file from the RDF Authority web server that receives the cookie and embeds it into the JavaScript source as a variable. The client-side redirect is not necessary, but provides a cleaner experience for the user. Once the website receives the RDF in the post header it can be used as desired.

Figure 1: Retrieving RDF from a user's cookie

The RDF authority must be used again when the website wants to either create a new semantic cookie, or update an existing one. This is a similar process to receiving the cookie and is shown in Fig. 2. As the response for a request from the user's client, the website returns the RDF as part of a hidden form field. The action for the form is to post to the RDF authority which responds with the RDF in a cookie header along with a redirect to a specified page on the 3rd party website. Again, the form page contains automatic form submission code to reduce the number of clicks for the user.

Figure 2: Updating RDF in a user's cookie

4.3 Querying/Updating User RDF

The RDF fragments need to be parsed into a format that can be queried and/or updated in order for the website to use them. For this implementation the 3store system was chosen to handle RDF, although any suitable RDF store could be used. The 3store is a database of triples built on top of MySQL and uses Raptor to parse the RDF. It can be queried in different ways, either using RDQL[11] or in a more programmatic way with the API provided. RDQL is a proposed W3C standard and has a similar syntax to SQL allowing for complex queries. Either way the 3store is queried, a set of triples is returned that match the criteria.

To update user information held in the 3store new triples can be asserted into the database and old triples can be removed.

4.4 Creating RDF Fragments

If a user's information has been changed in the 3store and the user wants to update their cookie with it then the relevant triples must be exported from the database to create the fragments that will be placed in the user's cookie. Unfortunately, the problem of deciding which triples are relevant to a user is not trivial.

RDF triples form a graph structure with either a URI or a literal as nodes and predicates as edges. By picking out a specific node to be root the structure can be traversed outwards to create an RDF document. To distinguish between users in the implementation it is assumed that all semantic cookies will contain a <foaf:mbox_sha1sum> predicate. This is the SHA1 hash of their e-mail address. The root node for a user can be found by using the RDQL in Listing 3.

SELECT ?person 
WHERE (?person, <foaf:mbox_sha1sum>, ßha1 hash")

Listing 3: Selecting the root node

Naively, the rest of the RDF could be constructed using the query in Listing 4 recursively calling it for every object that is a URI.

SELECT ?predicate, ?object
WHERE ("current URI", ?predicate, ?object)

Listing 4: Recursive query

However, this approach could result in a very large number of triples being returned from a 3store with many relating triples.

The implementation avoids the problem by relying on the fact that the 3store assigns a model identifier to triples asserted at the same time. As long as each user is associated with a unique model then only triples asserted for that user will be retrieved in queries, thus limiting the scope. However, this is very much implementation specific and does not solve the problem for the general case.

One way of avoiding the problem would be to restrict the queries to specific ontologies. In this case only the predicates described in the ontologies would be followed. The drawback to this approach is that if a website does not know about an ontology used in a user's RDF, then when the cookie gets replaced information under the unknown ontology would be lost.

5 Usage

There are many possibilities of how semantic cookies could be used, although there are several models that most would fit into. These models are not mutually exclusive, for example a persistence store would probably offer some form of RDF editor.

5.1 3rd Party Website

The 3rd party website actually processes the information in the cookie and offers some kind of service to the user based on the content of the cookie. For example, the Audioscrobler service[12] collects user's listening habits by recording the songs they listen to. One service provided is to suggest which songs or artists a user may like which can be delivered as an RDF feed. When the user visits the Audioscrobbler website an option would be offered to `remember' their suggested artists. When the user then visits a website selling music, the Audioscrobbler triple would be discovered in their RDF which would be used to retrieve the list of suggested artists. By comparing this list to their database, music the user may be interested in could be suggested for purchase.

All of the above could be accomplished without using semantic cookies, however, they provide the means for it to happen without any user intervention, in fact the user need not even know they have any RDF associated with them.

5.2 Persistence Store

A persistence service would allow a user to visit their website and `deposit' their RDF for long term storage. The service may also allow the user's RDF to be made publicly viewable and in which case provide a URI for it. This would allow the user's personal information, or a subset of it, to be a permanent part of the semantic web.

The user may also choose to publish their RDF if it grew to a large size. Currently cookies have a maximum limit of 4KB, although this could change if browser developers bring in support for semantic cookies. However, for user's with a slow Internet connection it would be faster to simply send a URI to a permanent RDF store than to transmit the entirety of their RDF fragments.

The persistence store could also be used to hold different RDF contexts for a person. These would all contain the same subset of information such as the person's name and date of birth, but would allow for differing context specific information. For example, a person could have a `home' and `work' context. The home context would have (amongst other things) their personal email address, a delivery address as their home and perhaps their music listening tastes. The work context would have work email address, the delivery address would be their place of work, and perhaps a link to their position in a company organisation chart. This could be performed by simply having the two different contexts on the user's home and work computer, however, a laptop or PDA may be used at home and at work and would need to be able to change context depending on how it was being used. To accomplish this the user would visit the persistence store, save their current context (probably not making it publicly visible) and select one of their other registered contexts.

5.3 RDF Editor

The user's RDF is stored locally and can be edited using a text editor or some other RDF tool. However, the majority of Internet user's would not have the ability, or inclination, to edit RDF themselves. A useful service would be for a website to provide a simple means to view and edit the information in a user's semantic cookie. The user would visit the website where they would be able to browse the contents of their RDF fragments and edit any information that had changed or add new information. The website could offer a wizard to update or add common ontologies to the user's RDF.

6 Observations & Conclusion

A working system has been implemented where a user can visit a website which can gain access to a cookie containing an RDF document. The RDF retrieved from the user can be inserted into a 3store which can be queried and updated. A new RDF document can be extracted from the 3store and used to replace a user's existing cookie.

The process of implementing a semantic cookie solution has highlighted many issues that would need to be addressed by a full prototype system.

6.1 Security

Security is very important when dealing with personal data and access to a user's semantic cookie should be controlled. There are no actual security procedures set in place in the current implementation as it is a proof of concept and the infrastructure is capable of providing access control.

With the current method of retrieving and updating user cookies the RDF authority would handle access control as it is the central point in the system. The user would specify in their RDF a security policy which stated the access rights to the RDF for different domains. Any access to the cookie must come via a request from the client to the RDF authority web server, in which case the RDF authority will always receive the user's RDF fragments before a 3rd party has had a chance to change it. The request from the client would always contain the address of the requesting website to allow the RDF authority to make access decisions.

Public key cryptography would be used as part of the cookie delivery mechanism to avoid eavesdropping and spoofing. When a user's RDF is requested the RDF authority would encrypt it using the website's public key. This stops spoofing as the website would be unable to decrypt the RDF. When attempting to update a user's cookie the website would encrypt the RDF with the public key of the RDF authority and sign it with their private key. This stops eavesdroppers and allows the RDF authority to confirm the RDF update request came from the specified website.

6.2 Cookie Size

As mentioned above in Section 5.2 for large RDF fragments a persistence store could be used and the cookie would just hold the URI for the store. A simpler intermediate approach would be to compress the RDF that is stored in the cookie. The easiest place to provide for this would be in the RDF authority as it reduces the complexity required for a 3rd party website to be able to use semantic cookies.

6.3 Understanding Ontologies

The RDF specification is just a syntax that allows items to be linked with predicates. For predicates to have a semantic meaning an ontology must be used. However, due to the nature of the Internet there are many different ontologies that describe the same things. It is impractical to expect all users of the Internet to use the same ontologies and equally impractical to expect all websites to be able to understand all of the different ontologies. Ontology mapping overcomes this problem by providing the notion of mapping items of one ontology onto another to show that they are semantically the same. For example, the two elements from different ontologies <O1:FirstName> and <O2:Forename> describe the same thing, a person's given name. When a user who uses the ontology O1 to describe themselves, visits a website that only understands the ontology O2, the website would need to use some form of ontology mapping lookup service to find a mapping between O1 and O2 so that they could make use of the user's personal data.

6.4 Other Issues

The benefit of RDF being machine readable also stands as a disadvantage if is not well formed. As the RDF is accessible by the user and open to modification it is possible that it may be left in an illegal state which would render it useless. It would be advisable to discourage inexperienced users from attempting to modify their semantic cookie, and instead use an online RDF editor as described in Section 5.3. This would ensure that the RDF is always well formed, assuming the editor service behaved correctly. This also gives rise to another possible use of an RDF editor to attempt to fix an invalid semantic cookie.

6.5 Proposed System

The system implemented proves that semantic cookies are possible with current technology, however the solution is not scalable due to the central RDF authority required to handle access to the cookie which would become a bottleneck. This could be overcome by allowing the user to choose a different RDF authority. However, this would then require them to inform websites of the URL for the authority, breaking one of the aims of semantic cookies to not require user intervention.

The ideal solution would involve client-side support for either semantic cookies or at least public cookies. The HTTP protocol would be a good place for this architecture with the browser natively supporting semantic cookies. A possible implementation would add several new HTTP headers:

Has-Semantic-Cookie - sent by the client with all requests. The presence of this header would indicate that the user has a semantic cookie. This is not strictly needed, but it removes the need to transmit the entire semantic cookie for each request which would be an unnecessarily large strain on Internet resources.
Semantic-Cookie
- if sent by the server in a response indicates that the server wishes to be sent the semantic cookie. Upon receiving the response the user's browser may respond with an HTTP request including the Semantic-Cookie header.
- if sent by the client in a request, the header value would contain the user's RDF fragments which are visible to the server as specified in the user's RDF access policy.
Set-Semantic-Cookie - sent by the server in a response. The header value would contain new RDF fragments to be stored in the user's semantic cookie. The user's browser would choose how to handle the new fragments depending on the current access policy.

The access policy would be embedded in the user's RDF as described above in Section 6.1.

It may be possible to create browser plugins to handle the extended HTTP headers above. Although this would require the user to install extra software it would be far simpler than to get browser manufacturers to actually change their own code to add support. The Mozilla Firefox browser offers good support for extending the browser through `extensions'[13], although its not clear whether user code can be inserted into the HTTP protocol handling section. This would be a good place to investigate a practical implementation for browser support of semantic cookies.

Acknowledgment

The work reported in this paper was supported in part by the Advanced Knowledge Technologies (AKT) Project (www.aktors.org), EPSRC Grant Number GR/N15764/01.

References

[1]: "The foaf project," RDFWeb.org. [Online]. Available: http://www.foaf-project.org/
[2]: E. Miller, "Rdf specification," 2000. [Online]. Available: http://www.w3.org/RDF/
[3]: D. Kristol, "Rfc 2109, http state management mechanism," Bell Laboratories, Lucent Technologies, February 1997. [Online]. Available: http://rfc.net/rfc2109.html
[4]: "Ordering via 1-click," amazon.com. [Online]. Available: http://www.amazon.com/exec/obidos/tg/browse/-/468480/ref=br_lr__2/104-1067665-4483110
[5]: "Introduction to dhtml behaviours," Microsoft. [Online]. Available: http://msdn.microsoft.com/library/default.asp?url=/workshop/author/behaviors/overview.asp
[6]: "Msn wallet," Microsoft. [Online]. Available: https://wallet.msn.com/home/home.aspx
[7]: "Paypal." [Online]. Available: http://www.paypal.com/
[8]: G. W. Bauer, "User data management," mozilla.org, February 2004. [Online]. Available: http://www.mozilla.org/projects/ui/communicator/browser/wallet/
[9]: D. N. Gibbins and S. Harris, "3store," AKT. [Online]. Available: http://inanna.ecs.soton.ac.uk/3store/
[10]: D. Beckett, "Raptor rdf parser toolkit," University of Bristol. [Online]. Available: http://www.redland.opensource.ac.uk/raptor/
[11]: A. Seaborne, "Rdql - a query language for rdf," HP Labs Bristol, January 2004. [Online]. Available: http://www.w3.org/Submission/RDQL/
[12]: R. Jones. [Online]. Available: http://www.audioscrobbler.com/
[13]: "Mozilla firefox extensions," Mozilla. [Online]. Available: http://texturizer.net/firefox/extensions/