Finding Bacon's Key

Does Google Show How the Semantic Web Could Replace Public Key Infrastructure?

Joseph M. Reagle Jr., <reagle@w3.org>

Abstract

This document briefly introduces the topic of trusted semantic web applications that do not require the existence of an complex public key infrastructure. It derives from a discussion with Tim Berners-Lee, has been improved given comments from folks in this thread, but I'm solely responsible for any errors.

Trust

The question of what is trust has been the subject of many a graduate thesis. For simplicity's sake I will rely upon the following definition:

Trust (worthiness): The degree to which an agent considers an assertion to be true for a given context. While the term "trust" is often used to denote a very high degree of confidence, there is an associated risk of the assertions being wrong.

In traditional cryptographic applications the trust in a statement is commensurate with the trust in the reputation of its author via a cryptographic binding. This assurance is accomplished via a digital signature which requires that:

A cryptographic key be strongly bound to a statement via a digital signature algorithm.
Only the specific person has access to the given key.

Consequently the following properties are ensured: authenticity (the trust in the person, who keeps their key private, is extended to the binding between the key and statement), integrity (any change to the key or statement will result in a different signature), and sometimes non-repudiation (if they key is indeed unique to the control of the person, then the person can not deny the binding because how else would it have been created?). Frequently, this cryptographic binding is associated with a semantic such as "I believe", "I assert", "It is true", or "I notarize". (I tend to think in the semantic "I believe", however one can often cast a semantic of one type as another: "I believe 'I notarize this was presented to me on this date and time.'")

Key Based Trust

How is this cryptographic digital signature created such that it has these properties? Public key algorithms are based on trap-door one-way functions:

"The public key gives information about the particular instance of the function; the private key gives information about the trap door. Whoever knows the trap door can perform the function easily in both directions, but anyone lacking the trap door can perform the function only in the forward direction. The forward direction is used for encryption and signature verification; the inverse direction is used for decryption and signature generation." — Cryptography FAQ.

Consequently, a single person (and only that person) can bind their key to a statement that anyone else, posessing the public key, can confirm! This is brilliant, but of course the problem of this scenario is that when I want to confirm Kevin Bacon's signature, how do I know I posess his real public key? On the Internet today there are many cryptographic keys out there purporting to belong to famous people. There may even be some cryptographically signed documents that can be confirmed to be bound to one of those Kevin Bacon keys, but did the real Bacon sign that document? Probably not. How is this problem addressed? Not easily.

The two common approaches to finding the right public key are:

Public Key Infrastructure (PKI) typically entails a hierarchically organized infrastructure for organizing trust relationships. For example, what if I wanted to confirm a key I found? When I was hired by MIT I was given a floppy disk with MIT's public key. I trust this. This key also signed other keys (i.e., a certificate) such that I can transitively trust those keys as well. When I find a key purporting to belong to Kevin Bacon I note that it is signed by the Actors' Guild. Of course, how do I know that's the real Actors' Guild? Fortunately, the MIT key has signed the Department of Education (DoE) key, which signed the Department of Commerce's key (DoC), which signed the Actors' Guild key. If I successfully verify all the signatures on these certificates (i.e., a certificate chain) I can be confident I have Kevin Bacon's public key! I can then use that to confirm if the document was signed by Kevin Bacon.
Web of Trust was popularized by the PGP privacy application and uses a similar transitive trust model as PKI, but without the heirarchical structure. Instead, it is informal and decentralized. Typically, when users of PGP meet together at conferences they have key signing parties where they can easily and personally identify each other and add the appropriate signatures to each others' keys. If I'm not sure that the key I found is really Kevin Bacon's, perhaps I know someone, who knows someone, (through six degrees), that does!

Preponderance Based Trust

If public key infrastructures, transitive trust, and certificate chaining sounds complex, it is! First, the infrastructure or density of the web of certificates must be sufficient to be able to confirm keys. Second, extended trust relationships can be nonintuitive to humans. We like immediate and intuitive reasons for trust. While the infrastructural mechanism can address institutional requirements for liability, they don't appeal to us viscerally. Fortunately, PGP offered an even simpler method of engendering confidence in a key without the need for other signatures: fingerprints!

A critical concept to cryptography is that of a digest value (hash result or fingerprint):

"A (mathematical) function which maps values from a large (possibly very large) domain into a smaller range. A 'good' hash function is such that the results of applying the function to a (large) set of values in the domain will be evenly distributed (and apparently at random) over the range." [X509] ... A cryptographic hash is "good" ... [when] any change to an input data object will, with high probability, result in a different hash result, so that the result of a cryptographic hash makes a good checksum for a data object." — RFC2828.

Strong hash functions are almost always used with a digital signature algorithm because it can be computationally expensive to perform a cryptographic signature on the whole of a document. Instead, one can take its digest value and sign that instead. Integrity is maintained because any alteration to the document will yield a different digest value (and consequent signature) and it's very difficult to find another document which hashes to the same value.

However, digests are useful independent of a signature. The nifty PGP fingerprint feature enabled a person to leave the finger print (the digest value) of their public key all over the Internet. Consequently, if I find a purported Bacon key and generate the fingerprint and then find postings on the Internet about or by Bacon with that fingerprint, my confidence that I possess his real key can be very high. Personally, I'd intuitively trust a key and its fingerprint, if I found the fingerprint on his official web page, repeated on his fans' pages, and included in his posting to a celebrity mailing list.

You can find a PGP fingerprint from me on Usenet back in 1996! How do you know it is not an imposter? It's improbable: the desire to impersonate me now would have required a determined effort to post messages that sound as if they've been written by me, (otherwise they'd be identified as fraudulent), for the past six years!

The Semantic Web

The Semantic Web envisions a web of machine processable information in the form of statements:

"The Semantic Web will bring structure to the meaningful content of Web pages, creating an environment where software agents roaming from page to page can readily carry out sophisticated tasks for users... The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation." — The Semantic Web.

On the Web today we have a rich interconnection of the rather stupid hyperlinks (i.e. "go there") between Web pages; these pages are in many natural language (e.g., English, Japanese) and include descriptions that are useful only to humans, but not our computers. Even so, the popular Web search service Google is able to make great use of these simple interconnections of hyperlinks to help us find words.

"Google bridges the divide between human-generated indexes and machine-generated analysis. Y'see, the Web is full of people like you and me, making links between documents; human beings, making decisions about documents, voting with their links. When I link to some arbitrary document, it's an indication that I think that it's in some way authoritative. When you link to a document I wrote, you're indicating that I'm in some way authoritative. The Internet is already structured in a meaningful way, but that structure is obscured. Google teases out the relationship between the URLs, examining the webs of authority: this person is linked to by 50,000 others, and he links to this other person over here, which indicates that person one is a pretty sharp individual, one who's inspired 50,000 human beings to take time out of their busy schedules to link to him; and person one thinks that person two is on the ball, which suggests that person two knows what she's on about." — How I Learned to Stop Worrying and Love the Panopticon

What if we had more than simple hyperlinks, but those words that describe things like contact information, schedules, interests, and relationships were also interconnected and easily processed by computers? And, just like in Google, statements about a "sharp" person endorsing a person "on the ball" can be made or inferred? Not only would search engines become more accurate, but we could have our programs easily organize the abundance of information available to us now but hidden in a babble of inconsistent formats. (For example, I typically enter the flight itineraries I receive in email into a Web form and my PDA, this is redundant!)

The Semantic Web and Trust

Not surprisingly, most security applications that entail authorization, access control, or any trust application are statements. The Semantic Web has tools that allow one to make simple statements of the form: "X has the property Y". Computers can then help humans by making use of that information: "Find all pages with property Y".

Imagine two projects at MIT: one is working on giving credentials to employees such that "Joseph is an employee of MIT", another is determining who has access to the online reference materials, "MIT Employees can access library services." Given the nature of large institutions one might not be surprised to find those working on these two projects know nothing of each other. But, if they are written using semantic web tools we need not worry about system incompatibilities. The development of an application to determine "Joseph has read access to MIT library services" is natural to the technology: neither system needs to be re-architected, one simply writes a few rules!

Additionally, one of the strengths of the Semantic Web is that one can make statements about statements. One of the simplest statements one can make about another statement is: "Statement X has the fingerprint: Z".

Key Free Trust in the Semantic Web

I've written about 1600 words so far to identify concepts that I will use towards a simple hypothesis. Those concepts are:

Trust is one's confidence in a statement.
Cryptographic signatures permits one to associate a level of trust in a statement (represented in digital form) akin to that of the reputation of its author/key.
It's hard to know if one has the real public key of someone else.
The Semantic Web can be a rich, decentralized, archived, and interconnected source of machine processable statements. Many of those statements will relate to the identity, relations, capabilities, and authorizations of agents (human or computer).
Cryptography need not be the only basis by which we evaluate trust. I've shown that the preponderance of a fingerprint or a link to a site permits a relatively high level of confidence in the owner of the key (i.e., PGP fingerprints) or relevance of a site (i.e., Google).
The Semantic Web permits statements that describe the digest value of and trust-worthiness of other statements; it will be permeated with annotated fingerprints.

My hypothesis:: The pervasive use of digest values to identify the statements in the Semantic Web will engender a preponderance of evidence for trust without cryptography.

There is a major and minor consequent of this hypothesis. The major consequent is that complex public key infrastructure may not be necessary. Instead, the Semantic Web can mirror the informal and decentralized character of PGP's Web of Trust with some improvements: it is available on and inter-related to the rest of the Web, redundantly archived, harvested and processed by roving agents and engines that can trivially repurpose it or offer other value added services. For example, institutions that demand liability assurances can easily build applications: "The cost of these statements is $4." and "I will pay $40,000 if these statements are incorrect." The minor consequent is the cryptographic signatures themselves might not be necessary to make a reasonable trust evaluation about a statement that has had time to grow into the tangled root structure of the Web. One might be willing to rely upon information if there is a dense set of inter-related statements of the form: "the information with this digest value (Z), was trustworthy for my purposes."

Of course, the presence of a digital signature (beyond the simple digest value) would increase one's confidence and the signature itself is a relatively inexpensive operation. So the true import would be the simplification of a mechanism for obtaining keys. It can be simple, bottom-up, and decentralized; if need be, decentralized and extensible systems can simulate hierarchical and closed systems much easier than vice-versa.

Gaming the System

How secure would this system be? Nothing is perfectly secure. People can work for years to build a reputation such that they can cheat once and gain more in that single act than their reputation is otherwise worth. Or, a community can band together to discredit the reputation of someone they dislike. This has nothing to do with cryptography, but human nature and game theory.

Recently, a businessman who owned a real store and who also had many loyal Web customers disappeared. Stewart Richardson built up a solid reputation with many rave reviews on the eBay auction site; he was known for the timely completion of Web transactions. Now he is gone and so is the $200,000 from his most recent auction.

Last year, tens of thousands people banded together for some amusing political antics: if you asked Google for a "dumb motherfucker", during the last presidential election the George W. Bush Presidential Campaign On-Line Store was the top return. Interestingly, the system corrected itself and the search term now returns articles about the phenomona, which seems entirely appropriate!

Revocation

Security applications often require a mechanism to revoke a previous "statement." For example, when I no longer consider my 1024 bit key to be strong enough, how do I uproot this statement from the Semantic Web and replace it with my new key? As I've written elsewhere, it can be hard to deprecate pre-existing information, as they say: "You can't take something off the Internet - it's like taking pee out of a pool." However, one can make a new statement, "old key is obsoleted by new key". The problem then is of ecology and economics. Would there be an incentive for the always evolving branches of the Semantic Web to gravitate towards this new statement? To give you an example, I recently wanted to determine if a vegetarian restaurant that I had heard of still existed. When I queried Google for "Five Seasons", the top returns were references to old pages describing what a great place it was. Only at the bottom of the listing did I find references to new restaurants that now occupied its location. Few people are going to bother to link to something that no longer exists! The same characteristic might pertain to the Semantic Web.

However, there are possible solutions. Just as the W3C uses a "Latest Version" hyperlink within its specifications so that people can always find the latest version of that specification, one could do the same for trust statements. Many of the recent on-line certificate or built-in expiration certificate mechanisms can be emulated: a statement may include properties that specify its duration or an on-line resource that must be used to determine if statement has been deprecated.

Finding Bacon's Key

To summarize how the application of my hypothesis might work, let's reconsider the problem of determining the authenticity and integrity of a statement from (an alleged) Kevin Bacon. Perhaps he said, "my latest movie is stupid." That would be an odd thing for an actor to say! So to confirm the statement I query a Semantic Web search engine and find a dense set of statements from otherwise reputable sources commenting on and confirming that statement.

Still, the press tends to repeat the misinformation of their peers. Ironically enough, the fact that something is well known and commented on is sometimes a reason to distrust the information (e.g., urban myths). Fortunately, the statement has a digital signature. If I can find a key that I trust to be Kevin Bacon's, (independent of this latest Hollywood controversy), that validates the signature, I will be satisfied.

Instead of validating the certificate chain of {MIT, DoE, DoC, Screen Actors Guild, and Kevin Bacon}, I query the Semantic Web for "Kevin Bacon PGP Key" and find a key that is highly inter-related. I can easily follow the source of references to that key to an official Web page, two large fan pages, and the Internet Movie Data Base. And indeed, that key can be used to validate the disparaging statement! I now trust that Kevin Bacon made that statement. Is that trust perfect? No, but it's sufficient for my following of Hollywood gossip.

(Out of curiousity, I do a similar query for the director Alan Smithee and find many poorly inter-related statements describing his filmography, and a few statements that Alan Smithee is a Director/Writer Guild pseudonym.)

Conclusion

It's easy to complain of complex public key infrastructures. It's also easy to wave one's hands about pie-in-the-sky solutions. In this paper, I do both with the excuse that I want my hypothesis to be easily understood.

The ability to assume agents are always on-line is changing the way the security community thinks about digital trust. I want to push this assumption a little further: not only are security services and data objects on-line, but they identifiable via a URI, easily referenced and annotated with other statements, accessible in a widely deployed syntax (e.g., XML), and structured as filaments in the Semantic Web. This could lead to a dense web of information that is sufficient for providing one with confidence sufficient for decentralized (furthering PGP's approach) light-weight trust applications. Additionally, this can then be the foundation for hierarchical business and risk models (satisfying PKI's goal).

last revised $Date: 2002/11/25 21:55:05 $ by $Author: reagle $