Trivial Syntaxes for the Semantic Web

Status

Some thoughts by sandro. They seemed complete enough to write up, but there's more at the end now.

Right now, I just briefly discuss some syntaxes, and discuss the general issue of communicating information about literal strings.

Trivial Syntaxes

Some trivial syntaxes for encoding a set of RDF-style subject-property-value statements into a stream:

  1. whitespace separated %-escaped strings in S-P-V order

  2. CSV table of three columns, strings in S-P-V order

  3. length-delimited strings

  4. HTML table with three columns in S-P-V order

  5. XML like <sentence> <subject>...</subject> <property>...</property> <value>...</value> </sentence>

  6. XML like <sentence subject="..." property="..." value="..." />

  7. C-like source code for calling: addSPV(String subject, String property, String value)

  8. C-like source code for calling: spvn(int subject_index, int property_index, int value_index); with extn(int index, String external_name);

Each of these has potential character encoding issues when the identifier strings and the stream follow different standards or operate at different levels (eg character vs. octet), but those can all be specified. At this point we don't need to say whether identifiers are octet strings or character strings.

Literal Strings

All these trivial syntaxes share a basic issue of how to communicate literal strings (whether of characters or octets is irrelevant).

There seem to be three approaches:

The Pure Symbol Approach

Define a vocabulary for naming octets or characters and for constructing sequences.

Let's hereby define a vocabulary for ASCII characters like http://www.w3.org/2001/02/trivsyn#literal_char_S for the letter "S". (This is human shorthand -- being pure symbolists here, we'll have to actually enumerate that mapping for each character to the machine. In practice we'd probably want a different abstraction like a mapping from peano-generated-integers to Unicode characters.)

We'll use lisp-style list constructors from http://www.w3.org/2000/07/hs78/KIF. And we'll say strings are lists of characters (instead of saying, perhaps, that a string has a list of characters).

A declaration that the author's first name is "Sandro" would look something like this:

     addSPV("http://example.com/somePub",
            "http://purl.org/dc/elements/1.0/Creator", 
            "http://example.com/somePub#literal_string_Sandro");
     addSPV("http://example.com/somePub#literal_string_Sandro",
	    "http://www.w3.org/2000/07/hs78/KIF#first",
	    "http://www.w3.org/2001/02/trivsyn#literal_char_S")
     addSPV("http://example.com/somePub#literal_string_Sandro",
	    "http://www.w3.org/2000/07/hs78/KIF#rest",
	    "http://example.com/somePub#literal_string_andro")
     addSPV("http://example.com/somePub#literal_string_andro",
	    "http://www.w3.org/2000/07/hs78/KIF#first",
	    "http://www.w3.org/2001/02/trivsyn#literal_char_a")
     addSPV("http://example.com/somePub#literal_string_andro",
	    "http://www.w3.org/2000/07/hs78/KIF#rest",
	    "http://example.com/somePub#literal_string_ndro")
and so on, for a total of 12 triples (2 per character). If we used identifiers like "_1" we could cut this down to 1 triple per character.

Cost: lots and lots of triples.

The Out-Of-Band or Extend-The-Syntax Approach

One alternative is to add additional complexity to the syntax, indicating whether a symbolic identifier or a literal string (with no given identifier) occurs in a given position in a triple. Or, to simple give the literal string value to be associated with an identifier. Picking two of the trivial syntaxed for which this can be done cleanly:

    <sentence>
       <subject parseType="identifier">http://example.com/somePub</subject>
       <property parseType="identifier">http://purl.org/dc/elements/1.0/Creator</property>
       <value parseType="literal">Sandro</value>
    </sentence>
or
    addSPV("http://example.com/somePub",
           "http://purl.org/dc/elements/1.0/Creator", 
           "http://example.com/somePub#literal_string_Sandro");
    defLit("http://example.com/somePub#literal_string_Sandro", "Sandro");

Cost: a more complex syntax - it's no longer just a set of triples

The Use-The-Symbol-Text Approach

The third possibility is to allow recipients of triples to disect identifiers, not just treat them as opaque objects whose only operation is comparison to each other.

A familiar form of this is the "data:" URI scheme, which gives us

    addSPV("http://example.com/somePub",
           "http://purl.org/dc/elements/1.0/Creator", 
           "data:,Sandro");

Once we start treating identifiers as not being opaque, however, we can go all the way to this:

    addSPV("http://example.com/somePub",
           "http://purl.org/dc/elements/1.0/Creator", 
           "http://example.com/somePub#literal_string_Sandro");
    addSPV("Sandro",
           "http://www.w3.org/2001/02/trivsyn#identifierString",
           "http://example.com/somePub#literal_string_Sandro")

In this example, the relationship "identifierString" connects an identified object with a string which identifies it. Plugging that triple into a sentence of the form "S has a P which is V" we get: Sandro has an identifier string which is http://example.com/somePub#literal_string_Sandro. Since we know an identifier string for Sandro, namely "Sandro", we know one value of http://example.com/somePub#literal_string_Sandro is "Sandro".

If we're not careful going down this road, we can get very lost.

Some wandering down this road is already done in existing RDF applications, such as RSS, I think. RDF tends to treat identifiers are opaque objects which can be not only compared, but used with traditional URI methods, particularly fetch. Note that in RSS an <image> is communicated with a (URI) identifier, while a <link> is communicated with a literal string. Presumably the difference comes from the idea that "image" links will be followed by a lower, more automatic layer than "link" links.

Cost: potential for confusing literals and identifiers

The Solution?

The Extend-The-Syntax approach is probably best, but perhap we can pretend we're being pure so we can claim it's just identifier-triples.

Still, if we can come up with guidelines to stay clear of the tar pits, the third approach could be the most simple and elegent.

More Thoughts

Computer networks cannot transmit objects, just abstractions. Whether that's bits, bytes, character strings, or whatever. The "Literals" are just one level below (one less stage of symbolic indirection than) the "Identifiers".

Similarly, information cannot be communicated except in some kind of language. Talking about an "abstract" form of information where it has no language (eg the RDF model without the syntax) may be impossible. So instead, perhaps we should simply rely on having several trivial languages with translations between them, to help ground the standard. (Of course just one is fine, but that has more potential for failure and maybe forgetting that accurate translation is the essense of interoperation.)

Three things that are useful to transmit:

  1. Literal character strings, octet sequences, etc => Web Content
  2. Opaque object identifiers (non-conflicting in some space)
  3. Web content handles (fetchable, etc, URIs)
      I think the RDF community wants the #1 and #3, and knows it can use #3 as if it were #2 and probably get away with it. (And the "data:" scheme combines #1 into #3 as well.)
      Sandro Hawke
      $Date: 2001/02/13 03:33:55 $