A Formalism for Internet Information References

Daniel W. Connolly
$Id: formalism.txt,v 1.5 1994/04/19 03:02:50 connolly Exp $
http://www.hal.com/users/connolly/drafts/formalism.txt

ABSTRACT
========

This is a mathematical model for references between typical
information objects in the internet community; specifically, the model
covers the URL concept from the World Wide Web application and the
external body mechanism from MIME.

I suggest that this model be used to define the requirements for
specifications of new URL schemes and MIME access types, and that it
provides basis for other extensible reference mechanisms such as the
URN and URC mechanisms as variously proposed in the URI working group.

INTRODUCTION
============

From "Integration of Internet Information Resources (iiir)"
< http://kudzu.cnidr.org/ietf/iiir/iiir-charter.txt >,
Mon Apr 18 22:00:54 CDT 1994:

	The Integration of Internet Information Resources Working Group (IIIR)
	is chartered to facilitate interoperability between Internet
	Information Services, and to develop, specify, and align protocols
	designed to integrate the plethora of Internet information services
	(WAIS, ARCHIE, Prospero, etc.) into a single ``virtually unified
	information service'' (VUIS).

(@@ motivation:)

@@ security, scalability, reliability (fault detection and tolerance,
caching, migration, compression, encryption, authentication)


FOUNDATIONS
===========

We begin with definitions of some commonly used notions:

The set of octets, or bytes:

	Octet = { 0, 1, 2, ..., 255 }

Strings, sequences, or streams of bytes:

	Using the notation SEQ{S}, short for
		{ seq | for some N, seq : {0, 1, 2, ... N}  -> S }
	OctetStream = SEQ{Octet}

Then we formalize some of the MIME notions:

	ContentType = { text/plain, application/octet-stream, ... }
		/* just a finite set. Properties or the elements,
		aside from identity, are not defined */

	Entity = ContentType x OctetStream

And finally, we begin with a function to model the resolution of
references between entities in terms of an as-yet undefined set of
References:

	Resolve : Reference -> Entity

While in general, internet resources are do not all fit the definition
of Entity (telnet servers, ftp directories, etc.), the information
units transferred between information clients and servers generally
do. Thus we begin by discussing references to entites, and we define
references to other sorts of resources in terms of the basic Resolve
function.

For example, the basic model of computation for a WWW client goes:

	1. Given some reference r0, compute e0 := Resolve(r0)
	by sending r0 to the server and getting e0 back.
	2. Present e0
	3. Let the user choose a reference r1 from e0
		(click on an anchor, select a number, drag-select
		a range of text, enter some search words, etc.)
	4. Compute e1 := Resolve(r1) as above
	5. Iterate steps 2 thru 4


THE BASIC WEB "GLOBAL FILESYSTEM" REFERENCE
===========================================

The WWW reference mechnism, the URL, is a generalization of a
filesystem pathname.

	Dirs <: SEQ{OctetStream} /* <: meaning subset */
	File <: OctetStream

A normal filesystem pathname doesn't satisfy the properties of a
reference for several reasons: The pathname "x/y/z" may reference
different entities depending on the "current directory" of the client
process. So the relation:

	resolve <: (Dirs x Filename) x Entity

is not a function. This can be addressed by using full pathnames. But
the path "/etc/motd" references different entities on different hosts.
So the WWW application adds hostname information to the path to make
it globally unique.

	Host <: OctetStream /* internet host names */

But even "local-file://host/dir/file" can reference different entities
at different times. But if we include a time argument

	Time = { real numbers } /* time in seconds since start of
				C.E. calendar, written here in
				ISO time format: YYYYMMDDHHMMSSZ.
				For the purposes of this discussion,
				a one second clock resolution is
				assumed to be sufficient */

then we have a well defined function:

	LOCAL-FILE.GET : (Host x Dirs x File) x Time -> OctetStream

Many filesystems don't associate an explicit ContentType with every
file. But we assume that the following function is well-defined:

	Resolve.LOCAL-FILE : Host x Dirs x File x Time -> Entity

For example, we can require that the content type be part of the
reference, or we can use heuristics guided by the filename. So with a
content type ct in hand, we have

	Resolve.LOCAL-FILE(h, d, f, t) = (ct, LOCAL-FILE.GET((h, d, f), t))

Or with a content type heuristic

	InferType : File -> ContentType

we can have

	Resolve.LOCAL-FILE(h, d, f, t)
		= (InferType(f), LOCAL-FILE.GET((h, d, f), t))

The local-file: access type shares semantics so far with the anon-ftp:
access type. In general, the ftp RETR operation can be modelled as:

	User <: OctetStream
	Account <: OctetStream u { {} } /* account is optional */
	Type = { ASCII, EBCDIC, IMAGE }

	FTP.GET : (Host x User x Account x Dirs x File x Type) x Time
			-> OctetStream

Note 1: we don't consider the password, since the retrieved entity does not
vary as a function of the password

Note 2: The FTP RFC includes another data type "local byte size," but
it's probably not worth supporting in any contemporary applications.
There's also a notion of "form-code" for carriage control codes within
ASCII and EBCDIC transfers that is not considered here.

The anon-ftp: access type is a simplification of FTP that obviates the
User and Account parameters:

	ANON-FTP.GET : (Host x Dirs x File x Type) x Time -> OctetStream
where
	ANON-FTP.GET((h, d, f, ty), t) =
		 FTP.GET((h, "anonymous", {}, d, f, ty), t)

And gopher is similar:

	Port = { 0, 1, 2, ... ,65535 } /* IP port number */
	Selector = SEQ{Octet - { 9, 13 }} /* no tabs or newlines allowed */
	Search <: OctetStream u { {} } /* @# what subset? */
	GType <: ContentType /* Gopher types: subset of MIME types */

	GOPHER.GET : (Host x Port x Selector) x Time -> OctetStream

	Resolve.GOPHER : Host x Port x Selector x Search x GType x Time
		 -> Entity
where
	Resolve.GOPHER(h, p, sel, ser, gt, t) =
		(gt, GOPHER.GET((h, p, sel, ser), t))

The HTTP 0.9 model is nearly identical to gopher:

	Path = SEQ{ { 33 .. 126 } } /* no whitespace or 8-bit chars */
	HTTP0.GET : (Host x Port x Path x Search) x Time -> OctetStream
	Resolve.HTTP0 : Host x Port x Path x Search x Time -> Entity
where
	Resolve.HTTP0(h, p, path, ser, t)
		= (text/html, HTTP0.GET((h, p, path, ser), Time))

The HTTP 1.0 model is significantly more complex. We'll get to that
later. But in general, for a set Location* and a set Context*, if there
are functions g and t and a set s where:

	g : Location* x Context* -> OctetStream
	t : Location* x Context* -> ContentType
and
	For all c in Context* and l in Location*,
	Resolve((s, l, c)) = (t(l, c), g(l, c))

then we say that g is a GET function for Context* and Location*, t is
a TYPE function for them, and s is a SCHEME for them.

So far we have that LOCAL-FILE.GET is the GET function for Context* =
Time and Location* = Host x Dirs x File, ANON-FTP.GET is a GET function
for the same set -- hence the need for the distinguishing scheme.

We now have some subsets of the set of References. If s is a SCHEME
for a set Context* and a set Location*, with corresponding GET* and
TYPE* functions, then

	For all c in Context*, l in Location*
	(s, l, c) is an element of Reference.
since
	Resolve((s, l, c)) =  (TYPE*(l, c), GET*(l, c))

Here we define several current URL schemes in this manner, and I
suggest that in order to introduce a new URL scheme, we required a
definitions its Location and Context sets along with definitions of
GET and TYPE functions.

LOCAL-FILE:
	/* defined somewhat informally; this is not a distributed scheme,
	so we don't need agreement across hosts as to how this works. */
	SCHEME = "local-file"	(a string is an OctetStream, which is a set)
	Location = Host x Dirs x File
	Context = Time x (ContentType u { {} })
	GET((h, d, f)) given informally:
		FILE *fp = fopen(path(d,f)); fread(fp...);
	TYPE((h, d, f), (t, ct)) := if ct = {}, InferType(f)
					else ct


FTP:
	SCHEME = "ftp"
	Location = Host x User x Account x Dirs x File x Type
	Context = Time x (ContentType u { {} })
	GET((h, u, a, d, f, ty)) as per FTP RFC:
		CONNECT h
		USER u; PASSWD ...
		Account a  /* unless a is {} */
		CWD d1
		CWD d2
		...
		TYPE ty
		RETR f

	TYPE((h, u, a, d, f, ty), (t, ct)) := if ct = {}, InferType(f)
					else ct
	
ANON-FTP:
	SCHEME = "anon-ftp" /* @@ sometimes written just "ftp" */
	Location = Host x Dirs x File x Type
	Context = Time x (ContentType u { {} })
	GET((h, d, f, ty)) as per FTP RFC:
		CONNECT h
		USER anonymous; PASSWD <client's email address>
		CWD d1
		CWD d2
		...
		TYPE ty
		RETR f

	TYPE((h, d, f, ty), (t, ct)) := if ct = {}, InferType(f)
					else ct

GOPHER:
	SCHEME = "gopher"
	Location = Host x Port x GType x Selector x Search
	Context = Time
	GET((h, p, gt, sel, ser)) as per Gopher Protocol:
		telnet h:p
		if ser = {}, send sel
		else send sel<tab>ser

	TYPE((h, p, gt, sel, ser), t) = gt /* @# other type mechansims? */

HTTP:
	SCHEME = "http"
	Location = Host x Port x Dirs x File x Search
	Accept = { f | f : ContentType -> Real } /* @@ more to it than this */
	Context = Time x Accept
	GET((h, p, d, f, s), (t, a)) and
	TYPE((h, p, d, f, s), (t, a)) in terms of HTTP GET operation:
		compute path from d, f, s
		compute accept encoding from a
		telnet to h:p
		send:	GET path HTTP/1.0
			Accept: <accept encoding>
			As-Of: t	/* @@ new HTTP header. Version? */

@@ forms in HTTP in MAILTO; MIME MAIL-SERVER access type
@@ wais, prospero, finger?


RELIABILITY ISSUES WITH FILE (AKA MUTABLE OCTET-STREAM OBJECT) REFERENCES
=========================================================================

The following problem of references to mutable objects is sharted by
several schemes: Consider:

	1pm: file Y has info on apples,
	  file X says "see file Y for more on apples"
	2pm: file Y modified to have info on oranged rather than apples
	3pm: reader of X follows link to Y and gets wrong info

The entity the reader sees is different from what the author saw. That
is,
	GET(Y, tread) != GET(Y, tauthor)
and hence
	Resolve((Y, tread)) != Resolve((Y, tauthor))

But in many cases, the author and the reader _can_ read the file at
different times and get the same result. How can we determine if this
is the case?

One way is to use the time of last modification of the file. We can
model this as a function lastmod satisfying:

	lastmod : Location* x Time -> Time
where
	For all l in Location*, t1, t2 in Time
	If t1 >= lastmod(l, t2)
	then GET*(t2)(l) = GET*(t1)(l)

Where GET* is a GET function for Location and Time.  We'll call such a
function a LASTMOD function for the set Location* and the function
GET*.

Note 1: I don't believe the FTP RFC specifies how to determine the
time of last modification, but it can be determined reliably in many
cases. @# I need to look into this...)

Note 2: I don't believe Gopher specifies a way to determine the time
of last modification. Perhaps Gopher+ does.

Then, in the above example, we have the following:

	If tauthor >= LASTMOD(loc, tread)
	Then GET(loc, tread) = GET(loc, tauthor)
and hence
	Resolve((s, loc, tread)) = Resolve((s, loc, tauthor))

That is, if the reader can determine that the file hasn't changed
since the link was made, s/he can have confidence in the integrity of
the link. If the mod time is later, the client software can warn the
reader before transferring the possibly erroneous data.

So one way to enhance the typical HREF="local-file://host/dir/file" to
be a well-defined Reference is to determine the time context of the
reference.  This can ben done in any of several ways:

	1. Enhance local-file URLs to contain time info; e.g. the author
	writes:

		local-file://host/dir/file;time=19940414141000Z

	2. Enhance the HTML A element to contain time info; e.g.:

		<A HREF="local-file://host/dir/file" TIME="19940414141000Z">

	3. If the reference is in a file whose modification time is known,
	we can make the heuristic assumption that all the references
	in the file were current as of the last modification of the
	file.

(@@ discuss other semantics for default time context: rather than
"default is author time", perhaps "default is read time"
for slowly changing docs (e.g. FAQ's): "any version less than 15days
old" (probabalistic reliability))

Either of the first two is tedious for hand-edited documents. But they
are simple to implement in any cut-n-paste or "paste link" feature.
They also work nicely to fill in the missing content type.

	<A HREF="ftp://host/dir/file;user=connolly;account=general"
		date="19940414141000Z"
		content-type="application/postscript">

The third option is more convenient, though less robust. Consider:

	1pm: file Y written with info on apples
	2pm: file X says "see file Y for more on apples"
	3pm: file Y modified to have info on oranged rather than apples
	4pm: file X modified, but still says "see Y for apples"
	5pm: reader of X cannot detect (by machine) that
		the link to Y is corrupt.

Also: references get copied and pasted, quoted, etc. The time context
of the reference should travel with the reference.

Another technique for detecting faults in resolving references is to
include the MD5 signature of the entity in the reference. This
mechanism is somewhat expensive: it only detects faults _after_
transmission of the octet stream, and it requires computation
proportional to the length of the entity.

A much less reliable technique is to include the octet count of the
entity in the reference. The FTP RFC includes a mechanism to detect
this type of fault before transmission of the entity. (i.e. you can
ask an FTP server how big a file is before you transfer it.)


STRING REPRESENTATIONS OF REFERENCES
====================================

The string representation of a reference is an important part of the
specification of various applications. But it is not clear that we
need one string representation for all applications. It is much more
critical that we understand the properties of the objects we are
trying to represent. Once we have a definition of the set of object
we're representing, we can choose various means to represent the set
in various contexts, with well-defined mappings between syntaxes, for
example, between W3's HREF="xxx" and MIME's external-body; access-type="xxx" 
syntaxes.

A definition of a scheme syntax may include incomplete string
representations, as long as it discusses mechanisms and/or heuristics
to determine the remaining parts of a well-defined reference.

LOCAL-FILE:
	string rep((h, d, f), (t, ct)) given in perl as
		&file_uri("local-file", $h, *d, $f, $t, $ct)
		where d is a list of the directories
		and t is in ISO time format

sub file_rep{
    local($scheme, $host, *dirs, $file, $time, $content_type) = @_;
    local($sep, $ret);

    if(@dirs){ $sep = "/"; }

    grep($_ = &uri_escape($_), @dirs);

    $ret = $scheme . "://"
	. &uri_escape(host)
	    . "/" . join('/', @dirs)
		. $sep . $file;

    if($time){
	$ret .= ";as-of=" . $time; # @# better name for as-of?
    }
    if($content_type){
	$ret .= ";content-type=" . &uri_escape($content_type);
    }

    return $ret;
}

sub uri_escape{
	local($_) = @_;
	s-[/?=;:]-sprintf("%%%02X", ord($&))-ge; # list of illegal chars?
	return $_;
}

FTP:
	string rep((h, u, a, d, f), (t, ct)) given in perl as
		&file_uri("ftp", "$u@$h", *d, $f, $t, $ct) # @@ account?
		where d is a list of the directories
		and t is in ISO time format
	NOTE: the string representation is often written without the
	time and content-type information. In those cases, the content
	type should be inferred from the filename and the time should
	be inferred from the context of the reference.

ANON-FTP:
	string rep((h, d, f), (t, ct)) given in perl as
		&file_uri("anon-ftp", $h, *d, $f, $t, $ct)
		where d is a list of the directories
		and t is in ISO time format
	NOTE: the string representation is often written without the
	time and content-type information. In those cases, the content
	type should be inferred from the filename and the time should
	be inferred from the context of the reference.

GOPHER:
	string rep((h, p, ty, sel, ser), t) given in perl as
		&file_uri("gopher", $p == 70 ? $h : "$h:$p", (),
			 &gtype_char($ct) . $sel, $t)
			 . $ser ? "?$ser" : ""
		where d is a list of the directories
	NOTE: the string representation is often written without the
	time information. And unfortunately, there are no mechanisms
	for determining whether gopher references with different times
	refer to the same entity. @@md5 mechanism

HTTP:
	string rep((h, p, d, f, s), (t, a)) given in perl as
		&file_uri("http", $p == 80 ? $h : "$h:$p", *d,
			 $f, $t, max(a))
			 . $s ? "?$s" : ""
		where d is a list of the directories t in ISO time format
	NOTE 1: the string representation is often written without the
	time. In these cases, the time should be inferred from
	context.

	NOTE 2: the string representation can only represent a limited
	form of the content type metric function, i.e. "type x is
	best," and ofthen it is ommitted completely. In those cases,
	the client should encode a content type metric function based
	on its capabilities and the user preferences.

RELATIVE NAMES
==============

Essentially (@@ need more discussion), if we have

	Resolve(r0) = e0

and a name n is found inside e0, we apply the function

	Relative : Name x Reference -> Reference

and compute

	Resolve(Relative(n, r0))

RELIABLE CACHING AND MIRRORING
==============================

Essentially (@@ need more discussion), if a client wants to
compute Resolve(r0), it can, for example, give the whole r0
reference to a proxy server s, ala:

	Resolve(r0) = Resolve((proxy, (s, r0)))

and so the question becomes: is Resolve(r0) in the cache?

In order to do mirroring, we need techniques to distribute information
about circumstances where a client can substitute references, i.e.
where
	Resoleve(r') = Resolve(r)
but r' is easier to get to.

This is where heirarchical namespaces come in: we'd like to distribute
information of the form:

	For all *,
	Resolve("ftp://host1/dir1/*") = Resolve("ftp://host2/dir2/*")


LOCATION-INDEPENDENT REFERENCES (URN's)
=======================================

Rather than:

	Resolve : Scheme x Location x Context -> Entity

we might as well write:

	Resolve : Scheme x Name x Context -> Entity

for example we can define:

	RFC.GET : Name x ContentType -> OctetStream
	RFC.FILENAME : Name x ContentType -> File
as
	RFC.FILENAME(n, c) = "n.txt" if c is text/plain,
				"n.ps" if c is application/postscript
	For all time t /* since rfc's don't change */
	RFC.GET(n, c) = ANON-FTP.GET(("ds.internic.net", "/pub/rfc",
					RFC.FILENAME(n, c), IMAGE), t)


FUTURE DISCUSSION
=================

@@ backlink features

@@ events: get (head), put