A Formalism for Internet Information References Daniel W. Connolly $Id: formalism.txt,v 1.5 1994/04/19 03:02:50 connolly Exp $ http://www.hal.com/users/connolly/drafts/formalism.txt ABSTRACT ======== This is a mathematical model for references between typical information objects in the internet community; specifically, the model covers the URL concept from the World Wide Web application and the external body mechanism from MIME. I suggest that this model be used to define the requirements for specifications of new URL schemes and MIME access types, and that it provides basis for other extensible reference mechanisms such as the URN and URC mechanisms as variously proposed in the URI working group. INTRODUCTION ============ From "Integration of Internet Information Resources (iiir)" < http://kudzu.cnidr.org/ietf/iiir/iiir-charter.txt >, Mon Apr 18 22:00:54 CDT 1994: The Integration of Internet Information Resources Working Group (IIIR) is chartered to facilitate interoperability between Internet Information Services, and to develop, specify, and align protocols designed to integrate the plethora of Internet information services (WAIS, ARCHIE, Prospero, etc.) into a single ``virtually unified information service'' (VUIS). (@@ motivation:) @@ security, scalability, reliability (fault detection and tolerance, caching, migration, compression, encryption, authentication) FOUNDATIONS =========== We begin with definitions of some commonly used notions: The set of octets, or bytes: Octet = { 0, 1, 2, ..., 255 } Strings, sequences, or streams of bytes: Using the notation SEQ{S}, short for { seq | for some N, seq : {0, 1, 2, ... N} -> S } OctetStream = SEQ{Octet} Then we formalize some of the MIME notions: ContentType = { text/plain, application/octet-stream, ... } /* just a finite set. Properties or the elements, aside from identity, are not defined */ Entity = ContentType x OctetStream And finally, we begin with a function to model the resolution of references between entities in terms of an as-yet undefined set of References: Resolve : Reference -> Entity While in general, internet resources are do not all fit the definition of Entity (telnet servers, ftp directories, etc.), the information units transferred between information clients and servers generally do. Thus we begin by discussing references to entites, and we define references to other sorts of resources in terms of the basic Resolve function. For example, the basic model of computation for a WWW client goes: 1. Given some reference r0, compute e0 := Resolve(r0) by sending r0 to the server and getting e0 back. 2. Present e0 3. Let the user choose a reference r1 from e0 (click on an anchor, select a number, drag-select a range of text, enter some search words, etc.) 4. Compute e1 := Resolve(r1) as above 5. Iterate steps 2 thru 4 THE BASIC WEB "GLOBAL FILESYSTEM" REFERENCE =========================================== The WWW reference mechnism, the URL, is a generalization of a filesystem pathname. Dirs <: SEQ{OctetStream} /* <: meaning subset */ File <: OctetStream A normal filesystem pathname doesn't satisfy the properties of a reference for several reasons: The pathname "x/y/z" may reference different entities depending on the "current directory" of the client process. So the relation: resolve <: (Dirs x Filename) x Entity is not a function. This can be addressed by using full pathnames. But the path "/etc/motd" references different entities on different hosts. So the WWW application adds hostname information to the path to make it globally unique. Host <: OctetStream /* internet host names */ But even "local-file://host/dir/file" can reference different entities at different times. But if we include a time argument Time = { real numbers } /* time in seconds since start of C.E. calendar, written here in ISO time format: YYYYMMDDHHMMSSZ. For the purposes of this discussion, a one second clock resolution is assumed to be sufficient */ then we have a well defined function: LOCAL-FILE.GET : (Host x Dirs x File) x Time -> OctetStream Many filesystems don't associate an explicit ContentType with every file. But we assume that the following function is well-defined: Resolve.LOCAL-FILE : Host x Dirs x File x Time -> Entity For example, we can require that the content type be part of the reference, or we can use heuristics guided by the filename. So with a content type ct in hand, we have Resolve.LOCAL-FILE(h, d, f, t) = (ct, LOCAL-FILE.GET((h, d, f), t)) Or with a content type heuristic InferType : File -> ContentType we can have Resolve.LOCAL-FILE(h, d, f, t) = (InferType(f), LOCAL-FILE.GET((h, d, f), t)) The local-file: access type shares semantics so far with the anon-ftp: access type. In general, the ftp RETR operation can be modelled as: User <: OctetStream Account <: OctetStream u { {} } /* account is optional */ Type = { ASCII, EBCDIC, IMAGE } FTP.GET : (Host x User x Account x Dirs x File x Type) x Time -> OctetStream Note 1: we don't consider the password, since the retrieved entity does not vary as a function of the password Note 2: The FTP RFC includes another data type "local byte size," but it's probably not worth supporting in any contemporary applications. There's also a notion of "form-code" for carriage control codes within ASCII and EBCDIC transfers that is not considered here. The anon-ftp: access type is a simplification of FTP that obviates the User and Account parameters: ANON-FTP.GET : (Host x Dirs x File x Type) x Time -> OctetStream where ANON-FTP.GET((h, d, f, ty), t) = FTP.GET((h, "anonymous", {}, d, f, ty), t) And gopher is similar: Port = { 0, 1, 2, ... ,65535 } /* IP port number */ Selector = SEQ{Octet - { 9, 13 }} /* no tabs or newlines allowed */ Search <: OctetStream u { {} } /* @# what subset? */ GType <: ContentType /* Gopher types: subset of MIME types */ GOPHER.GET : (Host x Port x Selector) x Time -> OctetStream Resolve.GOPHER : Host x Port x Selector x Search x GType x Time -> Entity where Resolve.GOPHER(h, p, sel, ser, gt, t) = (gt, GOPHER.GET((h, p, sel, ser), t)) The HTTP 0.9 model is nearly identical to gopher: Path = SEQ{ { 33 .. 126 } } /* no whitespace or 8-bit chars */ HTTP0.GET : (Host x Port x Path x Search) x Time -> OctetStream Resolve.HTTP0 : Host x Port x Path x Search x Time -> Entity where Resolve.HTTP0(h, p, path, ser, t) = (text/html, HTTP0.GET((h, p, path, ser), Time)) The HTTP 1.0 model is significantly more complex. We'll get to that later. But in general, for a set Location* and a set Context*, if there are functions g and t and a set s where: g : Location* x Context* -> OctetStream t : Location* x Context* -> ContentType and For all c in Context* and l in Location*, Resolve((s, l, c)) = (t(l, c), g(l, c)) then we say that g is a GET function for Context* and Location*, t is a TYPE function for them, and s is a SCHEME for them. So far we have that LOCAL-FILE.GET is the GET function for Context* = Time and Location* = Host x Dirs x File, ANON-FTP.GET is a GET function for the same set -- hence the need for the distinguishing scheme. We now have some subsets of the set of References. If s is a SCHEME for a set Context* and a set Location*, with corresponding GET* and TYPE* functions, then For all c in Context*, l in Location* (s, l, c) is an element of Reference. since Resolve((s, l, c)) = (TYPE*(l, c), GET*(l, c)) Here we define several current URL schemes in this manner, and I suggest that in order to introduce a new URL scheme, we required a definitions its Location and Context sets along with definitions of GET and TYPE functions. LOCAL-FILE: /* defined somewhat informally; this is not a distributed scheme, so we don't need agreement across hosts as to how this works. */ SCHEME = "local-file" (a string is an OctetStream, which is a set) Location = Host x Dirs x File Context = Time x (ContentType u { {} }) GET((h, d, f)) given informally: FILE *fp = fopen(path(d,f)); fread(fp...); TYPE((h, d, f), (t, ct)) := if ct = {}, InferType(f) else ct FTP: SCHEME = "ftp" Location = Host x User x Account x Dirs x File x Type Context = Time x (ContentType u { {} }) GET((h, u, a, d, f, ty)) as per FTP RFC: CONNECT h USER u; PASSWD ... Account a /* unless a is {} */ CWD d1 CWD d2 ... TYPE ty RETR f TYPE((h, u, a, d, f, ty), (t, ct)) := if ct = {}, InferType(f) else ct ANON-FTP: SCHEME = "anon-ftp" /* @@ sometimes written just "ftp" */ Location = Host x Dirs x File x Type Context = Time x (ContentType u { {} }) GET((h, d, f, ty)) as per FTP RFC: CONNECT h USER anonymous; PASSWD CWD d1 CWD d2 ... TYPE ty RETR f TYPE((h, d, f, ty), (t, ct)) := if ct = {}, InferType(f) else ct GOPHER: SCHEME = "gopher" Location = Host x Port x GType x Selector x Search Context = Time GET((h, p, gt, sel, ser)) as per Gopher Protocol: telnet h:p if ser = {}, send sel else send selser TYPE((h, p, gt, sel, ser), t) = gt /* @# other type mechansims? */ HTTP: SCHEME = "http" Location = Host x Port x Dirs x File x Search Accept = { f | f : ContentType -> Real } /* @@ more to it than this */ Context = Time x Accept GET((h, p, d, f, s), (t, a)) and TYPE((h, p, d, f, s), (t, a)) in terms of HTTP GET operation: compute path from d, f, s compute accept encoding from a telnet to h:p send: GET path HTTP/1.0 Accept: As-Of: t /* @@ new HTTP header. Version? */ @@ forms in HTTP in MAILTO; MIME MAIL-SERVER access type @@ wais, prospero, finger? RELIABILITY ISSUES WITH FILE (AKA MUTABLE OCTET-STREAM OBJECT) REFERENCES ========================================================================= The following problem of references to mutable objects is sharted by several schemes: Consider: 1pm: file Y has info on apples, file X says "see file Y for more on apples" 2pm: file Y modified to have info on oranged rather than apples 3pm: reader of X follows link to Y and gets wrong info The entity the reader sees is different from what the author saw. That is, GET(Y, tread) != GET(Y, tauthor) and hence Resolve((Y, tread)) != Resolve((Y, tauthor)) But in many cases, the author and the reader _can_ read the file at different times and get the same result. How can we determine if this is the case? One way is to use the time of last modification of the file. We can model this as a function lastmod satisfying: lastmod : Location* x Time -> Time where For all l in Location*, t1, t2 in Time If t1 >= lastmod(l, t2) then GET*(t2)(l) = GET*(t1)(l) Where GET* is a GET function for Location and Time. We'll call such a function a LASTMOD function for the set Location* and the function GET*. Note 1: I don't believe the FTP RFC specifies how to determine the time of last modification, but it can be determined reliably in many cases. @# I need to look into this...) Note 2: I don't believe Gopher specifies a way to determine the time of last modification. Perhaps Gopher+ does. Then, in the above example, we have the following: If tauthor >= LASTMOD(loc, tread) Then GET(loc, tread) = GET(loc, tauthor) and hence Resolve((s, loc, tread)) = Resolve((s, loc, tauthor)) That is, if the reader can determine that the file hasn't changed since the link was made, s/he can have confidence in the integrity of the link. If the mod time is later, the client software can warn the reader before transferring the possibly erroneous data. So one way to enhance the typical HREF="local-file://host/dir/file" to be a well-defined Reference is to determine the time context of the reference. This can ben done in any of several ways: 1. Enhance local-file URLs to contain time info; e.g. the author writes: local-file://host/dir/file;time=19940414141000Z 2. Enhance the HTML A element to contain time info; e.g.: 3. If the reference is in a file whose modification time is known, we can make the heuristic assumption that all the references in the file were current as of the last modification of the file. (@@ discuss other semantics for default time context: rather than "default is author time", perhaps "default is read time" for slowly changing docs (e.g. FAQ's): "any version less than 15days old" (probabalistic reliability)) Either of the first two is tedious for hand-edited documents. But they are simple to implement in any cut-n-paste or "paste link" feature. They also work nicely to fill in the missing content type. The third option is more convenient, though less robust. Consider: 1pm: file Y written with info on apples 2pm: file X says "see file Y for more on apples" 3pm: file Y modified to have info on oranged rather than apples 4pm: file X modified, but still says "see Y for apples" 5pm: reader of X cannot detect (by machine) that the link to Y is corrupt. Also: references get copied and pasted, quoted, etc. The time context of the reference should travel with the reference. Another technique for detecting faults in resolving references is to include the MD5 signature of the entity in the reference. This mechanism is somewhat expensive: it only detects faults _after_ transmission of the octet stream, and it requires computation proportional to the length of the entity. A much less reliable technique is to include the octet count of the entity in the reference. The FTP RFC includes a mechanism to detect this type of fault before transmission of the entity. (i.e. you can ask an FTP server how big a file is before you transfer it.) STRING REPRESENTATIONS OF REFERENCES ==================================== The string representation of a reference is an important part of the specification of various applications. But it is not clear that we need one string representation for all applications. It is much more critical that we understand the properties of the objects we are trying to represent. Once we have a definition of the set of object we're representing, we can choose various means to represent the set in various contexts, with well-defined mappings between syntaxes, for example, between W3's HREF="xxx" and MIME's external-body; access-type="xxx" syntaxes. A definition of a scheme syntax may include incomplete string representations, as long as it discusses mechanisms and/or heuristics to determine the remaining parts of a well-defined reference. LOCAL-FILE: string rep((h, d, f), (t, ct)) given in perl as &file_uri("local-file", $h, *d, $f, $t, $ct) where d is a list of the directories and t is in ISO time format sub file_rep{ local($scheme, $host, *dirs, $file, $time, $content_type) = @_; local($sep, $ret); if(@dirs){ $sep = "/"; } grep($_ = &uri_escape($_), @dirs); $ret = $scheme . "://" . &uri_escape(host) . "/" . join('/', @dirs) . $sep . $file; if($time){ $ret .= ";as-of=" . $time; # @# better name for as-of? } if($content_type){ $ret .= ";content-type=" . &uri_escape($content_type); } return $ret; } sub uri_escape{ local($_) = @_; s-[/?=;:]-sprintf("%%%02X", ord($&))-ge; # list of illegal chars? return $_; } FTP: string rep((h, u, a, d, f), (t, ct)) given in perl as &file_uri("ftp", "$u@$h", *d, $f, $t, $ct) # @@ account? where d is a list of the directories and t is in ISO time format NOTE: the string representation is often written without the time and content-type information. In those cases, the content type should be inferred from the filename and the time should be inferred from the context of the reference. ANON-FTP: string rep((h, d, f), (t, ct)) given in perl as &file_uri("anon-ftp", $h, *d, $f, $t, $ct) where d is a list of the directories and t is in ISO time format NOTE: the string representation is often written without the time and content-type information. In those cases, the content type should be inferred from the filename and the time should be inferred from the context of the reference. GOPHER: string rep((h, p, ty, sel, ser), t) given in perl as &file_uri("gopher", $p == 70 ? $h : "$h:$p", (), >ype_char($ct) . $sel, $t) . $ser ? "?$ser" : "" where d is a list of the directories NOTE: the string representation is often written without the time information. And unfortunately, there are no mechanisms for determining whether gopher references with different times refer to the same entity. @@md5 mechanism HTTP: string rep((h, p, d, f, s), (t, a)) given in perl as &file_uri("http", $p == 80 ? $h : "$h:$p", *d, $f, $t, max(a)) . $s ? "?$s" : "" where d is a list of the directories t in ISO time format NOTE 1: the string representation is often written without the time. In these cases, the time should be inferred from context. NOTE 2: the string representation can only represent a limited form of the content type metric function, i.e. "type x is best," and ofthen it is ommitted completely. In those cases, the client should encode a content type metric function based on its capabilities and the user preferences. RELATIVE NAMES ============== Essentially (@@ need more discussion), if we have Resolve(r0) = e0 and a name n is found inside e0, we apply the function Relative : Name x Reference -> Reference and compute Resolve(Relative(n, r0)) RELIABLE CACHING AND MIRRORING ============================== Essentially (@@ need more discussion), if a client wants to compute Resolve(r0), it can, for example, give the whole r0 reference to a proxy server s, ala: Resolve(r0) = Resolve((proxy, (s, r0))) and so the question becomes: is Resolve(r0) in the cache? In order to do mirroring, we need techniques to distribute information about circumstances where a client can substitute references, i.e. where Resoleve(r') = Resolve(r) but r' is easier to get to. This is where heirarchical namespaces come in: we'd like to distribute information of the form: For all *, Resolve("ftp://host1/dir1/*") = Resolve("ftp://host2/dir2/*") LOCATION-INDEPENDENT REFERENCES (URN's) ======================================= Rather than: Resolve : Scheme x Location x Context -> Entity we might as well write: Resolve : Scheme x Name x Context -> Entity for example we can define: RFC.GET : Name x ContentType -> OctetStream RFC.FILENAME : Name x ContentType -> File as RFC.FILENAME(n, c) = "n.txt" if c is text/plain, "n.ps" if c is application/postscript For all time t /* since rfc's don't change */ RFC.GET(n, c) = ANON-FTP.GET(("ds.internic.net", "/pub/rfc", RFC.FILENAME(n, c), IMAGE), t) FUTURE DISCUSSION ================= @@ backlink features @@ events: get (head), put