Tim Berners-Lee
Creation date: 2004-08-11, last change: $Date: 2005/01/01 14:12:42 $
Status: personal view only. Editing status: first draft.

Installation, Catalogs, and Caches

The management of disk space for long-term high-availability items

This article discusses the merging of the concepts of a decent installation manager and a persistent cache. It is a plea for web cacheing to be advanced to the power of current installation systems. It is a plea for XML catalogs to be transparent and integrated with

Introduction

In the history of computing, varies forms of remote storage have been used, with different characteristrics. Local disks have (on a good day) reliable access, high speed. Applications are written to assume their availability. Running on a local area network, remote mounted disks using RPC-based access protocols (NFS, etc) provide quite high speed, and quite a high functionality, and as they are mounted as virtual disks, the application has to treat them as though they are reliable: there is no application access to network outages for example. Web pages acesses using HTTP over a wide area network are still less reliable. In this case, applications and users are both aware of the many forms of errors which can occur, some such as packet loss arising from the distances involved, and some such as authentication errors, deriving from the fact that the system crosses organizational boundaries. Yet another form of remote storage, though it might not be often considered as such, is installed software. This information is accessed remotely, but is copied completely locally, as the reliability is exteremely important. Typically the installation process is managed explictly by the user, who is aware of the reason for wanting it, security issues (DNS address and signatures) and resource commitment (disk space). Unlike disk, NFS and HTTP access, there is no URI for most installed things. You can find them on the web, typcially, and the Debian packaging system has a global (non-URI) package name space, but the integration with the web is weak. This article proposes to strengthen that connection.

It is driven by the needs not only for conventional software installation to be handled more powerfully, but also for documents, and in particular namespace documents to be handled much more as installations. There

Operating systems and their component compiled modules
Applications and their software modules
Application data such as maps whcih consume significant resources
Media such as Music, Videos, and Pictures
Modules in langauge systems such as Python and Java
DTDs, XML Schemas
Ontologies
Semantic web Rules

The web architecture is that things should be identified by URIs.

The functionality is a sort of amalgam of Debian's packaging system, Microsoft's "Add/remove programs", Apple's Software Update system, persistent Web caches such as waffle, and various SGML DTD catalog systems being proposed.

I like hierarchical ways of organizing things like local resources, so the ability as a user to make dummy module in order to build a dependency. Let's call it a project. If I have asked the system over time for many things, I need to be able to group them.

In some kind of ideal world, it would be nice to be unaware of limitations of disk space, to be unheeded by the time to access any data, and for all infromation to be available at all times. In reality, of course, compromises are made, and an individual's proprities have to be matched against real costs. Here we imagine that this process is centralized in one place, and made as painless as possible. We need to match sets of data (say projects) against management policies. Examples of project may be: everything I have; everything I need for work; All the recorded music I own; all the family photos. Dimensions of management policies may include:

Whether to store it locally. This is the main bit - but clearly managing shared LAN storage is direct benefit.
Whether and when to let it expire
Where to mirror it
Whether to back it up
Whether and how often to check for updates.

Organizing data

Traditionally, the directtory system has been used to categorize information so as to for the purposes above. However, the conflicting needs of different dimensions

Currently many applications allow users to categorize data such as photos and music.

Requirements

To summarize,

Track the URI each file has as its definitive source. The URI allows one to reconsiture a system fromits past; to check on whether it is up-to-date, secure, and so on.
Keep a copy (just one) of the data on the local system.
Use the URI as the identifier for the file in its
Track dependecies between all installed information -- and with the original user needs. Dpkg does a good job or tracking dependencies, but forgets what/why/when were the original user demands.
Track resources used on the local machine -- and display them as a function of the original user needs. Do this across applications, as resources
Track Expiry of modules, so as to know when to check for new versions.

Subscription to updates notification streams by email, web servcice, polling (RSS), etc.
Allow the user to set a required quality of service by

The task of organizing data is shared between the user and

Examples of conversations I would like to be able to have with my computer

I would like to install Amaya. This will cost you 15MB and require you to trust yyy and zzz. OK? Yes. Why? For a new project "Interactive editing tools".

Why do I have zzz installed? Because it is used [by foo which used by bar which is used] by Amaya which you asked for on 18th June, for your Interactive Editing Tools project.

Please show me the still images I have installed as a function of project. Ok, here is the list, showing disk space used by project.

How much do I gain if I delete Interactive Editing Tools? Well, they take 2.3GB but you would gain only 1.4GB due to common modules shared with other things.

Please install my usual working environment on this machine. Ok.

Please make everything under project Family Personal Media mirrored up on machines alpha, beta and gamma, and part of the weekly incremental DVD burn backup. No, sorry, you wouldn't have enough disk space on gamma.

Please save space by deleting my Classical Music, but make a DVD for me to restore it later. Ok

Always keep the any ontologies of RDF documents you process as "local, forever".

Anything remote that I access from now until I say "stop" is to be deemed part of My Presentation project.

I would like to install SuperDVDPlayer3. Sorry, that needs Super3dShound, which needs access to your sound card, which is not sharable and is currently used by AudioDriver84. Here is a list of the projects you would get with one module but not the other. [...] Please make a choice. You can reverse it at any future time. Thanks, I'll stick with what I have. Ok. I'll leave the switch in the set of options in your control panel.

Conflicts occur when different applications, or projects here, need different versions of the same. These don't ahev to be conflicts when a smart systm can load both versions of the module in question. They do turn out to be a problem when that isn 't possible.

There is a class of conflict which happens when something needs a unique or limted resource. A hardware driver needs a device, for example. That is when your system has to make some hoice between them. Then it is useful to have a very good gui to give you the options in terms of things you asked for.

Existing systems

`make` and CVS

These Unix tools have been the mainstay of huge amounts of software development over the years, and have many clones and spinoffs with similar functionality, that is, dependency management and source code control.

They meet a surprising number of the requirements above, in that CVS tracks where each file came from (effectively a URI, from which it can be verified, updated, etc), and make keeps track fo the dependencies between files. Mainly, though, make is limited to working within a directory. To work in a large space managed using make and CVS typically requires a recursive invokation fo make to find dependencies on external directories, CVS to get an updated version, and make to build new versions of files which make have changed.

SGML and XML catalogs

These systems are designed to deal with the problem that SGML used FPIs, and some XML systems use non-dereferencable identifiers for things like DTDs, other external entities. If one then acquires copies of the resources in question, a catalog format allows one to tell the system where they are and what they correspnd to.

Persistent web caches

These come in two forms - typically integrated with a browser, or indepedent, operating typically as a proxy. In the first class, Internet Explorer has had in various versions the ability to ask for abookmark to be "available offline", with the option of keeping resources linked to a given nesting. These resources would be availble ffline, could be automatically resynchronized at requested intervals (not drtiven by the HTTP expiry date), and would be deleted after a while, on an algorithm which was not clear to me at the time.

As an example of the second class, wwwoffle allwos the user to cache files permanently, and, by running locally as a proxy, allows any browser to be used offline as though it were online.

Integration or Independence?

A danger in software design is to make modules which require to be "king of the castle", which must own all of a certian space, which much be the only one of their kind. (See Test of Independent Invention). This is clearly a trap here. A key requirement is that the user should be aware of storage space being used, and balance that, as a user value judgement, against the quality of availability for different resources.

One way to design this would, then, be to make the installation function for all resources of all kinds calls to the operating system. This would make it difficult to make applications portable between operating systems, unless a posix-like standard API were deployed evenly. At best, the abuse of such an API would be a constant temptation on the part of the designers of monopoly operating systems.

On ther hand, it is clear that each application needs to be involved with the installation or deinstallation of its own resources. The operating system needs to be involved with shared program modules. However, when application code itself is desinstalled, the application itself often has to be called to cleanly remove itself. A photo application needs to be aware of the arrival of new photos, and the music application of new music, and so on. Sometimes such applications in fact require the media to be stored within their own directory hierarchy, though sometimes not, and sometimes they are flexible about it.

Here are some possible architectures:

User control: The user can delete any resources at any time. With media, this tends to be the case now. The user has access to the file system, and sometimes needs disk space: the wise application understands this and re-indexes when necessary. Similarly, when new information arrives, the user just makes it appear somewhere. This happens already, when a web browser or an email client downloads files into a default download directory and uncompresses and unpacks them. What is missing is the ongoing resource control ("Keep this forever") and, flagrantly, sufficient security. At the moment, the user is responsible for either importing the resource into an a
Resource Manager control: The actual storage of all resources is under resource manager control. Any application makes a call to the resource manager to install or open one or many resources. The actual local filenames are not available to other applications.
Application control: The actual storage of resources is under application control. Applications can import, say, photos from cameras, or music from CDs, or python modules from the web, and when they have done that they provide medata about size and dependencies, so that the resource manager can index them. The resource manager has to call the application back if the user wants to delete resources while looking at the resource manager view. Or, that functionality is not provided: the user can see what has to be deleted, and goes into that application to delete them. This is practical, being closest to current practice, but does not allow for the managemnt of projects involving multiple types of resources.
It also has a snag in that the application becomes bound to a particular way of getting resources. For now, one might easily get pictures from a connected camera and music from a CD, but in the future one may want to do the reverse, or get python modules via peer-peer filesharing, and so on. A fundamental flexibility point of the web is to separet the ways of gettings tuff (URI scehemes, basically) with the sorts of stuff you can get (Internet Content Types, basically).

Indexed metadata

Possibly a final architecture is likely to be a bit of a mixture. The safest way to build any software will be to make the few assumptions about how resources arrived or where they are stored on the local disk. If you need to get something from the web, then put it somewhere obvious and allow it to be deleted, rather than hiding it. (Anecdote: A user lost all her music because her music application hid it where it didn't get backed up)

This means that any system which needs to know about a resource is going to have to be notified that it exists, or must be able to search for it. It must then be able to pick up what information it needs about it. This probably includes where it came from, and for anything executable a security trail of why it should be trusted, if at all. For media or software, it may include licencing information. It may include expiry dates.

A solution is for data to come with this metadata expressed in a form which anyone can read, a common metadata format such as RDF.

Where to keep resources in the file system

Moving data on disk is bad. A bit like changing URIs but opn the local scale, it is bad for similar reasons. Why does one move data on disk (actually, rename it)? For different conflicting reaons. The conflict is the problem. Because it belongs to someone else, because it needs to be kept, because it needs to be backed up, because it needs to be managed by XXX application. Let's look at architectures in which files are rarerly moved if ever. What should determine where it is stored?

Security

The current (2004 -- I hope this will date the article!) rash of viruses may be largely based on some mail and web applications' inability to distinguish between safe and unsafe files, between passive media and executable code. This binary distinction should be core to the design of a system. Most users can do all their normal business without ever having to execute something sent by email or picked up by following a link, except under very controlled software installation conditions.

Where does such responsability fall in the resource managemt scenario? Primarily on any application which passes control to, executes or interprets code. It is unreasonable to make applications which deal with passive media to jump through security hoops.

For security purposes, I am very much in favor of keeping files from a given source (eg signed by a given manufacturer) within a given subtree in the file system. The copying of files from application space into shared areas which often happens alas leaves no trace fo where they came from, and can lead to undiagnosable configuration errors. There is a certain safety, then, in keeping files in place under the source. This is also consistent with installing them from a compressed zip or tar file which typically unzips into a single directory tree. When that has been done, connections ahve to be made. Both the disk space management and the appliaction itself typically need to index what is there. The interpreter needs to build a list of available modules. The loader needs to per-link runtime images. The photo application indexes photos and builds thumbnails. The music application indexes music. So, the rule we end up with is:

It is the provenence of data which determines where it is stored in the local file system.

How to share metadata within the system

On the web, metadata may be published by anyone about anything. It is then up to diverse systems to find it and use it or not use it as they chose. Global indexes like Google make finding metadata (such as, in formal way, movie reviews) easy.

Some metadata, of course comes with the resource itself. This can clearly be stored with that resource. It is worth converting it from an application-sepcific standard such as EXIF of digtal photos to RDF.

Other metadata, such as the HTTP resource response message headers when something is fetched from the web, are generated when the resource is acquired and can also be stored with that resource.

Meanwhile, applications produce their own metdata. For example, if I rate my pictures and music, that is very important information from the resource management point of view. I run scripts which combine GPS data and photo data to produce medata about where photos were taken. The source of this metadata is the application, or the user with the application as his agent. Under the policy above, it would be inappropriate for this metadata to be stored with the original data, as it has different provenance. Therefore the rule for applications is to save application data generated in an onvious place and in an accessible file format, such as RDF. The rule we end up with for metadata is just the same as for data, which is not surprising as metadata is data.

It is the provenence of metadata which determines where it is stored in the local file system.

This leaves us with the question of how to merge the data from the different sources. This is the inter-application communication problem.

Many operating systems, and many RDF-based systems, use central repositories, such the NeXTStep defaults database, or the now infamous Windows registry. Why does the windows registry figure in so many problems? Because it is a common repository shared by applications, and it gets out of step with the file system? Because it is hidden from the user? Because it is used as a way of communicating between different applications, but it is not clear who wrote what when? The latter makes it a security hole, a hook-point for viruses.

A sounder scheme, it seems, is to leave the definitive data where it is, or always be aware of where it came from. RDF systems are now being developped which are more aware of the provenance of data. But is we stick with the model that files are upacked into a directory tree under the source from which they came, and an index is built over them using trust rules, then at least the index can be kept is sync with reality, by resynching at any time.

Remembering the csh "rehash" command, we know it is better to have a notificaton-based system for such regeneration than a polling-based system, if the infrastructure supports it.

File systems which allow notification to be delivered on changes will allow indexes to be updates in real time. Systems which use make (or equivalent) will also be able to propagate changes, but pull mode rather than push mode.

Practical deployment

The rules above can guide application and operating system designers. In the mean time, can we usefully build better resource management systems on top of existing operating systems and applications?

One way is to build consistent databases of the data which is available, and to make rules which will check or determne policy as regards backups, availability, etc. The author played with the extraction of metadata from make and from fink (essentially Debian) dependency data.

Another way is to infiltrate existing installation systems with hooks to allow the extraction of metadata but also the installation and deinstallation of modules.

Conclusion

The system described is a metadata-driven system to allow the owner of a computer system to manage the availability and reliability of information resources as a function the reason for which they are needed, and the amount of storage space needed.

In practice projects need information which is installed in many different ways.

References

Debian Packaging: The debian packaging system has centralized non-URI naming, and centralized distribution of packaging metadata. It does a good job of dependency and version management. It doesn't seem to remember what it was you originally asked for. Available on Mac OS X as fink.
red_import: Daniel Kretch's RDF-based automatic importer for python. Import is explict, but the RDF URI of a package is used, and RD data accessed to find out how to install the module. (@@Need a decent link to this). Daniel Kretch is the rdflib author.
RPM: Packaging system used by Red Hat linux. Compare Debian Packaging.
SoftwarePackaging: Wiki page on the subject, largely a collecton of references.
ZeroInstall: This is a cool toolkit (includes kernel patch) for Linux which does just this in that it makes cache and installation the same thing. Totally transparent swapping in of stuff on the Web. very nice. What it doesn't do is manage what diskspace you have allocated. Most important issue here is that you can't ask for offline availability. It really assumes constant connection.

Up to Design Issues

Tim BL