A Standardized Packaging System for Decentralized Unix Distributions

Preface

It is important to keep in mind while reading this that the ideas expressed are those of a single individual over the course of a couple months, not of a standards body over the course of years. It is a version 0.1 draft and nothing herein should be taken as final. The document is open to critique, recommendations, additions, changes, and modifications. It is ultimately envisioned as a community process, so feel free to contact the author with contributions at your leisure.

Introduction

The Unix operating system has been receiving considerable public attention over the past few years, due largely in part to the maturation of open-source Unix implementations such as Linux and FreeBSD. As a result of this growing interest, the numbers of users, developers, and available software packages have also increased, and the environment has grown exponentially richer. However, because of the somewhat anarchist nature of open-source development, the same fertile ground that has given rise to Unix popularity has also been the source of some of its most significant troubles. Notably, the frequent lack of coordination between independent development groups has resulted in software releases that sometimes need considerable fine-tuning before they will work together. And because open-source software tends to be far more modular than closed software, it is at times rather difficult to find a small but essential component when building or repairing a system. In fact, the task of finding the pieces and making them coexist generally requires the effort of an entire group to maintain the "distributions" that represent a complete, working environment.
The distribution maintainers do a reasonably adequate job under the circumstances, but the distributions are far from perfect. Broken packages and hardware incompatibilities abound, and it often takes a fair amount of tweaking on the part of the users to get a working system. Unfortunately, it is possible that progress in this area is reaching its limits under the current Unix file hierarchy. Historically, Unix wasn't designed as an decentralized, open-source development environment; it was the product of a single vendor with a proprietary code base. And for decades the Unix clones that arose were also monolithic in this way. Individuals and groups might have written software packages to be installed atop the core system, but rarely would a group duplicate core functionality that was provided by the vendor. But now the landscape has changed, and virtually every component of the open-source Unices is an independent development. And the foundation simply hasn't scaled to meet the new demands.

The Need for Change

Before getting into the proposed solution, package modularization, it might first be useful to convince the skeptics that something really needs to be done.
Traditionally, Unix files were stored according to usage: All system binaries would be kept in one location, all configuration files in another, all libraries in yet another, and so on. While it worked fine when there were only a handful of independent groups producing software and signifant changes took place over years rather than weeks, this model is hopelessly broken in an era of rapid modular open-source development where the numerous components of an operating system are being developed at different paces and most with little attention paid to the pace of any others. If the community development model is to operate at maximum efficiency, it needs an infrastructure that better accomodates its needs.
To the critics who would voice cries of heresy at the notion of breaking tradition, take a moment to consider the current state of affairs on a typical Unix system:
The list could continue with more examples of the problems of the current system, but hopefully this is enough to convince those resistant to change that the hurdles faced are practically insurmountable given the current foundation. In the early days of a few corporate vendors providing most of the software, none of the above would have been issues. But for a decentralized system relying on the efforts of thousands of uncoordinated independent efforts, the current Unix file hierarchy is wholly inadequate.

A Modular Solution

The modular solution proposed is to create a package hierarchy in which all software packages have a unique location (kind of a like a namespace for you xml folks :). The proposed schema for this is:
        /pkg/group/name/version/distributor/distid/.pkg
      
The following is a concrete example of a translation for an existing application and distribution:
        /pkg/gnu/gnome/gnumeric/0.64/debian/i686-source/.pkg
      
The abstractions are defined as follows:
[The peculiar-looking .pkg directory was chosen for a good reason which I haven't had time to explain yet. Explanation will be added shortly.]

Peer Distribution

Because the package modification and creation process is significantly simplified through modularity, we should see people increasingly motivated to contribute personalized distributions of various software packages. If this is the case, it would follow that a wider variety of specially-configured software would generally be available, and that obscure systems would have an easier time finding compatible software. Individual competition is also fostered in this model, as one is not limited to searching for either a "debian" package or a "redhat" package, but instead will have a sea of options that allow the best individual modules to be selected, rather than the set of modules distributed under the best brand name. By reducing the distributor to merely a field in the hierarchy (1) it encourages the interoperability of software packages from multiple distributors, (2) it levels the playing field for individuals or small groups to make significant distribution contributions, (3) it discourages OS fragmentation by relieving the end-users from having to make a choice between one distributor or another.
It will be interesting to see, however, how popular such a model becomes amongst the major distributions. Shareholder confidence may falter if a vendor can't make some guarantees of generating revenue through vendor lock-in, and conceit may keep the non-profits from reducing themselves to a mere field in the hierarchy. In the long run, however, as long as there are motivated individuals who see the value in an open, standardized package distribution schema, the cause will not be lost. Distributors who resist adapting to the streamlined model should eventually fall out of favor as the community effort ultimately produces a better product.

Standardized Directories

As previously mentioned, the .pkg directory is the repository for the package's files. And example .pkg directory might contain the following:
      .../.pkg/bin
      .../.pkg/include
      .../.pkg/info
      .../.pkg/lib
      .../.pkg/man
    
Though there is no way to mandate the directory names in a decentralized system, it is important that developers make an attempt to follow unenforced standards for their directory choices. The following guidelines are recommended: Obviously not all of the directories will be used by every program, and unfortunately there are numerous programs already in existence that use directories other than these for the same purposes. However, in the long run it is likely that distributions that follow standards will come to be more popular and natural selection will weed out those that don't.

Linking it all Together

If you're wondering what kind of chaos this is going to introduce to one's PATH variable, the answer will be a pleasant surprise: You need only one entry in the path. By standardizing the name for the executable directory (bin), one can create a single bin directory on the system, and maintain references to the different versions through symbolic links. Changing the system-wide default version of a program would not mean changing a hairy PATH variable; it would simply mean redirecting the links in the system bin directory through a simple scripted interface. Furthermore, since you'd be operating on entire packages, the effect is also extended to any man pages, libraries, class files, etc, that are included with the package. Switching between two versions of the same package becomes a one-line operation, and there's even the option of easily managing the use of intermingled files from two different versions.

Checksums

Because a peer distribution model comes with inherent risks, it should be standard practice to compute an md5 checksum of all files in the package before archiving, and to always verify the checksums on extracting. In addition, a checksum should be run on the overall archived file. In this way one is able to guarantee both integrity and searchability.
For example, consider the distributor Alice who takes the "foo" source code, compiles it for machine X, and archives it for distribution. On archival the checksum is computed for every file in the package and stored along with the rest of the files in the package. The tarball is then created, gets its own checksum computed, and is renamed to its checksum. Now Alice can put her archive into any public distribution pool (e.g. a peer-to-peer client such as Freenet or an anonymous ftp site), and as long as she can reliably distribute her search key (which contains the tarball checksum), users will be able to locate her unique archive amongst the others in the pool, as well as verify the integrity of the package once it was retrieved.

Recursive Linking

In order to maximize the ability to select a version of a package to work with, the system administrator may choose to implement "recursive linking". Recursive linking is the process of setting up symbolic links recursively up the hierarchy that will allow users to select to use certain packages by a symbolic name (e.g. "stable" or "recent") rather than a version or build number. For example, consider the following paths:
        /pkg/gnu/bash/1.14/debian/i386-01
        /pkg/gnu/bash/2.04/debian/i386-01
        /pkg/gnu/bash/2.04/redhat/i386-01
        /pkg/gnu/bash/2.04/redhat/i686-01
      
The administrator has the option here to create symbolic links at points along the hierarchy that mean something to the casual user. For example, the admin may create the link .stable->2.04/debian/i386-01 at the version level so that a user only wishing to have a stable version of bash in her path needs only send a request for "stable", without having to know anything about which version is being requested.

Self Maintence

With the kind of flexible, dynamic model that supports and even encourages multiple versions of the same package, a self-cleansing method must be implemented so that the user is not burdened with manually tracking down obsolete packages. Unix natively supports a number of ways to do this, including:
Any or all of these methods could be implemented, depending on how thorough the user wished to be. Also, the level of interaction is arbitrary: for a novice user the entire process could be wrapped up in a cron job with no prompting at all, while an expert might want constant prompting for confirmation of what was being deleted.

Conclusion

Not ready for that yet ;)

Authors: