A Standardized Packaging System for Decentralized Unix Distributions

It is important to keep in mind while reading this that the ideas expressed are those of a single individual over the course of a couple months, not of a standards body over the course of years. It is a version 0.1 draft and nothing herein should be taken as final. The document is open to critique, recommendations, additions, changes, and modifications. It is ultimately envisioned as a community process, so feel free to contact the author with contributions at your leisure.

The Unix operating system has been receiving considerable public attention over the past few years, due largely in part to the maturation of open-source Unix implementations such as Linux and FreeBSD. As a result of this growing interest, the numbers of users, developers, and available software packages have also increased, and the environment has grown exponentially richer. However, because of the somewhat anarchist nature of open-source development, the same fertile ground that has given rise to Unix popularity has also been the source of some of its most significant troubles. Notably, the frequent lack of coordination between independent development groups has resulted in software releases that sometimes need considerable fine-tuning before they will work together. And because open-source software tends to be far more modular than closed software, it is at times rather difficult to find a small but essential component when building or repairing a system. In fact, the task of finding the pieces and making them coexist generally requires the effort of an entire group to maintain the "distributions" that represent a complete, working environment.

The distribution maintainers do a reasonably adequate job under the circumstances, but the distributions are far from perfect. Broken packages and hardware incompatibilities abound, and it often takes a fair amount of tweaking on the part of the users to get a working system. Unfortunately, it is possible that progress in this area is reaching its limits under the current Unix file hierarchy. Historically, Unix wasn't designed as an decentralized, open-source development environment; it was the product of a single vendor with a proprietary code base. And for decades the Unix clones that arose were also monolithic in this way. Individuals and groups might have written software packages to be installed atop the core system, but rarely would a group duplicate core functionality that was provided by the vendor. But now the landscape has changed, and virtually every component of the open-source Unices is an independent development. And the foundation simply hasn't scaled to meet the new demands.

Before getting into the proposed solution, package modularization, it might first be useful to convince the skeptics that something really needs to be done.

Traditionally, Unix files were stored according to usage: All system binaries would be kept in one location, all configuration files in another, all libraries in yet another, and so on. While it worked fine when there were only a handful of independent groups producing software and signifant changes took place over years rather than weeks, this model is hopelessly broken in an era of rapid modular open-source development where the numerous components of an operating system are being developed at different paces and most with little attention paid to the pace of any others. If the community development model is to operate at maximum efficiency, it needs an infrastructure that better accomodates its needs.

To the critics who would voice cries of heresy at the notion of breaking tradition, take a moment to consider the current state of affairs on a typical Unix system:

Configuration files are thrown together in a mindless heap under /etc, making it a painstaking chore to determine which packages rely on which configuration files. It may sound nice on the surface to have all of your configuration files in one place, but when they are given arcane names and contain little documentation, the purpose of them is often difficult to determine to one not intimately familiar with the system in the first place. Furthermore, developers have to take care not choose a name that is already used, and there is no effective way to avoid naming conflicts. When a single vendor was in charge of everything under /etc this was not a problem, but when hundreds of independent developers are contributing to the system, the odds of a naming conflict increase dramatically. To circumvent this problem, some developers have taken to putting their configuration files under subdirectories in /etc, but unfortunately there is little or no standardization to this procedure and the chances of naming conflicts still exist. The only reliable way to avoid this problem is to store program configuration files with the packages, which (adding to the chaos) is exactly what many packages have chosen to do.
System executables are scattered all over the place with no apparent consistency. On a typical system you will might find a particular executable that you're looking for in any one of /bin, /sbin, /opt/bin, /usr/bin, /usr/sbin, /usr/X11/bin, /usr/bin/X11, /usr/ccs/bin, /usr/ucb/bin, /usr/local/bin, /usr/local/sbin, as well as in subdirectories of a particular package or in some random location according to the whims of the system administrator(s). The reason for so many "official" directories is really only a matter of historical interest; the important point is that on most systems the executables are organized with little regard for logic and consequently can be a burden to maintain.
System libraries are generally tossed under /lib or /usr/lib, which is convenient but has caused significant problems when multiple library versions need to be supported. Major and minor numbers are a reasonable attempt to solve this problem, but the resultant structure is quite clumsy, inelegant, and error-prone. A hierarchical organization would achieve the same goal (along with adding other benefits), and also unify the filesystem when implemented in conjunction with the modularized hierarchy presented in this document.
Installation of multiple versions of system executables is essentially impossible without severe limitations. How does one install multiple versions of a program and have both simultaneously available? If version x of a program is desired as the default because it offers enhanced functionality, but a certain script relies on version y, what is the solution? Currently there is no standardized solution, and not even a good solution that is in widespread use.
Rollbacks can be a nightmare. If the installation of a new package breaks something on the system, it can bring even an experienced system administrator to his knees. With the current file organization, new packages are forced to either overwrite previous files or come up with clumsy, unstandardized renaming conventions (like the deplorably useless "bash2"). If existing files get overwritten, what is one's recourse when the new package is found to be broken? What if the new package was a broken version of the package retrieval tools, or an widely used essential system library?

The list could continue with more examples of the problems of the current system, but hopefully this is enough to convince those resistant to change that the hurdles faced are practically insurmountable given the current foundation. In the early days of a few corporate vendors providing most of the software, none of the above would have been issues. But for a decentralized system relying on the efforts of thousands of uncoordinated independent efforts, the current Unix file hierarchy is wholly inadequate.

The modular solution proposed is to create a package hierarchy in which all software packages have a unique location (kind of a like a namespace for you xml folks :). The proposed schema for this is:

        /pkg/group/name/version/distributor/distid/.pkg

The following is a concrete example of a translation for an existing application and distribution:

        /pkg/gnu/gnome/gnumeric/0.64/debian/i686-source/.pkg

The abstractions are defined as follows:

pkg : The top-level directory under which all software packages are stored.
group : Zero or more directories that correspond to the groups (if any) to which the package belongs.
name : The name of the software package being distributed.
version : One or more entries that correspond to the particular version of the application. Multiple version directories are allowed for fine grained version management (multiple version levels will serve a role similar to major and minor numbers on system libraries).
distributor : The person or group responsible for packaging and distributing the software. The distributor may consist of 1 or more entries.
distid : The distributor ID for the package. The ID may consist of 1 or more entries.
.pkg : The directory under which all files are actually stored.

[The peculiar-looking .pkg directory was chosen for a good reason which I haven't had time to explain yet. Explanation will be added shortly.]

Because the package modification and creation process is significantly simplified through modularity, we should see people increasingly motivated to contribute personalized distributions of various software packages. If this is the case, it would follow that a wider variety of specially-configured software would generally be available, and that obscure systems would have an easier time finding compatible software. Individual competition is also fostered in this model, as one is not limited to searching for either a "debian" package or a "redhat" package, but instead will have a sea of options that allow the best individual modules to be selected, rather than the set of modules distributed under the best brand name. By reducing the distributor to merely a field in the hierarchy (1) it encourages the interoperability of software packages from multiple distributors, (2) it levels the playing field for individuals or small groups to make significant distribution contributions, (3) it discourages OS fragmentation by relieving the end-users from having to make a choice between one distributor or another.

It will be interesting to see, however, how popular such a model becomes amongst the major distributions. Shareholder confidence may falter if a vendor can't make some guarantees of generating revenue through vendor lock-in, and conceit may keep the non-profits from reducing themselves to a mere field in the hierarchy. In the long run, however, as long as there are motivated individuals who see the value in an open, standardized package distribution schema, the cause will not be lost. Distributors who resist adapting to the streamlined model should eventually fall out of favor as the community effort ultimately produces a better product.

As previously mentioned, the .pkg directory is the repository for the package's files. And example .pkg directory might contain the following:

      .../.pkg/bin
      .../.pkg/include
      .../.pkg/info
      .../.pkg/lib
      .../.pkg/man

Though there is no way to mandate the directory names in a decentralized system, it is important that developers make an attempt to follow unenforced standards for their directory choices. The following guidelines are recommended:

bin : Executables (scripts or binaries) that are intended to be called directly by a user
class : Java class (.class) files
conf : Program configuration files
doc : Documentation files that are not in man or info format
include : C/C++ header files
info : GNU info files
java : Java source (.java) files
javadoc : Javadoc (Java documentation files)
lib : Libraries necessary for program operation
man : Unix man pages
share : Files provided by the distribution that should be accessible to users or programs, but do not fit into one of the other categories

Obviously not all of the directories will be used by every program, and unfortunately there are numerous programs already in existence that use directories other than these for the same purposes. However, in the long run it is likely that distributions that follow standards will come to be more popular and natural selection will weed out those that don't.

If you're wondering what kind of chaos this is going to introduce to one's PATH variable, the answer will be a pleasant surprise: You need only one entry in the path. By standardizing the name for the executable directory (bin), one can create a single bin directory on the system, and maintain references to the different versions through symbolic links. Changing the system-wide default version of a program would not mean changing a hairy PATH variable; it would simply mean redirecting the links in the system bin directory through a simple scripted interface. Furthermore, since you'd be operating on entire packages, the effect is also extended to any man pages, libraries, class files, etc, that are included with the package. Switching between two versions of the same package becomes a one-line operation, and there's even the option of easily managing the use of intermingled files from two different versions.

Because a peer distribution model comes with inherent risks, it should be standard practice to compute an md5 checksum of all files in the package before archiving, and to always verify the checksums on extracting. In addition, a checksum should be run on the overall archived file. In this way one is able to guarantee both integrity and searchability.

For example, consider the distributor Alice who takes the "foo" source code, compiles it for machine X, and archives it for distribution. On archival the checksum is computed for every file in the package and stored along with the rest of the files in the package. The tarball is then created, gets its own checksum computed, and is renamed to its checksum. Now Alice can put her archive into any public distribution pool (e.g. a peer-to-peer client such as Freenet or an anonymous ftp site), and as long as she can reliably distribute her search key (which contains the tarball checksum), users will be able to locate her unique archive amongst the others in the pool, as well as verify the integrity of the package once it was retrieved.

In order to maximize the ability to select a version of a package to work with, the system administrator may choose to implement "recursive linking". Recursive linking is the process of setting up symbolic links recursively up the hierarchy that will allow users to select to use certain packages by a symbolic name (e.g. "stable" or "recent") rather than a version or build number. For example, consider the following paths:

        /pkg/gnu/bash/1.14/debian/i386-01
        /pkg/gnu/bash/2.04/debian/i386-01
        /pkg/gnu/bash/2.04/redhat/i386-01
        /pkg/gnu/bash/2.04/redhat/i686-01

The administrator has the option here to create symbolic links at points along the hierarchy that mean something to the casual user. For example, the admin may create the link .stable->2.04/debian/i386-01 at the version level so that a user only wishing to have a stable version of bash in her path needs only send a request for "stable", without having to know anything about which version is being requested.

With the kind of flexible, dynamic model that supports and even encourages multiple versions of the same package, a self-cleansing method must be implemented so that the user is not burdened with manually tracking down obsolete packages. Unix natively supports a number of ways to do this, including:

Checking the last access time of files in the package.
Scanning the system for symbolic links to files in the package.
Checking the dynamic dependencies of system executables to see if they rely on any shared libraries in the package.

Any or all of these methods could be implemented, depending on how thorough the user wished to be. Also, the level of interaction is arbitrary: for a novice user the entire process could be wrapped up in a cron job with no prompting at all, while an expert might want constant prompting for confirmation of what was being deleted.

Not ready for that yet ;)

A Standardized Packaging System for Decentralized Unix Distributions

Preface

Introduction

The Need for Change

A Modular Solution

Peer Distribution

Standardized Directories

Linking it all Together

Checksums

Recursive Linking

Self Maintence

Conclusion