A Standardized Packaging System for Decentralized
Unix Distributions
Preface
It is important to keep in mind while reading this that
the ideas expressed are those of a single individual over the
course of a couple
months, not of a standards body over the course of years. It is a
version 0.1 draft and nothing herein should be taken as
final. The document is open to critique, recommendations,
additions, changes, and modifications. It is ultimately
envisioned as a community process, so feel free to
contact
the author with contributions at your leisure.
Introduction
The Unix operating system has been receiving considerable
public attention over the past few years, due largely in part to the
maturation of open-source Unix implementations such as
Linux and
FreeBSD. As a result of
this growing interest, the numbers of users, developers, and
available software packages have also increased, and the
environment has grown exponentially richer. However, because
of the somewhat
anarchist nature of open-source development, the same fertile
ground that has given rise to Unix popularity has also been
the source of some of its most significant troubles. Notably,
the frequent lack of coordination between independent development
groups has resulted in software releases that sometimes need
considerable fine-tuning before they will work together. And because
open-source software tends to be far more modular than closed
software, it is at times rather difficult to find a small
but essential component when building or repairing a system.
In fact, the task of finding the pieces and making them
coexist generally requires the effort of an entire group
to maintain the "distributions" that represent a complete,
working environment.
The distribution maintainers do a reasonably adequate job
under the circumstances, but the distributions are far from
perfect. Broken packages and hardware incompatibilities abound,
and it often takes a fair amount of tweaking on the part of
the users to get a working system. Unfortunately, it is possible
that progress in this area is reaching its limits under
the current Unix file hierarchy. Historically, Unix wasn't
designed as an decentralized, open-source
development environment; it was the product of a single vendor with a
proprietary code base. And for decades the Unix clones that arose
were also monolithic in this way. Individuals and groups might have
written software packages to be installed atop the core system, but
rarely would a group duplicate core functionality that was
provided by the vendor. But now the landscape has changed,
and virtually every component of the open-source Unices is an
independent development. And the foundation simply hasn't scaled
to meet the new demands.
The Need for Change
Before getting into the proposed solution, package
modularization, it might first be useful to convince the
skeptics that something really needs to be done.
Traditionally, Unix files were
stored according to usage: All system binaries would be kept
in one location, all configuration files in another, all libraries
in yet another, and so on. While it worked fine when
there were only a handful of independent groups producing software and
signifant changes took place over years rather than weeks,
this model is hopelessly broken in an era of rapid modular
open-source development where the numerous components of an
operating system are being developed at different paces and
most with little attention paid to the pace of any others.
If the community development model is to operate at maximum
efficiency, it needs an infrastructure that better
accomodates its needs.
To the critics who would voice cries of heresy at the notion
of breaking tradition, take a moment
to consider the current state of affairs on a typical Unix system:
- Configuration
files are thrown together in a mindless heap under /etc, making
it a painstaking chore to determine which packages rely on which
configuration files. It may sound nice on the surface to have
all of your configuration files in one place, but when they are
given arcane names and contain little documentation, the purpose
of them is often difficult to determine to one not intimately
familiar with the system in the first place. Furthermore,
developers have to take care not choose a name that is already
used, and there is no effective way to avoid naming conflicts.
When a single vendor was in charge of everything under /etc this
was not a problem, but when hundreds of independent developers
are contributing to the system, the odds of a naming conflict
increase dramatically. To circumvent this problem, some developers
have taken to putting their configuration files under subdirectories
in /etc, but unfortunately there is little or no standardization to
this procedure and the chances of naming conflicts still exist.
The only reliable way to avoid this problem is to store
program configuration files with the packages, which (adding to
the chaos) is exactly what many packages have chosen to
do.
- System executables are scattered all over the place with
no apparent consistency. On a typical system you will might
find a particular executable that you're looking for in
any one of /bin, /sbin, /opt/bin, /usr/bin, /usr/sbin, /usr/X11/bin,
/usr/bin/X11, /usr/ccs/bin, /usr/ucb/bin, /usr/local/bin,
/usr/local/sbin, as well as in subdirectories of a particular
package or in some random location according to the whims of
the system administrator(s). The reason for so many "official"
directories is really only a matter of
historical interest; the important point is that on most
systems the executables are organized with little regard
for logic and consequently can be a burden to
maintain.
- System libraries are generally tossed under /lib or
/usr/lib, which is convenient but has caused significant
problems when multiple library versions need to be supported.
Major and minor numbers are a reasonable attempt to solve
this problem, but the resultant structure is quite clumsy,
inelegant, and error-prone. A hierarchical organization would
achieve the same goal (along with adding other benefits), and
also unify the filesystem when implemented in conjunction
with the modularized hierarchy presented in this
document.
- Installation of multiple versions of system executables
is essentially impossible without severe limitations.
How does one install multiple versions of a program and
have both simultaneously available? If version x of a program
is desired as the default because it offers enhanced
functionality, but a certain
script relies on version y, what is the solution? Currently
there is no standardized solution, and not even a good solution
that is in widespread use.
- Rollbacks can be a nightmare. If the installation of a new
package breaks something on the system, it can bring even
an experienced system administrator to his knees.
With the current file
organization, new packages are forced to either overwrite
previous files or come up with clumsy, unstandardized renaming
conventions (like the deplorably useless "bash2"). If
existing files get overwritten, what is one's recourse when
the new package is found to be broken? What if the new package
was a broken version of the package retrieval tools, or an
widely used essential system library?
The list could continue with more examples of the problems
of the current system, but hopefully this is enough to convince
those resistant to change that the hurdles faced are practically
insurmountable given the current foundation. In the early days
of a few corporate vendors providing most of the software,
none of the above would have been issues. But for
a decentralized system relying on the efforts of thousands
of uncoordinated independent efforts, the current Unix file
hierarchy is wholly inadequate.
A Modular Solution
The modular solution proposed is to create a package
hierarchy in which all software packages have a unique
location (kind of a like a namespace for you xml folks :).
The proposed schema for this is:
/pkg/group/name/version/distributor/distid/.pkg
The following is a concrete example of a translation for an
existing application and distribution:
/pkg/gnu/gnome/gnumeric/0.64/debian/i686-source/.pkg
The abstractions are defined as follows:
- pkg : The top-level directory under which all software
packages are stored.
- group : Zero or more directories that correspond
to the groups (if any) to which the package belongs.
- name : The name of the software package being
distributed.
- version : One or more entries that correspond to
the particular version of the application. Multiple
version directories are allowed for fine grained version
management (multiple version levels will serve a role similar
to major and minor numbers on system libraries).
- distributor : The person or group responsible for packaging
and distributing the software. The distributor may consist
of 1 or more entries.
- distid : The distributor ID for the package. The ID
may consist of 1 or more entries.
- .pkg : The directory under which all files are actually
stored.
[The peculiar-looking .pkg directory was chosen for a
good reason which I haven't had time to explain yet.
Explanation will be added shortly.]
Peer Distribution
Because the package modification and creation process
is significantly simplified through modularity, we should
see people increasingly motivated to contribute personalized
distributions of various software packages.
If this is the case, it would follow that a wider variety
of specially-configured software would generally be available,
and that obscure systems would have an easier time finding
compatible software. Individual competition is also
fostered in this model, as one is not limited to searching
for either a "debian" package or a "redhat" package, but
instead will have a sea of options that allow the best
individual modules to be selected, rather than the set
of modules distributed under the best brand name.
By reducing the distributor to merely
a field in the hierarchy (1) it encourages the interoperability
of software packages from multiple distributors, (2) it levels
the playing field for individuals or small groups to make
significant distribution contributions, (3) it discourages
OS fragmentation by relieving the end-users from having to make
a choice between one distributor or another.
It will be interesting to see, however, how popular
such a model becomes amongst the major distributions.
Shareholder confidence may falter if a vendor can't
make some guarantees of generating revenue through
vendor lock-in, and conceit may keep the non-profits
from reducing themselves to a mere field in the hierarchy.
In the long run, however, as long as there
are motivated individuals who see the value in an open,
standardized package distribution schema, the cause will
not be lost. Distributors who resist adapting to the
streamlined model should eventually fall out of favor
as the community effort ultimately produces a better
product.
Standardized Directories
As previously mentioned, the .pkg directory is the
repository for the package's files. And example .pkg
directory might contain the following:
.../.pkg/bin
.../.pkg/include
.../.pkg/info
.../.pkg/lib
.../.pkg/man
Though there is no way to mandate the directory names in
a decentralized system, it is important that developers make
an attempt to follow unenforced standards for their directory
choices. The following guidelines are recommended:
- bin : Executables (scripts or binaries) that
are intended to be called directly by a user
- class : Java class (.class) files
- conf : Program configuration files
- doc : Documentation files that are not in
man or info format
- include : C/C++ header files
- info : GNU info files
- java : Java source (.java) files
- javadoc : Javadoc (Java documentation files)
- lib : Libraries necessary for program
operation
- man : Unix man pages
- share : Files provided by the distribution
that should be accessible to users or programs, but do
not fit into one of the other categories
Obviously not all of the directories will be used by every
program, and unfortunately there are numerous programs already
in existence that use directories other than these for the
same purposes. However, in the long run it is likely that
distributions that follow standards will come to be more
popular and natural selection will weed out those that
don't.
Linking it all Together
If you're wondering what kind of chaos this is going to
introduce to one's PATH variable, the answer will be a pleasant
surprise: You need only one entry in the path. By standardizing
the name for the executable directory (bin), one can create
a single bin directory on the system, and maintain references
to the different versions through symbolic links. Changing
the system-wide default version of a program would not mean
changing a hairy PATH variable; it would simply mean redirecting
the links in the system bin directory through a simple scripted
interface. Furthermore, since you'd be operating on entire
packages, the effect is also extended to any man pages, libraries,
class files, etc, that are included with the package. Switching
between two versions of the same package becomes a one-line
operation, and there's even the option of easily managing the
use of intermingled files from two different versions.
Checksums
Because a peer distribution model comes with inherent risks,
it should be standard practice to compute an md5 checksum of all
files in the package before archiving, and to always verify the
checksums on extracting. In addition, a checksum should be
run on the overall archived file. In this way one is able
to guarantee both integrity and searchability.
For example, consider the distributor Alice who takes the
"foo" source code, compiles it for machine X, and archives
it for distribution. On archival the checksum is computed
for every file in the package and stored
along with the rest of the files in the package. The tarball
is then created, gets its own checksum computed, and is
renamed to its checksum. Now Alice can put her archive
into any public distribution pool (e.g. a peer-to-peer client
such as
Freenet
or an
anonymous ftp site), and as long as she can reliably
distribute her search key (which contains the tarball checksum),
users will be able to locate her unique archive amongst the
others in the pool, as well as verify the integrity of the
package once it was retrieved.
Recursive Linking
In order to maximize the ability to select a version
of a package to work with, the system administrator may choose
to implement "recursive linking".
Recursive linking is the process of setting up symbolic links
recursively up the hierarchy that will allow users to
select to use certain packages by a symbolic name (e.g.
"stable" or "recent") rather than a version or build number.
For example, consider the following paths:
/pkg/gnu/bash/1.14/debian/i386-01
/pkg/gnu/bash/2.04/debian/i386-01
/pkg/gnu/bash/2.04/redhat/i386-01
/pkg/gnu/bash/2.04/redhat/i686-01
The administrator has the option here to create symbolic links
at points along the hierarchy that
mean something to the casual user. For example, the admin may
create the link
.stable->2.04/debian/i386-01
at the version level so that a user only wishing to have a stable
version of bash in her path needs only send a request for
"stable", without having to know anything about which version
is being requested.
Self Maintence
With the kind of flexible, dynamic model that supports
and even encourages multiple versions of the same package,
a self-cleansing method must be implemented so that the user
is not burdened with manually tracking down obsolete packages.
Unix natively supports a number of ways to do this, including:
- Checking the last access time of files in the
package.
- Scanning the system for symbolic links to files
in the package.
- Checking the dynamic dependencies of system
executables to see if they rely on any shared libraries
in the package.
Any or all of these methods could be implemented, depending
on how thorough the user wished to be. Also, the level of
interaction is arbitrary: for a novice user the entire process
could be wrapped up in a cron job with no prompting at all,
while an expert might want constant prompting for confirmation
of what was being deleted.
Conclusion
Not ready for that yet ;)
Authors: