Trove Design Document
by Eric S. Raymond
$Id: trove-design.m4,v 1.16 1998/06/25 15:49:23 esr Exp $
Trove is a next-generation Internet software archiving facility, intended
to supplant the classical FTP-tree-with-decorations model. This
document describes the history, design, architecture, and user interface
of Trove. It is a work in progress, intended both to guide
implementation and to document the project.
Introduction
Why Trove?
The `classical' model of Internet software archive (exemplified by , WWW frosting on an FTP
cake) is no longer adequate to the increasing size and evolutionary
speed of the open-source community. It eats too much maintainer time;
the classification/search mechanisms are woefully weak; and the
package namespace has no collision detection.
One of us (Eric Raymond) had been Sunsite's principal maintainer
for more than a year before Trove got started. Eric wrote the tool, which does about
as good a job as possible of automating away the scutwork under the
present system. It's not good enough. The amount of maintainer time
Sunsite requires is rising to the point where the archive is not
sustainable. On present trends, Eric thinks Sunsite's system (or its
maintainers) will collapse by the end of 1998.
Some prominent Python people (including Ken Manheimer, Andrew
Kuchling, and Guido Van Rossum) had realized for a while they were
facing similar problems in the future of the Python archive, and begun
discussing a redesign they thought of as the `locator' project.
The concept of the Trove project was originally floated by Eric
Raymond in early April 1998. Within a week, he was approached by
Guido van Rossum about joining forces. By the end of April, when the
project and the Trove web pages were officially launched, principals
included Ken Manheimer and Andrew Kuchling of the Python Software
Activity. Ken Manheimer proposed the name `Trove'. John Cowan
provided valuable expertise in database design and IR pragmatics.
Terminology
For purposes of this document, a resource is a file such as a
source or binary archive, an RPM or Debian installable package, a
documentat, etc. A resource may have associated metadata
(such as a description of the resource).
Related resources will be grouped into a package, which will
have associated metadata of its own (including but not limited to
author's name, the project home page location, etc.).
The metadata exists to provide a handle on packages and resources,
making them discoverable through searching and browsing facilities.
Resources may have associated metadata of their own
A search is any selection operation that returns a subset of
the archive metadata.
A site ring is a collection of Trove sites that mirror each
others' metadata (so that a search of any is effectively a search of all).
Objectives and Architecture
Objectives
Primary Objectives
CONTRIBUTOR-DRIVEN: Minimize the need for intervention by archive
maintainers, so the system scales up to the capacity of the
automation, rather than the availability of maintainer.
SEARCHABLE: Support access to packages through a rich,
user-friendly keyword and text-search-based interface, rather than
topic directories.
NON-RESTRICTIVE: the design should be enabling rather than restrictive -- it
should not force use of a single interface or server that might become
a performance or (more importantly) a conceptual bottleneck.
LOCATION-INDEPENDENCE: the metadata representation and Trove tools
should be indifferent to where resources are actually stored.
RICH METADATA: Per-package metadata should have at least the
descriptive power of the best-of-breed installable package format, which
means RPM.
NOTIFICATION: Anyone should be able to sign up to be notified when a
package's resources or its metadata are updated.
MIRRORABILITY: It must be possible for an entire Trove site (resources
and metadata both) to be mirrored for load-sharing purposes.
DISTRIBUTOR-FRIENDLINESS: One of the deliverables should be a tool or
access mode that collects copies of all resources and metadata turned
up by a given search, so that CD-ROM distributors can make
distributable snapshots of the archive or subsets of it.
CONFIGURABILITY: Full configurability of things like keyword categories, so the
software can be used for multiple archives with different
policies (in particular, both son-of-Sunsite and the Python archive).
SCALABILITY: Must scale well, up to Sunsite's level of traffic and beyond.
Verifying this scalability before releasing will be important.Secondary Objectives
PERFORMANCE: It would be a good idea (for performance) if running CGIs
was only required for searching and for modifying the database, and
everything else was available as static HTML files.
AUTHENTICATION: Strong authentication for packages and package
updates, like what Debian does.
META-ARCHIVE: Meta-archive functions -- queries to one Trove service
may automatically also forwarded to other Trove services.
EMAIL: Support metadata updates by email to a robot.
CRAWLER: Support an optional `trusted remote metadata' field in the
metadata and write a crawler that polls these for metadata updates.
Blue Sky
DEPENDENCIES: Teach Trove to extract inter-resource dependencies by
analyzing binaries. Long-term project! Architectural Implications
To achieve the CONTRIBUTOR-DRIVEN objective, submissions and updates
will normally be done through a Web form with upload capability.
Maintaining metadata will be the responsibility of each package's
authors and maintainers.
The ENABLING objective implies that at least package resources (if not the
metadata) should be directly accessible via FTP or the Web.
The LOCATION-INDEPENDENCE objective implies that all resource pointers
in metadata are actually URLs.
The ENABLING and LOCATION-INDEPENDENCE objectives together require
that the Trove data architecture must have a clean separation between
two parts; the catalog, a database holding package metadata,
and the archive, a local FTP/Web tree holding some (but necessarily
all) of the resources pointed to by the catalog.
The ENABLING and PERFORMANCE objectives further imply that as much
as possible of the catalog view should be available through unmediated
Web and FTP access into the archive. This implies making HTML and
plaintext versions of package metadata available in the archive,
updated automatically when the master copy in the catalog database
changes.
To achieve RICH METADATA, we must roughly capture RPM's annotation
semantics. See the appendix on .
The NOTIFICATION implies that each package's metadata must include a
mailing list, and that the interface must support subscription and
unsubscription facilities.
The SCALABILITY requirement implies using managing the metadata with
a real database capable of handling high transaction volumes.
For the ENABLING and EMAIL and CRAWLER objectives, we must define a
plain-text tag format for rendering metadata. We'll use this to (1)
represent the metadata in FTP-accessible files in the archive,
(2) define the required format for email submissions, and
(3) define the required format for trusted remote metadata.
The plain-text tag format will come up again, so it needs a name:
TRL, for Trove Request Language.
Architecture
The forgoing objectives make it pretty clear what the general
architecture of the system. A Trove site will consist of the
following parts:
The catalog -- a database of metadata records, including URIs
pointing to resources.
The archive, a local directory tree containing resources
managed by the Trove software but independently FTP- and Web-accessible.
(Some Trove sites may not have an archive, instead being purely
registries of metadata and pointers.)
The shovel, a serializing front end that translates TRL
requests on its standard input into database actions. The shovel is
the only program that modifies the database directly. It's the
shovel's job to ensure transaction atomicity.
The librarian, a collection of web pages and CGIs
that mediates interactive access to the library (the catalog and
archives together) through Web browsers. The librarian manipulates
the database by making TRL service requests through the shovel program.
It may query the database directly.
The crawler, a program that periodically attempts to
update the library by polling maintainer sites specified in metadata.
The crawler makes TRL service requests through the shovel program.
(Some Trove sites may not have a crawler.)
The mailbot, a program that accepts email updates in TRL
format. The mail robot makes service requests through the shovel
program.
The structure of TRL, with an example, is discussed .
Fundamental Types and Namespace Control
To reason about the design, we need to know what kinds of things
will be in the Trove database and how they are named (e.g. what
handles they can be retrieved by. Some of this has been touched
on in the section on terminology.
There are three different kind of objects in the Trove universe.
These are:
Resource
A resource is `real' data, a source or binary archive or
document of the kind a Trove archive is intended to serve. In the
Trove universe, a resource it represented by a resource
record that must include a URL to where the resource actually
lives and may include other metadata (such as a description).
The name of a resource is the URL of the resource. Accordingly, any given
resource name always identifies exactly one resource.
Package
A package is a collection of resources tried together by a package
record. The associated resources may be the same program or
document in several different forms (such as source archive, binary
archive, installable package, etc.) or it may be a group of related
resources such as the individual components of a multiple-program
project.
Besides resources names, package records contain other metadata
intended to facilitate finding packages by topic or subject area,
including both a text description and controlled-vocabulary keywords
(discriminators).
The name of a package is an arbitrary identifier chosen by the package
record creator (its initial owner) and changeable by the package record
owner.
A package may have any number of resources associated with it. In general,
any given resource will only belong to one package, but exceptions are
harmless.
Person
A person record associates metadata
with an RFC822 email name/address pair. The metadata may include such
things as a home-page location, a PGP public key (as an optimization,
in order to make a public-key-server lookup on each submission
unnecessary), etc.
Person records exist so that Trove users can go from a package to its
maintainers to their home pages and other projects.
A person is named by the email address part of their name (which is unique).
All three kinds of resources are always explicitly created, modified,
and deleted, with a notoification to interested parties on each action.
The general policy on name validation is that references to
unregistered people and packages are not. Thus, maintainers of a
package need not be in the Person table as long as they have
syntactically valid email addresses; and package relations may refer
to packages by name that are not registered in Trove.
This implies that every creation of a Package or Person record needs
a global check to mark references it suddenly fills, but that is an
acceptable price for making the namespace open rather than closed.
Issue: We know that package names will be unique per site. Are they
unique across all sites in the Trove ring? If not, how do we do
synchronization when rings merge? And how do crawlers know which
package they are responsible for?
Catalog architecture
The catalog will be stored in a database. The is available at the Trove
website.
Archive architecture
To make the rest of this document concrete, we need to specify an
organization for the archive part. Here it is:
Each project has a directory. The name of the directory is the
name of the project, without a version number (this is
so project directories can contain multiple directories). Observe
the implication that project names must be unique per Trove site.
Project directories may live directly under a per-site root, or (for
performance) under superdirectories which express some kind of hash on
the names. It is important for bare-FTP accessibility that this hash
be easy for human beings to calculate by inspection. Example:
terminfo's scheme of having each terminal type live in a
superdirectory named after the first character of the terminal type
name. Whether such a scheme is used, an what it is, is per-site policy.
Within each project's directory live all its associated local
resources. Other resources may live offsite (the catalog records
don't care, they use URIs for everything). The directory will also
contain FTP and HTML versions of the package's metadata, as files
named %%INDEX.TRL and index.html respectively. The former name
is chosen to sort as early as possible in an FTP directory listing
without including Unix shell metacharacters; the latter, to be the
page automatically displayed by a browser pointed at the directory.
Librarian architecture
The librarian will be a set of HTML pages and CGIs that mediate
between users (including uploaders and maintainers) and the library.
It will be necessary for the librarian to maintain state through
multiple-form transactions. For discussion of the librarian design,
see the major section on below.
Mail-Robot and Crawler architecture
These will be programs that, essentially, translate metadata
submissions in TRL into actions on the archive. The only difference
between them will be that the email robot waits for input fed to it
though a mail alias, while the crawler looks for descriptions in
remote locations specified by metadata URIs.
In both cases, a parse error or package name collision or other
exception will generate email to the submitting party and contact
persons given in the both new and old metadata.
Architecture Open Issues
What do we use as the database back end? Postgres95?
SOLID? MySQL? Something else?
User Interface Design