Workbook on Digital Private Papers > Digital repositories > Comparing repository software for preserving personal digital archives

Comparing repository software for preserving personal digital archives

General findings

A number of general points can be made about Paradigm's experiences with DSpace and Fedora.

DSpace and Fedora are not equal

A more detailed comparison of DSpace and Fedora appears later in this chapter, but it is worth noting here that the two have some fundamental differences that may help potential users decide which is most suitable for their needs. While the two softwares are different, both user communities participate in a wider 'repository community', which brings together users of many repository softwares such as Eprints and Greenstone, as well as DSpace and Fedora. This inter-community dialogue encourages basic interoperability and an exchange of ideas and practice.

DSpace

Background: developed jointly by MIT Libraries and Hewlett-Packard Labs to act as a repository for the intellectual output of research organisations between March 2000 and November 2002.

Licence: BSD

Current version: 1.4.1 (since 7 December 2006)

Technology: DSpace is written in Java, and provides a Java Application Programming Interface (API), a web application that runs in Apache Tomcat and command line tools. An architecture diagram is available.

Data model: DSpace repositories create:

  • Communities (e.g. a university department), which can have Collections and Sub-Collections.
  • Collections and Sub-Collections are containers for grouping related Items; Collections can be part of multiple Communities.
  • Items can be part of multiple Collections and contain Bundles.
  • Bundles contain one or more Bitsreams (DSpace refers to digital files as Bitsreams).

Storage: Metadata is stored in a relational database management system (RDBMS); data is stored in a file system.

Version(s) tested: 1.3.2

Test-bed platform: SUSE Linux 10, Apache Tomcat, Java, PostgreSQL (actually an object relational DBMS).

Development priorities: unclear at present; a technical architecture group has been formed as part of the DSpace governance.

DSpace has been implemented by several institutions looking to develop repositories of simple objects, such as academic papers or e-theses, to enhance their accessibility to the research community. Membership of the DSpace community also includes institutions developing an interest in preservation and curating other kinds of materials, such as images and datasets.

DSpace is self-contained and straightforward; a usable system can be deployed out-of-the-box with relative ease. This simplicity is largely due to the fact that DSpace comes pre-configured with a standard user model, data model and workflows. This ready-made simplicity undermines the utility of the software in an archival context; the software's models and workflows are strongly biased towards an open access repository for academic output, which was the original purpose of the design, and are not well suited to the highly structured collections or complex objects that are commonly associated with personal digital archives.

The recommendations of the DSpace Architecture Review Group suggest that Version 2 of DSpace (no exact release date, but 2009 is probable) will bear more resemblance to Fedora, by providing better functionality for a range of contexts and activities, an extension framework for third party developers, and the ability to operate on a large scale.

Fedora

Background: developed at the Universities of Cornell and Virginia with funding from the Andrew W. Mellon foundation. Now established as Fedora Commons, a non-profit organisation. Recently awareded a $4.9M grant for further development from the Gordon and Betty Moore Foundation.

Licence: Fedora is available under the Educational Community License 1.0 (ECL); third party packages associated with its use are distributed under a variety of other licences.

Current version: 2.2 (since 19 January 2007)

Technology: Fedora is written in Java and runs as a web application in Apache Tomcat. It provides a number of open APIs that are exposed as SOAP and REST web services: Management API (API-M), Access API (API-A), Access-Lite API (API-A-Lite, also includes Search API), Management-Lite API (API-M-Lite) and Resource Index Search API. Fedora also supplies a client with a Graphical User Interface (GUI) and command line tools.

Fedora provides three local web services: Saxon XSLT Processor Local Service, FOP Local Service (for PDF Transformation) and the Image Manipulation Local Service.

The Fedora framework currently provides three services that interface with the Fedora repository service: Generic Search Service (GSearch), Directory Ingest Service (DirIngest) with a GUI tool called SIP Creator for preparing submissions to the DirIngest service and OAI Provider Service (PROAI).

Data model: Fedora has three kinds of object: data, behaviour definition and behaviour mechanism.

  • Data Objects must contain an ID and Dublin Core metadata; they may also contain one or more datastreams (digital file), XML metadata of any kind and RDF XML metadata to describe relationships with other objects.
  • Behaviour definition and behaviour mechanism objects provide disseminators for one or more datastreams in data objects.

Storage: Metadata is stored in a relational database management system (RDBMS); data is stored in a file system within the repository or externally.

Test-bed platform: SUSE Linux 10, Apache Tomcat, Java, PostgreSQL.

Version(s) tested: 2.1, 2.1.1, 2.2.

Development priorities: the community has established a series of working groups: Preservation Services, Search Services, Workflow Services and Content Models for Datastreams and Disseminators. The envisaged framework includes services for preservation monitoring, event notification, etc., but it is unclear when these will be implemented.

Fedora was designed to be a repository for all materials and all purposes from the beginning, although it is fair to say that many early users were developing repositories for access purposes. The Fedora community has evolved alongside the DSpace community; several members are now using Fedora for preservation and for complex objects and highly structured collections. The Fedora community has demonstrated an interest in preservation functions with the formation of a preservation working group (established 2005) and some useful preservation-related features in the 2.2 release of the software (January 2007).

Fedora is more complex than DSpace because it is a repository architecture as much as a repository. It was designed to be flexible, so that users could employ any kind of data model; and to be extensible, so that users could add whatever clients or services they needed to the Fedora framework. As a Fedora repository matures, it is likely to use several web services - this creates a distributed service-oriented architecture system, as opposed to the self-contained system presented by DSpace. This flexibility has immense potential, but comes at a cost. The implementing institution must do more than basic installation, configuration and customisation of the software; it must be prepared to design its own user models, data models, workflows and tools, or to adopt them from comparable implementations within the Fedora community. This means that Fedora-based repositories can be very different to one another and strong analytical and programming skills may be needed by the repository team.

Fedora does provide a client and web interface out-of-the-box, This basic install feels unpolished and incomplete; the system designers anticipated that adopters would design content models, services and interfaces particular to their varied implementation needs. The community has produced and published tools for some tasks, but the 'open access' origins of the repository movement mean that many relate to access rather than preservation. Some tools developed by the user community are not well documented for use by newcomers; others are not shared with the wider community at all. The availability of metadata and preservation tools is likely to improve as the combined experience of the Fedora repository community grows.

Scope and content of documentation

Comparing repository softwares is no small task for those new to repositories as much of the supporting documentation is aimed at technical audiences (developers and systems administrators), rather than managerial/professional users (archivists and librarians) or end-users (researchers). The nature of the documentation makes the process of learning how to install, configure and use the repositories, much less evaluate them, time-consuming. Some of the problems Paradigm encountered include:

Documentation provides instructions for configuring features, such as 'authorisation' or 'search', but does not detail the benefits of such features in language that is readily accessible to those whose needs should determine the selection and customisation of repository systems. Much of the documentation aims to answer 'how' questions; some more attention to 'why' questions and relevant use cases would be beneficial. There is very little in the way of supporting information that could assist implementation decisions; information which describes the advantages and disadvantages of a given implementation, and its intended purpose, would be a helpful addition to the documentation of repository software.

The boundaries of repository software

There is no end-to-end repository system available and no clear-cut definition stipulating which functions and activities should be the responsibility of the repository software and which should be the responsibility of another entity. Some aspects of policy and procedure may form part of a manual process and need not be automated at all, while others may be automated but devolved to a service outside the repository software's remit. Some important, non-trivial, parts of the preservation process take place outside of repository software. One example is the assembly of metadata required to support the digital archives into METS Archival Information Packages for submission to the repository system. Some functions and activities which would be desirable in preservation repositories, such as obsolescence monitoring and interoperability with external file format registries, are partly dependent on services external to the repository's organisation.

The criteria presented below must be met by the repository system used by the repository service. The repository software could form all of the system, or a central part of that system. This means that some of these criteria need not be the responsibility of the repository software, but of a service that may be used in conjunction with that software.

Coming soon

Some of the functionality that might interest those implementing preservation repositories has yet to be implemented in any repository software, but is on the development roadmap of several. It can be difficult to keep track of what functionality is in the pipeline and when it is due to arrive. The availability of more detailed information about current developments in the community would improve co-ordination among adopting institutions and be useful in planning local development priorities.