Workbook on Digital Private Papers > Digital preservation strategies > File formats

File formats

In order to process the bitstream of a digital object and convert it into something meaningful, it is necessary to know what file format the bits are in; it is the format specification that transforms a bitstream into a particular type of computer file.

Most personal digital archives contain a wide range of file formats. By the time the archive comes into the custody of a digital archivist, many may be obsolete (unreadable by modern software or hardware) or in danger of becoming obsolete. Various factors contribute to format obsolescence, e.g.:

When an archive is received, the digital archivist must identify all the file formats included in the accession and validate them to see that they conform to the relevant format specification where this is available. Various registries and tools exist to assist digital curators in this task and to support other activities which form part of an institution's preservation strategy.

Format registries

These are third-party services which contain varying levels of information about file formats and, when more fully mature, could serve a number of purposes for digital curators, such as:

File format registries can:

The first is easier and cheaper to administer, but problems may arise when file format specifications cease to be accessible. This is unlikely in the case of formats which have been registered as standards (e.g. Open Document Format ISO/IEC 26300:2006), but is a major issue in relation to proprietary formats, where earlier specifications may not be made available, or preserved at all, by the manufacturer.

File format registries are being developed and maintained by large (often national) archives and libraries with a strength in digital preservation, to provide a useful source of information for the digital preservation community; these have the promise to become well-established resources. Others are provided independently and are often intended for computer programmers rather than those engaged in preservation activities.

Examples of privately-maintained websites on file formats include:

  • Wotsit.org contains short descriptions, file extensions and format specifications for hundreds of different formats.
  • File Format Encyclopedia contains similar information.
  • File Extension Source contains detailed information about file formats and their associated extensions.

Examples of registries with a long-term digital preservation focus:

  • Digital Formats for Library of Congress Collections: provides information on the suitability of various digital file formats for long-term preservation, assessing each format against named sustainability factors and content-type specific quality and functionality factors.
  • PRONOM is provided by The National Archives in the UK and contains basic information about some file formats and their supporting software.
  • Global Digital Format Registry (GDFR) is being developed by Harvard University Library. The registry will collect representation information from centres around the world, which will be made available as a resource for any repository in the world.
  • Global Format Registry (GFR): a prototype registry developed at Maryland University, containing file format and application information; only sparsely populated.

Tools

There are various tools (often closely associated with format registries) which enable archivists to identify and validate file formats; other tools assist with preservation actions such as migration. These include:

  • Digital Record Object Identification (DROID): a software tool developed by The National Archives to perform automated batch identification of file formats.
  • JHOVE: a tool developed by JSTOR and Harvard University Library to allow the automatic identification, validation and characterisation of a range of digital object types.
  • FOrmat CUration Service (FOCUS): a prototype tool which will perform identification and validation on submitted files.
  • XML Electronic Normalising of Archives (XENA): a tool for converting a range of file formats to XML representations, used in normalisation.
  • Conversion and Recommendation of Digital Object Formats (CRiB): an online migration tool, which recommends optimal migration alternatives, undertakes the conversion process, evaluates the outcome of the migration and generates migration reports in appropriate forms for inclusion in preservation metadata records. It currently supports migration paths for a number of image formats, but can be scaled to provide for other formats.

Some of these tools are online services and may be unsuitable for use with closed archival collections.