Workbook on Digital Private Papers > Digital preservation strategies > File formats

File formats

Selecting file formats for preservation

It is unlikely that a repository will choose only to preserve bitstreams of digital objects, although this may be a necessary last resort in the case of unusual and obscure file formats, or those for which the format specification is unavailable. Any approach beyond the level of bitstream preservation requires a careful consideration of which file formats are most appropriate for preservation purposes in any particular case. Some repositories may keep files in their original formats, others may choose to support only a limited number of formats and others may normalise to one format. More information on these different approaches (all types of migration) is given below.

In order to preserve digital objects properly, digital curators, or agents developing tools on their behalf, require access to detailed technical information about their file formats.

Some proprietary formats have open specifications, meaning that they are largely independent of specific software and are therefore more suited to preservation, e.g. Adobe's PDF. Other owners of proprietary formats (which, unfortunately for archivists, are often the most widely used and popular formats) have no detailed specification for their formats, or restrict access to format specifications to third-party developers who have signed a non-disclosure agreement, allowing products compatible with their formats to be designed without publishing their specifications openly. Published and open formats are always preferable from a preservation perspective: the best options for digital curation and long-term preservation are non-proprietary, open format specifications produced by international standards bodies, such as ISO/IEC 26300:2006, the Open Document Format for Office Applications. Usually numerous organisations have been involved in the development of these standards and they are generally backwards compatible.

Issues to consider when selecting file formats for long-term preservation include:

  • Is it defined by an international, national or publicly available standard?
  • Is the quality of the specification adequate?
  • How widely has the format been adopted as a preservation format?
  • Is it backwards compatible?
  • Is it independent of any specific hardware or software environment?
  • Does it have good metadata support (i.e. metadata providing technical and provenance information which is generated by the creating application, entered manually by the record creator, or a combination of these)?
  • Does it have a good range of functionality without being too complex for the purpose?
  • Is it easily convertible into other formats (for migration purposes)?
  • How well does it retain the formatting and other significant properties of converted digital objects?
  • How stable is the format?
  • How proven is it in terms of longevity?
  • Does it include an error-detection facility?

Arms and Fleishhauer have codified such selection criteria into a decision-support framework for use when considering preservation formats for Library of Congress digital collections. They have identified seven sustainability factors which apply across digital formats for all categories of information and are applicable whichever preservation strategy is selected. These are:

  1. Disclosure: the degree of access to full specifications and tools for validating technical integrity; open standards are usually more fully documented and more likely to be supported by tools for validation than proprietary formats.
  2. Adoption: the degree to which the format is already used; if widely used, it is less likely to become obsolete quickly, and commercial tools for migration and emulation are more likely to emerge from the computing industry which archive institutions can purchase.
  3. Transparency: the degree to which the digital representation is open to direct analysis with basic tools; this is enhanced if textual content employs standard character encodings.
  4. Self-documentation: it is easier to manage digital objects that contain basic descriptive, technical and administrative metadata.
  5. External dependencies: the degree to which a format depends on particular hardware, operating system or software for rendering or use.
  6. Impact of patents: the degree to which digital preservation will be inhibited by patents.
  7. Technical production mechanisms: the implementation of mechanisms like encryption that might prevent the preservation of content by the digital repository.

They also identify quality and functionality factors which are genre specific, and pertain to the ability of a particular format to represent the significant characteristics required or expected by current and future users of a given content item.

Another approach to selecting preservation formats is provided by OCLC's INFORM Methodology, which measures the preservation durability of digital formats. It compares formats and preservation approaches, and provides a risk management-based means of tracking what might be lost over time if particular preservation actions are taken; the digital archivist can then make decisions about preservation strategy based on this risk assessment.

If a digital repository chooses to convert digital objects to one or more standard formats, there are a number of candidates for consideration; examples for text-based documents include:

Extensible Markup Language (XML): this is not a format, rather a general-purpose markup language for describing the structure and meaning of data. It is an open standard defined by the World Wide Web Consortium and is independent of specific applications. Preserving digital objects which have been created using XML in accordance with a standard DTD or Schema is straightforward. Converting other digital objects to XML is one kind of migration approach; the National Archives of Australia, for example, normalises the formats it receives to XML representations. However, while textual content may be well represented in XML, much of an original document's formatting and layout might be lost as a result of the conversion process. XML is also very limited in its support for non-textual data such as photographs and graphics.

OASIS Open Document Format for Office Applications: this is an open, XML-based, format for office files, such as word-processed documents or spreadsheets. It has been adopted as an international standard (ISO/IEC 26300) and offers a suitable format for the preservation of digital documents created in proprietary office formats like those generated by Microsoft Office.

Portable Document Format Archive (PDF/A): this is a constrained version of Adobe's PDF version 1.4 which has been adopted as an international standard (ISO 19005-1). It is preservation-friendly in that: its specification is openly available; it eliminates elements likely to complicate decoding and accelerate obsolescence (e.g. audio and video elements, or encryption, etc., which are sometimes used in other PDF formats); it is self-contained (i.e. can be displayed without any reliance on information from external sources); and support for embedding metadata is very good. Records saved to this format have a look and feel which is fundamentally one of text and images designed to fit a particular page size. However, it preserves static visual appearance only, so it is not suited in cases where functionality or logical structure needs to be preserved.

Examples for images include:

Tagged Image File Format (TIFF): a format used for raster (i.e. pixel-based) images. It is widely adopted and supported by most image processing and viewing applications, and it supports sophisticated colour management features. Many repositories consider TIFF to be the best option for preserving images, and it is often used to store archival masters of digitised images. There are various sub-types of TIFF. Uncompressed Baseline TIFF (Revision 6) should be used as other revisions have additional functionalities which hinder preservation.

Joint Photographic Experts Group (JPEG): a widely used format to represent continuous tone images (e.g. photographs and greyscale images). It is defined by an international standard (ISO 10918). As with TIFF, there are different JPEG profiles; the lossless version of JPEG is preferred for preservation purposes, and the JPEG 2000, Part 1, Core coding version with lossless compression is also a favoured option.

There are also widely accepted options for sound formats (e.g. WAVE LPCM or MP3_FF) and moving image formats (e.g. MPEG-2, MPEG-4_AVC). Many digital repositories publish details of the preservation formats they support, where more information about accepted formats can be found.

An important consideration when selecting a preservation format is how successfully the chosen format embodies the essential attributes, or significant properties, of the original digital object.