Workbook on Digital Private Papers > Digital preservation strategies > File formats
File formats
Representation Information
Whilst information about a digital object's file format is essential for its preservation, more than this is needed to ensure that the bitstream can be transformed into something which is meaningful and understandable over time. Elements like operating system and hardware dependencies, character encoding, algorithms, standards and so on should also be taken into account. The OAIS Model uses the term Representation Information to define this kind of information. Representation Information is subdivided into three classes:
- Structure Information: describes the format and data structure concepts to be applied to the bitstream, which result in more meaningful values like characters or number of pixels.
- Semantic Information: this is needed on top of the structure information. If the digital object is interpreted by the structure information as a sequence of text characters, the semantic information should include details of which language is being expressed.
- Other Representation Information: includes information about relevant software, hardware and storage media, encryption or compression algorithms, and printed documentation.
In an OAIS information package, the Content Information (i.e. information about the digital object which is the target of preservation) is comprised of:
- The Content Data Object itself (i.e. the bits of which the object is comprised).
- The necessary Representation Information to make the content understandable to the Designated Community (the body of users who may need to access and use the digital resource). As with file formats, the Representation Information for a digital object should allow the recreation of all the significant properties of the original digital object.
A digital repository should retain persistent Representation Information along with the data objects it preserves, or it should refer to Representation Information held externally in a reliable repository. Representation Information may need to be interpreted using further Representation Information in order to make it intelligible, e.g. it may be stated that the digital object to be preserved conforms to the ASCII standard; this standard in turn then needs to be explained. The recursive nature of Representation Information results in a complex and extensive network of representation objects, which continues expanding until the contents of the original digital object are displayed in a form the user can understand. The user in this case is a member of the repository's Designated Community (or primary user base). If this user base is small and specialised, only a minimum amount of Representation Information may be necessary. However, a repository must consider future developments and decide whether or not to maintain a larger amount of Representation Information which would render its holdings understandable to a wider community with a less specialised knowledge base. The latter is the more appropriate approach for a collecting institution which takes in personal archives; this means an extensive quantity of Representation Information is likely to be necessary.
The Digital Curation Centre has recognised that a collaborative model for creating, storing, maintaining, accessing and using Representation Information is necessary to assist the development of long-term digital curation strategies. The Centre is therefore developing a distributed Representation Information Registry/Repository to provide an infrastructure for the preservation of Representation Information. The DCC will not fully populate the registry itself, so the community will only derive benefit from the registry if its members invest time and effort in populating the resource. It is intended that the registry will include:
- A structure repository containing information about file formats (with an emphasis on formats used for automated processing rather than more common formats which are adequately documented elsewhere).
- A semantic repository containing relevant data dictionaries and ontologies.
Other Representation Information to support both migration and emulation preservation strategies will also be held, such as details of software with appropriate emulation capabilities. Digital repositories will be able to refer to Representation Information held in the registry by means of a Representation Information label (in the form of an XML Schema) which can be attached to a digital object.