Workbook on Digital Private Papers > Administrative and preservation metadata > Persistent identifiers

Persistent identifiers

Archival Resource Key (ARK)

Background

ARK is a scheme for the persistent identification of information objects, which can include finding aids and other metadata as well as digital archival objects; however, it can also be used to assign a persistent name to other resources, e.g. physical objects such as books and intangible objects (examples given include diseases, vocabulary terms and performances). In this case persistent identification encompasses both naming and retrieval. It is the most recent of the schemes considered here and was originally developed by John Kunze and R.P.C. Rogers at the US National Library of Medicine. An internet draft outlining the scheme was issued in February 2001, and the current draft in July 2007; the scheme is currently maintained at California Digital Library (University of California).

ARK was developed as an alternative to schemes like PURLs, URNs and Handles, which address the problem of broken URLs by using a stable, indirect hostname scheme. Instead, the ARK scheme is founded on the principle that persistence is a matter of service, not syntax – it is reliant on the continued stability and support of the service behind the identifiers.

As well as being a globally unique identifier, each ARK is an actionable URL, which links users to:

For further information about the ARK scheme see http://www.cdlib.org/inside/diglib/ark/.

How do ARK identifiers work?

In order to assign ARKs, an institution must either become a Name Assigning Authority (NAA) under the scheme or be authorised to allocate names as a sub-authority of a NAA. Each NAA is associated with one or more Name Mapping Authority Hostports (NMAHs), which provide services (such as hosting, access or forwarding) for the digital objects being identified under the scheme; these essentially act as a temporary address where ARK requests are directed in order to make the ARKs actionable. A NMAH may change over time if service providers change and may serve more than one NAA (see below). A single institution can act as both NMAH and NAA; in fact, the scheme recognises that this will be common.

ARKs work well with current protocols like HTTP and DNS, but they are designed to be protocol independent.

ARK syntax

An ARK identifier takes the following general format:

[Protocol]/[NMAH/]ark:/[NAAN]/[Name]/[Qualifier]

Hypothetical example:
http://library.manchester.ac.uk/ark:/98765/archive/object35

Protocol This label does not form part of the ARK identifier, but indicates the protocol which is being followed (e.g. http://).

NMAH This part of the string identifies the relevant NMAH or provider of services; it is expressed as a hostname in the same format as a domain name which appears in a URL, e.g. library.manchester.ac.uk. This is mutable and does not form part of the unique ARK identifier.

ark:/ This prefix indicates where the actual ARK identifier begins. It, and the components which follow it, can be extracted and used in other identifier schemes (e.g. as part of a URN), and are easily recognisable by the ark: prefix. Following the ark:/ label are the components which make up the globally unique identifier for the digital object.

NAAN NAAN stands for Name Assigning Authority Number: each NAA is assigned a 5 or 9 digit decimal number as a unique identifier. This element of the ARK string is mandatory because it unequivocally identifies the organisation which assigned the persistent name of the digital object.

Name The Name is a mandatory element of the identifier and is assigned by the NAA. It should be comprised of ASCII characters, although there are four reserved characters which have special meanings; it should be unique within the NAA (ensuring its uniqueness within the system as a whole). The NAAN and the Name taken together form the immutable persistent identifier for the object.

Qualifier This is an optional component of the ARK, and the use of qualifiers (e.g. identifying subcomponents or variants of a digital object) is determined by the relevant NAA or NMA. The ARK scheme specifies that hierarchies should be expressed using a path which separates each level with a slash. For example, in a digital archive this could be used to express hierarchies in a file structure. If the Name 3567 is assigned to a folder, the sequential files within that folder might be expressed in ARKs which look something like:

ark:/[NAAN]/3567/file1

ark:/[NAAN]/3567/file2

ark:/[NAAN]/3567/file3

Similarly, different variants of the same object can be specified by using qualifiers divided by dots. The NAA or NMAH determines what constitutes a variant. In an archival context it might be different representations of the same intellectual entity, or the same digital object in two different formats as a result of migration. Example:

ark:/[NAAN]/3567.t44.v23

ark:/[NAAN]/3567.232

Resolving ARKs

If a working NMAH is included in the ARK prefix, the user can be taken to the NMAH directly. If the NMAH no longer works (e.g. responsibility has been passed on to another institution), users can locate the new NMAH by identifying the NAA and using the register maintained by California Digital Library to look up current NMAHs that service ARKs issued by that NAA.

The ARK scheme also proposes an alternative method of locating the NMAH using a simplified version of the Name Authority Pointer (NAPTR) method of discovering URN resolvers, whereby a query is submitted to the DNS system requesting a list of resolvers matching a particular NAAN, and responses come back inside NAPTR records.

The ARK scheme also specifies a simple protocol for using HTTP to deliver ARKs, which is known as the Tiny HTTP URL Mapping Protocol (THUMP). It allows the user to enter ARK requests directly into the location field of their browser interface; once they have determined the internet host name and port number of the relevant NMAH, they can send questions to this via a THUMP request (contained within an HTTP request) and receive answers via a THUMP response (in an HTTP response).

ARKs can resolve to the object or object metadata, basic information about the object (who, what, when, where, etc., in relation to the object) as well as a commitment statement which could encompass statements about object permanence, variance (e.g. the conditions under which the object could change, such as format migration) and change history, etc.

Maintenance and adoption

The ARK scheme is currently maintained at California Digital Library (University of California). The NAAN registry (listing NAAs and their associated NMAHs) is also maintained by CDL and mirrored at the NLM. The list of registered NAAs gives an indication of the ARK user community. In January 2007 twenty institutions were listed. Most of these are American, including the Library of Congress and several leading university and digital libraries. France's Bibliothèque Nationale is also a participant; the only UK organisation represented is the Digital Curation Centre. The scheme therefore has some backing among information institutions and the public sector.

The cost of participating is low; there is no subscription fee involved. Any institution can obtain a NAAN by contacting CDL and can then begin generating ARKs; this can be done using any software which produces identifiers that conform to the ARK specification; CDL uses a piece of open-source software called 'noid' (nice opaque identifiers).

Advantages and disadvantages of the ARK scheme

Advantages

  • The scheme is standards based and protocol/technology independent.
  • It works well either as a simple identification scheme, or as a system for both identifying and accessing digital objects.
  • ARKs can be used to identify different types of entity, e.g. they could be used to identify agents and events as well as digital archival objects and metadata records.
  • The system was developed in a library context and is designed to meet the needs of digital archivists.
  • ARKs can be used in both a closed environment like a dark archive or an open publicly-accessible environment.
  • The ARK system makes explicit the importance of organisational commitment to a persistent identifier scheme and writes a requirement for this into the scheme itself.
  • It is maintained by a leading institution in the field of digital preservation and has no commercially motivated background (like DOI).
  • The model for participating in the ARK scheme is more flexible than some of the other PID schemes: if one institution acts as both NMAH and NAA, it is able to have complete control over its own identification scheme; the possibility of multiple NAAs being connected to one NMAH might also enable one institution to host the digital archives of smaller, less well-resourced institutions.
  • The technical requirements for participation are relatively low: currently a normal web server using the DNS.
  • Because the scheme is still under development, institutions which choose to participate now can feed into and shape this development.

Disadvantages

  • Because the ARK scheme was so recently established, it is difficult to gauge at this stage how popular and long-lived it might be.
  • Some elements of the scheme are probably superfluous to the requirements of digital archives, e.g. hierarchies and variants can be defined using METS and PREMIS metadata rather than complex identifiers. In reality, it is probably more straightforward to use a simple single-level sequence of identifiers.
  • Most institutions are moving towards encoding metadata in XML, which is intended to be reasonably human-legible and facilitates the sharing of data across different information systems. The use of Electronic Resource Citation (ERC) for recording ARK metadata (as recommended by the scheme) may involve both duplication of metadata and the additional task of converting it into a different format.