Workbook on Digital Private Papers > Administrative and preservation metadata > Persistent identifiers

Persistent identifiers

The impact of the Web on the relationship between names and addresses

Identifying and locating resources on the Internet and World Wide Web

The Web environment problematises the issue of associating specific digital resources with persistent identifiers and locations because of the ease with which web-based material can be moved and altered.

Before considering the problem of naming and addressing web-based resources, however, it is useful to note the distinction between the Internet and the World Wide Web - terms which are often used interchangeably. The Internet refers to the structure of interconnected computer networks that communicate using Internet Protocol (IP) and Transfer Control Protocol (TCP). The vast collection of interlinked hypertext documents which makes up the Web runs over this network, and access to web documents is provided by Hypertext Transfer Protocol (HTTP).

The earliest address mechanism used for the Internet was the IP address; IP addresses are commonly written as numbers between 0 and 255, separated by dots. This system is not user friendly, and to address this issue the Domain Name System (DNS) was introduced. This system enables human-readable names to stand in (as indirect names) for IP addresses; multiple IP addresses can then be assigned to a single domain name or vice versa. Domain names take a hierarchical structure and work from right to left. The top-level domain might be a country or community, as in .uk or .fr which appears at the end of a domain name; the next authority might indicate an institution or a sector, like .co. for companies, .ac. for the academic community, etc. Further subdomains are created by the domain owner.

URIs, URLs and URNs

Anyone who uses the Web will be accustomed to locating, requesting and citing digital resources by their Uniform Resource Locator (or URL); however, they may be less familiar with the terms URI and URN.

A Uniform Resource Identifier (URI) is essentially a string of characters used to name or identify a resource, and it can act as a name, a locator or both. A resource is anything that can be identified by a URI, and does not necessarily have to be in digital format or available via the Internet (e.g. a human being or an abstract concept could be assigned a URI). The URI specification does not require that a URI persists in identifying the same resource over time, although this is a major aim of most URI-compliant schemes. URI identification is carried out by means of an extensible set of registered naming schemes, maintained by the Internet Assigned Numbers Authority (IANA).

URI syntax

A URI has a standardised syntax. The permitted characters come from a limited set comprised of the letters of the basic Latin alphabet, digits and a small number of special characters. It is organised hierarchically, from left to right (rather than right to left as with the DNS), and takes the following form:

[Scheme]:[//][Authority]/[Path]?[Query]#[Fragment]

Hypothetical example:
http://personaldigitalarchive.ac.uk/archive/accession1?ABC#123

Scheme This component is required and is used to identify the scheme being used in the URI (e.g. the HTTP protocol). This is usually given in lower case and is separated from the rest of the URI by a colon. It is followed by the scheme-specific part of the URI, which is largely governed by the specifications of the relevant scheme, although the URI imposes some constraints to ensure consistency.

Authority Many URI schemes include an element for a naming authority, which governs the rest of the namespace in the URI. It is optional and is preceded by a double slash. It can include three subcomponents: userinfo (e.g. user name and scheme-specific information about how to access the resource); host (IP address or registered name); and port (port number).

Path A required component which begins with a slash and contains hierarchical data that (along with the query component) identifies a resource within the scope of the relevant URI scheme.

Query An optional component which contains non-hierarchical data that serves to identify the resource within the scope of the relevant URI scheme.

Fragment Another optional component, which allows indirect identification of a secondary resource by reference to a primary resource and additional identifying information. The secondary resource could be a subset of the primary resource, such as an image that is a constituent file in a web page, or it could be a view of the resource, perhaps the result of a query to a database.

 

URIs are the principal identifiers used in the Web environment. Whilst a URI can serve the purpose of both naming and locating, these two functions have essentially been separated into two URI subsets: Uniform Resource Locators (URLs, which are in general usage to describe web-based resources) and Uniform Resource Names (URNs).

URLs are used by the HTTP protocol for addressing documents and are intended only for locating resources. In addition to the address protocol used (http://) a URL also contains a network path that includes the domain name or IP address, and further optional paths and parameters. URLs are widely used as identifiers but they are inherently unstable. Users of the Web will be used to the frustration of broken links and error messages resulting from the removal of documents and constantly shifting locations. The recognition of this problem resulted in the search for a persistent identifier scheme, and URNs were introduced as globally unique, persistent identifiers that are independent of location; they have been formally defined and are discussed in more detail below. Other persistent identifier schemes have also been established.