Workbook on Digital Private Papers > Ingest
Troubleshooting METS files and Fedora's Directory Ingest Service
Fedora's Directory Ingest service
Using Fedora's Directory Ingest service the archivist can ingest a whole accession of digital archives at once (with some metadata if desired) while maintaining the hierarchical relationships between objects. This is clearly preferable to ingesting objects one by one, which is labour intensive and destroys the relationships between objects; it is not completely pain-free, however, as it requires the archivist to supply a detailed METS file which acts as a manifest for the accession. Combining the METS manifest and the content data objects together in a zipped folder creates a Submission Information Package (SIP) which the Fedora Directory Ingest service can process.
What kind of METS file is needed?
This METS manifest contains the paths of the content data objects we want to preserve; some metadata about these objects (also datastreams in their own right) and a structure map which details the hierarchical relationships in the accession.
We tested the DirIngest function firstly using samples provided by the Fedora developers and secondly by creating our own, simple, METS files containing only file IDs, mime types and Dublin Core descriptive metadata in the <fileSec> and a <structMap> representing our objects' relationships.
We plan to create metadata templates for different models of content (textual records, images, websites, emails, etc.) to assist us in the construction of each <file> in the <fileSec>. Defining metadata profiles in this way will help us while we are required to handcraft the metadata and such profiles will eventually facilitate the automation of metadata creation wherever this is possible.
An example SIP : three.zip
The file three.zip is one of three sample SIP files which are included in the sip2fox/samples directory in Fedora's DirIngest distribution. This demonstration object is a zipped directory containing folders and files, and a manifest which provides metadata about them (METS.xml). Familiarity with the structure of this dummy accession is necessary to understanding the manifest METS file that was created to ingest the objects and their metadata into Fedora, so we have reproduced a tree of the folders and files below:
As you can see, the three.zip sample is a simple 'My Documents' folder. In all, the sample contains 10 objects: 5 folders and 5 files. The METS manifest will need to provide metadata about all 10 objects in the <fileSec> and it must mimic exactly the nesting of the folders and files in its <structMap>.
Introducing a METS file created for Fedora's DirIngest service : METS.xml of three.zip
The File Section <fileSec>
In the <fileSec> each data object is represented by a <file> element which provides an ID attribute for the object and contains the <FLocat> element which points to the location of the file. The example below is the <file> entry for the mp3 track saved as '08_ThieveryCorporation_DC3000.mp3' in the 'Music' folder of 'My Documents'. The creator of the METS document has assigned the value 'DC 3000' for the file's ID attribute.
The creator of the METS document also wanted to associate metadata with the 'DC3000' object. To do this, it is necessary to create a <fileGrp> which will contain the 'DC3000' file and any metadata files we wish to associate with 'DC3000':
In this example, the editor has added two files to DC3000's file group: a Dublin Core metadata file and a Creative Commons licence. Notice that these files do not have a <FLocat> element associated with them; rather the metadata is contained in the METS file wrapped in the <FContent> and <xmlData> elements; this makes use of METS' ability to embed any XML metadata.
The <fileGrp> for a folder is created slightly differently than for a file. This is because the folders are conceptual containers, rather than bitstreams. In three.zip's METS file, the <fileGrp>s for folders contains only one <file>: Dublin Core metadata describing the folder:
A note about nesting
The order of <fileGrp>s in the <fileSec> must mimic the order of directories and objects in the source material.
The Structure Map <structMap>
Each datastream described in the <fileSec> has a <div> entry in the <structMap>. As in the <fileSec>, there are entries for metadata about the objects as well as entries for the folders and bitstreams in the accession. The <div>s in the <structMap> must be nested to reflect the hierarchy of the accession; this is similar to the nesting of Component, or <co >, tags in EAD, though METS does not number the levels.
Below is the structure of a section of the 'My Documents' folder that we found in three.zip:
- My Documents folder
- Music folder
- 08_ThieveryCorporation_DC3000.mp3 file
- Mine folder
- MyAlbumCover.jpg file
- Music folder
Notice how the nesting of the <div>s follows the nesting of the original file structure:
The importance of IDs
You may have noticed that there is no trace of the filenames or filepaths of our files in the <structMap>, so how will Fedora know where to find our objects? Fedora can find our objects because we have already recorded this information in the <fileSec>. Each <file> in the <fileSec> is given an ID attribute which the relevant <div> entry in the <structMap> will use to find this information. Note the ID attribute of <file> and the xlink:href attribute of <FLocat> below:
In this image we see the relevant <div> entry in the <structMap>. The file pointer tag <fptr> simply points to the ID of the object as recorded in the <fileSec> in its FILEID attribute:
It is essential to ensure that the exact paths and filenames for individual digital objects are correctly inputted in the <FLocat> and that IDs entered in the <fileSec> match the corresponding entry in the <structMap>.
Validating METS files
The METS file must be valid before proceeding with the DirIngest. To ensure conformity, proper processing, and data interchange, the encoded document must be compared to the specifications of its schema to ensure that the mark up adheres to the METS standard. This process involves two steps: parsing and validation. These steps tell you whether your XML document is well formed and valid.
Error messages in XML editors
Different softwares have different parsers and each parser may present different error messages for the same mistake. All XML editors give the line number where the mistake has occurred followed by the character number. In most cases this makes identifying the problem relatively straight-forward.
Fixing the errors in an XML document can be cumulative process. The XML editor cannot present all error messages at once and the editor must often resolve error messages only to re-validate and be presented with new ones. Gradually all the errors will be resolved and the XML will be ready for submission.
Common errors to look for include:
- Missing closing tags
- Mismatched element names - opening and closing tags must match
- Failure to close quotation marks for attribute values
- Incorrect ordering of elements (the sequence of elements is specified in the schema)
- Using the wrong case (XML is case sensitive and the metadata creator must adhere to the exact specification of elements in the schema)
- Badly nested elements (for example, if element A is opened first, and element B is
a child of element A, you must close element B before closing A)
Zipping together the METS file and associated digital objects to create the SIP
Once the METS file is completely error free it will need to be zipped together with the directory of files which are to be ingested. The METS file needs to be in the root directory at the same level as the highest directory of the accession. If we were re-creating the three.zip SIP, we would highlight the METS file and My Documents folder, right click, and select 'send to | compressed (zipped) folder' from the drop down menu as shown below:
Using the DirIngest Fedora Interface
Browse to the DirIngest interface (http://localhost/DirIngest/ingestSIP) using Mozilla Firefox:
Note: Microsoft's Internet Explorer will work if there are no problems with your SIP, but if the DirIngest service returns errors, Internet Explorer cannot display them and returns a rather unhelpful 500 error instead.
Browse to the zipped files and ingest. After a few moments or longer (depending on the size of your accession) you will either see a list of PIDs or (perhaps more likely) error messages. Hopefully you'll see a list of PIDs like this:
What it looks like in Fedora's Admin Client
The PIDs in this list are the Persistent IDs that Fedora has allocated your digital objects. So PID demo:850 refers to our 'DC3000' audio/mpeg file and the Dublin Core and Licence metadata about it:
The labels for the datastreams (the tabs on the left) should look familiar: we assigned 'DC3000', 'DC3000_LICENSE' and 'DC3000-DC' as IDs for objects in the METS <fileSec>. The two other datastreams are the Dublin Core that Fedora automatically generates for an object (labelled DC) and the relationship metadata which has been constructed from our METS file (labelled as RELS-EXT).
Let's take a closer look at the RELS-EXT datastream:
Using RDF (Resource Description Framework) metadata Fedora has recorded that demo:850 (DC3000) has a parent called demo:849 (the 'Mine' folder). All the relationships in the submission will be recorded in this way.
Common errors generated during the Fedora ingest process
Chances are that you won't see a list of PIDs the first time you create your own SIP and submit it to DirIngest. If our experiences are typical, the most likely type of error will be incorrect information in the <METS:FLocat LOCYTPE= "URL" xlink:href="file:///mydoc------/> tag. This error means that the DirIngest service cannot locate the object because the filepath supplied in the METS file is incorrect. It is vital to replicate exactly file title and spacing in the filepath so that it mirrors what is there in the original directory structure, even if this means replicating spelling mistakes, random capitalisation and odd spacing. A key pointer is that the METS file needs to mirror the filenames and directories in a case sensitive way. It is also crucial that internal IDs are consistent and that the ID attributes in the <fileSec> correspond with the FILEID attributes in the <structMap>. It is worth remembering that your XML editor cannot check for these errors, while this is a manual process it requires fastidious attention to detail.
Another common error is a missing endtag or mismatched tags: these are fairly straightforward to resolve and should have been identified and rectified prior to ingest during XML editor validation. (See the first example given below)
Deciphering the error messages generated by Fedora: some common examples
Message: org.xml.sax.SAXParseException: The element type "METS:fileSec" must be terminated by the matching end-tag
This message indicates that an end tag is missing. In order to locate where in the METS file an end tag is missing it is necessary to view the file in an XML editor and validate. The line and character number of the missing tag will be generated by the validation process.
The location of the missing tag in the METS file should be just before the <structMap>:
Message: java.io.IOException: Unable to locate METS in zip file. It must be in the root.
This error message tells you that the METS file must be in the root file of the directory. The likely error is that the METS file is not placed in the right space. See above on zipping and ingesting.
Error 1 what??? need to fix the namespace dec
This error message indicates that the namespace declaration in the METS header which points to the URL location of the METS XML schema is missing or incorrect. This header should look like this and is included in our template.
java.io.IOException: ZIP entry not found: My Documents/Music/Mine/MyAlbumCoooover.jpg
This is an error message generated when there is a problem with the <METS:FLocat> tag, generally when a space has been omitted or added, a word mispelt or the wrong case given for the first letter of a word. In this example the METS file could not find the document 'MyAlbumCover.jpg' as the filename had been copied incorrectly in the METS file:
Dealing with error messages in Fedora
The METS file will need to be opened and corrected in an XML editor before being re-zipped with its associated digital objects and the ingest process repeated. As the XML editor cannot detect inaccurate file directory locations in the <METS:FLocat> tag it may be necessary to repeat this process a number of times to eliminate all typos.
Larger, highly structured, accessions are likely to generate more errors than smaller accessions: the manual creation of METS files inevitably introduces human error. It is hoped that extraction tools will be developed in future which will at least generate the framework of the <fileSec> and <structMap> automatically populating filepaths and generating file IDs. This would speed up the ingest process considerably and dramatically reduce the scope for error in the METS file.
Moving on to creating your own SIP - a temporary template
To assist us in testing the DirIngest service with some of our exemplar collections we created a template based on the three.zip METS file:
You can download the DirIngest template (5KB) if you want to create your own METS files for ingesting other materials into Fedora using its DirIngest service.