Workbook on Digital Private Papers > Accessioning
Harvesting websites with Adobe Acrobat Professional 7.0
The Paradigm pilot experience
Using Adobe 7.0 to archive websites is very straightforward and intuitive. We were particularly impressed with the speed with which the software captured entire sites, though most of the sites captured in the pilot were not very large or complex. The longest capture time recorded was 9 minutes for Richard Allan's weblog which comprised 1581 pages.
The Paradigm project used the default settings during its pilot, so we can't report on the impact of changing any of the settings in the Settings dialogue box. We did not test the Stay on same path or Stay on same server options because we were interested in capturing whole sites.
Initial captures were made using the default Get only 1 level(s) rather than Get entire site. Using this option, we found that once we had captured the home page (i.e. the 1 level), clicking on the links in the newly created Adobe PDF file would instruct the software to create PDF files for each page selected. This feature can be used to capture entire websites manually by systematically clicking on each link from the sites home page menu bar(s).
An example - the Paradigm project website
This website was captured on 16 March 2006. As you can see from the image below, Adobe retains the structure of the website, but it is unable to render the institutions' logos in their proper place.
This is the full snapshot of the website:
-
Download as PDF document
(176kb)
Conclusions
Upside
- preserving a PDF representation of a website is potentially simpler than preserving all the component bit streams which compose it, although TIFF may be a more reliable format for this kind of preservation strategy
- the software is easy to use
Downside
- the software is expensive
- there is no option to schedule repeat harvesting of websites. This would be a very useful option for users taking periodic snapshots of websites
- the layout of most webpages do not sit comfortably in a page-oriented document resulting in webpages running over the page
- PDF cannot capture more sophisticated website functionality
Adobe Professional 7.0 is simple to use, but the manual intervention required means that it is most suitable for small scale web preservation. The software will cost around £250, unless your institution has a site licence.
<<Previous section
Using Adobe Acrobat
Professional 7.0 for harvesting websites
[Top]