Workbook on Digital Private Papers > Harvesting websites with Adobe Acrobat Professional 7.0

Harvesting websites with Adobe Acrobat Professional 7.0


It is necessary for the PC to be connected to the internet and to have the correct version of Adobe software installed.


Open Adobe.

From the menu bar select create PDF.

Select the From Web Page option from the drop down menu (alternatively use Shift+Ctrl+O).

The following dialogue box will appear:

dialogue box - create pdf from web page

Enter the required URL, or select a URL from the drop down menu of previously archived sites.

Beneath the URL box there are various options which can be selected. The default level is Get only 1 level(s).

Note: the default setting for this field is preset to 1, this can be altered using the up and down arrows. The other option (not the default option) is Get entire site. Two further settings which can be selected are Stay on same path and Stay on same server.

At the end of the dialogue box there is a Settings button which allows the user to specify the output. Selecting this button will display the following dialogue box:

dialogue box - web page conversion

Within this option there are two tabs: General and Page Layout. In General it is possible to select what kind of file format will be created. The default setting is Acrobat PDF Format, but there are also a number of other options including Gif Image Format, HTML, Plain Text etc. The General tab contains options for PDF settings: the three default selections are Create bookmarks, Place headers & footers on new pages and Save refresh commands; a fourth, Create PDF tags, is not selected by default.

The Page layout tab gives the option to change page size (A4, A5, legal, letter, etc); the default selection is Letter. Width, height and margins can all be specified here along with scaling options. The default scaling options are Scalewide contents to fit page and Switch to landscape if scaled smaller than 70%.

To capture a website once the required settings have been selected simply click on Create.

Once the creation of the PDF files is complete, any errors encountered during the capture are reported in a dialogue box entitled There were errors. There is an option to copy this error log to the clipboard, where it can be pasted to a new file using a simple text editor. The error log should be saved for each capture - use a meaningful title and save the log as a plain text file (.txt) to the same folder as the archived website.

dialogue box - there were errors -
                    displays error log after capturing Boris Johnson' blog

The default title for the archived website is the URL of the archived site; we decided to retain this but to prefix it with the date of capture (yyyymmdd) so that the snapshots of an archived website would be presented chronologically.

Next section >>
the paradigm experience of using Adobe for harvesting websites