Skip to page content

Asia Pacific (English) Change

Who will save the Olympics?

The Pandora archive and other digital preservation case studies at the National Library of Australia

Authored by Colin Webb
Director, Preservation, National Library of Australia
Presented by Lydia Preiss
Manager, Collections Preservation, National Library of Australia

1. Introduction

Who remembers the Games of the 27th Olympiad held in Sydney, less than 12 months ago? These Games generated a huge Web presence, not just official information, but also an outpouring of sites associated with all the interests surrounding the Olympics and their impact on Sydney and Australia.

Well, the Sydney Olympics are dead and gone - and so is most of the Web presence associated with them. Almost faster than the athletes themselves, Web sites started changing, disappearing, as Sydney, Australia and the world moved on. Hardly the end of civilisation as we know it, but a loss echoed many times over as digital information resources turn and walk away from their own closing ceremonies.

The purpose of this presentation is to tell you of some of the experiences of the National Library of Australia in trying to ensure long-term retention of digital information resources, including those associated with the Sydney Olympics. In doing so, I will need to talk about NLA's intentions and experiences with its own digital collections, and about our attempts to build a national model.

Somewhere along the way I am sure to mention collaboration at an international level as well. Our ability to take effective action is as much a reflection of the help and influence of others as it is a reflection of the National Library's own capabilities. We feel it is very much a part of our case study.

I've been asked to focus some attention on planning, policy, implementation and costs. Although that sequence makes perfect sense in terms of project planning, I am going to reverse some of it, because I would like you to know what we're talking about - the implementation - before discussing the why of it, which will come out in looking at planning and policy. At the end I will try to say something about costs, although it is hard to say anything meaningful.

2. Implementation

Let's start with PANDORA. (For this section I am particularly indebted to the manager of our Electronic Unit, Ms Julie Whiting, for the latest details on Pandora.)

Pandora has already been described to you in an earlier presentation, but to remind you, it is an archive of Australian online publications selected from the Web, managed by the NLA with a number of partners, at present consisting of the State Libraries of Victoria and South Australia and the national film and sound archive, ScreenSound Australia.

Currently the archive holds around 1,300 titles, made up of approximately five million files. This represents about 134 Gigabytes of data.

It is interesting to look at the way the archive has grown, from 0.2 GB in 1997, to 7.1 GB in 1999, to 134 GB in mid-2001.

When selecting an item for inclusion in the archive, NLA staff decide, usually in negotiation with the owner of the material, whether it should be captured on a regular basis to reflect changes, or simply captured once. Around two-thirds of the titles in the archive are captured once only. The other titles are gathered regularly, in accordance with a schedule agreed with the owner: this could be anything from weekly to annually.

The tasks of identifying and taking in material are managed by a small team of around 6 people in what we call our Electronic Unit, which is part of our Technical Services area (that part of the Library responsible for acquiring and cataloguing material). They are supported by our IT section and by a small Digital Preservation team in Preservation Services that has been involved in both the planning and oversight of Pandora, and is responsible for guiding the long-term maintenance of the archive.

Most items in PANDORA are freely available on the Web and are gathered from the publisher's site using website copying software. This "pull" approach has been taken in an effort to minimise the impact of archiving on publishers. Our aim is to make the process as simple and painless as possible.

Over the years, Library staff have struggled with some rather clunky tools for gathering files and for managing the ingest processes. Currently we use software called HTTrack for harvesting - it seems to manage quite well, although like all the file-gathering tools we have used, it does not cope with database-driven, dynamic material.

We are also about to start using a Library-developed suite of tools called PANDAS (Pandora Digital Archiving System) that will manage a number of the processes involved in ingesting and providing access to archived publications.

Pandas will record some metadata for resource discovery and for administration of the material in the archive. It will also record some technical metadata about file formats although at this stage it is unlikely to include all the preservation metadata we would like.

The exciting thing from the point of view of a national system is that Pandas will support distributed archiving, in that it can be used at a number of sites so all participants can use the same centrally-managed system. This will work regardless of whether the gathered files are stored on the NLA's central server or on the archiving partner's own local server.

I must emphasise that NLA's digital preservation activity is not confined to online publications and Pandora. We are also trying to manage ongoing accessibility for a collection of around 3,000 "offline" publications issued on physical format carriers like diskettes, CD-ROMS, etc, and a growing number of unidentified diskettes turning up in recent Manuscripts acquisitions. We document hardware and software dependencies where we can, and copy material from diskettes to either more stable carriers like CD-R or more stable systems like a backed-up hard disk.

Where we are unable to provide access using current systems, we have had to resort to data recovery tools that try to recognise file formats and offer translations into formats we can handle. Of course, sometimes we have to decide that the benefits simply aren't worth the effort.

The other major kind of digital material that we need to maintain is the mountain of files coming out of our digital imaging and digital audio archiving programs. I won't discuss them further here, except to say that we don't expect them to be as difficult to maintain as the digital publications I have been talking about. Unlike publications coming in from somewhere else, we have control over the way our own digitisation files are made, and we're in a position to choose file formats for which we expect to find viable migration paths.

All of these forms of digital collections are meant to be managed through two major systems innovations the Library is procuring. One is a digital object storage system, to store, back-up and maintain the integrity of all the byte-streams. The other is a digital object management system, to manage the masses of metadata that will control the collections. NLA is very keen to share experiences in setting up such systems with other institutions building similar technical infrastructure.

Planning processes

Planning is surely based on recognition of need, so NLA's planning for digital preservation could be said to have started in the early 1990s when it recognised that it needed to extend its collection building and collection maintenance roles to incorporate digital publications.

While we are pleased to be where we are now in 2001, you need to know that we went through a period of almost debilitating confusion over how to approach digital resources. We knew they were being produced, we knew people were starting to use them, we knew it didn't take much imagination to see many of them as publications, but just how important was this stuff? The more we looked at the complex issues involved, the less possible it seemed. As one colleague said recently: "If you look at all the issues at the one time, you'd have to be crazy to ever start!"

In the mid 1990s, we started. We started small, modest, and tentatively. The Library decided to walk a path that would allow us to take relatively simple steps in dealing with quite complex issues. From the beginning we were determined to build a close relationship between our evolving conceptual framework and our practical experience.

Our approach developed through a series of modelling exercises, starting in 1995 as an attempt to define a very modest pilot project, leading to the PANDORA proof-of-concept project which was launched in 1996. From 1997 the Library spent a lot of time in modelling business processes and a logical data model for the growing archive.

Through the following two years the Library deepened its commitment to PANDORA, recognising that it could only be secured and progressed with greatly increased technical capability. The Library also began taking a serious interest in the OAIS Reference Model. Having already developed many concepts in its own planning exercises, the Library has used the Reference Model as a way of refining and validating the model for PANDORA.

In late 1998 the Library published its Digital Services Project Information Paper which sets out requirements for a technical infrastructure to collect, store, provide access to, and manage the PANDORA Archive, as well as to support the management of other digital and analogue collections. Over the past two years these requirements and understandings have been refined through a series of public procurement processes aimed at delivering greatly improved collecting tools, metadata searching, archive management, and digital object storage systems - the systems I have mentioned under Implementation.

What are some of the characteristics revealed through these planning stages?

A first thing to say is that planning has been driven by the Library's objectives. When I suggest the kinds of costs that have been involved, it will be clear that such investments would not have been made if the outcomes were not very important to the Library.

A second thing to note about our planning is persistence. Within NLA there is a kind of dogged refusal to give up on good ideas. So long as the ideas are sufficiently grounded in need, sufficiently tied to our objectives, and sufficiently forward thinking, we see no reason to despair just because we can't do or get what we want straight away.

We are certainly prepared to apply this kind of long-range persistence to achieving the preservation required for our digital collections.

A third characteristic is our commitment to learning. While pleased with Pandora, we have never thought we could afford to ignore other ideas. Evaluation and feedback, and the sharing of information, are two of our most powerful planning tools.

Such thinking drives the PADI subject gateway on digital preservation which we maintain. PADI was set up specifically to ensure that anyone who is struggling with digital preservation issues, most certainly including NLA, would have ready and enduring access to a wide range of quality information to inform planning and decision-making.

Sometimes the Library's planning involves action aimed at making our implementation work better. There are examples of this in our digitisation policy, which governs the qualities that will make our own digital output preservable and worth preserving. There are examples in the work we are doing with commercial publishers to define a code of practice for submitting publications under legal deposit, if and when digital legal deposit comes our way.

NLA's planning continues. Currently we're planning on three main fronts:

  1. The first is the detailed implementation of the new technical systems to support PANDORA and our other digital collections, which I have already mentioned.
  2. A second current focus relates to national action. We initially hoped that partners would simply come forward and declare their own interest in joining us to preserve our shared digital documentary heritage. On the contrary, it has required quite an effort to engage others. This is true despite a history of excellent cooperation in the Australian library system.

    In a couple of weeks' time, we hope to take a next important step as NLA meets with all State and Territory libraries, with a mandate from the Council of Australian State Libraries to "get on with it" - in other words, to sign commitments to take practical steps towards setting up digital archiving programs in all the concerned libraries.
  3. The third area of planning activity concerns long-term accessibility. By digital preservation the Library means preserving the byte-streams, preserving accessibility, and maintaining the ability to find and connect.

The National Library's digital collections are subject to the same threats and challenges discussed in earlier presentations, and we are interested in the same approaches to addressing those threats. Our planning aims to move us out of the stasis of wondering when workable strategies will appear, and into action that will help take us forward.

Some of our digital collections, especially older "offline" materials, already present accessibility challenges. For most of our digital collections, however, the risk of loss is in the future - a future coming rapidly towards us, but still hypothetical.

Last year we made an assessment of files in the Pandora archive. We found in excess of 200 file formats, and approximately half a million HTML pages. Of the HTML documents, 127 pages contained tags that have been eliminated from HTML version 4 - in theory, version 4 browsers should not recognise these tags. There were also 7 million instances of tags that had been deprecated, or marked for future elimination in the next version of the HTML standard. Is this going to be a problem? So far, browsers have proved to be extremely tolerant of non-standard HTML, so we may not have a short-term problem; on the other hand, we could take it as a warning sign that action will be needed in the future to keep these files accessible.

For all of our digital collections, there are some things we believe we can do that will make it easier to plan and implement effective preservation strategies when they are needed. The things we are doing include:

  • Getting to know our collections better, understanding what formats we hold and their dependencies.
  • Defining the "significant properties" we will have to maintain to achieve effective preservation.
  • Understanding the nature of the threats to long-term access
  • Developing good risk indicators, hopefully giving sufficient warning to allow us to take action in a manageable way
  • Testing and recording preservation metadata to support both effective presentation, and good management of our digital collections.
  • Postulating and where possible testing, appropriate preservation pathways for different formats. (Presumably the way we define "appropriate preservation pathways" will change over time, but for now we think it includes criteria such as: providing access to the significant properties defined for the resource; maintaining evidence of authenticity; compliance with intellectual property and other legal and moral rights; cost-effectiveness; and supporting ongoing access and preservation - in other words, not looking like a dead-end that will block access when the operating environment changes again in the future.)
  • We are also investigating the feasibility and benefits of building a collaborative archive of software needed to run things in the archive.

These actions do not add up to a preservation strategy in themselves, but they will help us take action at the right time. The Library is currently preparing a Digital Preservation Policy statement that will give coherence to these plans.

Policy frameworks

One thing to say about policy in NLA is that there is a lot of it! The Library accepts a large degree of experimentation, and values pragmatism ahead of purism, but we also believe in coherence and accountability. Eventually, everything we do should be explainable by reference to publicly available policy.

Of many important policy areas relevant to our digital archiving and preservation aspirations, I want to mention just one: the business principles behind Pandora.

Business principles of Pandora

Very early on, we decided that we needed some principles that would guide us through the complexities of setting up and managing a digital archive. Currently they include:

  1. Online only. Pandora only accepts publications that appear solely in an online form. Where an online publication contains the same information as a print or CD-ROM version, Pandora has not accepted the online version. This principle is driven by purely pragmatic considerations, as we seek to limit the size of the task. Applying this principle has eliminated about 70% of the material that would otherwise have been selected.

    Personally, I think this principle will change as we recognise an increasing demand for digital access. It also appears that the number of online publications with print equivalents is declining rapidly.
  2. There is a principle of retaining the look and feel of publications, as well as their intellectual content. This principle takes us in two interesting directions. It implies that we need to define the preservable essence of what we are seeking to preserve - what the Cedars project calls the "significant properties"; it also implies a high level of quality control to ensure that the publications we are taking into the archive actually do work.
  3. Another principle we apply is that archived publications are part of the national bibliography. We should use the National Bibliographic Database to record their existence and to record explicit preservation intentions.
  4. Pandora operates on a principle of respect for intellectual property and other rights. At the same time, Pandora collects and maintains material so that it can be accessed, even if not immediately. Consequently, capturing only proceeds when we have negotiated access agreements with rights owners, and our systems are designed to manage access in accord with those agreements. Access restrictions are applied to a number of items for commercial reasons, for privacy or cultural reasons, or as part of a policy decision for certain categories of material such as adult material. Access can be restricted to onsite use only, for a specified time period, or can be password protected so that only designated researchers can obtain access.
  5. We also value adequate version control, so that it is easy for users (including NLA staff managing the archive) to know what they are looking at. A special application of this principle is ensuring that users realise they are looking at the archived version rather than the live version available from the owner's site, which is likely to contain more current information.
  6. We intend to use some kind of persistent identifier for publications archived in Pandora. The Library is currently investigating the approach it wants to take, and recently released a public document setting out some of the principles we think should apply .
  7. Finally, the archive is selective, and does not attempt to collect any more than a small portion of all Australian information available on the Web. Its focus is on material deemed to have national significance and enduring value.

This is a reasonably controversial issue amongst national libraries - presumably other people are free to choose whatever fits their collecting purposes.

There isn't time to discuss the benefits we achieved from being selective - I have included some notes in an appendix to the printed version of this paper. However, I will mention that we have recently employed a consultant specifically to help us explore the feasibility of taking comprehensive snapshots of the entire Australian domain, in addition to our ongoing work with Pandora.

Costs

Finally, I need to say something about costs. There are so many unknowns in our ongoing management of digital collections that it is too early say anything definitive, except that it is likely to be expensive, and the expenses are certain to be ongoing.

The Library's budget for procuring the two systems to store and manage our digital collections that I mentioned earlier is well over $1 million Australian, and I guess we expect those systems to be viable for just a few years.

Those costs do not take account of staff costs to get us where we are now, and to manage the archive on an ongoing basis. If we estimate the average cost of the staff directly involved, by the number of staff, by the number of years, we would probably arrive at a figure of around $3.5 million Australian, so far. (Of course, these don't add up to nearly as much in American dollars.)

The National Library of Australia has invested in a great deal of R&D work to get where it is with its digital collections. The ball park costs I've just suggested are ones that could be greatly reduced once the R&D is finished. On the other hand, it's hard to say just when that will be. It is also hard to say what future R&D work will be needed as digital technology continues to develop and change.

For NLA, the costs are truly significant, but we don't believe we have a choice -if we do have one, it is the choice of how much to collect and preserve, not whether. As a National Library, digital materials are as much a part of our job as any other information resource.

Digital archiving and preservation is deeply embedded in the National Library, being funded from our normal recurrent budgets rather than one-off special project funding. That arrangement has its advantages and disadvantages: naturally, it feels like quite a burden to have to manage these complex challenges without any additional funding, but it has also been a great advantage to manage these challenges not as an add-on dependent on outside funds, but as core Library business.

Oh yes, for all that cost you'll be pleased to know that we've kept the Olympics, Web-style - and a lot else besides.

Appendix

Additional information about the National Library of Australia's digital archiving and preservation programs

Managing Pandora

Growth rates
Growth rates have varied, partly reflecting things like the Sydney Olympics and elections, both of which have been great generators of Web activity. Last year we archived 372 new titles and about 400 re-gatherings of material for titles already in the archive. This year we expect to archive around 500 new titles and around the same number of re-gathers.

Functions of PANDAS
PANDAS is designed to do a number of key things:

  • manage the metadata about titles that have been selected or rejected for inclusion in the archive
  • initiate gathering of titles
  • manage the process of quality checking and fixing problems
  • prepare the item for public display and generate a title entry page
  • manage access restrictions, and
  • provide management reports.

Quality control
Link checking is a useful tool for identifying missing files so that they can be gathered. Staff also do a manual check of the whole site to identify problems such as Java script not working correctly, missing Shockwave files (these don't show up in link checks), broken links due to coding errors or case sensitivity, and real media files that have not been captured (often only a metafile is delivered). This is a very time consuming process, and the problems cannot always be completely rectified. However, as these items have been selected as nationally significant, we seek to ensure that the archived version is as accurate to the original as possible.

Access agreements
The Library's preferred model is for the publisher to allow onsite use only in all of the Pandora partners for an agreed time, preferably a maximum of 5 years. After this time, the title becomes freely available to external users anywhere. If the publisher is unwilling to allow this, then other options are offered, such as onsite use only, restricted to a single Pandora partner, and/or restrictions for a longer time period.

Storage
The titles in Pandora are currently stored in a UNIX file system. The archive is fully backed-up, as one might expect. Having had disk hardware failures that required complicated re-building of the archive, we fully appreciate the importance of having well-organised, well-managed and well-secured back-ups in place.

Selectivity
NLA's experience is that a selective approach makes it possible for us to:

  • Inject a high level of quality control into the process
  • Negotiate explicit access rights
  • Develop a cooperative relationship with creators and publishers
  • Develop and maintain an intimate working knowledge of new web design features likely to impact on archiving and preservation
  • Realistically commit to recording metadata
  • Realistically commit to maintaining accessibility.

Because these are things we value, we believe we have to pursue selection.

Others, including the Internet Archive, and the National Library of Sweden's Kulturarw3 program, take a more comprehensive approach, aiming to collect a complete picture of the Web in the domain in which they are capturing.

The selection process we currently use for Pandora is not mysterious. Based on our existing Collection Development Policy, we drafted a set of selection guidelines in the early days of Pandora, and we regularly review and revise them. They emphasise Australianness - and I suggest you look on our web site, if you want to know what that means.

About the presenter

Lydia Preiss

Lydia Preiss is a graduate from the first intake to the 'Conservation of Cultural Materials' applied science degree in Canberra Australia in the early 1980's. She has a varied background in paper conservation and library preservation and is currently responsible for the National Library's collection preservation management programs. These include; collection treatments, preservation reformatting, and preventive preservation programs such as collection housing, environmental and disaster management, as well as staff and user education in the care and use of the collections.

Her relationship with the National Library of Australia commenced with an internship in the Library's conservation laboratory. During her time at the Library, she was part of the development and transformation of Library conservation practice to the broader Library Preservation Management approach. She has found it a challenging and rewarding experience to come full circle from conservation practitioner to preservation manager.

Whilst Lydia's responsibilities remain with the more traditional library preservation programs, she is finding herself and her staff drawn into the digital world by the need for the preservation profession to get involved and respond to the challenges of digital technology.

She has a keen interest in the evolving future of digitisation for preservation and access, alongside maintaining and integrating with traditional library conservation and preservation practice.


About the author

Colin Webb is the Director of Preservation at the National Library of Australia, Canberra. After professional training as a bookbinder and as a book, paper and photographic conservator, he worked for the National Archives of Australia as a preservation manager for more than a decade before moving to the National Library in the early 1990s to set up a new program in information preservation. He created the first - and still the only - specialist positions in digital preservation in Australia (partly because he couldn't stand the thought of spending the rest of his life in endless discussions about metadata), and he has sought to bring a strong preservation perspective to the National Library's digital initiatives.

He is a member of the RLG Preserv Advisory Council, and is very pleased to be associated with OCLC through membership of the joint OCLC/RLG Working Groups on Preservation Metadata, and on the Attributes of Reliable Digital Repositories.


Additional resources

Presentations from the Digital Preservation Resources Symposium 2001