Economical Access via Normalisation

Submitted by andy jackson on 10 May 2011 – 9:01am

Normalisation

I’ve finally had time to consider David Rosenthal’s response to my argument in favour of format normalisation as a preservation strategy. While I largely agree with his position on format obsolescence (with some caveats I’ll return to in a future post), we do appear to disagree on a more fundamental level – on what it is we are actually trying to preserve. In David’s post, my normalisation approach is described as simply ‘improving access’, and by implication, not about preservation. Certainly, if access is not taken into account at all, then preservation is just about keeping the bits safe (by definition). However, in many cases a blind bit-bucket is not sufficient, because access is what we are really trying to preserve.

Most institutions will have to understand the content of their own collections – at least well enough to support full-text search and discovery. Most will also be tasked with helping their user community to interpret and re-use the items that they hold, and for some this may mean providing the entire access infrastructure. For example, when providing access to digitised material, libraries and archives tend not to just pass the TIFF files to the end user and call the job done. They create rich interfaces that place the items in context, include additional perspectives or layers (such as OCR text), and are appropriately optimised for particular platforms.

Unfortunately, I fear my description of normalisation may have confused things, as David appears to think I was talking about the creation of an access surrogate. This was not my intention. Rather, I was attempting to argue for the creation of a ‘master surrogate’ or ‘preservation master’ from which access surrogates can be generated. This distinction is important because the goal of the preservation master is to be a lossless clone of the original. In general, this goal may not be fully achievable, and a high-fidelity clone may not be completely lossless. Where the degree of loss is uncertain or unacceptable, then we may wish to keep the original, but I do not consider this to be essential in all cases.

Under this kind of normalisation strategy, the choice of format for the preservation masters is primarily driven by what the institution is willing to commit to supporting. Of course, for the exact reasons David is presenting, the natural choice will lean towards formats that are currently widely used for access because they are likely to be cheaper to support in the future. With that in mind, I think the root cause of our disagreement is revealed in this quote:

Unlike the original bits, the surrogate can be re-created at any time by re-running the tool that created it in the first place. If you argue for preserving the access surrogate, you are in effect saying that you don’t believe you will be able to re-run the tool in the future.

I am arguing for preserving the preservation master, but this is not because I believe no-one would ever be able to re-run the tool in the future. Rather, I see the preservation master as a viable alternative to having to maintain the ability to run all the tools, all of the time, both now and in the future. Each format represents a significant commitment in terms of the cost of procuring, supporting and sustaining the necessary access software and infrastructure. Therefore, it may be preferable to migrate a collection in order to reduce the variety of archival formats, so that the overall access framework is cheaper to develop and maintain.

Of course, if we also keep the original objects, this will increase the cost of storage, and the argument comes down to a rather complicated cost comparison between storing extra preservation master copies versus sustaining the access infrastructure, per format. I’m not convinced there are any easy answers here, as the cost of storing data and the cost of maintaining the software stack will depend on the nature and growth of the collection, and on a number on messy institutional and economic factors, like the expertise of the staff you have available or can attract. Even at a single institution, the optimal strategy will depend on the collection – for example, while the Internet Archive do not attempt to normalise the content of their web archive, they have chosen to generate derivative copies of content uploaded by contributors.

In summary, the need to adopt a normalisation strategy is driven primarily by the requirement to facilitate or maintain access to the content. It is not directly concerned with obsolescence, except to the extent that the progression towards obsolescence drives up access costs. Even when obsolescence is of no concern, a normalisation strategy can help mitigate the the cost of preserving access to digital items by reducing the complexity of the access infrastructure. To implement such a strategy, we need reliable, trustworthy format migration tools. I believe that the OPF has an important role to play here, finding ways to build that trust.

andy jackson’s blog
Login or register to post comments

2 comments

ecochrane wrote 7 weeks 1 day ago

Normalization as an Access or Preservation Strategy

Hi Andy,

I liked that you got to the heart of the question quickly in this post

“we do appear to disagree on a more fundamental level – on what it is we are actually trying to preserve.”

I would argue this is the crux of the issue and pose a few more questions which I hope might help to clarify the issue:

What is the subject of digital preservation? ie. what is “the original” that you speak of?
Is the subject of preservation the result of the rendering of a file (or set of files) using a particular software environment?
Is the subject the message or set of messages that are conveyed through the rendering of a digital object?
If it is the set of messages, can that set of messages be replicated through a different file + rendering environment combination? (as migration proponents purport)
If it is the set of messages and the set of messages can be replicated through migration to a new environment and file combination, can this process ever be automated (as may be necessary for large volumes)? I.e. can the identification of the messages be automated and the post-migration confirmation of the preservation of the messages be automated?

The issue I am trying to highlight and is that the “preservation master” created through normalisation may not actually convey all of the same content to the user as the original did when rendered using the intended rendering software or the original creating software.

In some cases this may not matter, but how can we ever know when it will matter without manually checking each file? What about when such an object has to be provided as evidence in a court of law? Will it be ok to say: “we normalised this and rendered the result in different software to the original and we know for sure that it might convey different messages and include different content, but we don’t know if it does or not or what might have been lost or added, but its the best we could do”?

So getting back to that original question, “what are we preserving access to?” the normalization answer appears to be: whatever “content” we could read out of the original file(s) and put into new files that open in current software.

It just seems a bit arbitrary, but then again, it may be the best we can do (at the moment).

There was also one other issue I encountered in your post, the idea of reducing complexity through normalization. This seems to be a common misconception with migration/normalisation. Just as “Each format represents a significant commitment in terms of the cost of procuring, supporting and sustaining the necessary access software and infrastructure” in order to do effective normalisation for preservation (i.e. preservation actions you can trust and that provide objects with integrity and authenticity) you will need a different set of code for each format/format variant, and with the variety of formats out therethat gives a lot of complexity and cost right there.

Also if you are going to take normalisation seriously, you will have to decide how long you will keep your normalisation paths open for. Will you always support the ability to normalise a word 95 .doc file? Or will you cut off support for such files at some point?

The point I am trying to make here is that normalisation for preservation is equally as complex as alternatives and possibly more complex.

In contrast to normalisation for preservation, normalisation that is used to give users some of the “stuff” (text, images, whatever we can easily do) from within a digital object in an easily accessible/reusable form is arguably a great thing to do if you can afford it.

Regards,

Euan Cochrane

andy jackson wrote 7 weeks 16 hours ago

Normalisation is not the norm!

Excellent points. Indeed, normalisation for preservation only makes sense if the spectrum of formats one receives is expected to change over time, and if you are comfortable about the amount and nature of any information you risk discarding by normalising.

A good example of this is provided by the UK Data Archive. They archive scientific data, and this means dealing with niche formats where the niche may be as narrow as a single person. But the data is valuable, and they understand the ‘performance’ required to enable re-use, and so to make the necessary access economically sustainable they attempt to migrate the data they receive to a normalised format wherever possible. This illustrates why I’m a little uncomfortable with the ‘widely used’ argument – the rarity of the format does not necessarily negate the value of the content.

But in general, there are very few types of data format where I am confident we can reliably measure the degree of loss of information (never mind the effect of that loss upon the performance!) without lots of expensive manual work. For example, with some difficulty, we can QA many image formats automatically (as we do for our TIFF to JP2 conversion), and I think the same can be said of many audio and video formats. However, I cannot say that about a conversion from Word 95 to ODF, and I’m certainly not arguing that normalisation should be the norm.

Open Planets Foundation – A community hub for digital preservation