I’d like to start this post with a bold claim:
The only currently practical solution for preserving access to archived websites over the long term its to maintain old web browsers and access them using those browsers running on emulated or virtualised hardware.
The truth of that claim could easily be debated but i’m not going to do that here. Instead I’d like to assume that it’s true and look at what stands in our way from doing that right now.
Given this (assumption), what happens when we try to do this right now?:
(www.archive.org running in Netscape Navigator on Windows 98 on VMware).
In (other) words it doesn’t work. It doesn’t work because our web-archive interfaces are not designed for old browsers. So this is the first issue.
This is assuming a workflow in which users load up an old browser and browse amongst links from web archives within the old browser. Another option might be to have the browsing done via a modern host and the viewing/rendering passed to the emulated/virtualised browser. But either way, this seems to be something practical web archivists could be working on.
Another issue is security, providing web archives via old software may mean providing old software access to the internet. For example it might mean providing windows 95 with IE 5 access to the internet. This should be manageable through good use of firewalls etc. But in theory most modern host environments should be able to be made immune to viruses that might attack the old operating systems. And the emulated systems could be set in “snapshot” mode to ensure any damage done can be recovered simply by restarting the emulated desktop.
The third and more challenging issue is the ever present issue for emulation solutions: Licensing. Many old browsers require old proprietary operating systems on which to run. This is a legitimate issue that desperately needs to be dealt with if we are to make emulation a viable solution more widely. However this is actually slightly less of a problem with browsers than with other software. Many browsers were freely available and many can be run directly on old Linux distributions or indirectly through API emulators such as WINE on linux. Most old browsers can still be downloaded via sites like OldVersion.com and OldApps.com, or are included in old Linux Distributions and repositories as David Rosenthal likes to point out.
I may try to build a version of a Damn Small Linux disk image for use in the Emulation Framework with old versions of browsers running on it via WINE.
UPDATE #2: I resized the PuppyLinux disk image provided with the Emulation Framework, added Wine 1.5 and installed Internet Explorer 3.01 and Netscape Communicator 4.80 for Windows. The disk image is available here and can be added to the Emulation Framework by following the instructions in this document to “add software”. I may try to add more browsers in the future if anyone is interested.
This is a quick post to put a question out there for discussion. The (partly rhetorical) question that I have been pondering over and raising with others is:
How long do we have to maintain a migration path for for any particular format?
Its probably safe to assume that most digital preservation institutions will continue to receive files in old formats for a long time after the format is considered to be obsolete. If we are going to use a migration strategy to preserve these files then the current best practice seems to be that we should migrate them as soon as we believe that their format is obsolete.
For example, say a digital preservation institution has a set of WordPefect 5.1 files and in 2012 it realizes/decides that they are obsolete and decides to migrate them all to ODT files to preserve them. This would seem to be a reasonable and practical approach for preserving these files. However if we apply the question posed above to this example: what happens if the institution receives more WordPerfect 5.1 files?
Presumably if the institution receives the additional WordPerfect 5.1 files while the tool(s) they used to migrate the original set are still functioning then they should be able to migrate them as they ingest them or soon after. But what happens when those migration tools are obsolete? Will they have to find or create new migration tools? Will they have to migrate the old migration tools?
There are a lot of answers to these questions including the option of refusing to accept any more files in formats that they have migrated away from. But to me it gives two good reasons to maintain emulation tools:
1. Migration through emulation tools (such as the UFC Migrate tool created as part of the Planets project) could help to ensure that files that come into the repository in obsolete formats can always be migrated. These tools do partly beg the question: Why bother migrating them if you are maintaining the ability to render them anyway (as you probably are if you are maintaining the ability to migrate them using original software)? - one answer is that you might migrate for reusability of parts of the content in modern software.
2. Using an emulation strategy to preserve the objects would make this issue redundant.
This post is intended to be speculative and may well be full of errors, both in the writing (spelling/grammar/typos) and in the content (I could be way off-mark). I am putting it out here as a thought piece to stimulate commentary and ideas. Some of this came out of recent discussions at the Future Perfect 2012 conference with many people including Jeff Rothenberg and Dirk von Suchodoletz.
What would it mean to take emulation seriously as a digital preservation strategy?
Most major digital preservation systems are currently based around having migration as the main long term preservation strategy. Some may argue that they are all in fact based on a strategy of hedging bets by way of retaining the original files and implementing migration, and this may be so; however none that I am aware of are based around using only emulation as a digital preservation strategy. I believe there is merit in some institutions using only emulation as a digital preservation strategy. They may wish to also use migration for providing access derivatives, much as we use a photocopier for providing access derivatives of paper records today. However there are some interesting and potentially cost-saving differences when implementing an emulation based digital preservation strategy instead of a migration based strategy.
This post is an attempt to highlight some of the differences in implementing a purely emulation based approach.
What would a business as usual digital preservation workflow look like?
At point of transfer or earlier digital preservation practitioners (DPPs) would try to ascertain the necessary rendering environment or environments for each digital object. This might be as simple as knowing that the object was a pdf file from a certain era and so would have been intended to be rendered in one of x version of Acrobat Reader, or a Microsoft Word document file from a certain era, created with OpenOffice, therefore intended to be rendered with either OpenOffice or one of the versions of Microsoft Word that was available at the time. Or it may be far more complex. The decision on how accurate the rendering environment has to be will depend on the context in which the object was normally used. If it was normally used by many users on many different systems then one or more representative rendering environments may be appropriate. If it was normally used by a multiple users via a specialised environment, then a copy of that environment may need to be made and transferred with the object.
Any necessary environments or environment components would be checked off against the preservation institution’s inventory (e.g. Microsoft word xx, java version xx, environment xx). Any components that had to be transferred from the agency would be packaged for transfer. Where full environments had to be transferred disk images would be made or virtual appliances would be transferred.
Files would go into the repository with some (digital) preservation metadata consisting of their age, rendering environment ID(s), date of last modification and any relevant fixity information (other metadata would be transferred for access restriction and discovery purposes etc). The date of last modification would be used when configuring the rendering environment to ensure active date fields were contemporaneous with the file (i.e. the emulated environment would have the system date set to the date the file was last modified).
The files would then have bit-preservation routines applied to them as per usual (copies made, checksums checked, media refreshment and replacement, etc).
If an appropriate rendering environment was not available in the inventory of the transferring agency one would either have to be configured or selected from a provider. Testing of the environment could be done in conjunction with the transferring organisation or individual, or could be done automatically using standard software installation testing routines. That one environment could then be used to render any object that was associated with it in the future. An average DPP (archivist, librarian) with basic IT skills should be able to be trained on how to configure most environments. In many cases it will only require knowledge of how to install applications on a base-operating system image.
When a user requested access to the original object there would be a number of options available:
1. They could be provided access to the object automatically rendered in the associated rendering environment within a controlled environment, e.g. in a reading room.
2. They could be provided access to the object automatically rendered in the associated rendering environment remotely, either through a custom application or through a web-browser.
3. They could be provided with the files that make up the object and information about the rendering environment, e.g. an unique ID for the environment or a list of the components. This could then be provided by the user (e.g. the transferring agency may still have the environment running) or by an external service provider.
4. They could be provided an access derivative created as part of non-preservation value-add process to facilitate greater reuse.
Throughout all of these options (aside from 4) the user could be given a number of ways to interact with the object and move content from it to a more modern environment (these may depend on confidentiality or commercial constraints):
a) They could be given the option of printing objects to a file or printer.
b) They could be given the option of selecting and copying content to paste into the modern host environment.
c) They could be given the option of save the object in a different format and moving the result to the modern host environment.
How does this process differ from standard, migration-based, approaches?
1. There is no validating of files against format standards (JHOVE would be unnecessary). Format validation only matters if you want to be able to consistently apply migration tools across a large set of files. If you are employing an emulation strategy this variance is not a problem. Intra-format variance generally results from different creating applications creating files differently but with the intention of them adhering to the same formatting standard. This variance is useful for identifying the rendering application but a problem for validation tools.
2. Format analysis becomes less important. Strictly speaking format identification is unnecessary when implementing an emulation strategy. The only format-like information that is necessary is an identifier for the rendering environment(s) to be used to render the object. File format identification tools could be used to infer the rendering environment(s) for the files. For example tools like DROID could be repurposed to identify patterns relating to creating applications and from there the intended rendering environment(s) could be inferred.
3. Identifying the rendering environment would be much more important and testing that environment at point of transfer could be more important. Doing this at point of transfer would make any issues apparent immediately rather than putting them off to a later date. In theory it would make it easier to consult with the original content owners to confirm decisions made (something that is harder to do each time a migration is conducted).
4. Preservation planning would involve tracking systems architecture etc, not software “obsolescence”. I.e. preservation planning would require ensuring that your emulation tools ran on your current host environment(s).
5. Preservation actions would involve writing new emulation hosts to host the old virtual hardware or writing new emulators to run the old environment images. This could be a significant process but would be relatively rare and would only need to be done once per emulator (which might emulate many different architectures & hundreds or thousands of environments).
6. Decisions about the content presented to users (e.g. as a result of migration or emulation) are made early in the preservation process (at point of transfer) as opposed to when a migration action is deemed necessary.
7. Access to the digital original could be more complicated for the average user and various mechanisms may have to be put in place to overcome this. Providing basic instructions for interacting with each environment would be an initial step. Old software documentation could be digitised and made available. Old software manuals often assumed no knowledge of computers and could be repurposed for future users. Interactive walk-through overlays could be added to the software (thanks to Jeff Rothenberg for suggesting this) leading users through the main steps necessary to interact with the objects (e.g. when mice no longer exist). Access to derivative versions may also be provided if required.
In general the steps involved in implementing a digital preservation strategy involved only emulation are quite different from those involved in implementing a migration strategy. Without solid examples of the practice of each, and metrics on costs and results, it is hard to say which would be more efficient.
I welcome comments and am very aware of the many gaps in this quite hurriedly written post. I chose to post this here rather than on the OPF or elsewhere because of its very raw nature, its speculative content and because i do not want it in any way associated with my kind employers.
I forgot an important point
The digital preservation institution does would not necessarily have to hold copies of any or every environment. They would only need to have access to them or to ensure that users could access them. Initially this may be possible with no work whatsoever. For example the environment for a pdf file may be limited to any current version of Acrobat Reader that a user would likely have at home, running on any OS that supported it. In the future if external emulation services were available the preservation institution may only have to check that the particular environment was available or request that it was configured and made available from the service provider. After that they may not need to actively do a lot besides tracking the health of the service providers (besides the usual bit-preservation routines).
Adobe changed the PDF specification several times and continues to develop new specifications with new versions of Adobe Acrobat. There have been nine versions of PDF with corresponding Acrobat releases:
- (1993) – PDF 1.0 / Acrobat 1.0
- (1994) – PDF 1.1 / Acrobat 2.0
- (1996) – PDF 1.2 / Acrobat 3.0
- (1999) – PDF 1.3 / Acrobat 4.0
- (2001) – PDF 1.4 / Acrobat 5.0
- (2003) – PDF 1.5 / Acrobat 6.0
- (2005) – PDF 1.6 / Acrobat 7.0
- (2006) – PDF 1.7 / Acrobat 8.0
- (2008) – PDF 1.7, Adobe Extension Level 3 / Acrobat 9.0
- (2009) – PDF 1.7, Adobe Extension Level 5 / Acrobat 9.1
The ISO standard ISO 32000-1:2008 is equivalent to Adobe’s PDF 1.7. Adobe declared that it is not producing a PDF 1.8 Reference. The future versions of the PDF Specification will be produced by ISO technical committees. However, Adobe published documents specifying what extended features for PDF, beyond ISO 32000-1 (PDF 1.7), are supported in its newly released products. This makes use of the extensibility features of PDF as documented in ISO 32000-1 in Annex E. Adobe declared all extended features in Adobe Extension Level 3 and 5 have been accepted for a new proposal of ISO 32000-2 (a.k.a. PDF 2.0).
The specifications for PDF are backward inclusive. The PDF 1.7 specification includes all of the functionality previously documented in the Adobe PDF Specifications for versions 1.0 through 1.6. Where Adobe removed certain features of PDF from their standard, they too are not contained in ISO 32000-1.
PDF documents conforming to ISO 32000-1 carry the PDF version number 1.7. Documents containing Adobe extended features still carry the PDF base version number 1.7 but also contain an indication of which extension was followed during document creation.”I added the emphasis to make a point. For understanding how files are internally structured it is not always enough to just know the formatting standard adhered to when files were created (e.g. PDF version 1.7). Sometimes we need more information about how the particular application chose to interpret or, as in the example above, implement the standard. This information could be represented in many cases simply by knowing what the creating application was. This is shown/implicitly acknowledged in the Wikipedia extract by the inclusion of the application association with the name of each PDF version listed in the list of versions.
This post and all others on this blog are my personal thoughts and opinions and are not necessarily those of any organisation I work for or have worked for.
Now to the post.
Firstly, the clarification:
If we assume that “the aim of digital preservation is to maintain our (the preserving organisation’s) ability to render digital objects over time”.
Then this means that digital objects become at risk when there is potential for them not to be rendered by us at a point in the future, and digital objects become issues when they can’t be rendered by us.
Maintaining the ability to render digital objects means maintaining access to a software environment that can render the objects. In other words this means we have to have at least one copy of the software and dependencies that are needed to render the objects.
In order to mitigate against a risk that objects won’t be renderable we have at least two options:
1. migrate content from files that make up the objects to other files that can be rendered in environments that we currently support.
2. maintain access to environments indefinitely using emulation/virtualization.
So there is the clarification. Now some conjectures regarding it:
- For any reasonably sized volume of digital objects that require the same rendering environment, it may be simpler and cheaper to just continue to maintain access to one environment by emulating or virtualizing it. All this takes is the ability for somebody to install the required software in a virtual/emulated machine and for that machine image to continue to be renderable by emulation/virtualisation software in the future.
- Maintaining one copy of a compatible environment suffices for preservation purposes as it enables us to say we have preserved the objects, but is probably not good enough for access. There are reasons why we should provide viewers for digital objects, and also reasons why we should try to make sure users can access objects using their own modern/contemporary software. For these reasons we may also have to perform migration where it is cheap/fund-able and provide access to the preservation master through reading rooms (either physical or virtual) in which we can restrict the number of concurrent users to as many as we have licenses for for the emulated environments.
As part of on-going research I have recently been working a lot with emulated desktop environments.
One of the somewhat surprising things to come out of this work has been the realisation that an having a set of emulated desktops with various old applications installed on them (an emulation workbench) is a really valuable tool for digital preservation practitioners.
When faced with an digital object with an unknown format that DROID, JHOVE etc cannot identify, one of the most useful approaches I have found for discovering the format of the object is to try opening it in a number of applications of roughly the same era. Often applications will suggest an open-parameter to use when opening a file e.g:
Or they may obviously produce errors when opening a file e.g:
Both of which can be useful for understanding the types of objects you are dealing with.
Some applications specify explicitly that they are converting an object from one format to another, implying that the application decided that the object was of the first format.
Admittedly this approach can be time consuming. But if you have a set of files that you think are the same type it may be worthwhile spending the time attempting to open the files in different applications. Also, with some research it may be possible to automate this process so that an object can be automatically opened in a range of applications from it’s era and the results automatically analysed to see which gave the least errors or to analyse the conversion messages provided to see whether all the applications agree on the original format. Jay Gattuso has discussed something similar here.
Given the obsolescence of hardware, and difficulty setting up old hardware, this use-case highlights the need for a set of emulated desktops for digital preservation practitioners to add to their tool-set. Such a tool-set or “workbench” would be extremely helpful for adding to format databases such as Pronom and UDFR.
Comments appreciated via @euanc on twitter
I’ve been working on an application and installed environment database.
As part of this I have been documenting the save-as, open, export and import parameters (options) for many business applications.
For example, the following are the open parameters available for Lotus 1-2-3 97 edition installed on Windows 95:
ANSI Metafile (CGM)
Lotus 1-2-3 PIC (PIC)
Lotus 1-2-3 SmartMaster Template (12M)
Lotus 1-2-3 Workbook (123;WK*)
Quattro Pro (WQ1;WB1;WB2)
Windows Metafile (WMF)
Recently I realised that this might be a good source for intelligence about file formats. Let me explain what I mean.
Different applications differentiate in different ways between versions of file formats in their open and save-as parameters. The logic behind the differentiation may be able to be analysed to discover when format variants are significant or not.
For example Microsoft Word Version 6.0c (running on Windows 3.11) has the following open parameters for word for ms-dos files:
Word for MS-DOS 3.x - 5.x
Word for MS-DOS 6.0
In contrast to this WordPerfect 5.2 for Windows (running on Windows 3.11) has these open parameters:
MS Word 4.0; 5.0 or 5.5
MS Word for Windows 1.0; 1.1 or 1.1a
MS Word for Windows 2.0; 2.0a; 2.0b
Of which the first may be referring to ms-dos versions.
Lotus Word Pro 96 Edition for Windows (running on Windows 3.11) has the following open parameter for word for ms-dos files:
MS Word for DOS 3;4;5;6 (*.doc)
And Corel WordPerfect Version 6.1 for Windows (running on Windows 3.11) has these open parameters:
MS Word for Windows 1.0; 1.1 or 1.1a
MS Word for Windows 2.0; 2.0a; 2.0b; 2.0c
MS Word for Windows 6.0
None of which refer to any ms-dos variants.
This pattern continues through more recent variants of each office suite.
The interesting finding from this is that the Microsoft suites differentiate between versions 3,4,5 (as a group) and version 6 but not within/between versions 3, 4 and 5 and the other suites (when they have a relevant parameter) do not differentiate between any of 3, 4, 5, or 6.
If every office suite differentiated between the variants in the same way then this would indicate that there were significant differences between them. However as they don’t then it is inconclusive in this case.
As Microsoft wrote the standards in this example then their suites ought to have the most reliable information and therefore it may be sensible to conclude that version 6 is significantly different to versions 3, 4 or 5.
This pattern also holds for save-as paramaters. The Microsoft suites differentiate between version 6 and the group of versions 3, 4 and 5 whereas the other suites don’t differentiate this way.
As the database gets more populated more analysis will be possible. Where there is general agreement in both open and save-as parameters across multiple applications then this will give digital preservation practitioners very good reason to believe that there are significant differences between the formats in question.
I am carefully suggesting that these findings only give us reason to believe that there are differences. There may not actually be differences. Just because particular applications allow for users to differentiate between these parameters/file formatting options that does not mean that the applications themselves actually do. It may, for example, be a marketing tool to enable the vendor of the product to state show that the tool is “compatible with many formats” even though it may use the same code to open them all.
Hopefully finding similar differences across many vendor’s tools enables us to mitigate against this issue but it should be noted that this approach does not provide definitive results.
Comments would be appreciated via twitter @euanc
Digital preservation practitioners often talk about digital preservation actions that they are planning or thinking about doing (rarely do they talk about ones that they have conducted, but thats another post).
Unfortunately I have found that when questioned about potential issues with their approaches digital preservation practitioners often fall back to saying either:
"well, we are keeping the originals as well",
"well, we are also doing ‘x’",
both of which are really unsatisfying replies.
It has lead me to conclude that we need a new term (or newly-redefined way of using an old term): “Permanent Preservation”.
Permanent preservation means actions that are taken that are intended to be the real-applied solution for digital preservation and which have the trust and approval of the organisation involved. Permanent preservation actions are those in which the organisation trusts the outcome and is willing to defend the authenticity of it.
- Permanent preservation using migration
Any migration action that an organisation is not willing to defend to the extent that they will dispose of the original files should not be considered a permanent preservation action.
- Permanent preservation using emulation
Any emulation solution that an organisation is not willing to defend to the extent that they will not perform any other (non-bit stream) preservation actions on objects that rely on the solution, should not be considered a permanent preservation action
Under this understanding of permanent preservation, migration for access is not a permanent preservation action as it is not intended to be a digital preservation solution and will generally involve retaining the original.
If we use this term in the way outlined above then when practitioners talk about digital preservation approaches they can now differentiate between those that are permanent and those that are not (yet) permanent and not (yet) worth our trust.
For Archives in particular (keepers of evidential records), trust and the authenticity is key to their very business, so all preservation solutions should have the potential to be permanent.
Of course for now we may not have any possibly permanent preservation actions. But we should also use the above definition to distinguish between those with the potential for permanence and those without.
Some preservation actions will never be able to be trusted without extensive and costly ongoing/long-term manual checking. Others may be able to be trusted with minimal (and therefore inexpensive) ongoing/long-term manual checking. Given that all digital preservation actions currently involve a degree of up-front cost, those that may be able to be trusted at some point in the future are arguably worth more upfront investment than those that never will be able to be trusted or those that won’t be able to be trusted without significant ongoing or long-term cost.
After a discussion about the cost of digital preservation the other day I thought I would try to do a quick and dirty estimate of the cost of providing the “simple” Google search box:
Google took in US$ 7,286,000,000 in revenues in the 3rd Quarter of this year. Of that 65% was Costs and Expenses including (in millions US$):
Cost of Revenues $2,552 —-This is the amount it cost to provide the services that gained the revenues (I think)
Percent of revenues 35%
Percent of revenues 35%
Research & Development $994 —Cost for ongoing R & D
Percent of revenues 14%
Percent of revenues 14%
Sales & Marketing $661 — Many people think we need more of this, we can’t really not count it as a cost
Percent of revenues 9%
Percent of revenues 9%
General & Administrative $532 — Unclear what this means but seems reasonable
Percent of revenues 7%
Percent of revenues 7%
Total Costs & Expenses $4,739 (4.739 Billion US$ !!!)
Percent of revenues 65%
Percent of revenues 65%$4,833 Billion in revenues came from its Google search service/advertising. That is 66% of its revenues. Assuming it spends 66% of its money to get those revenues then we can multiple the total costs by 66% and we should get a vague notion of how much it spends on its search:
4739 Billion x 66% = 3,190 Billion US$
So next time someone asks for a Google like solution, ask them for 3.2 Billion US dollars per quarter
(admittedly this analysis has gaping holes but its kinda fun to think about— We probably don’t need to index all of the internet’s information for any particular solution, and this probably includes the cost of the advertising infrastructure that gathers the revenues for Google, but I suspect the start-up cost to get something close to Google’s power would still be astronomical).
How do you preserve an outlook calendar? It would be quite a resource for future researchers.
Gizmodo are running a great series of posts on digitial continuity issues:
Hi and welcome to anyone who has stumbled upon this blog!
My name is Euan Cochrane and I’m a digital preservation professional based in Wellington, New Zealand. I intend on using this blog to talk about issues related to digital continuity generally, including issues and news around digital preservation, metadata and related technologies such as XML and RDF.
It may take me a little while to get this started so please bear with me. I don’t want to officially launch the blog until I have prepared it more fully and am ready to regularly post.
I look forward to starting a dialogue with you all.