This post and all others on this blog are my personal thoughts and opinions and are not necessarily those of any organisation I work for or have worked for.
Now to the post.
Firstly, the clarification:
If we assume that “the aim of digital preservation is to maintain our (the preserving organisation’s) ability to render digital objects over time”.
Then this means that digital objects become at risk when there is potential for them not to be rendered by us at a point in the future, and digital objects become issues when they can’t be rendered by us.
Maintaining the ability to render digital objects means maintaining access to a software environment that can render the objects. In other words this means we have to have at least one copy of the software and dependencies that are needed to render the objects.
In order to mitigate against a risk that objects won’t be renderable we have at least two options:
1. migrate content from files that make up the objects to other files that can be rendered in environments that we currently support.
2. maintain access to environments indefinitely using emulation/virtualization.
So there is the clarification. Now some conjectures regarding it:
- For any reasonably sized volume of digital objects that require the same rendering environment, it may be simpler and cheaper to just continue to maintain access to one environment by emulating or virtualizing it. All this takes is the ability for somebody to install the required software in a virtual/emulated machine and for that machine image to continue to be renderable by emulation/virtualisation software in the future.
- Maintaining one copy of a compatible environment suffices for preservation purposes as it enables us to say we have preserved the objects, but is probably not good enough for access. There are reasons why we should provide viewers for digital objects, and also reasons why we should try to make sure users can access objects using their own modern/contemporary software. For these reasons we may also have to perform migration where it is cheap/fund-able and provide access to the preservation master through reading rooms (either physical or virtual) in which we can restrict the number of concurrent users to as many as we have licenses for for the emulated environments.
As part of on-going research I have recently been working a lot with emulated desktop environments.
One of the somewhat surprising things to come out of this work has been the realisation that an having a set of emulated desktops with various old applications installed on them (an emulation workbench) is a really valuable tool for digital preservation practitioners.
When faced with an digital object with an unknown format that DROID, JHOVE etc cannot identify, one of the most useful approaches I have found for discovering the format of the object is to try opening it in a number of applications of roughly the same era. Often applications will suggest an open-parameter to use when opening a file e.g:
Or they may obviously produce errors when opening a file e.g:
Both of which can be useful for understanding the types of objects you are dealing with.
Some applications specify explicitly that they are converting an object from one format to another, implying that the application decided that the object was of the first format.
Admittedly this approach can be time consuming. But if you have a set of files that you think are the same type it may be worthwhile spending the time attempting to open the files in different applications. Also, with some research it may be possible to automate this process so that an object can be automatically opened in a range of applications from it’s era and the results automatically analysed to see which gave the least errors or to analyse the conversion messages provided to see whether all the applications agree on the original format. Jay Gattuso has discussed something similar here.
Given the obsolescence of hardware, and difficulty setting up old hardware, this use-case highlights the need for a set of emulated desktops for digital preservation practitioners to add to their tool-set. Such a tool-set or “workbench” would be extremely helpful for adding to format databases such as Pronom and UDFR.
Comments appreciated via @euanc on twitter
I’ve been working on an application and installed environment database.
As part of this I have been documenting the save-as, open, export and import parameters (options) for many business applications.
For example, the following are the open parameters available for Lotus 1-2-3 97 edition installed on Windows 95:
ANSI Metafile (CGM)
Lotus 1-2-3 PIC (PIC)
Lotus 1-2-3 SmartMaster Template (12M)
Lotus 1-2-3 Workbook (123;WK*)
Quattro Pro (WQ1;WB1;WB2)
Windows Metafile (WMF)
Recently I realised that this might be a good source for intelligence about file formats. Let me explain what I mean.
Different applications differentiate in different ways between versions of file formats in their open and save-as parameters. The logic behind the differentiation may be able to be analysed to discover when format variants are significant or not.
For example Microsoft Word Version 6.0c (running on Windows 3.11) has the following open parameters for word for ms-dos files:
Word for MS-DOS 3.x - 5.x
Word for MS-DOS 6.0
In contrast to this WordPerfect 5.2 for Windows (running on Windows 3.11) has these open parameters:
MS Word 4.0; 5.0 or 5.5
MS Word for Windows 1.0; 1.1 or 1.1a
MS Word for Windows 2.0; 2.0a; 2.0b
Of which the first may be referring to ms-dos versions.
Lotus Word Pro 96 Edition for Windows (running on Windows 3.11) has the following open parameter for word for ms-dos files:
MS Word for DOS 3;4;5;6 (*.doc)
And Corel WordPerfect Version 6.1 for Windows (running on Windows 3.11) has these open parameters:
MS Word for Windows 1.0; 1.1 or 1.1a
MS Word for Windows 2.0; 2.0a; 2.0b; 2.0c
MS Word for Windows 6.0
None of which refer to any ms-dos variants.
This pattern continues through more recent variants of each office suite.
The interesting finding from this is that the Microsoft suites differentiate between versions 3,4,5 (as a group) and version 6 but not within/between versions 3, 4 and 5 and the other suites (when they have a relevant parameter) do not differentiate between any of 3, 4, 5, or 6.
If every office suite differentiated between the variants in the same way then this would indicate that there were significant differences between them. However as they don’t then it is inconclusive in this case.
As Microsoft wrote the standards in this example then their suites ought to have the most reliable information and therefore it may be sensible to conclude that version 6 is significantly different to versions 3, 4 or 5.
This pattern also holds for save-as paramaters. The Microsoft suites differentiate between version 6 and the group of versions 3, 4 and 5 whereas the other suites don’t differentiate this way.
As the database gets more populated more analysis will be possible. Where there is general agreement in both open and save-as parameters across multiple applications then this will give digital preservation practitioners very good reason to believe that there are significant differences between the formats in question.
I am carefully suggesting that these findings only give us reason to believe that there are differences. There may not actually be differences. Just because particular applications allow for users to differentiate between these parameters/file formatting options that does not mean that the applications themselves actually do. It may, for example, be a marketing tool to enable the vendor of the product to state show that the tool is “compatible with many formats” even though it may use the same code to open them all.
Hopefully finding similar differences across many vendor’s tools enables us to mitigate against this issue but it should be noted that this approach does not provide definitive results.
Comments would be appreciated via twitter @euanc
Digital preservation practitioners often talk about digital preservation actions that they are planning or thinking about doing (rarely do they talk about ones that they have conducted, but thats another post).
Unfortunately I have found that when questioned about potential issues with their approaches digital preservation practitioners often fall back to saying either:
"well, we are keeping the originals as well",
"well, we are also doing ‘x’",
both of which are really unsatisfying replies.
It has lead me to conclude that we need a new term (or newly-redefined way of using an old term): “Permanent Preservation”.
Permanent preservation means actions that are taken that are intended to be the real-applied solution for digital preservation and which have the trust and approval of the organisation involved. Permanent preservation actions are those in which the organisation trusts the outcome and is willing to defend the authenticity of it.
- Permanent preservation using migration
Any migration action that an organisation is not willing to defend to the extent that they will dispose of the original files should not be considered a permanent preservation action.
- Permanent preservation using emulation
Any emulation solution that an organisation is not willing to defend to the extent that they will not perform any other (non-bit stream) preservation actions on objects that rely on the solution, should not be considered a permanent preservation action
Under this understanding of permanent preservation, migration for access is not a permanent preservation action as it is not intended to be a digital preservation solution and will generally involve retaining the original.
If we use this term in the way outlined above then when practitioners talk about digital preservation approaches they can now differentiate between those that are permanent and those that are not (yet) permanent and not (yet) worth our trust.
For Archives in particular (keepers of evidential records), trust and the authenticity is key to their very business, so all preservation solutions should have the potential to be permanent.
Of course for now we may not have any possibly permanent preservation actions. But we should also use the above definition to distinguish between those with the potential for permanence and those without.
Some preservation actions will never be able to be trusted without extensive and costly ongoing/long-term manual checking. Others may be able to be trusted with minimal (and therefore inexpensive) ongoing/long-term manual checking. Given that all digital preservation actions currently involve a degree of up-front cost, those that may be able to be trusted at some point in the future are arguably worth more upfront investment than those that never will be able to be trusted or those that won’t be able to be trusted without significant ongoing or long-term cost.
Open Planets Foundation is proud to present: Fido.jar. A java port of the Python version of Fido (Format Identification for Digital Objects). This first version runs on all platforms with Java 6 update 23 or later installed.
We would like you to give this first Fido in a jar a try. If you encounter any bugs, please submit them to the OPF Labs Jira. Installation and usage instructions are included in the zipfile.
Download Fido.jar @ Github:IdentificationToolsFido
Fido is a simple format identification tool for digital objects that uses Pronom signatures. It converts signatures into regular expressions and applies them directly. Fido is free, Apache 2.0 licensed, easy to install, and runs on Windows and Linux. Most importantly, Fido is very fast.
In a subsequent post, I’ll describe the implementation in more detail. For the moment, I would just like to highlight that the implementation was done by a rusty programmer in the evenings during October. The core is a couple of hundred lines of code in three files. It is shorter than these blog posts!
I was stunned by Fido’s performance. Its memory usage is very small. Under XP, it consumes less than 5MB whether it identifies 5 files or 5000 files.
I have benchmarked Fido 0.7.1 under Python 2.6 on a Dell D630 laptop with a 2ghz Intel Core Duo processor under Windows XP. In this configuration, Fido chews through a mixed collection of about 5000 files on an external USB drive at the rate of 60 files per second.
As a point of comparison, I also benchmarked the file (cygwin 5.0.4 implementation) command in the same environment against the same set of 5000 files. File does a job similar to Droid or Fido – it identifies types of files, but more from the perspective of the Unix system administrator than a preservation expert (e.g., it is very good about compiled programmes, but not so good about types of Office documents). I invoked file as follows:
time find . –type f | file –k –i –f – > file.out
This reports 1m24s or 84 seconds. I compared this against:
time python –m fido.run –q –r . > fido.csv
This reports 1m18s or 78 seconds.
In my benchmark environment, Fido 0.7.1 is about the same speed as file. This is an absolute shock. Neither Fido nor the Pronom signature patterns have been optimised, whereas file is a mature and well established tool. Memory usage is rock solid and tiny for both Fido and file.
Meanwhile, Maurice de Rooij at the National Archives of the Netherlands has done his own benchmarking of Fido 0.7.1 in a setting that is more reflective of a production environment (Machine: Ubuntu 10.10 Server running on Oracle VirtualBox; CPU: Intel Core Duo CPU E7500 @ 2.93 GHz (1 of 2 CPU's used in virtual setup); RAM: 1 GB). He observed Fido devour a collection of about 34000 files at a rate of 230 files per second.
Fido’s speed comes from the mature and highly optimised libraries for regular expression matching and file I/O – not clever coding.
For me, performance in this range is a surprise, a relief, and an important step forward. It means that we can include precise file format identification into automated workflows that deal with large-scale digital collections. A rate of 200 files per second is equivalent to 17.28 million files in a day – on a single processor. Fido 0.7 is already fast enough for most current collections.
Good quality format identification along with a registry of standard format identifiers is an important element for any digital archive. Now that we have the overall performance that we need, I believe that the next step is to correct, optimise, and extend the Pronom format information.
Fido is available under the Apache 2.0 Open Source License and is hosted by GitHub at http://github.com/openplanets/fido. It is easy to install and runs on Windows and Linux. It is still beta code – we welcome your comments, feedback, ideas, bug reports - and contributions!Preservation Topics: IdentificationFido
After a discussion about the cost of digital preservation the other day I thought I would try to do a quick and dirty estimate of the cost of providing the “simple” Google search box:
Google took in US$ 7,286,000,000 in revenues in the 3rd Quarter of this year. Of that 65% was Costs and Expenses including (in millions US$):
Cost of Revenues $2,552 —-This is the amount it cost to provide the services that gained the revenues (I think)
Percent of revenues 35%
Percent of revenues 35%
Research & Development $994 —Cost for ongoing R & D
Percent of revenues 14%
Percent of revenues 14%
Sales & Marketing $661 — Many people think we need more of this, we can’t really not count it as a cost
Percent of revenues 9%
Percent of revenues 9%
General & Administrative $532 — Unclear what this means but seems reasonable
Percent of revenues 7%
Percent of revenues 7%
Total Costs & Expenses $4,739 (4.739 Billion US$ !!!)
Percent of revenues 65%
Percent of revenues 65%$4,833 Billion in revenues came from its Google search service/advertising. That is 66% of its revenues. Assuming it spends 66% of its money to get those revenues then we can multiple the total costs by 66% and we should get a vague notion of how much it spends on its search:
4739 Billion x 66% = 3,190 Billion US$
So next time someone asks for a Google like solution, ask them for 3.2 Billion US dollars per quarter
(admittedly this analysis has gaping holes but its kinda fun to think about— We probably don’t need to index all of the internet’s information for any particular solution, and this probably includes the cost of the advertising infrastructure that gathers the revenues for Google, but I suspect the start-up cost to get something close to Google’s power would still be astronomical).
How do you preserve an outlook calendar? It would be quite a resource for future researchers.
Gizmodo are running a great series of posts on digitial continuity issues:
Hi and welcome to anyone who has stumbled upon this blog!
My name is Euan Cochrane and I’m a digital preservation professional based in Wellington, New Zealand. I intend on using this blog to talk about issues related to digital continuity generally, including issues and news around digital preservation, metadata and related technologies such as XML and RDF.
It may take me a little while to get this started so please bear with me. I don’t want to officially launch the blog until I have prepared it more fully and am ready to regularly post.
I look forward to starting a dialogue with you all.