Research Activities and Open Questions at Archives New Zealand
At Archives New Zealand we are currently working on a number of digital preservation research activities including:
- The collation of a sample set of files for use in testing tools and approaches and other digital preservation experiments.
- Documenting software applications and environments.
- Developing an evidence base of migration/normalization and emulation tests.
The purpose of this post is to raise awareness of our work with the wider community and to get some feedback on the activities we are undertaking. There are a number of specific questions throughout that we would also appreciate specific feedback on also.
Collating a sample set of files
As has been discussed a lot in the community recently most digital preservation research requires some sample files for use in testing. We have also recognised this at Archives NZ and have taken it upon ourselves to produce and make available a set of files that may be of interest to the wider community. We plan on making available via the internet and hope to have the first set available by the end of June. Currently the sample is made up of files from a number of sources:
- Files from actual archives that we have had transferred to us.
- Files provided by a Crown Research Institute (http://www.sciencenewzealand.org/).
- Personal files from members of our team.
The hope/aim with the sample is to be able to provide items that have varying but real value to individuals, agencies and the wider public. This value component is mostly lacking in other sample sets that are available and for good reason, files that are of real value are often difficult to make available publicly because of confidentiality or intellectual property reasons. This has also been a problem for us and is one of the reasons it is taking us quite a long time to make the set available. The set has to be checked to ensure that is appropriate for public release.
The set looks like it will be around 550 MB in size and so we have been debating internally about the best option is for making the set available. I would like to ask the community for suggestions on this. There are three main questions we have:
- What would be the best way to provide this information, in a compressed container file (e.g. zip, tar, rar etc?) or in an ISO file, as individual files, or in some other form?
- Where should the information be posted? We could potentially make it available via the Archives New Zealand website but are there better places for it to live?
- What information should be included with it to describe the files? This will be limited by what we have available but, for example, we have gotten much of the information from floppy disks and other portable media, sometimes these disks were labelled, would this information be of use to the community?
Documenting Software Applications and Environments
We are currently doing some development of our archival description tool/database and in that we are looking to include the ability to document the creating application/environment and/or intended rendering application/environment for all digital items we control. Unfortunately neither of these fields are going to be very easy to populate as the population will have to be conducted automatically in most cases (due to volume), and there are either no tools available to infer that information or the tools that are available are not really up to the challenge.
In order to fill this gap and so we can document other experimentation we are doing (more on that below) we have been experimenting with an application/environment documentation database which is intended to document every application we hold, its dependencies, and the various parameters that the applications have which may be useful to know for digital preservation purposes. These parameters are things such as save-as parameters, open parameters, and import and export parameters.
Something like this would be useful for many purposes but in doing this experimentation we have already learnt quite a lot about the complexity/volume of different sets of code that are used to open or save files captured with different “formatting standards”. For example out of 33 applications documented so far there are 505 save-as parameters and 77 of those have a .doc extension associated with them. This implies that there are many different sets of code being used to write “.doc” files and so explains some of the changes that are found when files are subsequently opened in different software environments (i.e. the internal structure of files is different to that expected by the rendering application). A similar example can be given for the open-parameters and the volume of different sets of code used to open files structured with the same “format”.
There is a bit of a chicken-and-egg (http://en.wikipedia.org/wiki/Chicken_or_the_egg) problem with application/environment documentation in the digital preservation community at the moment as there is no equivalent of a PUID for applications/environments. This means that it is difficult to create tools to identify which app/environment is needed to render a file (or set of files), or was used to create them, as there is no standard way of identifying/documenting the app/environment (that I am aware of)[edit– Pronom has information about, and PUIDs for applications e.g. http://www.nationalarchives.gov.uk/PRONOM/Software/proSoftwareSearch.aspx?status=detailReport&id=14. ]– we will have to syncronise these. ]
The database we have been creating is fairly rudimentary and simple at the moment but is helping us to understand what requirements we have in this area. It would be great to see this Open Planets Foundation provide something to fill the gap here and there is potential for us to share our requirements if such a project were to be undertaken.
Developing a Migration/Normalization and Emulation Evidence Base
The third area of research that we have been investigating at Archives New Zealand has been in the development of a migration/normalization and emulation evidence base for use in making decisions about preservation strategies. This work is currently getting started and involves testing the rendering of digital objects across a number of different rendering environments. For any one object we may test its rendering on the following:
- (What we believe or know to be) the object’s original creating or rendering software running on representative hardware from the era that it was created.
- The object’s original creating or rendering software running in an emulated or virtualised set of hardware (QEmu, VMware or VirtualBox)
- Various current software applications such as Open Office, Libre Office, Microsoft Office 2007, Corel WordPerfect X5 etc in order to represent the object as migrated using those applications.
We are using a lime-survey survey to document each test rendering. This survey has questions in it about the different variables that may change across different renderings. It also has conditional questions that change depending on what kind of object you are testing the rendering of (it doesn’t ask about slide-transitions for spreadsheets for example). The survey currently has around 140 (mainly yes/no/comment) questions in it in total, though for any one test only a portion of these will appear to the tester (as few as ~6-10). We are also endeavouring to take screenshots of the process when needed.
As you might imagine this experimentation is quite time consuming and would benefit from being replicated and/or extended by others as we simply won’t have the time/resources to run these tests across as many objects as we would like. At the conclusion of the testing we intend on publishing the results and a description of the methodology so that others can do similar tests.
Test set of files really appreciated!
Great to hear from a sample set of test files. We are desperately in need of those to use them in our prototypes to test different view-path. They become especially interesting when comparing different view-path to each other regarding the rendering quality and completeness of performance. With too simple or artificially created objects it might be easily possible to overlook important object properties which are handled different in different view-path. A future challenge could be to document and order them in a way to seek for certain properties, like “lots of footnotes”, “complex formatting”, “large number of different fonts” or “different languages requiring special fonts”. It would be great to host this set on OPF. Additionally, it would be nice if extension by (OPF) users and comments (for e.g. adding meta data) would be possible.
Test files and software description
Really interesting post Euan! At the BL, we’re also in the process of making some sample datasets available, in part to support activities on the AQuA and SCAPE Projects (http://wiki.opf-labs.org/display/AQuA http://www.scape-project.eu/). Various SCAPE partners will be contributing datasets to support preservation tool development on the project, and some datasets will be freely available where this is possible. These should be appearing over the next month or so.
I believe OPF are keen on hosting sample data for just the purposes you describe, so that might be an option for you. Bram will be able to advise on this.
As regards your software description work, you might be interested in an ongoing project in this area called SWORD. It sounds like you have a slightly different approach, but there might be synergy. There are a lot of projects named SWORD so I’ll give you the URL:
http://theswo.sourceforge.net/
(Past) software description
Thank you for the hint regarding the software ontology and thank you, Euan for starting a software database (beside the software archiving)! Some database/tool registry is definitely required to describe migration-by-emulation and create-view services using original environments properly. They consist of a range of different software components which relate to each other in a certain way and produce the ability to handle digital objects of a certain type. The number of supported in- and output fileformats of a given original environment is typically much larger than for many migration tools.
Identifying binaries via hashes?
I’ve been wondering whether we should start using something like the SHA-256 hash sums of the binary artefacts as the identifiers for applications. In the majority of cases, exactly the same binary executables are distributed to the majority of users, and thus we would be able to spot whether people were using the same software without having to mint identifiers in a centralised manner. We could also do this for all the DLLs for a given binary and perhaps even for the kernel, and end up with a pretty good software environment fingerprint.
Having said that, it’s also worth noting that the Windows registry is full of unique keys that ensure binary compatability, called ProgID GUIDs. My recent digging around in Work files showed me this, and made me wonder if we shouldn’t just reuse Microsofts identifiers.
Capturing the hardware environment fully is another thing entirely!
SWORD, GUIDs and hash sums
Hi all,
These are great comments thanks, I have tried to follow the software ontology work and would like to be involved but itβs hard to get information on it from over here in NZ. I did fill in the survey sign up for the newsletter.
Andy, your ideas around identifying software are great! It sounds like the beginnings of a Droid-like tool for identifying applications/environments. It would be great to be able to run a tool across a transferring agency or depositor’s computer that would identify all the software installed on it including version variants and dependencies and give an ID for that particular environment (ignoring the unimportant other files on the computer).
I will look into the ProgID GUIDs, they will definitely be helpful for Windows applications but we will probably also need a separate system (i.e/e.g. the Pronom Application IDs) that will apply across allOSs.
Hash sums for executables is as good a place to start as any for IDing Applications. It may require more complexity than that due to other non-executable dependencies and also configuration files but that would be a great start. Perhaps I will start to add those to our database.
Regards,
Euan Cochrane
Sorry, I meant ProgID and CLSID…
The ProgID are strings like Word.Document.8 (for Word 2007 I think) and unique class identifiers (CLSID) are strings like 8611AFE4-99CB-4dd5-9D7B-81E1F56E7151. See my AQuA project for more rough notes: https://github.com/openplanets/AQuA/tree/master/office-analyser
Identifying software components
Using preexisting identifiers definitely is a good point to start from. Are those MS program IDs documented somewhere and used in a reliable manner over the different versions of Windows and applications? Checksumming is a good idea and we will able to distinguish the different variants and versions by it. No idea if this is properly matched to the aforementioned IDs. But, I see some problems here too: When doing the software archiving at Archives NZ we got two Word Perfect variants of the same version. The silently updated some files when issuing a newer set of floppy disks. How the software should be seen for checksumming: Looking at the whole media or onto single files (binaries and DLLs)?
Fuzzy hashes and function signatures
I’m having trouble finding an authoritative list of ProdIDs and CLSIDs – I could have sworn I found one on MSDN but maybe I’m mis-remembering.
I had two ideas for the troublesome cases. We could find non-identical but very similar binaries using fuzzy hashes, or by extracting the set of function signatures from each binary artefact and making a digital fingerprint out of that.
So much to do, so little time!
Fuzzy hashes for QA of content?
Fuzzy hashing is a new concept to me. Wondering if it would have any application in QA of (possibly damaged or changed) content? For example, the QA of audio files we looked at in AQuA:
http://wiki.opf-labs.org/display/AQuA/Audit+audio+batch+against+criteria
Some work already underway…
I learned about the tool from the Digital Lives research blog, which indicates some related work is already underway.
[EDIT: Darn, hit the wrong ‘reply’ button. Please imagine this post is on the correct thread.]
To answer your questions…
To answer your questions about providing test files…
1. In the cases where the files were pulled from difficult media, e.g. floppy disks, I would like the raw images to be available. Apart from that, in general, I’d like each resource to be made available individually, so each has a unique URI that we can use to refer to and annotate it. Ideally, the server would have gz compression switched on to reduce the download size, and each resource would have some authority-independent identifier so we can spot the same files if the server moves, changes name, or if the content is mirrored elsewhere (e.g. locally). This identifer could be as complex as a DataCite DOI, or just a the hashsum/digest of the bitstream (although note that only MD5 is supported as a standard HTTP header).
2. Don’t really mind, particularly if all of 1. above is met. On Archives NZ, but with the OPF and maybe http://digitalcorpora.org/ hosting mirrors?
3. Difficult to narrow this down. Any ancillary information is potentially interesting and useful, but just the files alone would be immensely helpful. Links to the migration/normalisation evidence base perhaps? I think we’d need to have a better idea of the particular questions we have in mind in order to pin this down. For example, for testing identification tools, I’d like the corpus to have format metadata (e.g. PUIDs) that have been manually validated.