A couple of months ago I reported on this blog that the OPF was beginning a project to investigate options for a new approach to file format registries. We’ve just released the second report of this activity: “A New Registry for Digital Preservation: Conceptual Overview“. (330kB PDF); It explains our vision of a ‘registry ecosystem’ that will enable organisations worldwide to contribute and share information about file formats, while maintaining the ability to make independent and local decisions about preservation policies.
Please take a look and tell us what you think – leave a comment here or send me a mail ([email protected]).
The document explains the reasons why we think a new approach to file format registries is required and outlines the main requirements for such a registry. Our plan for the rest of 2011 is to start putting some of this into action: see the ‘Planning’ section of the report for details.
Most importantly, we’d like to ask for your input and contributions to this process:
please review the document and give us your feedback to help us refine our plans
consider whether the registry ecosystem we propose would be a good match to the needs of your institution
let us know if you would like to participate in making this idea a reality and what you could contribute.
We’re hoping for the usual insightful and thought-provoking comments from the readers of the OPF blog!
Practical Issues with Currently Available File Format Software
To get a better and more thorough impression of the activities andchallenges a national archive faces, I undertook a few of weeksresearch with Archives New Zealand, the National Archives of NewZealand. The original idea was to dig into emulation and emulationworkflows, but pretty quickly another topic emerged – the fileformat/rendering software detection problem.Thus it is great to see some activities to solve this problem!
At the beginning of my research trip we conducted a small survey of the digital files archives New Zealand holds that had already been copied from old media in order to get a number of suitable digital objects to test emulated original environments with. It produced a number of few hundred of objects which were run through the Droid detector (used in Windows XP). Some files were (re)checked with the Linux file command. Working on that primary set of example files a number of issues were discovered. For the older files in particular, PRONOM/Droid and Linux file fail pretty badly. This is particular important for the older files as there can often be less information or metadata on those files than for more recent ones. The detection tools do not come up withreasonable results for older WordStar, WordPerfect, MS-Word, … andfiles we found from old databases.
A very interesting case was a set of files from the late 1980s. Theyhad names like CAT23BB.DAT. It was not possible to get any meaningful results from PRONOM and Linux’s file. They were originally sitting on old DEC-structured 5.25″ disks and were unable to be read with a standard PC floppy drive (but fortunately someone had already done the copying to a shared storage devise otherwise we would have not been aware of them in the first round). Only a dubious sheet with some short notes on it (within the floppy disk box) gave away a hint on the data format, it had a scrawled note on top of other text that could just be read as saying “dataflex database”. Of course this application is not available here and was not archived with the files. There are ODBC importers available for dataflex but the set of files that we had was missing some important structural files that were required to open thedatabase itself. This case of digital archaeology is still to be solved, we are attempting to acquire a copy of dataflex via ebay to see if this will help.
In a next round files from the one of the Crown Research Institutes ofNew Zealand were cursory reviewed. Beside offering a similar collection of document files like in the archive there was some more peculiar fileslike programs that formed part of research projects and theses writtenin Turbo Pascal and Perl.
In a further investigation the holdings of the archive were checkedfor material which had not yet been copied off of the original media. It resulted in material on 3.5″, 5.25″ and 8″ floppy disks and ZIP disks. No evaluation of the content yet (and not simply possible for the old 8″ – btw. does any institution holds such a device?).
In sync with Archivematica project design/requirements
Having participated in some of the informal discussions about this at iPres2010 I am very happy to see how this report has turned out. Nicely done. The conceptual overview is very much in sync with the Archivematica project’s requirements and design for interacting with external registries. In particular:
In the Archivematica project we are especially interested in the ability to share our default project format policies (http://archivematica.org/preservation) and any institutional policy customizations made by Archivematica users together with policies from the digital curation community at large. It is difficult without a lot of cumbersome research to get a decent sense of community consensus/trends in relation to ‘best practice’ preservation and access formats. A community-based file format policy registry should make this easier.
The primary registry requirements on our 2011 development roadmap are:
We’d much rather implement these requirements under a wider community banner (e.g. OPF) than maintain our own registry.
Given the OPF 2011 plans for further work on the registry we might be able to collaborate (at least give further feedback) on the registry data model work you are planning as well as to test read/write of your API via Archivematica.
Cheers,
–peter
Peter Van Garderen. http://archivematica.org project manager.
P.S. For our initial stages of adding registry interaction to Archivematica, format policy changes/updates are expected to arise from manual analysis and decisions (e.g. availability of new open-source tool that allows us to add a previously unsupported preservation format, changes in community ‘consensus’ about the risk/sustainability of preservation format x). In a second phase we would focus on integration with format risk assesment registries that allow for more comprehensive and systematic risk alerts and/or recommended format policy changes.
KB-NL feedback on OPF registry
Barbara Sierman
OPF/NA wrote a proposal for a new Registry for Digital Preservation and asked for feedback. The KB-NL has a lot of experience with this material while setting up the development and implementation for the Preservation Manager, developed by IBM (but which is currently not in use by the KB as it is not compatible with our e-Depot- DIAS version). The following remarks are mainly based on this experience and we thought by sharing this we might add some interesting points to the discussion.