As I already briefly mentioned in a previous blog post, one of the objectives of the SCAPE project is to develop an architecture that will enable large scale characterisation of digital file objects. As a first step, we are evaluating existing characterisation tools. The overall aim of this work is twofold. First, we want to establish which tools are suitable candidates for inclusion in the SCAPE architecture. As the enhancement of existing tools is another goal of SCAPE, the evaluation is also aimed at getting a better idea of the specific strengths and weaknesses of each individual tool. The outcome of this will be helpful for deciding what modifications and improvements are needed. Also, many of these tools are widely used outside of the SCAPE project, which means that the results will most likely be relevant to a wider audience (including the original tool developers).
Over the last months, work on this has focused on format identification tools. This has resulted in a report which is attached with this blog post. We have evaluated the following tools:
All tools were evaluated against a set of 22 criteria. Extensive testing using real data has been a key part of the work. One area which, I think, we haven’t been able to tackle sufficiently so far is the accuracy of the tools. This is problematic, since it would require a test corpus where the format of each file object is known a priori. In most large data sets this information will be derived from the very same tools that we are trying to test, so we need to see if we can say anything meaningful about this in a follow-up.
Over the previous months we’ve been sending out earlier drafts of this document to the developers of DROID, FIDO, FITS and JHOVE2, and we have received a lot of feedback to this. In the case of FIDO, a new version is underway, and this should correct most (if not all) of the problems that are mentioned in the report. For the other tools we have also received confirmation that some of the found issues will be fixed in upcoming releases.
The attached report should be seen as a living document. There will probably be one or more updates at some later point, and we may decide to include more tests using additional data. Meanwhile, as always, we appreciate any of your feedback on this!
Evaluation of characterisation tools – Part 1: Identification
Johan van der Knijff
KB / National Library of the Netherlands
| Attachment | Size |
|---|---|
| 1.28 MB |
Comments
Johan this is an excellent
Johan this is an excellent report. It helped us to see where FITS (fits.googlecode.com) could use some improvements. The latest release of FITS (version 0.6) included some enhancements called for in your report. FITS now runs all its embedded tools in parallel for better performance – previously they were called sequentially. There’s also now options to run FITS against directories of files. We appreciate getting feedback on the tool – let us know if there are additional features or improvements that could make the tool more useful to the community.
Great to see response from the tool developers
It’s great to see that the report turns out to be useful for the tool developers outside of the SCAPE project – responses like these are exactly what we were hoping for!
Of course this also means that the report is getting outdated quickly – apart from FITS there’s now also a new version of FIDO (with fixes of issues that were mentioned in the report), so we have to see how we can keep this document more or less up to date. I won’t be able to dedicate much time (if any at all) to testing the updated tool versions until the end of this year, but we should think of a way to provide some kind of continuity, either directly under SCAPE flag or in some other way. I expect to be able to say more about this in a month or so …
Johan, this is a very
Johan,
this is a very interesting and thorough report. As the architect of DROID 5 and DROID 6, your comments really resonated with me. I think the criticisms presented are mostly fair, and it’s really good to have an honest third-party appraisal of DROID and the other tools in its space.
Many of the options which are awkward or missing in DROID (e.g. XML output, better command-line support) were on the development roadmap. The financial changes on rules to employing contractors in government meant we were unable to complete everything we wanted to – we were down a development resource in the last stages of the project.
One statement you made isn’t quite true though:
“The output provides no information at all on whether an object could be identified in the first place: if DROID encounters an unidentifiable object, most of the output fields are simply left blank.”
In fact, DROID reports the number of identifications it made in the FORMAT_COUNT column. If this column is zero, it means that the file was processed successfully, but no identification could be made using the signatures. A number greater than zero indicates positive identifications were made. If there is no number at all, then either no format identifications were attempted (e.g. it is a folder), or an error occurred trying to process the resource (in which case, the STATUS column tells you if there was any issue in processing.
This probably isn’t explained as well as it could be (despite the 50+ pages of help!), and it does represent a change from the earlier methods of reporting identification results. Earlier versions of DROID reported identification results like “positive”, “tentative”, “unidentified”, etc. These were at least clear in what they were communicating, although they were quite subjective. We wanted to transition DROID to only reporting statements of fact, rather than giving subjective judgements on what those facts might mean in your own context. Our research indicated that all insititutions had very different attitudes to risk, so one size did not fit all.
I should probably add that I no longer work at the National Archives, but I remain interested in the field and what happens to DROID and other preservation tools in this space.
Regards,
Matt Palmer
Thanks for commenting on this
Hi Matt,
Thanks for your comments!
As for my statement on DROID not giving any information on whether a file could be identified at all: you’re completely right! I must have somehow overlooked this, as the meaning of the values in the FORMAT_COUNT column is actually described in the documentation. The change from the old ‘positive’/’tentative’ qualifications also makes perfect sense. I’ll correct this in upcoming versions of the report. Thanks for pointing this out.
Johan