Call for a Test Set of Files

Memory institutions planning to realise a digital preservation strategy and setup suitable systems face the problem of missing evaluation components. A number of tools for object characterization, migration or rendering in emulated original environments are available or being developed at the moment. But, to evaluate or compare them a proper set of sample objects is required. Those objects could be taken from each organizations individual holdings, but this strategy has some shortcomings:

  1. The objects might be classified and restricted from being removed from the organisations premises and not all staff might be allowed access to them.
  2. Significant backlogs in reading deprecated media to direct storage or in the delivery of objects might draw a biased picture of the relevant object types or lead to some being completely overlooked.

Additionally, it might be favourable to hand out sample sets to developers or contractors. This could offer an opportunity to save time, offload tasks and direct the development as desired. Those sets should be free from any restrictions and privacy concerns as they are to be made publically available to everyone.

A public sample set of numbers of objects of different filetypes is relevant for a couple of reasons:

  1. Checking and extending existing filetype detection libraries or creating new tools (like suggested in Bills Blog),
  2. Definition of a core test set for migration tools and emulated original environments,
  3. Defining a minimum set for a software archive of creating and rendering applications.

Create those files with the original applications might not serve the purpose as those artificial objects might lack the complexity or features found in the original material.

The kind of content sought after could be hosted/curated by the OPF ...

Preservation Topics: 


Excellent idea, Dirk, such a thing exists in the form of the PLANETS corpus. It can now be accessed through the Testbed. We could make an effort to make it more accessible and to expand upon it.

Andy Jackson's picture

The Testbed actually hosts two corpora, the XCL Corpus and the ONB Digitisation Corpus. I'm not sure it's 100% clear what terms the corpora are available under. In particular, the XCL corpus is an amalgum of sets of files drawn from elsewhere, and so have various different terms of use. Although having them in the Testbed is nice, we should look into making them more trivially available.

Other organisations are also doing some interesting work building up test corpora with clear licensing arrangements. I started collecting some links here. I think the govdocs1 corpus is particularly interesting.

Euan Cochrane's picture

Hi Everyone,

I am in agreement with Dirk on this. 

I have not done a thorough analysis but it appears that existing corpora don't have much diversity in file formats. In particular they do not have many older file formats, such as WordPerfect, dbase, word for dos, etc or variation in creating applications for more recent formats e.g. examples of word97-2003 files created with or created with the WordPerfect Suite (10, X1--X5 etc), NeoOffice or Office for Mac. 

This lack of diversity is a problem because the common formats are the easiest to deal with and the most well supported by current tools. It is the uncommon formats that cause the greatest problems for the tools.

Additionally, what we really need in my opinion are signatures that register not just the file format selected as a setting when a user "saved" a file but also the creating application used to "save" that file (as well as tools to recognise those signatures). It is this combination that defines the structure of files and also sets limits on/defines how any particular piece of software will behave when trying to render or migrate content saved in that file. Currently we have no tools that can tell us both the creating application and the file-format setting used when creating files. 

Unfortunately the Planets Testbed is practically not-responsive from over here on the other side of the world. I have been able to login but doing anything else is near impossible. If I manage to get any response from the site I will look in more detail at the Test Corpora available there. 



Andy Jackson's picture

According to this breakdown, the govdocs1 corpus does hold some Wordperfect, dBase and Word for DOS files. There's not very many of them, but it's a start.Having said that, I do agree that a more systematically produced corpus that records the creating application(s) would be very useful, and indeed necessary if we are going to properly understand the long tail of formats that are not widely used/shared.

We're been talking quite a bit about test datasets over here at NLNZ.

The big problem with using this stuff as a source to develop/check new tools etc is the knowledge of the representativeness of the sample files and the format they represent. For example I have a large collection of MP3 format variants, all of which should be similar enough to group them under a single identifier (e.g. fmt/134). What I don't have is confidence that my collection represents all the various variants of 'MP3' structure that I am likely to encounter.

I think what’s missing is a ground-truthed / ‘trusted’ dataset. The corpora that I have found seem to have lots of files (which is great) but not so much by way of ground truth (supporting information that tells me that the file is of file type fmt/xxx and perhaps something about the creating app & environment). This means that I either have to assume the format identity through manual inspection, or I use format ID tools (which may or may not report the format accurately....)

The big value in the ground-truth is the authenticated format identity (PUID or other) that comes with the files, meaning we can inspect the exemplar files, and compare them to what we have in front of us with a high degree of confidence that we are comparing 'apples to apples'

If we go down the broad corpus route I suggest there should be some management of the dataset that results in an open development subset and an equivalent closed test/evaluation set that is used in a very controlled way to validate work completed via the development set. Perhaps OPF could be te custodian of the eval set, and offer an informal/formal tool appraisal function that validates the format ID performance? This would go some way to attempting to reduce the skew that fully open datasets could offer.

What I think would be really useful is a place where we could crowd source example formats... We could arrange a public facing wiki/repository. Anyone can get files, but there is trusted depositor status that is assigned to suitable people/institutions, who can upload files. Each file is tagged with an authenticated format ID data (at minimum something like a PUID, but more useful is the content creation app, environment, OS etc, and the date is was created if known). We also handle the IP at this time - the depositor grants a right to use the content, and everything is 'donated' under a suitable CCL. If we can get enough content into the repository there is lots of cool things we could do provide access to clumps of files of particular types, randomising sample sets, bulk down load etc etc.

I'd be interested in thoughts on this approach...

Bram van der Werf's picture


Actually OPF would be very happy to be custodian or steward of a test eval set and a crowd sourced validating mechanism.

OPF also strongly believes that a crowd sourcing process for appraising, validating and peer reviewing within a community of experts is potentially key, to bringing us closer to effective solutions.  See also Andy Jacksons proposal for building a Collaborative Format Registry Editor.

This was related to his blogging on 

Todays challenge in our community of libraries and archives is a lack of collaboration for sharing a part of our daily work to a wider community via a process that meets basic requirement for trusted information.  This is not rocket science and does not require "projects", "software development", but rather common sense and an open mind for practical solutions

For Andy's practical collaborative initiative for a "crowd sourced format registry editor", we saw very little registrations and response, while at the same time we all know that the real problem with existing tools (PRONOM,JHOVE, DROID etc) is only to a limited extent the functionality but far more about the quality of the content. 

OPF is very supportive to endorse and sponsor your idea of crowd sourcing the validation files in test corpora similar as OPF is very supportive of endorsing Andy's Format Registry Editor.  

All tools like (PRONOM, JHOVE2, DROID) all depend heavily on the quality of the underlying content, this can only be provided by a community of technically competent reviewers, as is the case with validation of Test Corpora. Wiki's and other collaborative editing tools can be of great help, but we actually need to motivate our community to make this sharing of information and peer reviewing to become part of their daily work. OPF would be happy to explore with you and many other colleagues around the globe on practical actions to make this happen.