Corpora

Identification tools, an evaluation

We have created a testing framework based on the Govdocs1 digital Corpora (http://digitalcorpora.org/corpora/files), and are using the characterisation results from Forensic Innovations, Inc. ((http://www.forensicinnovations.com/), as ground truths.

We have tested Tika 1.0, Fido 0.9.6 and Droid 6.0 with the V45 signature file.

Tika generally performs best for all the 20 most common formats. Especially for text files (text/plain), it is the only tested tool that correctly identifies the files.

Tika is the fastests of the tools, and Fido is the slowest.

Project to Identify files with linked dependencies

 Many office suites and other applications allow the embedding of information in them via a link to another file. The use of linked spreadsheets is common amonst data intensive agencies and large documents are often managed through linking multiple office documents to form a single final product. 

Future Perfect 2012

The draft programme for Future Perfect 2012 is now available online!

http://bit.ly/uVsvDl

The theme of Future Perfect 2012 - Digital Preservation by Design - seeks to stimulate discussion about how, when and why influencing the design of systems can support digital preservation and ultimately ensure that today’s information is available tomorrow.  Future Perfect 2012 will be a two day conference featuring many exciting international speakers.

Our audience will hear presentations from:

Date: 
26 March 2012 to 27 March 2012
Location: 
Museum of New Zealand Te Papa Tongarewa Wellington
New Zealand
Event Types: 

Evaluation of identification tools: first results from SCAPE

As I already briefly mentioned in a previous blog post, one of the objectives of the SCAPE project is to develop an architecture that will enable large scale characterisation of digital file objects. As a first step, we are evaluating existing characterisation tools. The overall aim of this work is twofold.

Call for a Test Set of Files

Memory institutions planning to realise a digital preservation strategy and setup suitable systems face the problem of missing evaluation components. A number of tools for object characterization, migration or rendering in emulated original environments are available or being developed at the moment. But, to evaluate or compare them a proper set of sample objects is required. Those objects could be taken from each organizations individual holdings, but this strategy has some shortcomings: