Identification

SPRUCE Mashup: Batch File Identification using Apache Tika

My last post discussed the benefits of collabaration, centred around a SCAPE hackathon. I argued that, in general, it was the collaborative, collocated nature of the developers working together that made demo development quicker; more people staring at the same problem results in multiple and varied viewpoints, ideas, and solutions.

Identification tools, an evaluation

We have created a testing framework based on the Govdocs1 digital Corpora (http://digitalcorpora.org/corpora/files), and are using the characterisation results from Forensic Innovations, Inc. ((http://www.forensicinnovations.com/), as ground truths.

We have tested Tika 1.0, Fido 0.9.6 and Droid 6.0 with the V45 signature file.

Tika generally performs best for all the 20 most common formats. Especially for text files (text/plain), it is the only tested tool that correctly identifies the files.

Tika is the fastests of the tools, and Fido is the slowest.

Digital Archaeology and Forensics

Archives New Zealand and the University of Freiburg are cooperating on a data recovery project. The archive received a set of 5.25 inch floppy disks from the early 1990s that contained records of a public organization dating back to the mid 1980s. These floppies were not readable in standard X86 machines with a 5,25 inch floppy drive attached to it.

A prototype JP2 validator and properties extractor

A few months ago I wrote a blog post on a simple JP2 file structure checker. This led to some interesting online discussions on JP2 validation. Some people asked me about the feasibility of expanding the tool to a full-fledged JP2 validator. Despite some initial reservations, I eventually decided to dedicate a couple of weeks to writing a rough prototype.

Project to Identify files with linked dependencies

 Many office suites and other applications allow the embedding of information in them via a link to another file. The use of linked spreadsheets is common amonst data intensive agencies and large documents are often managed through linking multiple office documents to form a single final product. 

Future Perfect 2012

The draft programme for Future Perfect 2012 is now available online!

http://bit.ly/uVsvDl

The theme of Future Perfect 2012 - Digital Preservation by Design - seeks to stimulate discussion about how, when and why influencing the design of systems can support digital preservation and ultimately ensure that today’s information is available tomorrow.  Future Perfect 2012 will be a two day conference featuring many exciting international speakers.

Our audience will hear presentations from:

Date: 
26 March 2012 to 27 March 2012
Location: 
Museum of New Zealand Te Papa Tongarewa Wellington
New Zealand
Event Types: 

Pages