blekinge’s blog

Identification tools, an evaluation

We have created a testing framework based on the Govdocs1 digital Corpora (http://digitalcorpora.org/corpora/files), and are using the characterisation results from Forensic Innovations, Inc. ((http://www.forensicinnovations.com/), as ground truths.

We have tested Tika 1.0, Fido 0.9.6 and Droid 6.0 with the V45 signature file.

Tika generally performs best for all the 20 most common formats. Especially for text files (text/plain), it is the only tested tool that correctly identifies the files.

Tika is the fastests of the tools, and Fido is the slowest.

A new direction in file characterisation

While thinking about the Dev8D challenge (which I cannot compete in 🙁 I got to thinking about the way we do file characterisation.

I am not old enough to know the history of this field, but it seems that the grand old tool is the file(8) tool from unix. When “file” was developed, all files should contain/contained a few magic bytes in the header, to help identification tools. We still see this pattern.