Feed aggregator

Identification tools, an evaluation

Fido Blog Feed - 23 February 2012 - 9:09am
The Scape Characterisation Tool Testing Suite

This information have also been published in the Scape Deliverable D9.1.

 We have created a testing framework based on the Govdocs1 digital Corpora (http://digitalcorpora.org/corpora/files), and are using the characterisation results from Forensic Innovations, Inc. ((http://www.forensicinnovations.com/), as ground truths.

The framework we used for this evaluation can be found on 

https://github.com/openplanetsfoundation/Scape-Tool-Tester

All the tested tools use a different identification scheme for formats of files. As a common denominator, we have decided to use Mime Types. Mimetypes are not detailed enough to contain all the relevant information about a file format, but all the tested tools are capable of reducing their more complete results to mimetypes. This thus ensures a level playing field.

The ground truths and the corpus

The govdocs1 corpus is a set of about 1 million files, freely available (http://digitalcorpora.org/corpora/files). Forensic Innovations, Inc. (http://www.forensicinnovations.com/) have kindly provided the ground truths for this testing framework, in the form of http://digitalcorpora.org/corp/files/govdocs1/groundtruth-fitools.zip. Unfortunately, they do not list mimetypes for each file, but rather a numeric ID, which seems to be vendor specific. They do provide this mapping, however, http://www.forensicinnovations.com/formats-mime.html, which allows us to match IDs to mimetypes. The list is not complete, as they have not provided mimetypes for certain formats (which they claim do not have mimetypes). For the testing suite, we have chosen to disregard files that Forensic Innovations, Inc. do not provide mimetypes for, as they make up a very small part of the collection. The remaining files number 977885.

The reduced govdocs1 corpus contains files of 87 different formats. These are not evenly distributed, however. Some formats are only represented by a single file, while others make up close to 25% of the corpus.

To display the results, we have chosen to focus on the 20 most common file formats in the corpus, and to list the remainding 67 as the long tail, as these only make up 0.56% of the total number of files in the corpus.

 Format Distribution in Govdocs

One interesting characteristic of the ID-to-mime table from Forensic Innovations, Inc. is that each format only have one mimetype. Now, in the real world, this is patently untrue. Many formats have several mimetypes, the best known example probably being text/xml and application/xml. To solve this problem, we have introduced the mimetype-equivalent list, which ammends the ground truths with additional mimetypes for certain formats. It should be noted that this list have been constructed by hand, simply by looking at the result of the characterisation tools. Any result that do not match the ground truth is recorded as an error, but inspection of the logs later have allowed us to pick up the results that should not have been errors, but rather alias results.

The test iterator

We have endeavoured to use the tools in a production-like way for benchmarking purposes. This means that we have attempted to use the tools’ own built-in recursion features, to avoid redundant program startups (most relevant for the java based tools). Likewise, we have, if possible, disabled those parts of the tools, that are not needed for format identification (most relevant for Tika). We have hidden the filenames from the tools (by simple renaming the data files), in order to test their format identification capabilities, without recursion to file extension.

Versions

Tika: 1.0 release

Droid: 6.0 release, Signature version 45

Fido: 0.9.6 release

Tika – a special note

For this test, Tika have been used as a java library, and have been wrapped in a specialised Java program (https://github.com/blekinge/Tika-identification-Wrapper). This way, we can ensure that only the relevant parts of Tika is being invoked (ie. identification) and not the considerably slower metadata extraction parts. By letting java, rather than the test framework handle the iteration over the files in the archve, we have also been able to measure the performance in a real massprocessing situation, rather than the large overhead in starting the JVM for each file.

Results

We have tested how precisely the tools have been able to produce results to match the ground truths. As stated, we have focused on the 20 most common formats in the corpus, and bundled the remainder into a bar called the Long Tail.

Precision

As can be seen from this graph, Tika generally performs best for all the 20 most common formats. Especially for text files (text/plain), it is the only tested tool that correctly identifies the files. For office files, especially excel and powerpoint, droid seems to be more precise. Tika is almost as precise, but Fido loses greatly here. Given that Fido is based on the Droid signatures, it might be surprising why it seems to outperform Droid for certain formats, but this is clear for pdf, postscript and rich text format. The authors will not speculate on why this is so.

Comma/tab separated files are fairly common in the corpus. Tika cannot detect this feature of the files, and recognize them as text/plain files. Fido and Droid fails to identify the files, just as they do for text/plain files.

The dBase files, a somewhat important feature of the corpus is not detected by any of the tools.

Only Tika identifies any files as rfc2822, and even then it misses a lot. All three tools are equally bad at identifing sgml files.

Interestingly, Droid and Fido seems to work much better than Fido on the long tail of formats.

The Long tail

We feel that the long tail of formats is worth looking more closely at.

 

The long tail 

In this table, we have removed any format where none of the tools managed to identify any files. So, this table is to show the different coverage of the tools. We see that it is not just different levels of precision that matter, but which formats are supported by which tools.

Droid and fido support the Fits image format. Tika does not. Tika however, supports the openxml document format, which Fido and Droid does not.

Application pdf and application xml are some rather odd files (otherwise the ground truths would have marked them as normal pdfs or xmls). Here Tika is worse than the other tools. Tika, however, is able to recognize RDF, as shown by the application/rdf+xml format.

It is clear that while the overall precision in the long tail is almost equivalent for the three tools, the coverage differs greatly. If Tika, for example, gained support for the fits image format, it would outperform Droid and Fido on the long tail. Droid and Fido, however, would score much higher, if they gained Tikas support for Microsoft openxml documents.

The speed of the tools

For production use of these tools, not just the precision, but also the performance of the tools are critical. For each tool, we timed the execution, to show us the absolute time, in which the tool is able to parse the archive. Of course, getting precise numbers here is difficult, as keeping an execution totally free of delays is almost impossible on modern computer systems.

We ran each of the tools on a dell poweredge m160 blade server, with two Intel(R) Xeon(R) CPU X5670 @ 2.93GHz. The server had 70 GB RAM, in the form of 1333MHz Dual Ranked LV RDIMMs.

The corpus was mounted on file server accessed through a mounted Network File System via Gigabit network interface.

Each of the tools were allowed to run as the only significant process on the given machine, but we could not ensure that no delayes were caused by the network, as this was shared with other processes in the organisation.

 Speed test

To establish baselines, we have added two additional ”tools”, the unix File tool and the md5 tool.

The Unix File tool check the file headers against a database of signatures. Being significantly faster than the File tool indicates that the tool was able to identify the file without reading the contents. To do so, it would probably have to rely on filenames. Tika seems to be faster, but such small differences are covered by the uncertainties in the system.

Md5 is not a characterisation tool. Rather, it is a checksumming tool. To checksum a file, the tool needs to read the entire file. For the system in question the actual checksum calculation in neglible, so Md5 gives a baseline for reading the entire archive.

As can be seen, Tika is the fastests of the tools, and Fido is the slowest. That the showdown was to be between Tika and Droid was expected. Python does not have a Just In Time compiler, and will not be able to compete with java for such long running processes. That Fido was even slower than Md5 came as a surprise, but again, Md5 is written in very optimised C, and Fido is still python.

Preservation Topics: IdentificationWeb ArchivingCorporaSCAPEFido AttachmentSize Format distribution in Govdocs17.48 KB Tool precision19.12 KB The long tail17.43 KB Tool speed7.17 KB

New FIDO version: 0.9.6

Fido Blog Feed - 4 October 2011 - 1:50pm

The new FIDO (Format Identification for Digital Objects) is here, version 0.9.6.

Improvements:

  • reports if match is based on signature, extension or no match (fail)
  • reports if file is empty (to stderr)
  • reporting of mime-types fixed (special thanks to Derek Higgins)
  • shows help upon invocation without arguments
  • PDF signatures updated from PRONOM files, due to this FIDO failed to recognize some versions
  • extra information available in output via matchprintf: file format version, alias, Apple UTI, group index and group size (in case of multiple -tentative- hits) and current file count


Changes:

  • extension switch removed, this is a builtin default now
  • mime-types added to standard match output
  • match type added to standard match output
  • STDOUT/STDERR printing is now backward/forward compatible with old and future Python versions
  • Windows installer and site-package installer removed due to incompatibility problems


Additionally there is a new script 'to_xml.py' which converts FIDO's csv output to XML. This script also reports the FIDO version and PRONOM signature version. You can pipe FIDO's output to this script while it runs or use it afterwards to convert the CSV output file. More information on how to invoke this converter can be found in the script. Please note that the XML template in this script is only compatible with the default matchprintf output, but you are free to change this template yourself if needed.


Next tasks on the list are cleaning up code, create Pythonic easy installer, have FIDO recognize Compound documents better and improve the Prepare script (to generate FIDO compatible signatures). Please consult the FIDO JIRA for more information on these subjects.


You can pull the new version via https://github.com/openplanets/fido or download the zip directly: https://github.com/openplanets/fido/zipball/master


If you find any bugs or have any questions or requests, please submit them to the FIDO JIRA:

http://jira.opf-labs.org/browse/FIDO

Preservation Topics: IdentificationToolsOpen Planets FoundationFido

Fido in the jar

Fido Blog Feed - 3 March 2011 - 4:37pm

Open Planets Foundation is proud to present: Fido.jar. A java port of the Python version of Fido (Format Identification for Digital Objects). This first version runs on all platforms with Java 6 update 23 or later installed.

We would like you to give this first Fido in a jar a try. If you encounter any bugs, please submit them to the OPF Labs Jira. Installation and usage instructions are included in the zipfile.

Download Fido.jar @ Github:

https://github.com/downloads/openplanets/fido/fido_jar-0.9.5.zip

Preservation Topics: IdentificationToolsFido

Fido – a high performance format identifier for digital objects

Fido Blog Feed - 3 November 2010 - 7:57am

Fido is a simple format identification tool for digital objects that uses Pronom signatures. It converts signatures into regular expressions and applies them directly. Fido is free, Apache 2.0 licensed, easy to install, and runs on Windows and Linux.  Most importantly, Fido is very fast.

In a subsequent post, I’ll describe the implementation in more detail.  For the moment, I would just like to highlight that the implementation was done by a rusty programmer in the evenings during October.  The core is a couple of hundred lines of code in three files.  It is shorter than these blog posts!

I was stunned by Fido’s performance.  Its memory usage is very small.  Under XP, it consumes less than 5MB whether it identifies 5 files or 5000 files.

 I have benchmarked Fido 0.7.1 under Python 2.6 on a Dell D630 laptop with a 2ghz Intel Core Duo processor under Windows XP.  In this configuration, Fido chews through a mixed collection of about 5000 files on an external USB drive at the rate of 60 files per second.

As a point of comparison, I also benchmarked the file (cygwin 5.0.4 implementation) command in the same environment against the same set of 5000 files.  File does a job similar to Droid or Fido – it identifies types of files, but more from the perspective of the Unix system administrator than a preservation expert (e.g., it is very good about compiled programmes, but not so good about types of Office documents).  I invoked file as follows:

       time find . –type f | file –k –i –f – > file.out

This reports 1m24s or 84 seconds.  I compared this against:

       time python –m fido.run –q –r . > fido.csv

This reports 1m18s or 78 seconds.

In my benchmark environment, Fido 0.7.1 is about the same speed as file.  This is an absolute shock.  Neither Fido nor the Pronom signature patterns have been optimised, whereas file is a mature and well established tool.  Memory usage is rock solid and tiny for both Fido and file.

Meanwhile, Maurice de Rooij at the National Archives of the Netherlands has done his own benchmarking of Fido 0.7.1 in a setting that is more reflective of a production environment (Machine: Ubuntu 10.10 Server running on Oracle VirtualBox; CPU: Intel Core Duo CPU E7500 @ 2.93 GHz (1 of 2 CPU's used in virtual setup);  RAM: 1 GB).  He observed Fido devour a collection of about 34000 files at a rate of 230 files per second.

Fido’s speed comes from the mature and highly optimised libraries for regular expression matching and file I/O – not clever coding.

For me, performance in this range is a surprise, a relief, and an important step forward.  It means that we can include precise file format identification into automated workflows that deal with large-scale digital collections.  A rate of 200 files per second is equivalent to 17.28 million files in a day – on a single processor. Fido 0.7 is already fast enough for most current collections.

Good quality format identification along with a registry of standard format identifiers is an important element for any digital archive.  Now that we have the overall performance that we need, I believe that the next step is to correct, optimise, and extend the Pronom format information.

Fido is available under the Apache 2.0 Open Source License and is hosted by GitHub at http://github.com/openplanets/fido. It is easy to install and runs on Windows and Linux.  It is still beta code – we welcome your comments, feedback, ideas,  bug reports - and contributions!

Preservation Topics: IdentificationFido