FIDO (Format Identification for Digital Objects) is a Python command line tool to identify the file formats of digital objects. A lot of improvements to the code and functionality have been made.
What’s new?FIDO now uses the PRONOM container signature file to determine the container (compound) type if the file is of container type “OLE2” (fmt/111) or “ZIP” (fmt/189 or x-fmt/263). If so, FIDO will perform a deep (complete) scan of the object.A side effect of this deep scan is that FIDO sometimes detects objects which are embedded in the scanned object. So, if you cut’n’paste a Powerpoint slide into a Word document, FIDO might detect this. Tests revealed this depends on the way the creating application (in this case MS Office 2003) structures the file while saving, this is not always straightforward and seems to differ on a per-file basis with only slight variations. Nice food for thought I’d suppose!
Another side effect is that deep scanning can cause FIDO to slow down significantly with big container-type files. You can disable deep scanning by invoking FIDO with the ‘-nocontainer’ argument. While disabling deep scan of containers speeds up identification, it may reduce accuracy.
Signature update scriptFIDO now comes with an interactive CLI script to update PRONOM signatures to the latest version. This script queries the PRONOM SOAP webservice for the latest version available. If there is a new version available it will download the latest signature file and signatures. Please note that the script WILL NOT update the container signature file. The reason for this is that the PRONOM container signature file contains special types of sequences which need to be tested before FIDO can use them. If there is an update available for the PRONOM container signature file, it will be tested before it shows up in the next commit. However, if you are up for a challenge, you can always try and update the file yourself.
Other changes in this versionThe functionality to convert and check the quality of PRONOM signatures has been taken out in this new version. These functions are now available through the prepare script which comes with the signature update script.
The futureFIDO should only do what it was intended for in the first place: Format Identification for Digital Objects. As an example, don’t expect FIDO to ever validate JPEG2000 files, we have JPYLYZER for that. The OPF philosophy is to create small applications for separate kinds of digital preservation tasks. This way every application can be used standalone, or, and here comes the best part, they can be built into a workflow. In fact, at NANETH we already use FIDO in one of our ingest tools.With FIDO version 1.0 being published, it does not mean development stops. In a next release there will be support added to run FIDO multithreaded for even better performance. A next step in container identification could be to unpack compound files and have foreign objects scanned separately. Also there has been a request to provide an interface to the values of the inner functions. This way FIDO can be incorporated into larger Python frameworks more easily. BTW: at the moment it is already possible to invoke FIDO from your own scripts, read more here.
Try it and report!If you already use DROID or other format identification tools on a regular basis, give FIDO a try and report back to us. We’d like to know what features are missing, if you use FIDO in a workflow or if you have found a bug. Even report back if you are just happy using FIDO.
DOWNLOAD NOW!https://github.com/openplanets/fido/zipball/master
More FIDO…This information have also been published in the Scape Deliverable D9.1.
We have created a testing framework based on the Govdocs1 digital Corpora (http://digitalcorpora.org/corpora/files), and are using the characterisation results from Forensic Innovations, Inc. ((http://www.forensicinnovations.com/), as ground truths.
The framework we used for this evaluation can be found on
https://github.com/openplanetsfoundation/Scape-Tool-Tester
All the tested tools use a different identification scheme for formats of files. As a common denominator, we have decided to use Mime Types. Mimetypes are not detailed enough to contain all the relevant information about a file format, but all the tested tools are capable of reducing their more complete results to mimetypes. This thus ensures a level playing field.
The ground truths and the corpusThe govdocs1 corpus is a set of about 1 million files, freely available (http://digitalcorpora.org/corpora/files). Forensic Innovations, Inc. (http://www.forensicinnovations.com/) have kindly provided the ground truths for this testing framework, in the form of http://digitalcorpora.org/corp/files/govdocs1/groundtruth-fitools.zip. Unfortunately, they do not list mimetypes for each file, but rather a numeric ID, which seems to be vendor specific. They do provide this mapping, however, http://www.forensicinnovations.com/formats-mime.html, which allows us to match IDs to mimetypes. The list is not complete, as they have not provided mimetypes for certain formats (which they claim do not have mimetypes). For the testing suite, we have chosen to disregard files that Forensic Innovations, Inc. do not provide mimetypes for, as they make up a very small part of the collection. The remaining files number 977885.
The reduced govdocs1 corpus contains files of 87 different formats. These are not evenly distributed, however. Some formats are only represented by a single file, while others make up close to 25% of the corpus.
To display the results, we have chosen to focus on the 20 most common file formats in the corpus, and to list the remainding 67 as the long tail, as these only make up 0.56% of the total number of files in the corpus.
One interesting characteristic of the ID-to-mime table from Forensic Innovations, Inc. is that each format only have one mimetype. Now, in the real world, this is patently untrue. Many formats have several mimetypes, the best known example probably being text/xml and application/xml. To solve this problem, we have introduced the mimetype-equivalent list, which ammends the ground truths with additional mimetypes for certain formats. It should be noted that this list have been constructed by hand, simply by looking at the result of the characterisation tools. Any result that do not match the ground truth is recorded as an error, but inspection of the logs later have allowed us to pick up the results that should not have been errors, but rather alias results.
The test iteratorWe have endeavoured to use the tools in a production-like way for benchmarking purposes. This means that we have attempted to use the tools’ own built-in recursion features, to avoid redundant program startups (most relevant for the java based tools). Likewise, we have, if possible, disabled those parts of the tools, that are not needed for format identification (most relevant for Tika). We have hidden the filenames from the tools (by simple renaming the data files), in order to test their format identification capabilities, without recursion to file extension.
VersionsTika: 1.0 release
Droid: 6.0 release, Signature version 45
Fido: 0.9.6 release
Tika – a special noteFor this test, Tika have been used as a java library, and have been wrapped in a specialised Java program (https://github.com/blekinge/Tika-identification-Wrapper). This way, we can ensure that only the relevant parts of Tika is being invoked (ie. identification) and not the considerably slower metadata extraction parts. By letting java, rather than the test framework handle the iteration over the files in the archve, we have also been able to measure the performance in a real massprocessing situation, rather than the large overhead in starting the JVM for each file.
ResultsWe have tested how precisely the tools have been able to produce results to match the ground truths. As stated, we have focused on the 20 most common formats in the corpus, and bundled the remainder into a bar called the Long Tail.
As can be seen from this graph, Tika generally performs best for all the 20 most common formats. Especially for text files (text/plain), it is the only tested tool that correctly identifies the files. For office files, especially excel and powerpoint, droid seems to be more precise. Tika is almost as precise, but Fido loses greatly here. Given that Fido is based on the Droid signatures, it might be surprising why it seems to outperform Droid for certain formats, but this is clear for pdf, postscript and rich text format. The authors will not speculate on why this is so.
Comma/tab separated files are fairly common in the corpus. Tika cannot detect this feature of the files, and recognize them as text/plain files. Fido and Droid fails to identify the files, just as they do for text/plain files.
The dBase files, a somewhat important feature of the corpus is not detected by any of the tools.
Only Tika identifies any files as rfc2822, and even then it misses a lot. All three tools are equally bad at identifing sgml files.
Interestingly, Droid and Fido seems to work much better than Fido on the long tail of formats.
The Long tailWe feel that the long tail of formats is worth looking more closely at.
In this table, we have removed any format where none of the tools managed to identify any files. So, this table is to show the different coverage of the tools. We see that it is not just different levels of precision that matter, but which formats are supported by which tools.
Droid and fido support the Fits image format. Tika does not. Tika however, supports the openxml document format, which Fido and Droid does not.
Application pdf and application xml are some rather odd files (otherwise the ground truths would have marked them as normal pdfs or xmls). Here Tika is worse than the other tools. Tika, however, is able to recognize RDF, as shown by the application/rdf+xml format.
It is clear that while the overall precision in the long tail is almost equivalent for the three tools, the coverage differs greatly. If Tika, for example, gained support for the fits image format, it would outperform Droid and Fido on the long tail. Droid and Fido, however, would score much higher, if they gained Tikas support for Microsoft openxml documents.
The speed of the toolsFor production use of these tools, not just the precision, but also the performance of the tools are critical. For each tool, we timed the execution, to show us the absolute time, in which the tool is able to parse the archive. Of course, getting precise numbers here is difficult, as keeping an execution totally free of delays is almost impossible on modern computer systems.
We ran each of the tools on a dell poweredge m160 blade server, with two Intel(R) Xeon(R) CPU X5670 @ 2.93GHz. The server had 70 GB RAM, in the form of 1333MHz Dual Ranked LV RDIMMs.
The corpus was mounted on file server accessed through a mounted Network File System via Gigabit network interface.
Each of the tools were allowed to run as the only significant process on the given machine, but we could not ensure that no delayes were caused by the network, as this was shared with other processes in the organisation.
To establish baselines, we have added two additional ”tools”, the unix File tool and the md5 tool.
The Unix File tool check the file headers against a database of signatures. Being significantly faster than the File tool indicates that the tool was able to identify the file without reading the contents. To do so, it would probably have to rely on filenames. Tika seems to be faster, but such small differences are covered by the uncertainties in the system.
Md5 is not a characterisation tool. Rather, it is a checksumming tool. To checksum a file, the tool needs to read the entire file. For the system in question the actual checksum calculation in neglible, so Md5 gives a baseline for reading the entire archive.
As can be seen, Tika is the fastests of the tools, and Fido is the slowest. That the showdown was to be between Tika and Droid was expected. Python does not have a Just In Time compiler, and will not be able to compete with java for such long running processes. That Fido was even slower than Md5 came as a surprise, but again, Md5 is written in very optimised C, and Fido is still python.
Preservation Topics: IdentificationWeb ArchivingCorporaSCAPEFidoThe new FIDO (Format Identification for Digital Objects) is here, version 0.9.6.
Improvements:
Changes:
Additionally there is a new script ‘to_xml.py’ which converts FIDO’s csv output to XML. This script also reports the FIDO version and PRONOM signature version. You can pipe FIDO’s output to this script while it runs or use it afterwards to convert the CSV output file. More information on how to invoke this converter can be found in the script. Please note that the XML template in this script is only compatible with the default matchprintf output, but you are free to change this template yourself if needed.
Next tasks on the list are cleaning up code, create Pythonic easy installer, have FIDO recognize Compound documents better and improve the Prepare script (to generate FIDO compatible signatures). Please consult the FIDO JIRA for more information on these subjects.
You can pull the new version via https://github.com/openplanets/fido or download the zip directly: https://github.com/openplanets/fido/zipball/master
If you find any bugs or have any questions or requests, please submit them to the FIDO JIRA:
http://jira.opf-labs.org/browse/FIDO
Preservation Topics: IdentificationToolsOpen Planets FoundationFidoOpen Planets Foundation is proud to present: Fido.jar. A java port of the Python version of Fido (Format Identification for Digital Objects). This first version runs on all platforms with Java 6 update 23 or later installed.
We would like you to give this first Fido in a jar a try. If you encounter any bugs, please submit them to the OPF Labs Jira. Installation and usage instructions are included in the zipfile.
Download Fido.jar @ Github:
https://github.com/downloads/openplanets/fido/fido_jar-0.9.5.zip
Preservation Topics: IdentificationToolsFidoFido is a simple format identification tool for digital objects that uses Pronom signatures. It converts signatures into regular expressions and applies them directly. Fido is free, Apache 2.0 licensed, easy to install, and runs on Windows and Linux. Most importantly, Fido is very fast.
In a subsequent post, I’ll describe the implementation in more detail. For the moment, I would just like to highlight that the implementation was done by a rusty programmer in the evenings during October. The core is a couple of hundred lines of code in three files. It is shorter than these blog posts!
I was stunned by Fido’s performance. Its memory usage is very small. Under XP, it consumes less than 5MB whether it identifies 5 files or 5000 files.
I have benchmarked Fido 0.7.1 under Python 2.6 on a Dell D630 laptop with a 2ghz Intel Core Duo processor under Windows XP. In this configuration, Fido chews through a mixed collection of about 5000 files on an external USB drive at the rate of 60 files per second.
As a point of comparison, I also benchmarked the file (cygwin 5.0.4 implementation) command in the same environment against the same set of 5000 files. File does a job similar to Droid or Fido – it identifies types of files, but more from the perspective of the Unix system administrator than a preservation expert (e.g., it is very good about compiled programmes, but not so good about types of Office documents). I invoked file as follows:
time find . –type f | file –k –i –f – > file.out
This reports 1m24s or 84 seconds. I compared this against:
time python –m fido.run –q –r . > fido.csv
This reports 1m18s or 78 seconds.
In my benchmark environment, Fido 0.7.1 is about the same speed as file. This is an absolute shock. Neither Fido nor the Pronom signature patterns have been optimised, whereas file is a mature and well established tool. Memory usage is rock solid and tiny for both Fido and file.
Meanwhile, Maurice de Rooij at the National Archives of the Netherlands has done his own benchmarking of Fido 0.7.1 in a setting that is more reflective of a production environment (Machine: Ubuntu 10.10 Server running on Oracle VirtualBox; CPU: Intel Core Duo CPU E7500 @ 2.93 GHz (1 of 2 CPU’s used in virtual setup); RAM: 1 GB). He observed Fido devour a collection of about 34000 files at a rate of 230 files per second.
Fido’s speed comes from the mature and highly optimised libraries for regular expression matching and file I/O – not clever coding.
For me, performance in this range is a surprise, a relief, and an important step forward. It means that we can include precise file format identification into automated workflows that deal with large-scale digital collections. A rate of 200 files per second is equivalent to 17.28 million files in a day – on a single processor. Fido 0.7 is already fast enough for most current collections.
Good quality format identification along with a registry of standard format identifiers is an important element for any digital archive. Now that we have the overall performance that we need, I believe that the next step is to correct, optimise, and extend the Pronom format information.
Fido is available under the Apache 2.0 Open Source License and is hosted by GitHub at http://github.com/openplanets/fido. It is easy to install and runs on Windows and Linux. It is still beta code – we welcome your comments, feedback, ideas, bug reports – and contributions!
Preservation Topics: IdentificationFido