FIDO version 1.0.0 released

FIDO (Format Identification for Digital Objects) is a Python command line tool to identify the file formats of digital objects. A lot of improvements to the code and functionality have been made.

What’s new?

  • Scanning of container objects
  • Signature update script

Scanning of container objects

FIDO now uses the PRONOM container signature file to determine the container (compound) type if the file is of container type “OLE2” (fmt/111) or “ZIP” (fmt/189 or x-fmt/263). If so, FIDO will perform a deep (complete) scan of the object.

A side effect of this deep scan is that FIDO sometimes detects objects which are embedded in the scanned object. So, if you cut’n’paste a Powerpoint slide into a Word document, FIDO might detect this. Tests revealed this depends on the way the creating application (in this case MS Office 2003) structures the file while saving, this is not always straightforward and seems to differ on a per-file basis with only slight variations. Nice food for thought I’d suppose!

Another side effect is that deep scanning can cause FIDO to slow down significantly with big container-type files. You can disable deep scanning by invoking FIDO with the ‘-nocontainer’ argument. While disabling deep scan of containers speeds up identification, it may reduce accuracy.

Signature update script

FIDO now comes with an interactive CLI script to update PRONOM signatures to the latest version. This script queries the PRONOM SOAP webservice for the latest version available. If there is a new version available it will download the latest signature file and signatures.

Please note that the script WILL NOT update the container signature file. The reason for this is that the PRONOM container signature file contains special types of sequences which need to be tested before FIDO can use them. If there is an update available for the PRONOM container signature file, it will be tested before it shows up in the next commit. However, if you are up for a challenge, you can always try and update the file yourself.

Other changes in this version

The functionality to convert and check the quality of PRONOM signatures has been taken out in this new version. These functions are now available through the prepare script which comes with the signature update script.

The future

FIDO should only do what it was intended for in the first place: Format Identification for Digital Objects. As an example, don’t expect FIDO to ever validate JPEG2000 files, we have JPYLYZER for that. The OPF philosophy is to create small applications for separate kinds of digital preservation tasks. This way every application can be used standalone, or, and here comes the best part, they can be built into a workflow. In fact, at NANETH we already use FIDO in one of our ingest tools.

With FIDO version 1.0 being published, it does not mean development stops. In a next release there will be support added to run FIDO multithreaded for even better performance. A next step in container identification could be to unpack compound files and have foreign objects scanned separately. Also there has been a request to provide an interface to the values of the inner functions. This way FIDO can be incorporated into larger Python frameworks more easily. BTW: at the moment it is already possible to invoke FIDO from your own scripts, read more here.

Try it and report!

If you already use DROID or other format identification tools on a regular basis, give FIDO a try and report back to us. We’d like to know what features are missing, if you use FIDO in a workflow or if you have found a bug. Even report back if you are just happy using FIDO.

DOWNLOAD NOW!

http://www.openplanetsfoundation.org/software/fido/

More FIDO…

Comments

 

Can I ask a couple of questions about container signature processing? 1. You say that container sigs can potentially slow down identification, referencing that a deep scan is made – presumably this is why it is slow. What benefit does a deep scan give us? 2. What is the need to test certain container signatures? They certainly use a richer syntax than the binary sigs, which I guess is a bit more work to parse, but is this the only issue, or are there other problems with container signatures?

mauricederooij's picture

  

Hi Matt,

You are always welcome to ask questions!

1. In special cases OLE2 files have a header that is beyond the maximum offset as defined in the container signature file. This seems to occur when the file contains one or more enclosed binaries. Dropping the maximum offset limit immediately gained better results. More research is needed why this header is sometimes further away.

2. The reason for hand-checking the container signature file is while testing the parser, it refused to return results for Visio documents. While narrowing down the problem I found out there is no priority set in the Visio signatures to override the OLE2 supertype. The best solution was to patch the Visio signature using the signature extensions file. Signatures in this file overwrite ‘original’ signatures having the same PUID. I am going to contact PRONOM about this priority issue.

  

1. I think I get the container slowdown now – you are not parsing OLE2 files into their component objects then scanning them – you are simply scanning the buffers of the OLE2 file itself. Since Fido works on scanning buffers at the beginning and end of the files, you have to use bigger chunks (due to the potentially big OLE2 header offset) than you normally do.

Would it not be worth doing what DROID does here, and actually walking the OLE2 file system and matching specific file streams within it? I believe there is a reasonable python library for parsing OLE2 files. Then you can stick with your normal buffer limit for the files inside the OLE2 file, and all should be well (and faster too). In addition, like file systems, OLE2 files can fragment (bits of the internal files can be mixed up in the OLE2 stream). This means that simply running signatures against OLE2 streams will not always match, as bits of the signature can be physically in different places in the OLE2 file. Again, walking the OLE2 as a file system gets around the fragmentation issues.

2. Good point about the priority format oversight in PRONOM. I’m sure TNA will correct this when they get a chance. However, is there then any real reason not to auto-update container signatures in Fido, as any set of signature files can contain errors and omissions, including the binary ones!

mauricederooij's picture

  

1. You are correct about the container slowdown. FIDO scans the entire file to prevent missing stuff.

And I agree, walking OLE2 files and scanning the components would be the only correct way. As I wrote in earlier in my blogpost, this is going to be a next step in container identification.
I do know OLE2 is a filesystem, but was not aware that fragmentation could occur, very interesting.

2. At the moment the most important reason for not updating the container signature file is the parser function for containers not being mature enough yet (bear in mind it’s the first version) and could potentially cause FIDO to crash. Another reason is to check if all the necessary properties are correct. The latter won’t make it crash but only cripples results.

You are right about errors and omissions. This is the reason we are planning to create unittests for new versions of both the normal and container signature file.

 

Have you considered a multiple method approach to identifying OLE2 files? The internal CLSID value is often accurate enough to identify the file type without having to read past the internal stream directory list. Every object in an OLE2 file has the possibility of being fragmented. That includes the sector allocation tables. For dependable reading of the streams, you need to support fragmentation. Are you currently identifying files by the stream names inside, or the application ID strings? I recommend using all 3 approaches, easiest/fastest to hardest/slowest. I find that no one approach can identify all OLE2 files.

If you can point me to a list of the ID values that you are using for OLE2 files, I can provide some more ID values to help Increase your list.

Rob