Fido – a high performance format identifier for digital objects

Fido is a simple format identification tool for digital objects that uses Pronom signatures. It converts signatures into regular expressions and applies them directly. Fido is free, Apache 2.0 licensed, easy to install, and runs on Windows and Linux.  Most importantly, Fido is very fast.

In a subsequent post, I’ll describe the implementation in more detail.  For the moment, I would just like to highlight that the implementation was done by a rusty programmer in the evenings during October.  The core is a couple of hundred lines of code in three files.  It is shorter than these blog posts!

I was stunned by Fido’s performance.  Its memory usage is very small.  Under XP, it consumes less than 5MB whether it identifies 5 files or 5000 files.

 I have benchmarked Fido 0.7.1 under Python 2.6 on a Dell D630 laptop with a 2ghz Intel Core Duo processor under Windows XP.  In this configuration, Fido chews through a mixed collection of about 5000 files on an external USB drive at the rate of 60 files per second.

As a point of comparison, I also benchmarked the file (cygwin 5.0.4 implementation) command in the same environment against the same set of 5000 files.  File does a job similar to Droid or Fido – it identifies types of files, but more from the perspective of the Unix system administrator than a preservation expert (e.g., it is very good about compiled programmes, but not so good about types of Office documents).  I invoked file as follows:

       time find . –type f | file –k –i –f – > file.out

This reports 1m24s or 84 seconds.  I compared this against:

       time python –m fido.run –q –r . > fido.csv

This reports 1m18s or 78 seconds.

In my benchmark environment, Fido 0.7.1 is about the same speed as file.  This is an absolute shock.  Neither Fido nor the Pronom signature patterns have been optimised, whereas file is a mature and well established tool.  Memory usage is rock solid and tiny for both Fido and file.

Meanwhile, Maurice de Rooij at the National Archives of the Netherlands has done his own benchmarking of Fido 0.7.1 in a setting that is more reflective of a production environment (Machine: Ubuntu 10.10 Server running on Oracle VirtualBox; CPU: Intel Core Duo CPU E7500 @ 2.93 GHz (1 of 2 CPU’s used in virtual setup);  RAM: 1 GB).  He observed Fido devour a collection of about 34000 files at a rate of 230 files per second.

Fido’s speed comes from the mature and highly optimised libraries for regular expression matching and file I/O – not clever coding.

For me, performance in this range is a surprise, a relief, and an important step forward.  It means that we can include precise file format identification into automated workflows that deal with large-scale digital collections.  A rate of 200 files per second is equivalent to 17.28 million files in a day – on a single processor. Fido 0.7 is already fast enough for most current collections.

Good quality format identification along with a registry of standard format identifiers is an important element for any digital archive.  Now that we have the overall performance that we need, I believe that the next step is to correct, optimise, and extend the Pronom format information.

Fido is available under the Apache 2.0 Open Source License and is hosted by GitHub at http://github.com/openplanets/fido. It is easy to install and runs on Windows and Linux.  It is still beta code – we welcome your comments, feedback, ideas,  bug reports – and contributions!

15 comments

renevoorburg wrote 13 weeks 3 days ago

Enhancement request for Fido

Hello,


I am very happy with this initiative.  Speed is an important requirement when doing file identification and fido is speedy. Thanks for making this tool available!


I do have one wish though. I would like to be able to run fido on data supplied through an pipe, like

cat file.xyz | fido -q –

That would make it easy (and fast)  to implement it in scripts like:

#!/bin/bash
Arcfile=$1
ARCREADER=~/bin/heritrix/bin/arcreader
for i in `$ARCREADER $Arcfile | grep -v “^CDX” | awk ‘{print $7}’` do
$ARCREADER –format=nohead –offset=$i $Arcfile | fido -q – | awk -F , ‘{print $3}’
done

(Which gives a list of PUIDs found in the .arc file supplied to the script. Using DROID this gets way to slow, file is speedy but doesn’t do PUIDs, fido would be perfect).


Thanks again,


René Voorburg

adam farquhar wrote 13 weeks 3 days ago

Piping to Fido

René,

You can already pipe filenames to Fido as in:

find . -type f | python -m fido.run -i –

This goes part of the way.  I had considered accepting content from a pipe, but hadn’t included it in the current release as it would only work with a single object at a time.  It  wasn’t clear to me how to demarcate between the objects.  E.g.

cat *.* | python -m fido.run -input-from –

wouldn’t do the right thing.It would try to match the start of the first file and the end of the last file, rather than the start and end of each one.

Perhaps you, or another reader, could suggest a good way to do this – It would be a very powerful addition.

For your specific problem, however, I would like to see direct support WARC just as Fido currently supports Zip and Tar.  Perhaps you can point to a Python module (ideally) that could do that.  It should be quite easy.

renevoorburg wrote 13 weeks 3 days ago

reading arcs and warcs

Hello Adam,


I would like to use fido through a pipe just on single objects, just the same way I now use the unix ‘file’ tool:


$ARCREADER –format=nohead –offset=$i $Arcfile | file -bi –


To read WARCs (which we don’t produce yet) I use a slightly modified arcreader script (arcreader is part of the heritrix package). It differs from arcreader by using the ‘org.archive.io.warc.WARCReader’ class iso ‘org.archive.io.arc.ARCReader’.


Having a tool that is able to read an arc/warc file at once has never been important for the processess I am working on, thanks to the little shell script I posted and the arcreader and warcreader scripts.


Thanks, René

adam farquhar wrote 13 weeks 3 days ago

Piping to Fido 2

It is very easy to add single-stream ID; I’ll add this.  The only disadvantage is that it will incur a per-object invocation overhead which will be much higher than the actual time to identify the types.  In order to avoid this, I’m considering adding an option to handle http headers that would require content-length to work.  E.g., you could do “curl -i some.url.com | fido.sh -u” or some such.  This approach could allow you to pipe multiple objects in a single invocation.  It should also work well with arc.

adam farquhar wrote 12 weeks 4 days ago

Fido 0.8.0 supports content streaming

In response to Renee’s suggestion, I’ve added the ability to stream content to fido.  Just provide a ‘-‘ instead of a filename to tell Fido to read the content from stdin.  For this to work, you must use the ‘-u’ option to python as:

    cat doc.pdf | python -u -m fido.run -q –

The filename appears as “STDIN”.  This incurs quite a bit of overhead if you have a lot of content to check, but it will enable Fido to be dropped into some workflows very easily.

 

Rob Sharpe wrote 12 weeks 3 days ago

Is speed the most important quality of an identification tool?

Hi Adam,


 


This is interesting development.  However, I have some comments…


You make the statement that “performance in this range is a surprise, a relief, and an important step forward. It means that we can include precise file format identification into automated workflows that deal with large-scale digital collections”. 


This seems to suggest that we have been waiting on a faster file format identification tool before we can have automated workflows.  I have two general points to make on scalability/automation here before coming back to the issue of the tool itself. 


The first is that automation is already possible and, indeed, available in practice.  We can already run automated ingest workflows that perform a variety of steps (including running runs a variety of characterisation tools including DROID and, where applicable, Jhove).  For example SDB (our system) can, with a fairly typical digital collection profile, deal with > 1TB (on a single cheap-ish server) per day.  We believe this can be scaled across more servers (although we are still in the process of proving this) and are working on identifying and eliminating bottlenecks.


The second point is that, although our scalabity work is not finished, it is already clear that the format identification step (using DROID in our case) is so far from being rate determining in real workflows that its speed is essentially irrelevant.  The things that really take time in practice are:


(i) Moving files around the network.


(ii) Format validation / property extraction tools like Jhove.  They inevitably have to do a more complex job to do than format identification tools (that normally just need to look at small bits of a file).


(iii) Migration tools (where relevant). Again, they have to read and process every bit in a file.


Hence, unless the rate determining steps are addressed (and these are much harder problems) speeding up the file format identification is not that important (although, of course, welcome).


What is important is that format identification is accurate.  Hence, I also have a few technical questions about the tool.  DROID has a fairly complex set of requirements that have been built up over the years.  These are important and any alternative tool would need to cover these requirements.  These include (off the top of my head):


– Does it deal with all of the potentially hard things about signatures?  These include signatures consisting of multiple byte sequences, those containing a bit sequence at an arbitrary position in a file (thereby requiring a full file scan) and signatures that are linked to the position of other signatures.  Incidentally, my memory is that it is the need for full file scans that dominate the time taken.


– If >1 signature is hit does it apply the precedence rules in PRONOM (if they exist) to decide which format is reported?


– If a signature is shared by >1 format or a file is otherwise non-uniquely identified does it report them all?  (These are normally then resolvable via a subsequent format validation step).


– What does it do for zero-length files?


– Does it report discrepancies in extensions?


 


I’m sure there are more and I suspect there would be a useful exercise to compare Fido’s functions against those of DROID’s.  I guess my point is that it is more important that format identification is as accurate and complete as possible (even thought this may take a bit longer) than that it does a partial job quickly.


 


Rob

adam farquhar wrote 12 weeks 2 days ago

Speed – it’s not everything, but it is something!

As always, Rob Sharpe asks some excellent questions and raises some good points.   I’ll see if I can address some of them.

(1)    Have we been waiting for faster file format identification?  Actually, yes.  Invoking Droid 5.0 from the command line is not performant, nor is invoking it through the GUI.  I don’t believe that I am alone in finding this to be the case.

(2)    Is file format identification the most expensive operation?  No.  But it can take significant amounts of time.  At 1s/file, a million files takes over 11 days.  That is much too long – even if something else is slower.

(3)    I agree that moving files around the network is slow.  That is one of the main reasons to have a small efficient tool that can be executed close to content and return a compact result.

(4)     I agree that format validation and property extraction is more expensive than identification.  One of the benefits of having a quick identification step is to do a better job of targeting the expensive validation/extraction steps.

(5)    Is accuracy important?  Well, sort of.  The Pronom signatures are heuristics.  For example, it is a heuristic that a PDF file starts with %PDF and ends with %%EOF\n.  I can easily construct non-PDFs that also fit that pattern, and some will even randomly occur.  But the heuristics have a reasonable level of accuracy – good enough for incorporation into our workflows.

(6)    Does Fido correctly execute the Pronom signatures?  Well, I think so! If there was a regression test suite for the Pronom signatures, it would be very interesting to run Fido against it to see if there are any gaps.

(7)    Does Fido use exactly the same algorithm as Droid?  Well, not precisely.  It does, however, handle most of the points that Rob mentioned.  For example, it handles signatures with multiple byte sequences, including ones that start at an arbitrary position.  It does not currently handle signatures that are linked to the position of other signatures.  This is not hard to do, but as far as I can tell there are no such signatures in Pronom and Droid 5.0 does not handle them either.

(8)    I believe that Fido correctly implements the Pronom format precedence relationships.

(9)    Fido currently reports multiple format matches after removing identifications with lower precedence.

(10)Zero length files are not treated specially.  They will fail all matches; if the ‘-ext’ option was provided, the file extension will be used for matching.

(11)Fido does not report mismatches between extensions and patterns.  It would obviously be easy and if requested, it could be added to a subsequent release. Or anyone else can make the extension; Fido is sitting on http://github.com/openplanets/fido

I think that we agree that accurate and comprehensive identification is possible.  I hope that a simple command line tools like Fido will encourage use of the Pronom signature information and help us to validate and extend the coverage.

Rob Sharpe wrote 12 weeks 2 days ago

DROID speed

First of all, isn’t it great to have a forum like this to exchange views (even if we disagree!) and perhaps more importantly information?


For example, it is news to me is that several users of DROID are getting performance in the range of 1 file per second!!!   I agree that such performance is far from acceptable / usable in real life.  If people have been suffering with such performance for some time, I wish I’d known.


When we use DROID we get performance around (I think) hundred times better than that.  I’ll need to check exact figures but we can certainly complete end-to-end ingest workflows (of which format identification is only a tiny part) of small (about a hundred file) accessions within a few seconds. 


It is not immediately obvious to me why this is the case?  I assume the content is local so that file access is not the issue and that there is fairly decent hardware.  One other thing that does occur to me is that we don’t invoke DROID from the command line interface but, instead, call the underlying Java class directly.  I wonder if this stops overheads occuring ever time it is invoked?  This is more brittle in that the class methods are not guaranteed against change but if this delivers much greater performance this is a price worth paying.


I’ll get this checked out and get back to you.


Rob

Rob Sharpe wrote 11 weeks 3 days ago

Fido / DROID comparison

Thanks to Pauline Sinclair and James Carr, we have now conducted a series of tests on Fido and various versions of DROID. 


To do this we used two test sets: 


1. “DROID test set”.  The test set used to test the accuracy of DROID identifications (159 files). 


2. “Large file set”.  A 1GB test set of 100 files of about 10MB in size (to test performance with large files).


The tests were run on a fairly standard PC simply recording the clock time at the start and finish.  Note most of the tests have not been repeated so the odd run could be affected by other activity on the PC.   


Results 


The raw stats are below (all times in seconds). 


























Tool


DROID test set


Large file set


Fido 0.8.1


2.9


2.3


DROID 2


4.5


25.1


DROID 3


4.5


22.0


DROID 4


8.0


27.3


DROID 5


51.7


182.3


Taking the large file set test further using DROID 2, we have also run 300 SIPs (each of 100 10MB files) i.e. a 300GB ingest through a realistic ingest using SDB installed on a multi-threaded (fairly standard) server set-up.


We find that, in this case, a DROID 2 thread running a single SIP takes, on average, about 45 seconds.  This is equivalent to 2.23 files/sec (or about 22MB/sec) per thread.  However, there are 12 simultaneous threads so the machine speed is equivalent to 26.7 files (267MB) per second (or 23TB per day).  As I stated in a previous post, this is far from rate limiting (i.e. other steps and network issues tend to slow the ingest rate down far more). 


As an aside, Fido does not successfully identify JPEG 2000 files (although this does seem to be a known issue).  Also note that DROID 2 – DROID 4 do not check within ZIP, TAR etc. (we have a separate part of our framework to detect and characterise “embedded bytestreams” in containers).  (We could have an interesting side-debate on what is wanted here but that’s for a separate thread!). 


Analysis 


It is clear that Fido definitely wins on speed. 


 


One thing that surprised me was that Fido was as quick for large files as it was for small files.  Taking a quick look at what Fido is doing, it appears to be: 


1. Checking BOF and floating byte sequences in the first 128k of the file. 


2. Checking EOF byte sequences in the last 128k of the file. 


This will probably pick up most signatures.  However, it is not obvious to me that this is correct behaviour.   Perhaps Adam can confirm this is what Fido is doing (or not)?  This would certainly explain why it is much faster than DROID since DROID does full file scans for floating sequences (i.e. it appears that DROID is checking 10MB while Fido is checking 128k). 


It is also clear that later versions of DROID do seem to have significantly degraded the performance (especially the latest release, DROID 5).  However, given performance it much better in earlier releases, this should be a fixable problem (indeed, having been alerted to the issue now, TNA are on the case). 


 


Rob

adam farquhar wrote 11 weeks 2 days ago

Fido and Droid comparisons

Thanks to Pauline and Rob doing this analysis.  It shows that Fido is about 2x faster than Droid 2 and that Droid 4 and 5 have significantly degraded performance compared with earlier versions. It would also be very interesting to benchmark on a larger set of files.  My tests were on 5000+ files with 10GB of content.  This helped to confirm that Fido’s performance was independent of the number of files processed, which is not the behaviour that I observed in Droid 5.

I infer from Rob’s comment that Fido’s identification agreed with Droid’s for all of the test files except the JP2K images.  As he mentioned, this is a known issue for Fido.  The Pronom JP2K signature is the only one that uses big-endian byte order.  I thought that it would be easier to re-write the signature, rather than implement this feature, as I was not completely clear on either the need or the semantics.

Rob also points out that Fido does not do a full scan of each file.  That is correct!  This was a definite design choice on my part.  There is a –bufsize argument that can be used to specify an arbitrary buffer size (either larger or smaller than the default, which is 128k).  It is my current opinion that if you have to look at more than the first and last 128k of a file, then the problem is the signature!   In fact, I think that this is probably too much.  There is a longer discussion to have on this point about what Droid and Fido are really doing.

Rob Sharpe wrote 11 weeks 2 days ago

So what is the best identification algorithm?

I think we are now at the heart of the debate. Its the identification algorithm that counts.
I think the two comments Adam makes: Fido is faster than DROID and it makes a simplification to the algorithm are probably the same thing. In other words, I suspect Fido is faster because it simplifies the algorithm. I know for sure DROID would be much faster without the full file scan!
So the question is: is this simplification justified? I’m no expert on trying to create DROID signatures so it would be nice to hear from people who are (although the relative paucity of signatures means that there aren’t many people who fall into this category). In particular, are there circumstances where formats can’t be identified by information in the first and last 128k (or some other definable buffer size)?
While we are on the subject of algorithms, another thing DROID (and I think Fido) does is check every file against every signature. This is often unnecessary because an EOF or BOF-based signature gives the correct identification quickly. It also means that as the number of signatures go up identification speeds go down: although, for DROID (as stated above) it is really the signatures with floating byte sequences that currently eat up the time (so adding new fixed signatures matters less). However, the fact that files can conform to multiple signatures means it is not obvious which rules to apply to stop once a hit has been made? This has bugged us in the past but, given that DROID (earlier versions at least), was sufficiently performant for every use it has realistically be put too (and normally much faster than surrounding steps in a workflow) we decided not to worry too much!! However, it is certainly an interesting intellectual challenge to try to think of the correct rules about the order of checking signatures and when can a decision be made to stop checking any further?

adam farquhar wrote 11 weeks 23 hours ago

What makes Fido fast?

Once again, Rob raises some interesting points.  I’ve was travelling yesterday, and was unable to post this reply due to connection issues.

I think the two comments Adam makes: Fido is faster than DROID and it makes a simplification to the algorithm are probably the same thing. In other words, I suspect Fido is faster because it simplifies the algorithm. I know for sure DROID would be much faster without the full file scan!

Fido and Droid use a similar approach – read in a file, apply a set of patterns to the bytes that constitute the file, and provide information on which ones match.  I don’ t think that the buffer size choice constitutes different algorithms.  There are some very different designs and algorithms that could be applied to the problem, but more on that another day.

Fido is mostly faster because of design philosophy.  First, it relies on a mature existing library to match the signatures.  I did not spend a single hour implementing a clever pattern matcher, or even a dumb one.  Fido just uses the standard off-the-shelf module.  Second, Fido is a small tool that does a well-defined job well.  The core is 400 lines in one file, not 400 files.  Third, Fido is designed to be incorporated in automated workflows.  It is very thrifty about resource consumption and follows typical (command-line) interface patterns.

So the question is: is this simplification justified? I’m no expert on trying to create DROID signatures so it would be nice to hear from people who are (although the relative paucity of signatures means that there aren’t many people who fall into this category). In particular, are there circumstances where formats can’t be identified by information in the first and last 128k (or some other definable buffer size)?

Is a smaller default buffer size (128k) justified?  I think so.  As far as I understand, Fido correclty handles the Droid test suite.  I’ve run Fido over thousands of files and am comfortable with the results. I’ve also looked at a substantial number of Pronom signatures in their regex form, and every instance of variable offset signature that I’ve seen could be approached in another way.  I guess that I’m the only person in years who has been able to look at the entire set of signatures easily.

Rob is spot-on here – there are only a very few people who have created Pronom signatures.  This is not a coincidence.  This is a symptom. The pattern language is idiosyncratic and the version of signatures distributed with Droid is opaque and compiled.  In contrast, Fido uses regular expressions.  There are dozens of books on how to understand, use, and write regular expressions.  There are certainly hundreds of thousands of people who have used them including almost every programmer and system admin in the world.  There are even plenty of tools to test regular expressions on sample content (although most of these are for text, rather than binary material).  I hope that shifting to a well understood and relatively easy-to-read language will help increase the number of people who can write and test signatures.

While we are on the subject of algorithms,…

I actually started out with the goal of optimising Droid by looking at some cross-pattern compilation.  I had assumed that Droid had already covered the easy bits and that I would do some fun algorithm hacking in Java. Once I had dug in a bit, it just all seemed too hard – or perhaps Droid reflected a different vision of the problem being solved than I had.  That’s when I thought I would try an alternate approach and see how it would go.

At this point, Fido is sufficiently fast (200+ files per second in a modest production environment) and I do not think there is any immediate need to be clever.  The next challenge is to improve the coverage, accuracy, and speed of the signatures.  I hope that the OPF and broader preservation community will be able to pitch in and work on this.  If we have ten times the number of signatures, we may need to rethink the algorithmic approach – but there is a lot of hard work building up the content before we need to change the algorithms.

Shaul Zevin wrote 9 weeks 6 days ago

Droid vs. FIDO speed

Hello.

We have been using DROID for several years for the format identification purposes.

 In our preservation pipeline DROID is one of the major time consumers.

 From the algorithmic perspective  , DROID should be faster than FIDO.The Boyer-Moore-Horspool algorithm used by DROID has a linear O(N) complexity in average where N is the length of the examined string i.e. the identified file.The regular expression match performance is hard to predict because of a potential backtracking and depends on a match implementation in a particular programming language. A good reference for a regular expression match can be found at http://swtch.com/~rsc/regexp/regexp1.html

 So why DROID is so slow ?

The answer is simple : version 38 of a DROID signature file has 44 variable position byte sequences which are killing DROID performance.A simple experiment can be done to prove that fact. Convert each variable position byte sequence to be fixed positioned ( add Reference attribute ) and add maximal offset to the first sub sequence – SubSeqMaxOffset=”131072  ( The 128 KB used by FIDO ).

The performance will improve dramatically. I have run a test on a huge 10 GB mpg file with the original signature file which took 125 sec.

After converting variable positioned byte sequences to be fixed positioned , the elapsed time for DROID was 0.5 sec – 250 times improvement !!!

I will be happy to run the same test on a data set used for the FIDO – DROID speed comparison as published in this blog.

Another problem is that PRONOM signatures are clearly underoptimized.

Take for example signature with ID 40 from DROID signature file.

It has a fixed position byte sequence with 1024 maximum offset which is unique in PRONOM and is 45 bytes long.

3C21444F43545950452068746D6C205055424C494320222D2F2F5733432F2F445444205848544D4C20312E31 

This signature has also another variable position byte sequence. Is the second byte sequence really needed ?The probability of finding 45 bytes assuming uniform distribution in positions 1 to 1024 is ((1 / 256 ) ^ 45) * 1024 i.e. zero for any practical purposes.We do not need the second byte sequence.

Shaul Zevin , Exlibris

mike45 wrote 9 weeks 4 days ago

FIDO Output?

Greetings and salutations; I beg your collective pardon if this is the incorrect forum for this question.

I was curious as to what the first two columns of FIDO’s output refer (underlined in the example below):

OK,7,fmt/189,Microsoft Office Open XML,Microsoft Office Open XML,198933,”./documents/Iowa_Collection.docx”

Thank you for your consideration–and the impressive resource!

adam farquhar wrote 9 weeks 4 days ago

Fido output

Mike54 asked about Fido’s default output.  The first columns are:

Status – this is either OK if the file type was identified, or KO if it was not.

Time – the second column is the time to identify the file in milliseconds.  For multiple identifications, it is the same value for each one, so a simple addition of times will give too high a value.  You can see the current definition for the matchprintf and nomatchprintf values with  -show defaults :

> python -m fido.run -show defaults

I’m very pleased if you find the tool of value.

Please register or login to post a comment.

Download the Planets Suite

http://planets-suite.sourceforge.net/

Discuss Reports and Publications