MUPPET: MUlti Pass file Properties Extraction Tool

MUPPET: MUlti Pass file Properties Extraction Tool

This tool needs some explanation of how it came about. At Nationaal Archief we were faced with various bottlenecks at ingest for our digital repository (which we call e-Depot). Characterization was one of them and when the OPF released the first prototype of FIDO we happily jumped on board for its development. Seeing the potential for significant speed increases, Nationaal Archief put in a substantial amount of work – freeing me for development of FIDO, leading to a wrapped Java version. At that point we were faced with the question of whether we would replace DROID by FIDO in our e-Depot and we paused a moment, for various reasons.

The first one was that we identified Java wrappers as a cause in itself for bottlenecks in the system favouring a command line approach. Secondly, at that time Johan van der Knijff of KB was doing his excellent comparison report on DROID, FIDO, Unix FILE utility, FITS and JHOVE2, giving us more insight in the matter.

Thirdly, by that time Dave Tarrant (University of Southampton) released OPF REF (Open Planets Foundation Result Evaluation Framework), an interface in PHP to hook up characterization command line tools in order to compare the results. After that we started experimenting with a commercial tool called File Investigator from Forensic Innovations, which looks very promising for future deployment in our e-Depot.

At some point Maurice van den Dobbelsteen proposed to take the best of the above and do characterization as a multi-pass process. Start with FIDO to profit from its amazing speed, then run DROID on what FIDO didn’t tackle and then invoke other tools for more granularity – all in one workflow. After refining the idea we worked out the following concept of MUPPET (Multi Pass File Properties Extraction Tool).

You will find an overview of MUPPET and proposed screen shots in this proposal document. We estimate the creation of a prototype of the API and an initial GUI will take approximately 80 hours, depending on individual vs. team effort.

We are happy that this relates to the “What do we mean by format?” discussion here on the OPF blogs, where Rob Zirnstein commented on a layered approach as identification process. This is exactly the idea that MUPPET will operationalise. Feedback welcome!

Leave a Reply

Join the conversation