Sven Schlarb’s blog

Droid file format identification using Hadoop

The DROID software tool is developed by The National Archives (UK) to perform automated batch identification of file formats by assigning Pronom Unique Identifiers (PUIDs) and MIME types to files. The tool uses so called signature files as a basis of information stemming from the PRONOM technical registry.

I am here presenting some considerations for using the tool on the Hadoop platform together with a performance evaluation of the job execution on a Hadoop cluster using the publicly available Govdocs1 corpus data set.

Big data processing: chaining Hadoop jobs using Taverna

Processing very large data sets is a core challenge of the SCAPE project. Using the SCAPE platform and a variety of services and tools, the SCAPE Testbeds are developing solutions for real world institutional scenarios dealing with big data. Here I present a simple way of chaining hadoop jobs using Taverna’s Tool service invocation mechanism.