Droid file format identification using Hadoop
The DROID software tool is developed by The National Archives (UK) to perform automated batch identification of file formats by assigning Pronom Unique Identifiers (PUIDs) and MIME types to files. The tool uses so called signature files as a basis of information stemming from the PRONOM technical registry.
I am here presenting some considerations for using the tool on the Hadoop platform together with a performance evaluation of the job execution on a Hadoop cluster using the publicly available Govdocs1 corpus data set.
Why I would not "mix" Hadoop and Taverna
This blog post is an answer to willp-bl‘s post “Mixing Hadoop and Taverna” and is building on some of the ideas that I presented in my blog post “Big data processing: chaining Hadoop jobs using Taverna“.
Big data processing: chaining Hadoop jobs using Taverna
How does lossy JP2 image compression influence OCR?
Many institutions have been doing large scale digitisation projects during the last decade, and the question how to store the digital master images in a cost effective way made the JPEG2000 image format more popular in the library, museums, and archives community.