Sven Schlarb's blog | Open Planets Foundation

Droid file format identification using Hadoop

The DROID software tool is developed by The National Archives (UK) to perform automated batch identification of file formats by assigning Pronom Unique Identifiers (PUIDs) and MIME types to files. The tool uses so called signature files as a basis of information stemming from the PRONOM technical registry.

I am here presenting some considerations for using the tool on the Hadoop platform together with a performance evaluation of the job execution on a Hadoop cluster using the publicly available Govdocs1 corpus data set.

Submitted by Sven Schlarb on 24 May 2013 – 11:44am

Why I would not "mix" Hadoop and Taverna

This blog post is an answer to willp-bl‘s post “Mixing Hadoop and Taverna” and is building on some of the ideas that I presented in my blog post “Big data processing: chaining Hadoop jobs using Taverna“.

Submitted by Sven Schlarb on 4 March 2013 – 12:10pm

Big data processing: chaining Hadoop jobs using Taverna

Processing very large data sets is a core challenge of the SCAPE project. Using the SCAPE platform and a variety of services and tools, the SCAPE Testbeds are developing solutions for real world institutional scenarios dealing with big data. Here I present a simple way of chaining hadoop jobs using Taverna’s Tool service invocation mechanism.

Submitted by Sven Schlarb on 7 August 2012 – 10:07am

How does lossy JP2 image compression influence OCR?

Many institutions have been doing large scale digitisation projects during the last decade, and the question how to store the digital master images in a cost effective way made the JPEG2000 image format more popular in the library, museums, and archives community.

Submitted by Sven Schlarb on 13 February 2012 – 11:29am

Sven Schlarb’s blog