Mixing Hadoop and Taverna

Mixing Hadoop and Taverna

As part of our work on test-beds for the SCAPE project we have been investigating the various ways in which a large scale file format migration workflow could be implemented.  The underlying technologies chosen for the platform are Hadoop and Taverna.  One of the aims of the SCAPE project is to allow the automatic generation and execution of Taverna workflows, which will be executed via Hadoop.

The four methods for implementing a file format migration workflow that we tested were:

  1. Batch execution of a shell script (no parallelisation)
  2. A workflow written in/controlled from Java, run on Hadoop
  3. A workflow written in/controlled from Taverna, run on Hadoop
  4. A workflow written in Taverna, calling an XML defined unit of execution in Hadoop

The code is a generic wrapper for Hadoop, set up so the different types of workflow can be chosen at runtime. 

The example workflow we used is a file format migration from TIFF to JPEG2000, followed by some validation of the file structure and image data.

The structure of the workflows are broadly the same as that for SCAPE LSDRT-3:

  1. Use Exiftool to extract metadata from the original TIFF
  2. Use OpenJPEG to migrate the TIFF to JP2
  3. Use Exiftool to extract metadata from the new JP2
  4. Use Jpylyzer on new JP2 to check validity and that the JP2 profile matches the encode settings
  5. Use Matchbox to generate SSIM between the TIFF and JP2
  6. Generate a short xml report on the migration
  7. Checksum all files
  8. Zip all generated files together in a bagit like structure

The batch execution shell script

The shell script contains a rudimentary version of the above workflow.  No reporting is performed by the implementation of the shell script used.  

The Java workflow via Hadoop (CommandLineJob)

This workflow is controlled from a Java class: CommandLineJob.  It contains the full workflow above.  There is Java code to produce a report, generate a zip file containing an empty SUCCESS/FAILURE file and return success/failure back through to the HDFS result file.

The Taverna workflow via Hadoop (TavernaCommandLineJob)

The Taverna workflow contains the full workflow, and is executed by Hadoop calling the Taverna command line client.  The reporting and zip generation steps are written in Beanshell/shell script.  Success/failure is reported in one of two ways: an empty SUCCESS/FAILURE file in the zip, or if the failure was more serious, a log file ending in “.error” is stored in HDFS, with no other outputs.

There is code in the repository to call Taverna Server from Hadoop, instead of the Taverna command line.  The code has not been tested for a while and may not work.  Note that input files for the workflow are not currently uploaded to the server by this class so it will only work on a single node Hadoop machine at the moment. 

The XML workflow, via Taverna calling Hadoop (XMLCommandLineJob & XMLWorkflowReport)

This method incorporates more technologies.  Initially, the workflow is loaded and run from Taverna (note that only Taverna Workbench has been tested so far).  The workflow contains several calls to Hadoop to execute an XMLCommandLineJob, with a command line and parameters defined in XML files.  Generated files from each call are stored in HDFS and tracked with a JobTracker class.  Messages regarding the success of each step are queued to an ActiveMQ instance.  The final step in the workflow runs an XMLWorkflowReport via Hadoop.  It wraps up the processing by all the messages for the previous steps, generating a short report, checksumming all the generated files and producing a zip.

Execution time

The above methods were tested on a single core Debian Wheezy VM with 2GB RAM.  The test files were thirty files from our JISC1 newspaper collection.  When encoded to JP2 with certain settings using OpenJPEG, one of the files is known to produce a bad encode.  The setting that makes the difference is coder bypass being enabled (-M 1), This has been reported to OpenJPEG.

 

Shell script

Hadoop->Java

Hadoop->Taverna

Taverna->Hadoop

Runtime (mm:ss)

36:08

41:59

76:58

77:54

Runtime/file

01:12

01:23

02:33

02:35

MB/hour

772.14

664.55

362.49

358.15

Errors (true positive)

NA~

1

1

1

Errors (false positive)

NA~

0

1*                 

5^

 

~ No reporting was present in the shell script

*Matchbox SSIM was 0.85 (i.e. <0.9).  When the jp2 file was first on the “compare” command line the result was 0.85, when the TIFF file was first the result was 0.97.  Both orderings need to be checked.

^All had exit code 137 from Matchbox SIFT comparison, which indicates out of memory

Conclusions

From the numbers above it’s clear that adding more processing increases execution time.  Although there is an increase in execution time it is hoped that we will be able to develop tools to allow less technical people develop and execute their own workflows.  Some of the steps still need a good technical understanding, such as the beanshell and java code needed to glue the Taverna workflows together.

Leave a Reply

Join the conversation