ARC to WARC migration: How to deal with de-duplicated records?

In my last blog post about ARC to WARC migration I did a performance comparison of two alternative approaches for migrating very large sets of ARC container files to the WARC format using Apache Hadoop, and I said that resolving contextual dependencies in order to create self-contained WARC files was the next point to investigate further.

A Tika to ride; characterising web content with Nanite

This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite.

Introducing Nanite

Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects:

  • Nanite-Core: an API for Droid   
  • Nanite-Hadoop: a MapReduce program for characterising web archives that makes use of Nanite-Core, Apache Tika and libmagic-jna-wrapper  (the last one here essentially being the *nix `file` tool wrapped for reuse in Java)

Some reflections on scalable ARC to WARC migration

The SCAPE project is developing solutions to enable the processing of very large data sets with a focus on long-term preservation. One of the application areas is web archiving where long-term preservation is of direct relevance for different task areas, like harvesting, storage, and access.

SCAPE Webinar: ToMaR – The Tool-to-MapReduce Wrapper: How to Let Your Preservation Tools Scale


When dealing with large volumes of files, e.g. in the context of file format migration or characterisation tasks, a standalone server often cannot provide sufficient throughput to process the data in a feasible period of time. ToMaR provides a simple and flexible solution to run preservation tools on a Hadoop MapReduce cluster in a scalable fashion.
ToMaR offers the possibility to use existing command-line tools and Java applications in Hadoop’s distributed environment very similarly to a Desktop computer. By utilizing SCAPE tool specification documents, ToMaR allows users to specify complex command-line patterns as simple keywords, which can be executed on a computer cluster or a single machine. ToMaR is a generic MapReduce application which does not require any programming skills.

This webinar will introduce you to the core concepts of Hadoop and ToMaR and show you by example how to apply it to the scenario of file format migration.

Learning outcomes

1. Understand the basic principals of Hadoop
2. Understand the core concepts of ToMaR
3. Apply knowledge of Hadoop and ToMaR to the file format migration scenario

Who should attend?

Practitioners and developers who are:

• dealing with command line tools (preferrably of the digital preservation domain) in their daily work
• interested in Hadoop and how it can be used for binary content and 3rd-party tools

Session Lead: Matthias Rella, Austrian Institute of Technology

Time: 10:00 GMT / 11:00 CET

Duration: 1 hour

21 March 2014
Event Types: 

SCAPE QA Tool: Technologies behind Pagelyzer - I Support Vector Machine

Managing Digital Preservation: A SCAPE & OPF Seminar

This event will focus on the issues that managers face when implementing digital preservation in their organisation. It will explore the tension between stable business processes and the introduction of new technologies. Many managers have a responsibility for digital preservation but they are not necessarily a technical expert in the field.

Why attend?

  • Meet peers who are managing digital preservation
  • Learn about the approach of others who are embedding digital preservation in business practices
  • Hear about strategic approaches and policies in the field of digital preservation
  • Meet experts in digital preservation
  • Find out about research and development project developments

Who should attend?

Managers with a responsibility for digital preservation in large or small organisations

2 April 2014
Event Types: