SCAPE

Using Kanban at the SCAPE Developer Workshop

The SCAPE project is into its final 6 months and with that came our final developer workshop. The main focus of this event was demonstrations, productisation and sustainability, however with everyone together it provided an opportune time to make progress with other SCAPE related activities.

Breaking down walls in digital preservation (part 2)

Here is part 2 of the digital preservation seminar which identified ways to break down walls between research & development and daily operations in libraries and archives (continued from Breaking down walls in digital preservation, part 1). The seminar was organised by SCAPE and the Open Planets Foundation in The Hague on 2 April 2014.

ARC to WARC migration: How to deal with de-duplicated records?

In my last blog post about ARC to WARC migration I did a performance comparison of two alternative approaches for migrating very large sets of ARC container files to the WARC format using Apache Hadoop, and I said that resolving contextual dependencies in order to create self-contained WARC files was the next point to investigate further.

A Tika to ride; characterising web content with Nanite

This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite.

Introducing Nanite

Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects:

  • Nanite-Core: an API for Droid   
  • Nanite-Hadoop: a MapReduce program for characterising web archives that makes use of Nanite-Core, Apache Tika and libmagic-jna-wrapper  (the last one here essentially being the *nix `file` tool wrapped for reuse in Java)