Blogs

CSV Validator - beta releases

For quite some time at The National Archives (UK) we've been working on a tool for validating CSV files against user defined schema.  We're now at the point of making beta releases of the tool generally available (1.0-RC3 at the time of writing), along with the formal specification of the schema language.  The tool and source code are released under Mozilla Public Licence version 2.0.

A Tika to ride; characterising web content with Nanite

This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite.

Introducing Nanite

Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects:

  • Nanite-Core: an API for Droid   
  • Nanite-Hadoop: a MapReduce program for characterising web archives that makes use of Nanite-Core, Apache Tika and libmagic-jna-wrapper  (the last one here essentially being the *nix `file` tool wrapped for reuse in Java)

Some reflections on scalable ARC to WARC migration

The SCAPE project is developing solutions to enable the processing of very large data sets with a focus on long-term preservation. One of the application areas is web archiving where long-term preservation is of direct relevance for different task areas, like harvesting, storage, and access.