Webinar: Tools for uncovering preservation risks in large repositories

An important part of digital preservation is analysing content to uncover the risks that hinder its preservation. This analysis entails answering diverse questions, for example: Which file formats do I have? Are there any invalid files? Are there any files violating my defined policies?; and many others.
The threats to preserving content come from many distinct domains, from technological to organizational, economical and political, and can relate to the content holder, the producers or to the target communities to which the content is primarily destined for.
Scout, the preservation watch system, centralizes all the necessary knowledge on the same platform, cross-referencing this knowledge to uncover all preservation risks. Scout automatically fetches information from several sources to populate its knowledge base. For example, Scout integrates with C3PO to get large-scale characterization profiles of content. Furthermore, Scout aims to be a knowledge exchange platform, to allow the community to bring together all the necessary information into the system. The sharing of information opens new opportunities for joining forces against common problems.
This webminar demonstrates how to identify preservation risks in your content and, at the same time, share your content profile information with others to open new opportunities.
Learning outcomes
In this webinar you will learn how to:
  • characterise collections and use C3PO to easily inspect the content characteristics
  • integrate C3PO with Scout and publish content profiles online
  • use Scout to automatically monitor your content profile
  • monitor preservation risks by cross referencing your content profile with policies, information from the world, and even content profiles from peers
There are 23 places available on a first come, first service basis. 
Date: Thursday 26 June
Time: 14:00 BST / 15:00 CET
Duration: 1 hour
Session Lead: Luis Faria, KEEP SOLUTIONS
26 June 2014
Event Types: 

A Weekend With Nanite

Well over a year ago I wrote the ”A Year of FITS”( blog post describing how we, during the course of 15 months, characterised 400 million of harvested web documents using the File Information Tool Kit (FITS) from Harvard University. I presented the technique and the technical metadata and basically concluded that FITS didn’t fit that kind of heterogenic data in such large amounts. In the time that has passed since that experiment, FITS has been improved in several areas including the code base and organisation of the development and it could be interesting to see how far it has evolved for big data. Still, FITS is not what I will be writing on today. Today I’ll present how we characterised more than 250 million web documents, not in 9 months, but during a weekend.

Using Kanban at the SCAPE Developer Workshop

The SCAPE project is into its final 6 months and with that came our final developer workshop. The main focus of this event was demonstrations, productisation and sustainability, however with everyone together it provided an opportune time to make progress with other SCAPE related activities.

Breaking down walls in digital preservation (part 2)

Here is part 2 of the digital preservation seminar which identified ways to break down walls between research & development and daily operations in libraries and archives (continued from Breaking down walls in digital preservation, part 1). The seminar was organised by SCAPE and the Open Planets Foundation in The Hague on 2 April 2014.

ARC to WARC migration: How to deal with de-duplicated records?

In my last blog post about ARC to WARC migration I did a performance comparison of two alternative approaches for migrating very large sets of ARC container files to the WARC format using Apache Hadoop, and I said that resolving contextual dependencies in order to create self-contained WARC files was the next point to investigate further.

A Tika to ride; characterising web content with Nanite

This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite.

Introducing Nanite

Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects:

  • Nanite-Core: an API for Droid   
  • Nanite-Hadoop: a MapReduce program for characterising web archives that makes use of Nanite-Core, Apache Tika and libmagic-jna-wrapper  (the last one here essentially being the *nix `file` tool wrapped for reuse in Java)