Preservation Actions

A Tika to ride; characterising web content with Nanite

This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite.

Introducing Nanite

Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects:

  • Nanite-Core: an API for Droid   
  • Nanite-Hadoop: a MapReduce program for characterising web archives that makes use of Nanite-Core, Apache Tika and libmagic-jna-wrapper  (the last one here essentially being the *nix `file` tool wrapped for reuse in Java)

Some reflections on scalable ARC to WARC migration

The SCAPE project is developing solutions to enable the processing of very large data sets with a focus on long-term preservation. One of the application areas is web archiving where long-term preservation is of direct relevance for different task areas, like harvesting, storage, and access.

SCAPE Webinar: ToMaR – The Tool-to-MapReduce Wrapper: How to Let Your Preservation Tools Scale


When dealing with large volumes of files, e.g. in the context of file format migration or characterisation tasks, a standalone server often cannot provide sufficient throughput to process the data in a feasible period of time. ToMaR provides a simple and flexible solution to run preservation tools on a Hadoop MapReduce cluster in a scalable fashion.
ToMaR offers the possibility to use existing command-line tools and Java applications in Hadoop’s distributed environment very similarly to a Desktop computer. By utilizing SCAPE tool specification documents, ToMaR allows users to specify complex command-line patterns as simple keywords, which can be executed on a computer cluster or a single machine. ToMaR is a generic MapReduce application which does not require any programming skills.

This webinar will introduce you to the core concepts of Hadoop and ToMaR and show you by example how to apply it to the scenario of file format migration.

Learning outcomes

1. Understand the basic principals of Hadoop
2. Understand the core concepts of ToMaR
3. Apply knowledge of Hadoop and ToMaR to the file format migration scenario

Who should attend?

Practitioners and developers who are:

• dealing with command line tools (preferrably of the digital preservation domain) in their daily work
• interested in Hadoop and how it can be used for binary content and 3rd-party tools

Session Lead: Matthias Rella, Austrian Institute of Technology

Time: 10:00 GMT / 11:00 CET

Duration: 1 hour

21 March 2014
Event Types: 

SCAPE Training - Preserving Your Preservation Tools


Learning to Think Like a Package Maintainer

Lots of great digital preservation applications and services exist, however very few are actively maintained and thus preserved! This is a big problem! By introducing the steps to develop these and engage the support of the community, this training course looks at what can be done to improve this situation. Specifically, this training course looks at how to prepare packages for submission into the very heart of many digital environments; the operating system and directly associated “app-stores”. Attendees will be given hands-on experience with developing and maintaining packages rather than software and key differences will be discussed and evaluated. Better preservation of preservation tools, means better preservation our digital history.

Learning Outcomes (by the end of the training event the attendees will be able to):

  1. Understand the complexities of package management and distinguish between the different practices relating to both package objectives and chosen programming language. 
  2. Be able to carry out advanced package management operations in order to critically appraise current packages and propose changes. 
  3. Understand the importance of clearly defined versioning and licenses and the role of clear documentation and examples. 
  4. Apply best practice techniques in order to create a simple package suitable for long term maintenance. 
  5. Evaluate a number of options for managing package configuration and behavior relating to package installation, removal, upgrade and re-installation. 
  6. Analyse opportunities for automating package management and releases, maintaining a clear focus on the user and not the developer. 
  7. Critically evaluate opportunities to generalise package management to allow the easy building and maintenance of packages on multiple platforms.
  8. Assess the potential to apply package management techniques in your own environment. 

Delegates will receive a certificate of attendance for the training course.

The agenda can be seen here:

Registration is now open!

26 March 2014 to 27 March 2014
Event Types: 

Impressions of the ‘Hadoop-driven digital preservation Hackathon’ in Vienna

More than 20 developers visited the ‘Hadoop-driven digital preservation Hackathon’ in Vienna which took place in the baroque room called "Oratorium" of the Austrian National Library from 2nd to 4th of December 2013.

Fund it, Solve it, Keep it (with SPRUCE)

How to fund and solve your digital preservation challenges


What will the event do for me?

This event will help to make your digital preservation more effective by demonstrating the best community focused approaches and results from the JISC funded SPRUCE Project. You'll be hearing from the SPRUCE Team experts and from the practitioners and developers who have been tackling digital preservation challenges in targeted SPRUCE Award projects. We'll also be hearing from you, so we can take on board what you need from our future work.

  • If you're taking your first steps in preserving your digital assets we will demonstrate how to get started, where to get help, and how to make the case to resource your work more effectively.
  • If you're already engaged in digital preservation we'll show how your efforts can be supported more effectively with help from the community.

Key topics we will be covering include:

  • Securing funding for your digital preservation activities with the Digital Preservation Business Case Toolkit
  • Community approaches to solving digital preservation challenges
  • SPRUCE guides on how to assess your digital collections
  • Stabilising data stored on obsolete hand-held media
  • Results from the SPRUCE Award Projects

Who is this for?

Practitioners, developers and middle managers who are engaged (or would like to be engaged) in preserving their organisation's digital assets.

When, where and how do I register?

The free event will take place at 11am on the 25th November at the brand new Library of BirminghamRegister your attendance here. Please note that anyone who registers for the event and then fails to attend without giving at least one week of notice will be liable for a £50 cancellation charge. Places are limited, so please don't waste them!

25 November 2013
Event Types: 

POSTPONED Digital Preservation Without Tears

The ‘Digital Preservation Without Tears’ Mash-up will appeal to collection owners and developers.  The programme offers two connected strands – a hack and a sprint.

  • In the hack, developers will have two days to develop, test and enhance practical tools for digital preservation. Collection owners will be invited to bring problem elements of their digital collections for analysis using the latest digital forensic and characterisation tools.  This will help the collection owners develop practical workflows for management and preservation while helping developers spot and refine solutions that will enable better tools.
  • In the sprint, collection owners will examine current thinking on digital preservation policy and planning in their organisations.  Collections owners will present their own digital preservation policies and will be invited to assess these against each other and against emerging good practice, providing a managed environment for policy development and peer review.  Collection owners will then be invited to pool their wisdom to create a Digital Preservation Policy Building Toolkit that can be shared.

This mashup will:

  • Provide a forum for practical problem solving for analysis of digital collection
  • Provide a forum for discussion, review and development of digital preservation policy
  • Bring together developers and collection owners from across the DPC and OPF to address shared challenges
  • Extend and enhance the corpus of digital preservation tools
  • Deliver a simple beginners’ guide for the development of digital preservation policies

This event will be of interest to:

  • Collections managers, librarians, curators and archivists and policy makers in all institutions with an interest in digital preservation
  • Techies, tools developers, IT officers, database managers and systems analysts with an interest in long term data management
  • Innovators and researchers digital preservation
  • Vendors and providers of digital preservation services
  • CEO’s CTO’s and CIO’s seeking to develop institutional capacity for digital preservation

Everyone coming needs to bring a lap top computer.  In addition:

  • Collection owners will need to bring a data set that is giving them trouble in terms of characterisation or identification and be prepared to present their institutional policy on digital preservation
  • Techies will need to tell us about the skills they have and bring a knowledge of existing digital forensic and characterisation tools

Also, because elements of the mash-up include peer-review of existing practice, participants need to understand and consent to working under ‘Chatham House Rules’ for parts of the programme.

Places are strictly limited and should be booked in advance.  Priority will be given to DPC and OPF members who can attend at no cost.  Non-members are welcome at a cost of £150 pounds per person. Lunch and refreshments are provided on three days and dinner on the first night.  Accommodation will be recommended but is not included in the cost.  Register online at:

Can’t make it?

Parts of the event will be available as a webcast. We’ll publish the slides after each event and will tweet live from the event using the hashtag #DPnoTears.  

Event Types: 

OPF Webinar - Digital library development and practice at the London School of Economics

This webinar will present a case study of digital preservation and digital library development at the London School of Economics. It will cover the nature of digital library collections we are working with now and a bit about our experiments and future directions for other kinds of born-digital material; the high-level architecture and functional components we have in place, and a discussion about our general approach and what we feel we can avoid having an opinion about for now; discussion of our user experience design process and how we are integrating this way of thinking into other areas of the library like our main website; and a bit about how we made the case to fund digital preservation and the development of our core team and how we involve others within the library.

Session lead: Ed Fay, Digital Library Manager, London School of Economics

Time: 14:00 BST / 15:00 CET

There are 25 places available which will be allocated on a first come, first serve basis. Registration will open soon.

23 September 2013
Event Types: