Preservation Actions

BSDIFF: Technological Solutions for Reversible Pre-conditioning of Complex Binary Objects

Documented provenance and the ability for researchers to locate and view original versions of digital records as transferred into an archive are concepts central to archival theory. The continuing ability to enable this is challenged by the numbers of digital records we’re facing; a requirement to follow good digital preservation practice; a need to provide access; the complexity of modern file formats; and the cost of doing all of it. Technological solutions, and techniques borrowed from other disciplines can help reduce costs throughout the transfer process through to the maintenance of digital objects in a digital repository without compromising the integrity asked by archival theory. Using binary diffs and binary patching mechanisms is one such solution that can reduce costs and provide a sound method of documenting all file modifications, from the trivial to the complex – enabling the original record to always be recovered.

Webinar: Tools for uncovering preservation risks in large repositories

Overview
An important part of digital preservation is analysing content to uncover the risks that hinder its preservation. This analysis entails answering diverse questions, for example: Which file formats do I have? Are there any invalid files? Are there any files violating my defined policies?; and many others.
 
The threats to preserving content come from many distinct domains, from technological to organizational, economical and political, and can relate to the content holder, the producers or to the target communities to which the content is primarily destined for.
 
Scout, the preservation watch system, centralizes all the necessary knowledge on the same platform, cross-referencing this knowledge to uncover all preservation risks. Scout automatically fetches information from several sources to populate its knowledge base. For example, Scout integrates with C3PO to get large-scale characterization profiles of content. Furthermore, Scout aims to be a knowledge exchange platform, to allow the community to bring together all the necessary information into the system. The sharing of information opens new opportunities for joining forces against common problems.
 
This webminar demonstrates how to identify preservation risks in your content and, at the same time, share your content profile information with others to open new opportunities.
 
Learning outcomes
In this webinar you will learn how to:
  • characterise collections and use C3PO to easily inspect the content characteristics
  • integrate C3PO with Scout and publish content profiles online
  • use Scout to automatically monitor your content profile
  • monitor preservation risks by cross referencing your content profile with policies, information from the world, and even content profiles from peers
There are 23 places available on a first come, first service basis. 
Date: Thursday 26 June
Time: 14:00 BST / 15:00 CET
Duration: 1 hour
Session Lead: Luis Faria, KEEP SOLUTIONS
Date: 
26 June 2014
Event Types: 

An Analysis Engine for the DROID CSV Export

I have been working on some code to ensure the accurate and consistent output of any file format analysis based on the DROID CSV export. The tool produces summary information about any DROID export and more detailed listings for content of interest such as files with potentially problematic file names or duplicate content based on MD5 hash value. I describe some of the rationale and ask for advice on where to go next.

A Weekend With Nanite

Well over a year ago I wrote the ”A Year of FITS”(http://www.openplanetsfoundation.org/blogs/2013-01-09-year-fits) blog post describing how we, during the course of 15 months, characterised 400 million of harvested web documents using the File Information Tool Kit (FITS) from Harvard University. I presented the technique and the technical metadata and basically concluded that FITS didn’t fit that kind of heterogenic data in such large amounts. In the time that has passed since that experiment, FITS has been improved in several areas including the code base and organisation of the development and it could be interesting to see how far it has evolved for big data. Still, FITS is not what I will be writing on today. Today I’ll present how we characterised more than 250 million web documents, not in 9 months, but during a weekend.

Preserving PDF: identify, validate, repair

Overview
This event will focus on the PDF file format. Participants are encouraged to contribute requirements, for instance sample files with errors or anomalies for investigation. Currently available identification and validation tools will be demonstrated, with the opportunity to compare results using your own collections and identify gaps for future development.
 
OPF members have identified specific tasks for the event:
  • check the validity of the files and whether they are encrypted;
  • perform quality assurance checks after migration, using comparison tools; 
  • investigate error messages, repair the problems, and build a knowledge base; and
  • document and improve open source tool functionality e.g. JHOVE validation.
 
There will also be discussion sessions, and the opportunity to share experiences with peer organisations.
 
Olaf Drümmer, Chairman of the PDF Association / CEO of callas software GmbH / DIN delegate to all PDF related working groups in ISO TC 171 and ISO TC 130 since 1999, will present the work of the ISO standards body, including efforts related to PDF and PDF/A, and share the industry perspective on tool development.
 
Why attend?
  • Learn about PDF and PDF/A standards 
  • Document and prioritise known preservation problems with PDF files
  • Assess state of the art identification and validation tools
  • Test the tools on sample files and compare the results 
  • Define organisational requirements and policies for conformance
  • Identify requirements for future development work (road-mapping)
  • Help improve current PDF tools (hacking)
 
Who should attend? 
Collection owners with a responsibility to preserve PDFs. Bring along your problem files! 
Developers interested in hacking PDF identification and validation tools.
 
Agenda
 
Registration
OPF members are invited free-of-charge (please use the code issued to your main point of contact at your organisation). Non-members are welcome at the rate of EUR 150.
 
 
Date: 
1 September 2014 to 2 September 2014
Event Types: 

A Tika to ride; characterising web content with Nanite

This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite.

Introducing Nanite

Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects:

  • Nanite-Core: an API for Droid   
  • Nanite-Hadoop: a MapReduce program for characterising web archives that makes use of Nanite-Core, Apache Tika and libmagic-jna-wrapper  (the last one here essentially being the *nix `file` tool wrapped for reuse in Java)

Some reflections on scalable ARC to WARC migration

The SCAPE project is developing solutions to enable the processing of very large data sets with a focus on long-term preservation. One of the application areas is web archiving where long-term preservation is of direct relevance for different task areas, like harvesting, storage, and access.

SCAPE Webinar: ToMaR – The Tool-to-MapReduce Wrapper: How to Let Your Preservation Tools Scale

Overview

When dealing with large volumes of files, e.g. in the context of file format migration or characterisation tasks, a standalone server often cannot provide sufficient throughput to process the data in a feasible period of time. ToMaR provides a simple and flexible solution to run preservation tools on a Hadoop MapReduce cluster in a scalable fashion.
ToMaR offers the possibility to use existing command-line tools and Java applications in Hadoop’s distributed environment very similarly to a Desktop computer. By utilizing SCAPE tool specification documents, ToMaR allows users to specify complex command-line patterns as simple keywords, which can be executed on a computer cluster or a single machine. ToMaR is a generic MapReduce application which does not require any programming skills.

This webinar will introduce you to the core concepts of Hadoop and ToMaR and show you by example how to apply it to the scenario of file format migration.

Learning outcomes

1. Understand the basic principals of Hadoop
2. Understand the core concepts of ToMaR
3. Apply knowledge of Hadoop and ToMaR to the file format migration scenario

Who should attend?

Practitioners and developers who are:

• dealing with command line tools (preferrably of the digital preservation domain) in their daily work
• interested in Hadoop and how it can be used for binary content and 3rd-party tools

Session Lead: Matthias Rella, Austrian Institute of Technology

Time: 10:00 GMT / 11:00 CET

Duration: 1 hour

Date: 
21 March 2014
Event Types: