Preservation Actions

Weirder than old: The CP/M File System and Legacy Disk Extracts for New Zealand’s Department of Conservation

We’ve been doing legacy disk extracts at Archives New Zealand for a number of years with much of the effort enabling us to do this work being done by colleague Mick Crouch, and former Archives New Zealand colleague Euan Cochrane – earlier this year, we received some disks from New Zealand’s Department of Conservation (DoC) which we successfully imaged and extracted what was needed by the department. While it was a pretty straightforward exercise, there was enough about it that was cool enough to warrant that this blog be an opportunity to document another facet of the digital preservation work we’re doing, especially in the spirit of being another war story that other’s in the community can refer to. We do conclude with a few thoughts about where we still relied on a little luck, and we’ll have to keep that in mind moving forward.

User-Driven Digital Preservation

We recently posted an article on the UK Web Archive blog that may be of interest here, User-Driven Digital Preservation, where we summarise our work with the SCAPE Project on a little prototype application that explores how we might integrate user feedback and preservation actions into our usual discovery and access processes.

When (not) to migrate a PDF to PDF/A

It is well-known that PDF documents can contain features that are preservation risks (e.g. see here and here). Migration of existing PDFs to PDF/A is sometimes advocated as a strategy for mitigating these risks. However, the benefits of this approach are often questionable, and the migration process can also be quite risky in itself. As I often get questions on this subject, I thought it might be worthwhile to do a short write-up on this.

BSDIFF: Technological Solutions for Reversible Pre-conditioning of Complex Binary Objects

Documented provenance and the ability for researchers to locate and view original versions of digital records as transferred into an archive are concepts central to archival theory. The continuing ability to enable this is challenged by the numbers of digital records we’re facing; a requirement to follow good digital preservation practice; a need to provide access; the complexity of modern file formats; and the cost of doing all of it. Technological solutions, and techniques borrowed from other disciplines can help reduce costs throughout the transfer process through to the maintenance of digital objects in a digital repository without compromising the integrity asked by archival theory. Using binary diffs and binary patching mechanisms is one such solution that can reduce costs and provide a sound method of documenting all file modifications, from the trivial to the complex – enabling the original record to always be recovered.

Webinar: Tools for uncovering preservation risks in large repositories

Overview
An important part of digital preservation is analysing content to uncover the risks that hinder its preservation. This analysis entails answering diverse questions, for example: Which file formats do I have? Are there any invalid files? Are there any files violating my defined policies?; and many others.
 
The threats to preserving content come from many distinct domains, from technological to organizational, economical and political, and can relate to the content holder, the producers or to the target communities to which the content is primarily destined for.
 
Scout, the preservation watch system, centralizes all the necessary knowledge on the same platform, cross-referencing this knowledge to uncover all preservation risks. Scout automatically fetches information from several sources to populate its knowledge base. For example, Scout integrates with C3PO to get large-scale characterization profiles of content. Furthermore, Scout aims to be a knowledge exchange platform, to allow the community to bring together all the necessary information into the system. The sharing of information opens new opportunities for joining forces against common problems.
 
This webminar demonstrates how to identify preservation risks in your content and, at the same time, share your content profile information with others to open new opportunities.
 
Learning outcomes
In this webinar you will learn how to:
  • characterise collections and use C3PO to easily inspect the content characteristics
  • integrate C3PO with Scout and publish content profiles online
  • use Scout to automatically monitor your content profile
  • monitor preservation risks by cross referencing your content profile with policies, information from the world, and even content profiles from peers
There are 23 places available on a first come, first service basis. 
Date: Thursday 26 June
Time: 14:00 BST / 15:00 CET
Duration: 1 hour
Session Lead: Luis Faria, KEEP SOLUTIONS
Date: 
26 June 2014
Event Types: 

An Analysis Engine for the DROID CSV Export

I have been working on some code to ensure the accurate and consistent output of any file format analysis based on the DROID CSV export. The tool produces summary information about any DROID export and more detailed listings for content of interest such as files with potentially problematic file names or duplicate content based on MD5 hash value. I describe some of the rationale and ask for advice on where to go next.

A Weekend With Nanite

Well over a year ago I wrote the ”A Year of FITS”(http://www.openplanetsfoundation.org/blogs/2013-01-09-year-fits) blog post describing how we, during the course of 15 months, characterised 400 million of harvested web documents using the File Information Tool Kit (FITS) from Harvard University. I presented the technique and the technical metadata and basically concluded that FITS didn’t fit that kind of heterogenic data in such large amounts. In the time that has passed since that experiment, FITS has been improved in several areas including the code base and organisation of the development and it could be interesting to see how far it has evolved for big data. Still, FITS is not what I will be writing on today. Today I’ll present how we characterised more than 250 million web documents, not in 9 months, but during a weekend.

Preserving PDF: identify, validate, repair

Overview
This event will focus on the PDF file format. Participants are encouraged to contribute requirements, for instance sample files with errors or anomalies for investigation. Currently available identification and validation tools will be demonstrated, with the opportunity to compare results using your own collections and identify gaps for future development.
 
OPF members have identified specific tasks for the event:
  • check the validity of the files and whether they are encrypted;
  • perform quality assurance checks after migration, using comparison tools; 
  • investigate error messages, repair the problems, and build a knowledge base; and
  • document and improve open source tool functionality e.g. JHOVE validation.
 
There will also be discussion sessions, and the opportunity to share experiences with peer organisations.
 
Olaf Drümmer, Chairman of the PDF Association / CEO of callas software GmbH / DIN delegate to all PDF related working groups in ISO TC 171 and ISO TC 130 since 1999, will present the work of the ISO standards body, including efforts related to PDF and PDF/A, and share the industry perspective on tool development.
 
Why attend?
  • Learn about PDF and PDF/A standards 
  • Document and prioritise known preservation problems with PDF files
  • Assess state of the art identification and validation tools
  • Test the tools on sample files and compare the results 
  • Define organisational requirements and policies for conformance
  • Identify requirements for future development work (road-mapping)
  • Help improve current PDF tools (hacking)
 
Who should attend? 
Collection owners with a responsibility to preserve PDFs. Bring along your problem files! 
Developers interested in hacking PDF identification and validation tools.
 
Agenda
 
Registration
OPF members are invited free-of-charge (please use the code issued to your main point of contact at your organisation). Non-members are welcome at the rate of EUR 150.
 
 
Date: 
1 September 2014 to 2 September 2014
Event Types: 

A Tika to ride; characterising web content with Nanite

This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite.

Introducing Nanite

Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects:

  • Nanite-Core: an API for Droid   
  • Nanite-Hadoop: a MapReduce program for characterising web archives that makes use of Nanite-Core, Apache Tika and libmagic-jna-wrapper  (the last one here essentially being the *nix `file` tool wrapped for reuse in Java)