We recently posted an article on the UK Web Archive blog that may be of interest here, User-Driven Digital Preservation, where we summarise our work with the SCAPE Project on a little prototype application that explores how we might integrate user feedback and preservation actions into our usual discovery and access processes.
It is well-known that PDF documents can contain features that are preservation risks (e.g. see here and here). Migration of existing PDFs to PDF/A is sometimes advocated as a strategy for mitigating these risks. However, the benefits of this approach are often questionable, and the migration process can also be quite risky in itself. As I often get questions on this subject, I thought it might be worthwhile to do a short write-up on this.
This blog post continues a series of posts about the weeb archiving topic „ARC to WARC migration“, namely it is a follow-up on the posts „ARC to WARC migration: How to deal with de-duplicated records?“, and „Some reflections on scalable ARC to WARC migration“.
- characterise collections and use C3PO to easily inspect the content characteristics
- integrate C3PO with Scout and publish content profiles online
- use Scout to automatically monitor your content profile
- monitor preservation risks by cross referencing your content profile with policies, information from the world, and even content profiles from peers
- check the validity of the files and whether they are encrypted;
- perform quality assurance checks after migration, using comparison tools;
- investigate error messages, repair the problems, and build a knowledge base; and
- document and improve open source tool functionality e.g. JHOVE validation.
- Learn about PDF and PDF/A standards
- Document and prioritise known preservation problems with PDF files
- Assess state of the art identification and validation tools
- Test the tools on sample files and compare the results
- Define organisational requirements and policies for conformance
- Identify requirements for future development work (road-mapping)
- Help improve current PDF tools (hacking)
This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite.
Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects:
Whenever you run into the situation that you have got used to a command line tool and all of a sudden need to apply it to a large amount of files over a Hadoop cluster without having any clue of writing distributed programs ToMaR will be your friend.