This blog post continues a series of posts about the weeb archiving topic „ARC to WARC migration“, namely it is a follow-up on the posts „ARC to WARC migration: How to deal with de-duplicated records?“, and „Some reflections on scalable ARC to WARC migration“.
- characterise collections and use C3PO to easily inspect the content characteristics
- integrate C3PO with Scout and publish content profiles online
- use Scout to automatically monitor your content profile
- monitor preservation risks by cross referencing your content profile with policies, information from the world, and even content profiles from peers
- check the validity of the files and whether they are encrypted;
- perform quality assurance checks after migration, using comparison tools;
- investigate error messages, repair the problems, and build a knowledge base; and
- document and improve open source tool functionality e.g. JHOVE validation.
- Learn about PDF and PDF/A standards
- Document and prioritise known preservation problems with PDF files
- Assess state of the art identification and validation tools
- Test the tools on sample files and compare the results
- Define organisational requirements and policies for conformance
- Identify requirements for future development work (road-mapping)
- Help improve current PDF tools (hacking)
This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite.
Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects:
Whenever you run into the situation that you have got used to a command line tool and all of a sudden need to apply it to a large amount of files over a Hadoop cluster without having any clue of writing distributed programs ToMaR will be your friend.
The SCAPE project is developing solutions to enable the processing of very large data sets with a focus on long-term preservation. One of the application areas is web archiving where long-term preservation is of direct relevance for different task areas, like harvesting, storage, and access.