In my last blog post about ARC to WARC migration I did a performance comparison of two alternative approaches for migrating very large sets of ARC container files to the WARC format using Apache Hadoop, and I said that resolving contextual dependencies in order to create self-contained WARC files was the next point to investigate further.
This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite.
Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects:
Whenever you run into the situation that you have got used to a command line tool and all of a sudden need to apply it to a large amount of files over a Hadoop cluster without having any clue of writing distributed programs ToMaR will be your friend.
The SCAPE project is developing solutions to enable the processing of very large data sets with a focus on long-term preservation. One of the application areas is web archiving where long-term preservation is of direct relevance for different task areas, like harvesting, storage, and access.
First things first. The Github repository with the Audio QA workflows is here: https://github.com/statsbiblioteket/scape-audio-qa. And version 1 is working. Version is really all wrong here. I should call it Workflow 1, which is this one:
This event will focus on the issues that managers face when implementing digital preservation in their organisation. It will explore the tension between stable business processes and the introduction of new technologies. Many managers have a responsibility for digital preservation but they are not necessarily a technical expert in the field.
- Meet peers who are managing digital preservation
- Learn about the approach of others who are embedding digital preservation in business practices
- Hear about strategic approaches and policies in the field of digital preservation
- Meet experts in digital preservation
- Find out about research and development project developments
Who should attend?
Managers with a responsibility for digital preservation in large or small organisations