This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite.
Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects:
Whenever you run into the situation that you have got used to a command line tool and all of a sudden need to apply it to a large amount of files over a Hadoop cluster without having any clue of writing distributed programs ToMaR will be your friend.
The SCAPE project is developing solutions to enable the processing of very large data sets with a focus on long-term preservation. One of the application areas is web archiving where long-term preservation is of direct relevance for different task areas, like harvesting, storage, and access.
First things first. The Github repository with the Audio QA workflows is here: https://github.com/statsbiblioteket/scape-audio-qa. And version 1 is working. Version is really all wrong here. I should call it Workflow 1, which is this one:
This event will focus on the issues that managers face when implementing digital preservation in their organisation. It will explore the tension between stable business processes and the introduction of new technologies. Many managers have a responsibility for digital preservation but they are not necessarily a technical expert in the field.
- Meet peers who are managing digital preservation
- Learn about the approach of others who are embedding digital preservation in business practices
- Hear about strategic approaches and policies in the field of digital preservation
- Meet experts in digital preservation
- Find out about research and development project developments
Who should attend?
Managers with a responsibility for digital preservation in large or small organisations
One of my first blogs here covered an evaluation of a number of format identification tools. One of the more surprising results of that work was that out of the five tools that were tested, no less than four of them (FITS, DROID, Fido and JHOVE2) failed to even run when executed with their associated launcher script. In many cases the Windows launcher scripts (batch files) only worked when executed from the installation folder. Apart from making things unnecessarily difficult for the user, this also completely flies in the face of all existing conventions on command-line interface design. Around the time of this work (summer 2011) I had been in contact with the developers of all the evaluated tools, and until last week I thought those issues were a thing of the past. Well, was I wrong!