Writing preservation policies might be a daunting task, but support is underway!
The SCAPE project is into its final 6 months and with that came our final developer workshop. The main focus of this event was demonstrations, productisation and sustainability, however with everyone together it provided an opportune time to make progress with other SCAPE related activities.
Here is part 2 of the digital preservation seminar which identified ways to break down walls between research & development and daily operations in libraries and archives (continued from Breaking down walls in digital preservation, part 1). The seminar was organised by SCAPE and the Open Planets Foundation in The Hague on 2 April 2014.
People & knowledge are the keys to breaking down the walls between daily operations and digital preservation (DP) within our organisations. DP is not a technical issue, but information technology must be embraced as as a core feature of the digital library.
In my last blog post about ARC to WARC migration I did a performance comparison of two alternative approaches for migrating very large sets of ARC container files to the WARC format using Apache Hadoop, and I said that resolving contextual dependencies in order to create self-contained WARC files was the next point to investigate further.
This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite.
Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects:
Whenever you run into the situation that you have got used to a command line tool and all of a sudden need to apply it to a large amount of files over a Hadoop cluster without having any clue of writing distributed programs ToMaR will be your friend.