Open Planets Foundation Blogs

Subscribe to Open Planets Foundation Blogs feed
The Open Planets Foundation has been established to provide practical solutions and expertise in digital preservation, building on the €15 million investment made by the European Union and Planets consortium.
Updated: 1 hour 16 min ago

New Memorandum strengthens global collaboration in digital preservation

6 May 2014 - 10:26pm
The Open Planets Foundation (OPF) and the Digital Preservation Coalition (DPC) are delighted to announce a new memorandum of understanding that strengthens their ongoing collaboration to tackle digital preservation challenges.  Signed by OPF Executive Director, Ed Fay, and DPC Executive Director, William Kilbride, at the DPC offices in Glasgow, the MoU commits both two organisations to share knowledge and expertise, deliver joint events, and to support the development of tools and best practices. ‘The OPF and DPC have worked together on a number of stand-alone activities and events over the past few years’, explained Ed Fay. ‘And were both partners in the JISC-funded SPRUCE (Sustainable Preservation Using Community Engagement) project. The MoU will help us to better align our activities and offer further value to members from both organisations’. ‘Our first official joint initiative is already underway’, revealed William Kilbride, ‘The DPC has expanded the Digital Preservation Awards to offer five awards this year. The OPF is one of our partners, sponsoring the Research and Innovation award’. To view the new Memorandum see: OPF_DPC_MOU Preservation Topics: Open Planets Foundation
Categories: Planet DigiPres

Preservation Health Check: Monitoring Threats to Digital Repository Content

30 April 2014 - 1:01pm

OCLC have published a report presenting the preliminary findings of its Phase 1 investigation of preservation monitoring as part of the Preservation Health Check (PHC) Project.

In collaboration with the Open Planets Foundation and the Bibliothèque Nationale de France, the project aims to evaluate the usefulness of the preservation metadata created and maintained by operational repositories for assessing basic preservation properties.

Written by Wouter KoolBrian Lavoie and Titia van der Werf, the report suggests that there is an opportunity to use PREMIS preservation metadata as an evidence base to support a threat assessment exercise based on the Simple Property-Oriented Threat (SPOT) model.     

Key highlights:

  • There is a need for digital preservation repositories to perform periodic "health checks" as a routine part of preservation activities
  • Preservation Health Check activities serve the day-to-day planning and operations of digital repositories
  • A certain level of predictability and harmonization is necessary for threat assessment applications that rely on automated data evaluation
  • Analysis reveals a variety of gaps in current preservation metadata coverage, which might be filled by other metadata schema
  • Findings suggest an opportunity to use PREMIS preservation metadata as an evidence base to support a threat assessment exercise
  • The results of preservation actions (PREMIS Events) represent a crucial part of the information needed for assessment—whether this information is under the direct control of the repository itself, or whether it is created and maintained by parties external to the repository.
  • The flexibility of the PREMIS standard allows for a large diversity in implementations and leaves much room for encoding relevant metadata in other formats and schemas—all of which impedes the implementation of a threat assessment logic that generalizes over many repositories. 

This report will be of interest to digital repository managers, digital preservation practitioners, and PREMIS implementers.

Phase 2 of the Preservation Health Check Pilot will extend the logic diagrams to other SPOT properties developed in Phase 1 and test them against a data set of "real-world" preservation metadata provided by the Bibliothèque Nationale de France.

For more information please visit:

Download the report:

8.5x11" format (.pdf: 307K/20pp.)

A4 format (.pdf: 302K/20pp.)

Preservation Topics: ProjectsOpen Planets Foundation
Categories: Planet DigiPres

Poznan Supercomputing and Networking Center Joins the OPF

29 April 2014 - 10:02am

We are delighted to welcome the Poznan Supercomputing and Networking Center (PSNC) as our latest affiliate member.

‘PSNC brings new expertise and tools to the OPF membership’, explained Ed Fay, Executive Director of OPF. ‘They have developed dArceo, a long-term preservation system ( which is already used by a number of institutions in Poland to preserve their digital content. As part of their membership contribution they will publicise the service as an open source package. In addition, PSNC will submit improvements to the FITS ( code base, a tool which is widely used by both OPF members and the wider community’.

‘As we speak about long-term preservation, our long-term goal is to improve PSNC's excellence in the field and share our expertise and tools with the preservation community’ said Tomasz Parkoła, long-term preservation specialist, and member of PSNC’s Digital Libraries Team. ‘We believe that OPF can help us bring this to reality, by providing an excellent collaboration and networking environment. We are especially looking forward to sharing knowledge, creating new initiatives, investigating new ideas and running new projects. Our main focus is obviously on research and development activities."

The Poznan Supercomputing and Networking Center is a public ICT research and development institution working on broad range of topics, including network, storage, computing, applications and network services. It has been active in the long-term preservation and archiving domain for several years.

PSNC becomes the 19th member of Open Planets joining libraries, archives, research institutions, universities, and service providers collaborating on shared approaches to digital preservation.

Preservation Topics: Open Planets Foundation
Categories: Planet DigiPres

News from COPTR and DCH-RP on digital preservation tool registries

22 April 2014 - 4:45pm
COPTR logoDCH-RP (Digital Cultural Heritage – Roadmap for Preservation) and the partners of the Community Owned digital Preservation Tool Registry, i.e. The Digital Curation Centre (DCC), The Digital Curation Exchange (DCE), National Digital Stewardship Alliance (NDSA), The Open Planets Foundation (OPF) and Preserving digital Objects With Restricted Resources project (POWRR) are investigating the possibility to join their efforts to set up a common registry of services and tools useful for preserving digital information for the long term. The aim of this registry is to help decision makers selecting quality, mature, sustainable (maintained) and portable tools to be used to plan and implement their digital preservation strategy. DCH-RP logoThe idea that is currently under discussion is to merge and integrate the work that has been done to set up the DCH-RP registry in the Community Owned digital Preservation Tool Registry (COPTR), in order to collate the knowledge on preservation tools in one single place, thus providing a unique and sustainable reference point to the whole digital preservation community. The DCH-RP registry of services and toolsThe DCH-RP registry collects and describes information and knowledge related to tools, technologies and systems that can be applied for the purposes of digital cultural heritage preservation. It also reviews existing and emerging services developed and offered by R&D projects, public organisations and commercial solution vendors. Whilst providing a broad overview of the existing solutions, the registry initiative focuses on analysing those services and tools that can enable cultural heritage institutions to benefit from the capacities of e-Infrastructures including cloud and grid systems. Tools and services are categorized by purpose, technologies required, resource formats supported and domain-specific application, among many other criteria. Alongside this functional description, an attempt has been made (for a subset of the tools and services covered) to provide assessments of each. In the first iteration, assessment criteria chosen have been: popularity, support level, portability, scalability, licensing model, and modularity/openness of architecture. Help us to select the most relevant and used services for digital preservation!In order to improve the registry, we prepared a survey to rank the services that are listed.The questionaire, which is anonymous, is intented to determine what services are especially interesting and used by the DCH community. Please help us in identifying the most relevant services by filling in this survey! The results will be taken up while working in the next iteration of DCH-RP registry and in the set up of the common registry of services and tools for digital preservation together with the partners of the Community Owned digital Preservation Tool Registry. Claudio Prandoni and Paul Wheatley
Categories: Planet DigiPres

Digital Preservation Awards 2014 - nominations now open!

22 April 2014 - 11:56am

We are pleased to announce we are partnering with the Digital Preservation Coalition (DPC) to sponsor the Award for Research and Innovation in the Digital Preservation Awards 2014.

This award is one of five categories available this year. The Award for Research and Innovation recognises significant technical or intellectual accomplishments which practically lowers the barriers to effective digital preservation. It will be awarded to the project, initiative or person that, in the eyes of the judges, has produced a tool, framework, standard, service, or approach that has (or will have) the greatest impact in ensuring our digital memory is available tomorrow.   

Ed Fay, OPF Executive Director comments:

‘We’re excited to be associated with the awards this year and we’re delighted that this award retains such a realistic focus. Digital preservation could not progress without innovative and practical problem solving. This work is often taken for granted, so it’s important that we celebrate it properly.’

The Digital Preservation Awards were founded in 2004 to recognise the people and organisations that have made significant and innovative contributions to digital preservation.

William Kilbride, Executive Director of the DPC, explains the background to the awards:

‘In its early years, the Digital Preservation Award was a niche category in the Conservation Awards but in each round the judges have been impressed by the increasing quality, range and number of nominations. Last time we added two new awards. This time there will be five. The expansion is a direct result of the growth in importance and sophistication of digital preservation solutions.

 We run these awards for the whole community of people interested in digital preservation. So we’re asking that whole community to spread the word and to support the awards.’

The full criteria for each category and the rules of entry are listed on the DPC website:

The deadline for entries is 28th July 2014. The awards ceremony will take place on Monday 17th November, hosted by the Wellcome Trust, London.

Preservation Topics: ToolsProjectsOpen Planets FoundationSoftware
Categories: Planet DigiPres

Breaking down walls in digital preservation (part 2)

7 April 2014 - 10:44am

Here is part 2 of the digital preservation seminar which identified ways to break down walls between research & development and daily operations in libraries and archives (continued from Breaking down walls in digital preservation, part 1). The seminar was organised by SCAPE and the Open Planets Foundation in The Hague on 2 April 2014. – Report & photographs by Inge Angevaare, visualisations by Elco van Staveren


Ross King OPF SCAPE seminar

Ross King of the Austrian Institute of Technology (and of OPF) kicking off the afternoon session by singlehandedly attacking the wall between daily operations and R&D

Experts meet managers

Ross King of the Austrian Institute of Technology described the features of the (technical) SCAPE project which intends to help institutions build preservation environments which are scalable – to bigger files, to more heterogeneous files, to a large volume of files to be processed. King was the one who identified the wall that exists between daily operations in the digital library and research & development (in digital preservation):

SCAPE/OPF seminar on Managing Digital Preservation

The Wall between Production & R&D as identified by Ross King

Zoltan Szatucsket of the Hungarian National Archives shared his experiences with one of the SCAPE tools from a manager’s point of view: ‘Even trying out the Matchbox tool from the SCAPE project was too expensive for us.’ King admitted that the Matchbox case had not yet been entirely successful. ‘But our goal remains to deliver tools that can be downloaded and used in practice.’

Maureen Pennock of the British Library sketched her organisation’s journey to embed digital preservation [link to slides to follow]. Her own digital preservation department (now at 6 fte) was moved around a few times before it was nested in the Collection Care department which was then merged with Collection management. ‘We are now where we should be: in the middle of the Collections department and right next to the Document processing department. And we work closely with IT, strategy development, procurement/licensing and collection security and risk management.’

The British Library’s strategy calls for further embedding of digital preservation, without taking the formal step of certification

Pennock elaborated on the strategic priorities mentioned above (see slides) by noting that the British Library has chosen not to strive for formal certification within the European Framework (unlike, e.g., the Dutch KB). Instead, the BL intends to hold bi-annual audits to measure progress. The BL intends to ensure that ‘all staff working with digital content understand preservation issues associated with it.’ Questioned by the Dutch KB’s Hildelies Balk, Pennock confirmed that the teaching materials the BL is preparing could well be shared with the wider digital preservation community. Here is Pennock’s concluding comment:

Digital preservation SCAPE/OPF seminar

Digital preservation is like a bicycle – one size doesn’t fit everyone … but everybody still recognises the bicycle

Marcin Werla from the Polish Supercomputing & Networking Centre PSNC provided an overview of the infrastructure PSNC is providing for research institutions and for cultural heritage institutions. It is a distributed network based on the Polish fast (20GB) optical network:


PSNC network

The PSNC network includes facilities for long-term preservation

Interestingly, the network services mostly smaller institutions. The Polish National Library and Archives have built their own systems.

Werla stressed that proper quality control at the production stage is difficult because of the bureaucratic Polish public procurement system.

Heiko Tjalsma of the Dutch research data archive DANS pitched the 4C project which was established to  ’create a better understanding of digital curation costs through collaboration.’

Heiko Tjalsma 4CProject SCAPE/OPF seminar

Tjalsma: ‘We can only get a better idea of what digital curation costs by collaborating and sharing data’

At the moment there are several cost models available in the community (see, e.g., earlier posts), but they are difficult to compare. The 4C project intends to a) establish an international curation cost exchange framework, and b) build a Cost Concept Model – which will define what to include in the model and what to exclude.

The need for a clearer picture of curation costs is undisputed, but, Tjalsma added, ‘it is very difficult to gather detailed data, even from colleagues.’ Our organisations are reticent to make their financial data available. And both ‘time’ and ‘scale’ make matter more difficult. The only way to go seems to be anonimisation of data, and for that to work, the project must attract as many participants as possible. So: please register at – and participate.

 Building bridges between expert and manager

The last part of the day was devoted to building bridges between experts and managers. Dirk van Suchodeletz of the University of Freiburg introduced the session with a topic that is often considered an ‘expert-only’ topic: emulation.

Dirk von Suchodeletz SCAPE/OPF seminar

Dirk von Suchodeletz: ‘The EaaS project intends to make emulation available for a wider audience by providing it as a service.’

The emulation technique has been around for a while, and it is considered one of the few methods of preservation available for very complex digital objects – but takeup by the community has been slow, because it is seen as too complex for non-experts. The Emulation as a Service project intends to bridge the gap to practical implementation by taking away many of the technical worries from memory institutions. A demo of Emulation as a Service is available for OPF members. Von Suchodeletz encouraged his audience to have a look at it, because the service can only be made to work if many memory institutions decide to participate.


Group discussion SCAPE/OPF seminar

Getting ready for the last roundtable discussion about the relationship between experts and managers

 How R&D and the library business relate

‘What inspired the EaaS project,’ Hildelies Balk (KB) wanted to know from von Suchodeletz, ‘was it your own interest or was there some business requirement to be met?’ Von Suchodeletz admitted that it was his own research interest that kicked off the project; business requirements entered the picture later.

Birgit Henriksen of the Royal Library, Denmark: ‘We desperately need emulation to preserve the games in our collection, but because it is such a niche, funding is hard to come by.’ Jacqueline Slats of the Dutch National Archives echoed this observation: ‘The NA and the KB together developed the emulation tool Dioscuri, but because there was no business demand, development was halted. We may pick it up again as soon as we start receiving interactive material for preservation.’

This is what happened next, as visualised by Elco van Staveren:

Some highlights from the discussions:

  • Timing is of the essence. Obviously, R&D is always ahead of operations, but if it is too far ahead, funding will be difficult. Following user needs is no good either, because then R&D becomes mere procurement. Are there any cases of proper just-in-time development? Barbara Sierman of the KB suggested Jpylyzer (translation of Jpylyzer for managers) – the need arose for quality control in a massive TIFF/JP2000 migration at the KB intended to cut costs, and R&D delivered.
  • Another successful implementation: the Pronom registry. The National Archives had a clear business case for developing it. On the other hand, the GDFR technical registry did not tick the boxes of timeliness, impetus and context.
  • For experts and managers to work well together managers must start accepting a certain amount of failure. We are breaking new ground in digital preservation, failures are inevitable. Can we make managers understand that even failures make us stronger because the organisation gains a lot of experience and knowledge? And what is an acceptable failure rate? Henriksen suggested that managing expectations can do the trick. ‘Do not expect perfection.’

Panel members SCAPE/OPF seminar

Some of the panel members (from left to right) Maureen Pennock (British Library), Hildelies Balk (KB), Mies Langelaar (Rotterdam Municipal Archives), Barbara Sierman (KB) and Mette van Essen (Dutch National Archives)


  • We need a new set of metrics to define success in the ever changing digital world.
  • Positioning the R&D department within Collections can help make collaboration between the two more effective (Andersen, Pennock). Henriksen: ‘At the Danish Royal Library we have started involving both R&D and collections staff in scoping projects.’
  • And then again … von Suchodeletz suggested that sometimes a loose coupling between R&D and business can be more effective, because staff in operations can get too bogged down by daily worries.
  • Sometimes breaking down the wall is just too much to ask, suggested van Essen. We may have to decide to jump the wall instead, at least for the time being.
  • Bridge builders can be key to making projects succeed, staff members who speak both the languages of operations and of R&D. Balk and Pennock stressed that everybody in the organisation should know about the basics of digital preservation.
  • Underneath all of the organisation’s doings must lie a clear common vision to inspire individual actions, projects and collaboration.

In conclusion: participants agreed that this seminar had been a fruitful counterweight to technical hackatons in digital preservation. More seminars may follow. If you participated (or read these blogs), please use the commentary box for any corrections and/or follow-up.

‘In an ever changing digital world, we must allow for projects to fail – even failures bring us lots of knowledge.’

Slides from presentations at

Preservation Topics: SCAPE AttachmentSize aDSC_1848RossKing4.jpg80.14 KB thewall.PNG357.64 KB aDSC_1856Strategy.jpg79.91 KB aDSC_1863Bicycle.jpg89.9 KB PSNC.PNG234.51 KB aDSC_1880TjalsmaDANS4C.jpg89.11 KB aDSC_1910Suchodeletz.jpg115.27 KB aDSC_1890ManagingDigitalPreservation.jpg114.4 KB aDSC_1896Group2Discussion.jpg99.27 KB
Categories: Planet DigiPres

Breaking down walls in digital preservation (part 1)

5 April 2014 - 11:20am

People & knowledge are the keys to breaking down the walls between daily operations and digital preservation (DP) within our organisations. DP is not a technical issue, but information technology must be embraced as as a core feature of the digital library. Such were some of the conclusions of the seminar organised by the SCAPE project/Open Planets Foundation at the Dutch National Library (KB) and National Archives (NA) on Wednesday 2 April. - Report & photographs by Inge Angevaare, visualisations by Elco van Staveren


Newcomer questions some current practices

Menno Rasch was appointed Head of Operations at the Dutch KB 6 months ago – but  ’I still feel like a newcomer in digital preservation.’ His division includes the Collection Care department which is responsible for DP. But there are close working relationships with the Research and IT departments in the Innovation Division. Rasch’s presentation about embedding DP in business practices in the KB posed some provocative questions:

Menno Rasch

Menno Rasch: 'Do correct me if I'm wrong'


  • We have a tendency to cover up our mistakes and failures rather than expose them and discuss them in order to learn as a community. That is what pilots do. The platform is there, the Atlas of Digital Damages set up by the KB's Barbara Sierman, but it is being underused. Of course lots of data are protected by copyright or privacy regulations, but there surely must be some way to anonimise the data.

  • In libraries and archives, we still look upon IT as 'the guys that make tools for us'. 'But IT = the digital library.'

  • We need to become more pragmatic. Implementing the OAIS standard is a lot of work - perhaps it is better to take this one step at a time.

  • 'If you don't do it now, you won't do it a year from now.'

  • 'Any software we build is temporary - so keep the data, not the software.'

  • Most metadata are reproducible - so why not store them in a separate database and put only the most essential preservation metadata in the OAIS information package? That way we can continue improving the metadata. Of course these must be backed up too (an annual snapshot?), but may tolerate a less expensive storage regime than the objects.

  • About developments at the KB: 'To replace our old DIAS system, we are now developing software to handle all of our digital objects - which is an enormous challenge.'



SCAPE OPF seminar Managing Digital Preservation

SCAPE/OPF seminar on Managing Digital Preservation, 2 April 2014, The Hague

  Digital collections and the Titanic

Zoltan Szatucsket from the Hungarian National Archives used the Titanic for his presentation’s metaphor - without necessarily implying that we are headed for the proverbial iceberg, he added. Although, …  ’many elements from the Titanic story can illustrate how we think’:

  • Titanic received many warnings about ice formations, and yet it was sailing at full speed when disaster struck.
  • Our ship – the organisation – is quite conservative. It wants to deal with digital records in the same way it deals with paper records. And at the Hungarian National Archives IT and archivist staff are in the same department, which does not work because they do not speak each others’ language.


Szatuscek SCAPE seminar

Zoltan Szatucsket argued that putting together IT staff and archivists in the Hungarian National Archives caused ‘language’  problems; his Danish colleagues felt that in their case close proximity had rather helped improve communications


  • The captain must acquire new competences. He must learn to manage staff, funding, technology, equipment, etc. We need processes rather than tools.
  • The crew is in trouble too. Their education has not adapted to digital practices. Underfunding in the sector is a big issue. Strangely enough, staff working with medieval resources were much quicker to adopt digital practices than those working with contemporary material. They seem to want to put off any action until legal transfer to the archives actually occurs (15-20 years).
  • Echoing Menno Rasch’s presentation, Szatucsket asked the rhetorical question: ‘Why do we not learn from our mistakes?’ A few months after Titanic, another ship went down in similar circumstances
  • Without proper metadata, objects are lost forever.
  • Last but not least: we have learned that digital preservation is not a technical challenge. We need to create a complete environment in which to preserve.

Szatucsek SCAPE seminar
Are our digital collections heading for the iceberg as well? A visualisation of Szatucsket's presentation


OPF: trust, confidence & communication

Ed Fay was appointed director of the Open Planets Foundation (OPF) only six weeks ago. But he presented a clear vision of how the OPF should function within the community, crack in the middle, as a steward of tools, a champion of open communications, trust & confidence, a broker between commercial and non-commercial interests:

Ed Fay OPF SCAPE seminar

Ed Fay’s vision of the Open Planets Foundation’s role in the digital preservation community

Fay also shared some of his experiences in his former job at the London School of Economics:

Ed Fay London School of Economics

Ed Fay illustrated how digital preservation was moved around a few times in the London School of Economics Library, until it found its present place in the Library division

  So, what works, what doesn’t?

The first round-table discussion was introduced by Bjarne Anderson of the Statsbiblioteket Aarhus (DK). He sketched his institution’s experiences in embedding digital preservation.

Andersen Danish National Archives SCAPE seminar

Bjarne Andersen (right) conferring with Birgit Henriksen (Danish Royal Library, left) and Jan Dalsten Sorensen (Danish National Archives. ‘SCRUM has helped move things along’

He mentioned the recently introduced SCRUM-based methodology as really having helped to move things along – it is an agile way of working which allows for flexibility. The concept of ‘user stories’ helps to make staff think about the ‘why’. Menno Rasch (KB) agreed: ‘SCRUM works especially well if you are not certain where to go. It is a step-by-step methodology.’

Some other lessons learned at Aarhus:

  • The responsibility for digital preservation cannot be with the developers implementing the technical solutions
  • The responsibility needs to be close to ‘the library’
  • Don’t split the analogue and digital library entirely – the two have quite a lot in common
  • IT development and research are necessary activities to keep up with a changing landscape of technology
  • Changing the organisation a few times over the years helped us educate the staff by bringing traditional collection/library staff close to IT for a period of time.


SCAPE OPF seminar Managing Digital Preservation

Group discussion. From the left: Jan Dalsten Sorensen (DK), Ed Fay (OPF), Menno Rasch (KB), Marcin Werla (PL), Bjarne Andersen (DK), Elco van Staveren (KB, visualising the discussion), Hildelies Balk (KB) and Ross King (Austria)

And here is how Elco van Staveren visualised the group discussion in real time:

Some highlights from the discussion:

  • Embedding digital preservation is about people
  • It really requires open communication channels.
  • A hierarchical organisation and/or an organisation with silos only builds up the wall. Engaged leadership is called for. And result-oriented incentives for staff rather than hierarchical incentives.
  • Embedding digital preservation in the organisation requires a vision that is shared by all.
  • Clear responsibilities must be defined.
  • Move the budgets to where the challenges are.
  • The organisation’s size may be a relevant factor in deciding how to organise DP. In large organisations, the wheels move slowly (no. of staff in the Hungarian National Archives 700; British Library 1,500; Austrian National Library 400; KB Netherlands 300, London School of Economics 120, Statsbiblioteket Aarhus 200).
  • Most organisations favour bringing analogue and digital together as much as possible.
  • When it comes to IT experts and librarians/archivists learning each other’s languages, it was suggested that maybe hard IT staff need not get too deeply involved in library issues – in fact, some IT staff might consider it bad for their careers. Software developers, however, do need to get involved in library/archive affairs.
  • Management must also be taught the language of the digital library and digital preservation.

(Continued in Breaking down walls in digital preservation, part 2) [link to follow]

Seminar agenda and links to presentations


Preservation Topics: SCAPE AttachmentSize aDSC_1819MennoRasch.jpg177.57 KB Szatucsek presentation SCAPE seminar992.64 KB aDSC_1759ZoltanSzatucsket.jpg87.92 KB FayOPF.PNG138.52 KB FayLSE.PNG184.2 KB aDSC_1893BjarneAndersenetal.jpg120.24 KB aDSC_1808SCAPESeminarGroupDisc.jpg88.25 KB
Categories: Planet DigiPres

Web-Scale Data Mining for Digital Preservation

3 April 2014 - 5:50pm

Recent years have seen an ever-increasing interest in developing Data Mining methods that allow us to find structured information of interest in very large collections of data ("Big Data"). In this complex and emerging field, the digital preservation community may play an interesting role: 

1. Information needs. One the one hand, the digital preservation community is actively developing tools in order to identify preservation risks, events and opportunities. As I highlight further on, this points to diverse and complex information needs that Big Data Analytics methods may help address. 

2. Large scale data and processing. On the other hand - and this is even more significant - the digital preservation community has both a unique access to very large data sets and the necessary infrastructure and experience to perform data-parallel processing on this data.

Taken together, this points to the potential of actively leveraging the data we preserve in order to make more informed digital preservation decisions.


An Example from SCAPE

In the SCAPE project, we are investigating scenarios in which we address information needs from digital preservation using large-scale data mining. We presented one such scenario at last year's iPres conference (slides here, paper here). In this scenario, we mined the Web for a simple piece of information:

Which publisher is responsible for which content?

Such information is currently aggregated in repositories such as the Keepers Registry - check out the following screenshot from Keepers which shows how they archive a journal from the area of "Big Data". This includes the journal title, its ISSN number, its publisher and the archiving agency:

Our goal was to automatically find more journals and their publishers in order to make such repositories more complete than they currently are.


Example Continued: Information Extraction on the Web

We implemented an Information Extraction (IE) system and executed it on a collection of crawled Web pages from the area of preservation. We were especially interested in sentences like the following:


"In 1991, two years before the merger with Reed, Elsevier acquired Pergamon Press in the UK."


"The American Journal of Preventive Medicine is the official journal of the American College of Preventive Medicine and the Association for Prevention Teaching and Research."


We performed deep syntactic analysis on such sentences and applied so-called lexico-syntactic patterns (such as "X acquired Y" or "Y is journal of X") to extract structured information from matching sentences. As a result, we extracted thousands of journal-publisher pairs, examples of which are given in the following table: 


JournalPublisherA Journal of Human EnvironmentRoyal Swedish Academy of SciencesAAPS JournalAmerican Association of Pharmaceutical ScientistsActa RadiologicaScandinavian Society of Radiology......


A manual evaluation reveiled that 50% of all journal-publisher pairs found with this method were not in the Keepers Registry, but were correct and should be added. This shows how IE can be used to address information needs from the digital preservation community.


Try it Out: Build Your Own Extractor!

For demonstration purposes, our Information Extraction system is now online as a workbench HERE. It executes IE on-the-fly on a very large corpus of over 160 million sentences crawled from the Web.

1. Try the examples. At the top left corner, there are some examples that you may chose from to get introduced to the system. Next to the journal-publisher use case, we have created an extractor that identifies which tool supports which file format as an example.

2. Try creating your own. By selecting lexico-syntactic patterns and entity type restrictions, you create your own extractors. You can export the result tables using the export link at the bottom right. You can also create a permalink to share the extractor that you have created by clicking on the icon at the bottom left.

Try it out! By clicking on the question mark in the top middle of the page you get more detailed usage instructions for the workbench.



We will demonstrate the system at the upcoming SCAPE Developer Workshop in Den Haag. Until then, some GUI details may change to make its use more intuitive. I look forward to many interesting discussions :)

AttachmentSize keepers25.41 KB
Categories: Planet DigiPres

Two or more things that I learn at the "Preserving Your Preservation Tools" workshop

2 April 2014 - 6:37am
These have been two busy days in Den Haag where Carl Wilson from the OPF tries to show us how to use tools in order to have clean environments and well-behave installation procedures that will "always" work.
 The use of vagrant (connected to the appropriate provider, in our experimental case VirtualBox) allows to start from a genuine box and to experiment installation procedures. Everything being scripted or, better said, automatically provisioned allows for repeated tries until we reach an exact clean and complete installation. The important fact is that, once this goal is attained, its sharing is easy by just publishing the steps in a code repository.  The second day was real experiments. We begin by looking at how that has been done for jpylyzer, the indispensable tool to validate JPEG2000 files created by Johan van der Knijff from the Nationale bibliotheek van Nederland, which hosted the event, with the traditional dutch welcoming.Then we begin to look at the old but precious Jhove tool, from Gary McGath, which recently has migrated to GitHub and is actively been transformed to use a maven build process thanks to the efforts of Andrew Jackson and Will Palmer. A first (not so quick but dirty) debian package was obtained at the end of the session providing an automatic installation of this tool for Linux boxes that will take care of installing the script that hides the infamous java idiomatics and of providing a default configuration file so that when you can launch the simple jhove -v just after install it, its works !!! One other thing that attracts my attention was the use of vagrant as a simple way of making sure that every developper works against the same environment so that there is no misconfiguration. In case of need for other tools, an automatic provision can be established and distributed around. Of course, the same process can be applied in production, making sure the deployment is as smooth as possible. So now it appears that it becomes easy to have a base (or reference) environment and the exact list of extra dependencies that allow for a given program to run. From a preservation perspective, this is quite enlightening and is very closed to the work made by the premis group on describing environments. We then can think about transforming the provision script into a premis environment description so that we will have not only an operational way of having emulation but also a standard description of it. The base environment could be collected in a repository and rightly described. The extra steps to make a program revival could then be embeded in the AIP of the program or the datas we try to preserve. Incidentally, at the same time we were working on these virtual environments, Microsoft announces it releases the source code of MS-DOS 2.2. This makes me wondering if we could rebuild a msdos box from scratch and uses it as a base reference environment for all this "old" (some thirty years ago only) programs. All in all, these 2 days went so quickly we just have time for a dutch break along the Plein ; but those were fruitful in giving us the aim to come with more easy to use and better documented tools  that we can rely to build great preservation repositories.


Preservation Topics: PackagingSCAPEjpylyzer
Categories: Planet DigiPres

ARC to WARC migration: How to deal with de-duplicated records?

24 March 2014 - 4:13pm

In my last blog post about ARC to WARC migration I did a performance comparison of two alternative approaches for migrating very large sets of ARC container files to the WARC format using Apache Hadoop, and I said that resolving contextual dependencies in order to create self-contained WARC files was the next point to investigate further. This is why I am now proposing one possible way to deal with de-duplicated records in an ARC to WARC migration scenario.

Before entering into specifics, let me briefly recall what is meant by „de-duplication“: It is a mechanism used by a web crawler to reference identical content that was already stored when visiting a web site at a previous point in time, and the main purpose is to avoid storing content redundantly and by that way to reduce the required storage capacity.

The Netarchive Suite uses a Heritrix module for de-duplication, which takes place on the level of a harvest definition. The following diagram roughly outlines the most important information items and their dependencies.


The example shows two subsequent jobs executed as part of the same harvest definition. Depending on the configuration parameters, as the desired size of ARC files, for example, each crawl job creates one or various ARC container files and a corresponding crawl metadata file. In the example above, the first crawl job (1001) produced two ARC files, each containing ARC metadata, a DNS record and one HTML page. Additionally, the first ARC file contains a PNG image file that was referenced in the HTML file. The second crawl job (1002) produced equivalent content except that the PNG image file is not contained in the first ARC file of this job, but it is only referred to as a de-duplicated item in the crawl-metadata using the notation {job-id}-{harvest-id}-{serialno}.

The question is: Do we actually need the de-duplication information in the crawl-metadata file? If an index (e.g. CDX index) is created over all ARC container files, we know – or better: the wayback machine knows – where a file can be located, and in this sense the de-duplication information could be considered obsolete. We would only loose the information as part of which crawl job the de-duplication actually took place, and this concerns the informational integrity of a crawl job because external dependencies would not be explicit any more. Therefore, the following is a proposed way to preserve this information in a WARC-standard-compliant way.

Each content record of the original ARC file is converted to a response-record in the WARC file like illustrated in the bottom left box in the diagram above. Any request/response metadata can be added as a header block to the record payload or as a separate metadata-record that relates to the response-record.

The de-duplicated information items available in the crawl-metadata file are converted to revisit-records as illustrated in the bottom right box as a separate WARC file (one per crawl-metadata file). The payload-digest must be equal and should state that the completeness of the referrred record was checked successfully. The WARC-Refers-To property refers to the WARC record that contains the record payload, additionally, the fact that Content-Length is 0 explicitely states that the record payload is not available in the current record and that it is to be located elsewhere.

Taxonomy upgrade extras: SCAPESCAPEProjectSCAPE-ProjectMigrationWeb ArchivingARCWARCARC to WARCPreservation Topics: MigrationWeb ArchivingSCAPE
Categories: Planet DigiPres

CSV Validator - beta releases

21 March 2014 - 2:51pm

For quite some time at The National Archives (UK) we've been working on a tool for validating CSV files against user defined schema.  We're now at the point of making beta releases of the tool generally available (1.0-RC3 at the time of writing), along with the formal specification of the schema language.  The tool and source code are released under Mozilla Public Licence version 2.0.

For more details, links to the source code repository, release code on Maven Central, instructions and schema specification, see

Feedback is welcome.  When we make the formal version 1.0 release there will be a fuller blog post on The National Archives blog.

Preservation Topics: Tools
Categories: Planet DigiPres

A Tika to ride; characterising web content with Nanite

21 March 2014 - 1:58pm

This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite.

Introducing Nanite

Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects:

  • Nanite-Core: an API for Droid   
  • Nanite-Hadoop: a MapReduce program for characterising web archives that makes use of Nanite-Core, Apache Tika and libmagic-jna-wrapper  (the last one here essentially being the *nix `file` tool wrapped for reuse in Java)

Nanite-Hadoop makes use of UK Web Archive Record Readers for Hadoop, to enable it to directly process ARC and WARC files from HDFS without an intermediate processing step.  The initial part of a Nanite-Hadoop run is a test to check that the input files are valid gz files.  This is very quick (takes seconds) and ensures that there are no invalid files that could crash the format profiler after it has run for several hours.  More checks on the input files could be potentially be added.

We have been working on Nanite to add different characterisation libraries and improve them/their coverage.  As the tools that are used are all Java, or using native library calls, Nanite-Hadoop is fast.  Retrieving a mimetype from Droid and Tika for all 93 million files in 1TB (compressed size) of WARC files took 17.5hrs on our Hadoop cluster.  This is less than 1ms/file.  Libraries to be turned on/off relatively easily by editing the source or in the jar.

That time does not include any characterisation, so I began to add support for characterisation using Tika’s parsers.  The process I followed to add this characterisation is described below.

(Un)Intentionally stress testing Tika’s parsers

In hindsight sending 93 million files harvested from the open web directly to Tika’s parsers and expecting everything to be ok was optimistic at best.  There were bound to have been files in that corpus that were corrupt or otherwise broken that would cause crashes in Tika or its dependencies. 

Carnet let you do that; crashing/hanging the Hadoop JVM

Initially I began by using the Tika Parser interface directly.  This was ok until I noticed that some parsers (or their dependencies) were crashing or hanging.  As that was rather undesirable I began to disable the problematic parsers at runtime (with the aim of submitting bug reports back to Tika).  However, it soon became apparent that the files contained in the web archive were stressing the parsers to the point I would have had to disable ever increasing numbers of them.  This was really undesirable as the logic was handcrafted and relied on the state of the Tika parsers at that particular moment.  It also meant that the existence of one bad file of a particular format meant that no characterisation of that format could be carried out.  The logic to do this is still in the code, albeit not currently used.

Timing out Tika considered harmful; first steps

The next step was to error-proof the calls to Tika.  Firstly I ensured that any Exceptions/Errors/etc were caught.  Then I created a TimeoutParser  that parsed the files in a background Thread and forcibly stopped the Tika parser after a time limit had been exceeded.  This worked ok, however, it made use of Thread.stop() – a deprecated API call to stop a Java Thread.  Use of this API call is thoroughly not recommended as it may corrupt the internal state of the JVM or produce other undesired effects.  Details about this can be read in an issue on the Tika bug tracker.  Since I did not want to risk a corruption of the JVM I did not pursue this further. 

I should note that subsequently it has been suggested that an alternative to using Thread.stop() is to just leave it alone for the JVM to deal with and create new Thread.  This is a valid method of dealing with the problem, given the numbers of files involved (see later), but I have not tested it.

The whole Tika, and nothing but the Tika; isolating the Tika process

Following a suggestion by a commenter in the Tika issue, linked above, I produced a library that abstracted a Tika-server as a separate operating system process, isolated from the main JVM: ProcessIsolatedTika.  This means that if Tika crashes it is the operating system’s responsibility to clean up the mess and it won’t affect the state of the main JVM.  The new library controls restarting the process after a crash, or after processing times out (in case of a hang).  An API similar to a normal Tika parser is provided so it can be easily reused.  Communication by the library with the Tika-server is via REST, over the loopback network interface.  There may be issues if there is more than BUFSIZE bytes read (currently 20MB) – although such errors should be logged by Nanite in the Hadoop Reducer output.

Although the main overhead of this approach is having a separate process and JVM per WARC file, that is mitigated somewhat by the time that process is used for.  Aside from the cost of transferring files to the Tika-server, the overhead is a larger jar file, longer initial start-up time for Mappers and additional time for restarts of the Tika-server on failed files.  Given average runtime per WARC is slightly over 5 minutes, the few additional seconds that are included for using a process isolated Tika is not a great deal extra.

The output from the Tika parsers is kept in a sequence file in HDFS (one per input (W)ARC) – i.e. 1000 WARCs == 1000 Tika parser sequence files.  This output is in addition to the output from the Reducer (mimetypes, server mimetypes and extension).

To help the Tika parsers with the file, Tika detect() is first run on the file and that mimetype is passed to the parsers via a http header.  A Metadata object cannot be passed to the parsers via REST like it would be if we called them directly from the Java code.

Another approach could have been to use Nailgun as described by Ross Spencer in a previous blog post here.  I did not take that approach as I did not want to set up a Nailgun server on each Hadoop node (we have 28 of them) and if a Tika parser crashed or caused the JVM to hang then it may corrupt the state of the Nailgun JVM in a similar way to the TimeoutParser above.  Finally, with my current test data each node handles ~3m files – much more than the 420k calls that caused Nailgun to run out of heap space in Ross’ experiment.

Express Tika; initial benchmarks

I ran some initial benchmarks on 1000 WARC files using our test Hadoop cluster (28 nodes with 1 cpu/map slot per node) the results are as follows:

Identification tools used

Nanite-core (Droid)

Tika detect() (mimetype only)

ProcessIsolatedTika parsers

WARC files


Total WARC size

59.4GB (63,759,574,081 bytes)


Total files in WARCs (# input records)


Runtime (hh:mm:ss)






Total Tika parser output size (compressed)

765MB (801,740,734 bytes)


Tika parser failures/crashes


Misc failures

Malformed records: 122

IOExceptions*: 3224

Other Exceptions: 430

Total: 3776

*This may be due to files being larger than the buffer – to be investigated.

The output has not been fully verified but should give an initial indication of speed.

Conceivably the information from the Tika parsers could be loaded into c3po but I have not looked into that.

Conclusion; if the process isolation FITS, where is it?

We are now able to use Tika parsers for characterisation without being concerned about crashes in Tika.  This research will also allow us to identify files that Tika’s parsers cannot handle so we can submit bug reports/patches back to Tika.  When Tika 1.6 comes out it will include detailed pdf version detection within the pdf parser.

As an aside - if FITS offered a REST interface then the ProcessIsolatedTika code could be easily modifed to replace Tika with FITS – this is worth considering, if there was interest and someone were to create such a REST interface.

Apologies for the puns.

Preservation Topics: Preservation ActionsIdentificationCharacterisationWeb ArchivingToolsSCAPE
Categories: Planet DigiPres

Three years of SCAPE

18 March 2014 - 12:24pm

SCAPE is proud to look back at another successful project year. During the third year the team produced many new tools, e.g. ToMaR, a tool which wraps command line tools into Hadoop MapReduce jobs. Other tools like xcorrSound and C3PO have been developed further.

This year’s All-Staff Meeting took place mid-February in Póvoa de Varzim, Portugal. The team organised a number of general sessions, during which the project partners presented demos of and elevator pitches for the tools and services they developed in SCAPE. It was very interesting for all meeting participants to see the results achieved so far. The demos and pitches were also useful for re-focusing on the big picture of SCAPE. During the main meeting sessions the participants mainly focused on take up and productization of SCAPE tools.

Another central topic of the meeting was integration. Until the end of the project the partners will put an emphasis on integrating the results further. To prove scalability of the tools, the team set up a number of operative Hadoop clusters instances (both central and local), which are currently being used for the evaluation of the tools and workflows.

Another focus lies on the sustainability of SCAPE tools. The SCAPE team is working towards documenting the tools for both developers and users. SCAPE outcomes will be curated by the Open Planets Foundation until the end of the project and will keep them available.

In September 2014 SCAPE is organising a final event in collaboration with APARSEN. The workshop is planned to take place at the Digital Libraries 2014 conference in London, where SCAPE will have its final, overall presentation. The workshop is directed towards developers, content holders, and data managers. The SCAPE team will present tools and services developed since 2011. A special focus will lie on newly and further developed open source tools for scalable preservation actions; SCAPE’s scalable Platform architecture; and its policy-based Planning and Watch solutions.

Preservation Topics: SCAPE
Categories: Planet DigiPres

ToMaR - How to let your preservation tools scale

14 March 2014 - 4:01pm

Whenever you run into the situation that you have got used to a command line tool and all of a sudden need to apply it to a large amount of files over a Hadoop cluster without having any clue of writing distributed programs ToMaR will be your friend.

Mathilda is working at the department for digital preservation at a famous national library. In her daily work she has to cope with various well-known tasks like data identification, migration and curation. She is experienced in using the command shell on a Unix system and occasionally has to write small scripts to perform a certain workflow effectively.

When she has got to deal with a few hundreds of files she usually invokes her shell script on one file after the other using a simple loop for automation. But today she has been put in charge of a much bigger data set than she is used to. There are one hundred thousand TIFF images which need to be migrated to JPEG2000 images in order to save storage space. Intuitively she knows that processing these files one after the other with each single migration taking about half a minute would take a whole work day to run.

Luckily Mathilda has heard of the recent Hadoop cluster colleagues of her have set up in order to do some data mining on a large collection of text files. "Would there be a way to run my file migration tool on that cluster thing?", she thinks, "If I could run it in parallel on all these machines then that would speed up my migration task tremendously!" Only one thing makes here hesitate: She has hardly got any Java programming skills, not to mention any idea of that MapReduce programming paradigm they are using in their data mining task. How to let her tool scale?

That's where ToMaR, the Tool-to-MapReduce Wrapper comes in!

What can ToMaR do?

If you have a running Hadoop cluster you are only three little steps away from letting your preservation tools run on thousands of files almost as efficiently as with a native one-purpose Java MapReduce application. ToMaR wraps command line tools into a Hadoop MapReduce job which executes the command on all the worker nodes of the Hadoop cluster in parallel. Dependent on the tool you want to use through ToMaR it might be necessary to install it on each cluster node beforehand. Then all you need to do is:

  1. Specify your tool so that ToMaR can understand it using the SCAPE Tool Specification Schema.
  2. Itemize the parameters of the tool invocation for each of your input files in a control file.
  3. Run ToMaR.

Through MapReduce your list of parameter descriptions in the control file will be split up and assigned to each node portion by portion. For instance ToMaR could have been configured to create splits of 10 lines each taken from the control file. Then each node parses the portion line by line and invokes the tool with the parameters specified therein each time.

File Format Migration Example

So how may Mathilda tackle her file format migration problem? First she will have to make sure that her tool is installed on each cluster node. Her colleagues who maintain the Hadoop cluster will take care for this requirement. Up to her is the creation of the Tool Specification Document (ToolSpec) using the SCAPE Tool Specification Schema and the itemization of the tool invocation parameter descriptions. The following figure depicts the required workflow:

Create the ToolSpec

The ToolSpec is an XML file which contains several operations. An operations consists of name, a description, a command pattern and input/output parameters. The operation for Mathilda's file format migration tool might look like this:

<operation name="image-to-j2k"> <description>Migrates an image to jpeg2000</description> <command> image_to_j2k -i ${input} -o ${output} -I -p RPCL -n 7 -c [256,256], [256,256],[128,128],[128,128],[128,128],[128,128],[128,128] -b 64,64 -r 320.000,160.000,80.000,40.000,20.000,11.250,7.000,4.600,3.400,2.750, 2.400,1.000 </command> <inputs> <input name="input" required="true"> <description>Reference to input file</description> </input> </inputs> <outputs> <output name="output" required="true"> <description>Reference to output file. Only *.j2k, *.j2c or *.jp2!</description> </output> </outputs> </operation>

In the <command> element she has put the actual command line with a long tail of static parameters. This example highlights another advantage of the ToolSpec: You gain the ease of wrapping complex command lines in an atomic operation definition which is associated with a simple name, here "image-to-j2k". Inside the command pattern she puts placeholders which are replaced by various values. Here ${input} and ${output} denote such variables so that the value of the input file parameter (-i) and the value of the output file parameter (-o) can vary with each invocation of the tool.

Along with the command definition Mathilda has to describe these variables in the <inputs> and <outputs> section. For the ${input} being the placeholder for a input file she has to add a <input> element with the name of the placeholder as an attribute. The same counts for the ${output} placeholder. Additionally she can add some description text to these input and output parameter definitions.

There are more constructs possible with the SCAPE Tool Specification Schema which can not be covered here. The full contents of this ToolSpec can be found in the file attachments.

Create the Control File

The other essential requirement Mathilda has to achieve is the creation of the control file. This file contains the real values for the tool invocation which are mapped to the ToolSpec by ToMaR. Together with the above example her control file will look something like this:

openjpeg image-to-jp2 --input=“hdfs://myFile1.tif“ --output=“hdfs://myFile2.jp2“ openjpeg image-to-jp2 --input=“hdfs://myFile2.tif“ --output=“hdfs://myFile2.jp2“ openjpeg image-to-jp2 --input=“hdfs://myFile3.tif“ --output=“hdfs://myFile3.jp2“ ...

The first word refers to the name of the ToolSpec ToMaR shall load. In this example the ToolSpec is called "openjpeg.xml" but only the name without the .xml extension is needed for the reference. The second word refers to an operation within that ToolSpec, it's the "image-to-j2k" operation described in the ToolSpec example snippet above.

The rest of the line contains references to input and output parameters. Each reference starts with a double dash followed by a pair of parameters name and value. So --input (and likewise --output) refers to the parameters named "input" in the ToolSpec which in turn refers to the ${input} placeholder in the command pattern. The values are file references on Hadoop's Distributed File System (HDFS).

As Mathilda has 100k TIFF images she will have 100k lines in her control file. As she knows how to use the command shell she quickly writes a script which generates this file for her.

Run ToMaR

Having the ToolSpec openjpeg.xml and the control file controlfile.txt created she copies openjpeg.xml into the directory "hdfs:///user/mathilda/toolspecs" of HDFS and executes the following command on the master node of the Hadoop cluster:

hadoop jar ToMaR.jar -i controlfile.txt -r hdfs:///user/mathilda/toolspecs

Here she feeds in the controlfile.txt and the location of her ToolSpecs and ToMaR does the rest. It splits up the control file and distributes a certain number of lines per split to each node. The ToolSpec is loaded and the parameters are mapped to the command line pattern contained in the named operation. Input files are copied from HDFS to the local file system. As the placeholders are replaced by the values the command line can be executed by the worker node. After that the result output file is copied back to HDFS to the output location given.

Finally Mathilda has got all the migrated JPEG2000 images on HDFS in a fraction of the time it would have taken when run sequentially on her machine.

  • easily take up external tools with a clear mapping between the instructions and the physical invocation of the tool
  • use the SCAPE Toolspec, as well as existing Toolspecs, and its advantage of associating simple keywords with complex command-line patterns
  • no programming skills needed as the minimum requirement only is to setup the control file

When dealing with large volumes of files, e.g. in the context of file format migration or characterisation tasks, a standalone server often cannot provide sufficient throughput to process the data in a feasible period of time. ToMaR provides a simple and flexible solution to run preservation tools on a Hadoop MapReduce cluster in a scalable fashion.

ToMaR offers the possibility to use existing command-line tools in Hadoop's distributed environment very similarly to a desktop computer. By utilizing SCAPE Tool Specification documents, ToMaR allows users to associate complex command-line patterns with simple keywords, which can be referenced for execution on a computer cluster. ToMaR is a generic MapReduce application which does not require any programming skills.

Checkout the following blog posts for further usage scenarios of ToMaR:


Preservation Topics: Preservation ActionsSCAPE AttachmentSize Full openjpeg ToolSpec1.02 KB ToMaR-image_to_j2k-workflow.png158.29 KB ToMaR-overview.png67.97 KB logo.png74.65 KB
Categories: Planet DigiPres

Some reflections on salable ARC to WARC migration

7 March 2014 - 1:56pm

The SCAPE project is developing solutions to enable the processing of very large data sets with a focus long-term preservation. One of the application areas is web archiving where long-term preservation is of direct relevance for different task areas, like harvesting, storage, and access.

Web archives usually consist of large data collections of multi-terabyte size, the largest archive being the Internet Archive which according to its own statements stores about 364 billion pages that occupy around 10 petabytes of storage. And the International Internet Preservation Consortium (IIPC) with its over 40 members worldwide shows how diverse the institutions are, each with a different focus regarding the selection of content or type of material they are interested in.

It is up to this international community and to the individual member institutions to ensure that archived web content content can be accessed and displayed correctly in the future. And this is a real challenge, the reason for this lies in the nature of the content which is like the internet itself: diverse in the use of formats for publishing text and multi-media content, using a rich variety of standards and programming languages, enabling interactive user experience, data-base driven web sites, strongly interlinked functionality, involving social media content from numerous sources, etc. This is to say that apart from the sheer size, it is the heterogeneity and complexity of the data that poses the significant challenge for collection curators, developers, and long-term preservation specialists.

One of the topics which the International Internet Preservation Consortium (IIPC) is dealing with is the question how web archive content should actually be stored for the long term. Originally, content used to be stored in a the ARC format as proposed by the Internet Archive. The format was designed to hold multiple web resources aggregated in a single – optionally compressed – container file. But this format was not supposed to format for storing content for the long term, it was lacking features that support adding contextual information in a standardised way. For this reason, the new WARC format as an ISO Standard was created to provide additional features, especially the ability to hold harvested content as well as any meta-data related to it in a self-contained manner.

An important pragmatic aspect of web archiving is the fact that while content is continuously changing on one side, it remains static on the other. In order to preserve the changes, web pages are harvested with a certain frequency of crawl jobs. Storing the same content at each visit would store content redundantly and not make efficient use of storage.

For this reason, the Netarchive Suite, originally developed by the The Royal Library and The State and University Library, and used in the meantime by other libraries as well, provides a mechanism called “deduplication” which detects that content was already retrieved and therefore references the existing payload content. The information where the referenced content is actually stored is available in the crawl log files which means that if the crawl log file is missing, there is actually no knowledge of any referenced content. In order to display a single web page with various images, for example, the wayback machine needs to know where to find content that may be scattered over various ARC container files. An index file, e.g. an index in the CDX file format, contains the required information, and to build this index, at the current state it is necessary to involve ARC files and crawl log files in the index building process.

From a long-term-preservation perspective, this is a problematic dependency. The ARC container files are not self-describing, they depend on operative data (log files generated by the crawl software) in a non-standardised manner. Web archive operators and developers know where to get the information, and the dependency might be well documented. But it involves the risk of loosing information that is essential for displaying and accessing the content.

This is one of the reasons why the option to migrate from ARC to the new WARC format is being considered by many institutions. But, as often happens, what looks like a simple format transformation at first glance rapidly turns into a project with complex requirements that are not easy to fulfil.

In the SCAPE project, there are several aspects that in our opinion deserve closer attention:

  1. The migration from ARC to WARC is typically dealing with large data sets, therefore a solution must provide an efficient, reliable and scalable transformation process. There must be the ability to scale-out which means that it should be possible to increase processing power by using a computing cluster of appropriate size to enable organisations to complete the migration in a given time frame.

  2. Reading and writing the large data sets comes with a cost. Sometimes, data must be even shifted to a (remote) cluster first. It should therefore be possible to easily hook in other processes that are used to extract additional meta-data from the content.

  3. The migration from one format to another conveys the risk of information loss. Measures of quality assurance like calculating the payload hash and compare content between corresponding ARC and WARC instances or doing rendering tests in the Wayback machine of subsets of migrated content are possible approaches in this regard.

  4. Resolving dependencies of the ARC container files to any external information entities is a necessary requirement. A solution should therefore not only look into a one-to-one mapping between ARC and WARC, but it should involve contextual information in the migration process.

The first concrete step regarding this activity was to find out the right approach to */

the first of the above mentioned aspects.

In the SCAPE project, the Hadoop framework is an essential element of the so called SCAPE platform. Hadoop is the core which holds the responsibility of efficiently distributing processing tasks to the available workers in a computing cluster.

Taking advantage on software development outcomes from the SCAPE project, there were different options to implement a solution. The first option was using a module of the SCAPE platform called ToMaR, a Map/Reduce java application that allows to easily distribute command line application processing on a computing cluster (in the following: ARC2WARC-TOMAR). And the second option was using a Map/Reduce application with customised reader for the ARC format and customised writer for the WARC format so that the Hadoop framework is able to handle these web archive file formats directly (in the following: ARC2WARC-HDP).

An experiment was set up to test the performance of two different approaches and the main question was whether the native Map/Reduce job implementation had a significant performance advantage compared to using ToMaR with an underlying command line tool execution.

The reason why this advantage should be “significant” is that the ARC2WARC-HDP option has an important limitation: In order to achieve the transformation based on a native Map/Reduce implementation it is required to use a Hadoop representation of a web archive record. This is the intermediate representation that is between reading the records from the ARC files and writing the records to WARC files. As it uses a byte array field to store web archive record payload content, there is a theoretical limit of around 2 GB due to the Integer length of the byte array which would be a value near Integer.MAXVALUE. In reality, the limitation of payload content size might be much lower depending on hardware setup and configuration of the cluster.

This limitation would come along with the need for an alternative solution for records with large payload content. And, such a separation between "small" and "large" records would possibly increase the complexity of the application, especially when it is required to involve contextual information across different container files in the migration process.

The implementations used to do the migration are proof-of-concept tools which means that they are not intended to be used to run a production migration at this stage. This means that there are the following limitations:

  1. Related to ARC2WARC-HDP, as already mentioned, there is a file size limit regarding the in-memory representation of a web archive record, the largest ARC file in the data sets used in these experiments is around 300MB, therefore record-payload content can be easily stored as byte array fields.

  2. Exceptions are catched and logged, but there is no gathering of processing errors or any other analytic results. As the focus lies here on performance evaluation, any details regarding record processing are not taken into consideration.

  3. The current implementations neither do quality assurance nor do they involve contextual information which have been mentioned as important aspects of the ARC to WARC migration above.

The basis of the implementations is the Java Web Archive Toolkit (JWAT) for reading web archive ARC container files and to iterate over the records.

As an example for a process that is used while we are reading the data, the implementations include Apache Tika to identify the payload content as an optional feature. All Hadoop job executions are therefore tested with and without payload content identification enabled.

As already mentioned, the ARC2WARC-HDP application was implemented as a Map/Reduce application which is started from the command line as follows:

hadoop jar arc2warc-migration-hdp-1.0-jar-with-dependencies.jar \
-i hdfs:///user/input/directory -o hdfs:///user/output/directory

And the ARC2WARC-TOMAR workflow is using a command line Java-Implementation and executed using ToMaR. One bash script was used to prepare the input needed by ToMaR and another bash script to execute the ToMaR Hadoop job, a combined representation of the workflow is available as a Taverna workflow.

A so called “tool specification” is needed to start an action in a ToMaR Hadoop which specified inputs and outputs and the java command to be executed:

<?xml version="1.0" encoding="utf-8" ?> <tool xmlns:xsi="" xsi:schemaLocation=" tool-1.0_draft.xsd" xmlns="" xmlns:xlink="" schemaVersion="1.0" name="bash"> <operations> <operation name="migrate"> <description>ARC to WARC migration using arc2warc-migration-cli</description> <command> java -jar /usr/local/java/arc2warc-migration-cli-1.0-jar-with-dependencies.jar -i ${input} -o ${output} </command> <inputs> <input name="input" required="true"> <description>Reference to input file</description> </input> </inputs> <outputs> <output name="output" required="true"> <description>Reference to output file</description> </output> </outputs> </operation> </operations> </tool>

All commands allow use of a “-p” flag to enable Apache Tika identification of payload content.

The cluster used in the experiment has one controller machine (Master) and 5 worker machines (Slaves). The master node has two quadcore CPUs (8 physical/16 HyperThreading cores) with a clock rate of 2.40GHz and 24 Gigabyte RAM. The slave nodes have one quadcore CPUs (4 physical/8 HyperThreading cores) with a clock rate of 2.53GHz and 16 Gigabyte RAM. Regarding the Hadoop configuration, five processor cores of each machine have been assigned to Map Tasks, two cores to Reduce tasks, and one core is reserved for the operating system. This is a total of 25 processing cores for Map tasks and 10 cores for Reduce tasks.

The experiment was executed using two data sets of different size, one with 1000 ARC files with 91,58 Gigabyte and one with 4924 ARC files with a total size of 445,47 Gigabyte.

A summary of the results is shown in the table below.


/*-->*/ /*-->*/


  Obj./hourThroughput Avg.time/item  (num)(GB/min)(s) Baseline8341,27274,32Map/Reduce1000 ARC files45927,00890,78 4924 ARC files46457,00420,77ToMaR1000 ARC files42506,48750,85 4924 ARC files43206,51430,83 Baseline5450,83216,60Map/Reduce w. Tika1000 ARC files27614,21391,30 4924 ARC files28134,24191,28ToMaR w. Tika1000 ARC files33185,06451,09 4924 ARC files28134,24191,28


The Baseline value was determined by executing a standalone Java-application that shifts content and meta-data from one container to the other using JWAT. It was executed on one worker node of the cluster and serves as a point of reference for the distributed processing.

Some observations regarding these data are that, compared to the local java application processing the cluster processing shows a significant increase of performance for all Hadoop jobs – which should not be a surprise, this is the purpose of distributed processing. Then, the throughput does not change significantly between the two data sets of different size which allows the assumption that there is a linear execution time as the number of objects increases. Regarding the two different approaches ARC2WARC-HDP and ARC2WARC-TOMAR there is only a slight difference which given the above mentioned caveats of the Map/Reduce implementation does highlight ToMaR as an interesting option to be the tool of choice. Finally, the figures show that using Apache Tika the processing time is increased by more than 50%.

To give an outlook to further work and following arguments outlined here, resolving contextual dependencies in order to create self-contained WARC files is the next point to look into.

As a final remark, the proof-of-concept implementations presented here are far from a workflow that can be used in production. There is an ongoing discussion in the web archiving community whether it makes any sense to tackle such a project in memory institutions at all. Ensuring backwards-compatibility of the wayback machine and safely preserving contextual information is a viable alternative to this.

Many thanks to colleagues sitting near to me and in the SCAPE project who gave me useful hints and support.

Taxonomy upgrade extras: SCAPEPreservation Topics: Preservation ActionsIdentificationMigrationWeb ArchivingSCAPE
Categories: Planet DigiPres

Interview with a SCAPEr - Pavel Smrz

7 March 2014 - 1:07pm
Who are you?

My name is Pavel Smrz. I work as an associate professor at the Faculty of Information Technology, Brno University of Technology (BUT) in the Czech Republic. Our team joined the SCAPE project in September 2013.

Tell us a bit about your role in SCAPE and what SCAPE work you are involved in right now?

I lead a work package dealing with the Data Centre Testbed. Together with other new project partners, we aim at extending the current SCAPE development towards preserving large-scale computing experiments that take place in modern data centres. Our team particularly focuses on preservation scenarios and workflows related to large-scale video processing and interlinking.

Why is your organisation involved in SCAPE?

BUT has a long tradition and a proved research track in the fields of large-scale parallel and distributed computing, knowledge technologies and big data analysis. We have participated in many European projects, other international and national research and development activities and industrial projects relevant to this domain. That is why we have been invited to join the proposal to extend the SCAPE project as a part of the special EC Horizontal Action – Supplements to Strengthen Cooperation in ICT R&D in an Enlarged European Union. The proposal was accepted and the SCAPE project was successfully extended in 2013.

What are the biggest challenges in SCAPE as you see it?

SCAPE is a complex project so that there are many technological challenges. Being new to the project, I was agreeably surprised by the high level of technical expertise of professionals from libraries and other institutions dealing with preservation. To mention just an example from our domain, concepts of advanced distributed computing are well understood and commonly employed by the experts. I believe the technical excellence will help us to meet all the challenges in the remaining project time.

In addition to the technical area, I would see a key challenge of the project in integration of partners and individuals with very different backgrounds and perspectives. SCAPE is really an inter-disciplinary project so that people from various fields need to make a special effort to find common ground. I am glad that this works in the project and I really enjoy being part of the community.

What do you think will be the most valuable outcome of SCAPE?

SCAPE will deliver a new platform and a set of tools for various preservation contexts. I would stress diversity of tools as a particular outcome. My experience shows that “one-size-fits-all” solutions are often too scary to be used. Although funding agencies believe opposite, research and development project seldom deliver solutions that could be used as a whole. It is often the case that what seemed to be a minor contribution becomes the next big thing for business. I believe that at least some components developed within the project have this great potential.

Having interest in large-scale parallel and distributed computing, I cannot forget scalability as a key attribute of the SCAPE development. Today’s public and private cloud and cluster infrastructures enable realizing large-scale preservation scenarios. What would be a year preservation project few years ago, can be solved in a day on these platforms. However, many tools are not ready to take benefit from existing computing infrastructures – scalability does not come ‘automagically’.

In my opinion, the most valuable outcome of the SCAPE project consists in providing a diverse set of preservation tools and showing that they scale-up in real situations.

Contact information

Pavel SMRZ
Brno University of Technology
Faculty of Information Technology
Bozetechova 2, 61266 Brno
Czech Republic

Preservation Topics: SCAPE
Categories: Planet DigiPres

A Nailgun for the Digital Preservation Toolkit

24 February 2014 - 2:17am

Fifteen days was the estimate I gave for completing an analysis on roughly 450,000 files we were holding at Archives New Zealand. Approximately three seconds per file for each round of analysis:

3 x 450,000 = 1,350,000 seconds 1,350,000 seconds = 15.625 days

My bash script included calls to three Java applications, Apache Tika, 1.3 at the time, twice, running the -m and -d flags:

-m or --metadata Output only metadata -d or --detect Detect document type

It also made a call to Jhove 1.11 in standard mode. The script also calculates SHA1 for de-duplication purposes, and to match Archives New Zealand's chosen fixity standard; computes a V4 UUID per file, and outputs the result of the Linux File command, in two separate modes, standard, and with the -i flag to attempt to identify mime-type.

Each application receives a path to a single file as an argument from a directory manifest. The script outputs five CSV files that can be further analysed.

The main function used in the script is as follows:

dp_analysis () { FUID=$(uuidgen) DIRN=$(dirname "$file") BASN=$(basename "$file") echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "file-5.11" '\t' $(file -b -n -p "$file") '\t' $(file -b -i -n -p "$file") >> ${LOGNAME}file-analysis.log echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "tika-1.5-md" '\t' $(java -jar ${TIKA_HOME}/tika-app-1.5.jar -m "$file") >> ${LOGNAME}tika-md-analysis.log echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "tika-1.5-type" '\t' $(java -jar ${TIKA_HOME}/tika-app-1.5.jar -d "$file") >> ${LOGNAME}tika-type-analysis.log echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "jhove-1_11" '\t' $(java -jar ${JHOVE_HOME}/bin/JhoveApp.jar "$file") >> ${LOGNAME}jhove-analysis.log echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "sha-1-8.20" '\t' $(sha1sum -b "$file") >> ${LOGNAME}sha-one-analysis.log }

What I hadn't anticipated was the expense of starting the Java Virtual Machine (JVM) three times each loop, 450,000 times. The performance is prohibitive and so I immediately set out to find a solution. Either cut down the number of tools I was using, or figure out how to avoid starting the JVM each time. Fortunately a Google search led me to a solution, and a phrase, that I had heard before – Nailgun.

It has been mentioned on various forums, including comments on various OPF blogs, and it is even found in the Fits release notes. The phrase resonated and it turned out that it provided a single and accessible approach to do what we need.

One of the things that we haven't seen yet is a guide on using it within the digital preservation workflow. I'll describe how to make best use of this tool, and try and demonstrate its benefits during the remainder of this blog.

For testing purposes we will be generating statistics on a laptop that has the following specification:

Product: Acer Aspire V5-571PG (Aspire V5-571PG_072D_2.15) CPU: Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz Width: 64 bits Memory: 8GiB OS: Ubuntu 13.10 Release: 13.10 Codename: saucy Java version "1.7.0_21" Java(TM) SE Runtime Environment (build 1.7.0_21-b11) Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode) JVM Startup Time

First, let's demonstrate the startup cost of the JVM. If we take two functionally equivalent programs, the first in Java and the second in C++, we can look at the time taken to run them 1000 times consecutively.

The purpose of each application is to run and then exit with a return code of zero.


public class SysExitApp { public static void main(String[] args) { System.exit(0); } }

C++: SysExitApp.cpp:

int main() { return(0); }

The Script to run both, and output the timing for each cycle, is as follows:

#!/bin/bash time (for i in {1..1000} do java -jar SysExitApp,jar done) time (for i in {1..1000} do ./SysExitApp.bin done)

The source code can be downloaded from GitHub. Further information about how to build the C++ and Java applications is available in the README file. The output of the script is as follows:

real 1m26.898s user 1m14.302s sys 0m13.297s real 0m0.915s user 0m0.093s sys 0m0.854s

With the C++ binary, the average time taken per execution is 0.915ms. The execution time of the Java application rises from this to 86.898ms on average. One can reasonably put this down to the cost of the JVM startup.

Both C++ and Java are compiled languages. C++ compiles down to machine code; instructions that can be executed directly by the CPU (Central Processing Unit). Java compiles down to bytecode. Bytecode lends itself to portability across many devices where the JVM provides an abstraction layer handling differences in hardware configuration before interpreting it down to machine code. .

A good proportion of the tools in the digital preservation toolkit are implemented in Java, e.g. DROID, Jhove, Tika, Fits. As such, we currently have to take this performance hit, and optimizations must focus on handling that effectively.

Enter Nailgun

Nailgun is a client/server application that removes the overhead of starting the JVM by running it once within the server and enabling all command-line based Java applications to run within that single instance. The Nailgun client then handles those applications' calls to the server; that might be the command line (stdin) one normally associates with a particular application, e.g. running Tika with the -m flag, and passing it a reference to a file. The application runs and Nailgun directs its stdout, and stderr back to the client which is then output to the console.

With the exception of the command line being executed within a call to the Nailgun client, behaviour remains consistent with that of the standalone Java application. The Nailgun background information page provides a more detailed description of the process.

How to build Nailgun

Before running Nailgun it needs to be downloaded from GitHub and built using Apache Maven to build the server, and the GNU Make utility to build the client. The instructions in the Nailgun README describe how this is done.

How to start Nailgun

Once compiled the server needs to be started. The command line to do this looks like this:

java -cp /home/digital/dp-toolkit/nailgun/nailgun-server/target/nailgun-server-0.9.2-SNAPSHOT.jar -server com.martiansoftware.nailgun.NGServer

The classpath needs to include the path to the Nailgun server Jar file. The command to start the server can be expanded to include any further application classes you want to run. There are other ways it can be modified as well. For further information please refer to the Nailgun Quick Start Guide. For simplicity we start the server using the basic startup command.

Loading the tools (Nails) into Nailgun

As mentioned above, the tools you want to run can be loaded into Nailgun at startup. For my purposes, and to provide a useful and simple overview for all, I found it easiest to load them via the client application.

Applications loaded into Nailgun need to have a main class. It is possible to find if the application has a main class by opening the Jar in an archive manager capable of opening Jars such as 7-Zip. Locate the MET-INF folder, and within that the MANIFEST.MF file. This will contain a line similar to this example from the Tika Jar’s MANIFEST.MF in tika-app-1.5.jar.

Main-Class: org.apache.tika.cli.TikaCLI

Confirmation of a main class means that we can load Tika into Nailgun with the command:

ng ng-cp /home/digital/dp-toolkit/tika-1.5/tika-app-1.5.jar

Before working with our digital preservation tools. we can try running the Java application created to baseline the JVM startup time alongside the functionally comparable C++ application.

MANIFEST.MF within the SysExitApp.jar file reads as follows:

Manifest-Version: 1.0 Created-By: 1.7.0_21 (Oracle Corporation) Main-Class: SysExitApp

As it has a main class we can load it into the Nailgun server with the following command:

ng ng-cp /home/digital/Desktop/dp-testing/nailgun-timing/exit-apps/SysExitApp.jar

The command ng-cp tells Nailgun to add it to its classpath. We provide an absolute path to the Jar we want to execute. We can then call its main class from the Nailgun client.

Calling a Nail from the Command Line

Following that, we want to call our application from within the terminal. Previously we have used the command:

java -jar SysExitApp.jar

This calls Java directly and thus the JVM. We can replace this with a call to the Nailgun client and our application's main class:

ng SysExitApp

We don't expect to see any output at this point, that is, provided no error occurs, it will simply return a new input line on the terminal. On the server, however, we will see the following:

NGSession 1: SysExitApp exited with status 0

And that's it. Nailgun is up and running with our application!

We can begin to see the performance improvement gained by removing the expense of the JVM startup when we execute this command using our 1000 loop script. We simply add the following lines:

time (for i in {1..1000} do ng SysExitApp done)

This generates the output:

real 0m2.457s user 0m0.157s sys 0m1.312s

Compare that to running the Jar, and compiled binary files before:

real 1m26.898s user 1m14.302s sys 0m13.297s real 0m0.915s user 0m0.093s sys 0m0.854s

It is not as fast as the compiled C++ code but it represents an improvement of well over a minute compared to calling the JVM each loop.

The Digital Preservation Toolkit Comparison Script

Up and running we can now baseline Nailgun with the script used to run our digital preservation analysis tools.

We define two functions: one that calls the Jars we want to run without Nailgun, and the other to call the same classes, with Nailgun:

dp_analysis_no_ng () { FUID=$(uuidgen) DIRN=$(dirname "$file") BASN=$(basename "$file") echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "file-5.11" '\t' $(file -b -n -p "$file") '\t' $(file -b -i -n -p "$file") >> ${LOGNAME}${FILELOG} echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "tika-1.5-md" '\t' $(java -jar ${TIKA_HOME}/tika-app-1.5.jar -m "$file") >> ${LOGNAME}${TIKAMDLOG} echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "tika-1.5-type" '\t' $(java -jar ${TIKA_HOME}/tika-app-1.5.jar -d "$file") >> ${LOGNAME}${TIKATYPELOG} echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "jhove-1_11" '\t' $(java -jar ${JHOVE_HOME}/bin/JhoveApp.jar "$file") >> ${LOGNAME}${JHOVELOG} echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "sha-1-8.20" '\t' $(sha1sum -b "$file") >> ${LOGNAME}${SHAONELOG} } dp_analysis_ng () { FUID=$(uuidgen) DIRN=$(dirname "$file") BASN=$(basename "$file") echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "file-5.11" '\t' $(file -b -n -p "$file") '\t' $(file -b -i -n -p "$file") >> ${LOGNAME}${FILELOG} echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "tika-1.5-md" '\t' $(ng org.apache.tika.cli.TikaCLI -m "$file") >> ${LOGNAME}${TIKAMDLOG} echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "tika-1.5-type" '\t' $(ng org.apache.tika.cli.TikaCLI -d "$file") >> ${LOGNAME}${TIKATYPELOG} echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "jhove-1_11" '\t' $(ng Jhove "$file") >> ${LOGNAME}${JHOVELOG} echo -e ${FUID} '\t' "$file" '\t' ${DIRN} '\t' ${BASN} '\t' "sha-1-8.20" '\t' $(sha1sum -b "$file") >> ${LOGNAME}${SHAONELOG} }

Before we define the functions, we load the applications into Nailgun with the following commands:

#Load JHOVE and TIKA into Nailgun CLASSPATH $(ng ng-cp ${JHOVE_HOME}/bin/JhoveApp.jar) $(ng ng-cp ${TIKA_HOME}/tika-app-1.5.jar)

The complete script can be found on GitHub with more information in the README file.


For the purpose of this blog I reworked the Open Planets Foundation Test Corpus and have used a branch of that to run the script across. There are 324 files in the corpus with numerous different formats. The script produces the following results:

real 13m48.227s user 26m10.540s sys 0m59.861s real 1m32.801s user 0m4.548s sys 0m16.847s

Stderr is piped to a file called errorlog.txt using ‘2>’ syntax to enable me to capture the output of all the tools and to avoid any expense of the tool printing to the screen. Errors shown in the log relate to the tools ability to parse certain files in the corpus rather than to do with Nailgun. The errors should be reproducible with the same format corpus and tool set.

There is a marked difference in performance when running the script with Nailgun and without. Running the tools as-is we find that each pass takes approximately 2.52 seconds per file on average.

Using Nailgun this is reduced to approximately 0.28 seconds per file on average.


The timing results collected here will vary quite widely on different systems, even on the same system. The disparity between running applications executing the JVM each time and then running applications using Nailgun should show up with fairly even contrast.

While I hope this blog provides a useful Nailgun tutorial, the concern I have after being able to work in anger with the tools we talk about in the digital preservation community on a daily basis, is in understanding what smaller institutions with smaller IT departments, and potentially fewer IT capabilities, are doing. And whether they are even able to make use of the tools out there given the overheads described.

It is possible to throw more technology and more resources at this issue but it can't be expected that this will always be possible. The reason I sought this workaround is that I can't see that capability being developed at Archives New Zealand without significant time and investment, and that capability can't always be delivered in short-order within the constraints of working within government. My analysis, on a single collection of files, needs to be complete within the next few weeks. I need tools that are easily accessible and far more efficient to be able to do this. 

It is something I'll have to think about some more. 

Nailgun gives me a good shot-term solution, and hopefully this blog opens it up as a solution that will prove useful to others too.

It will be interesting to learn, following this work, how others have conquered similar problems, or equally interesting, if they are yet to do so.  



Loops: I experimented with various loops for recursing the directories in the opf-format-corpus expecting to find differences in performance within each. Using the Linux time command I was unable to find any material difference in either loop. The script used for testing is available on GitHub. The loop executes a function that calls two Linux commands, ‘sha1sum’ and ‘file’. A larger test corpus may help to reveal differences in either approach. I opted to stick with iterating over a manifest as this is more likely to mirror processes within our organization.

Optimization: I recognize a naivety in my script. Produced to collect quick and dirty results from a test set that I only have available for a short period of time. The first surprise running the script was the expense of the JVM startup. After finding a workaround for that I now need to look at other optimizations to continue to approach the analysis this way. Failing that, I need to understand from others why this approach might not be appropriate, and/or sustainable. Comments and suggestions along those lines as part of this blog are very much appreciated.

And Finally...

All that glisters is not gold: Nailgun comes with its own overhead. Running the tool on a server at work with the following specification:

Product: HP ProLiant ML310 G3 CPU: Intel(R) Pentium(R) 4 CPU 3.20GHz Width: 64 bits Memory: 5GiB OS: Ubuntu 10.04.4 LTS Release: 10.04 Codename: lucid Java version "1.7.0_51" Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) Client VM (build 24.51-b03, mixed mode)

We find it running out of heap space around the 420,000th call to the server, with the following message:

java.lang.OutOfMemoryError: Java heap space

If we look at the system monitor we can see that the server has maxed out the amount of RAM it can address. The amount of memory it is using grows with each call to the server. I haven't a mechanism to avoid this at present, other than chunking the file set and restarting the server periodically. Users adopting Nailgun might want to take note of this issue up-front. Throwing memory at the problem will help to some extent but a more sustainable solution is needed, and indeed welcomed. This might require optimizing Nailgun, or instead, further optimization of the digital preservation tools that we are using.


Preservation Topics: IdentificationCharacterisationToolsSoftware
Categories: Planet DigiPres

A spot of Gardening: Weeding the Open Planets Foundation Format Corpus

20 February 2014 - 6:57am

Conducting some research into the chaining of digital preservation tools using a Linux shell script, I once again found it difficult to source a set of files that I could use as a stake in the ground and allow my work to be in some way replicated by others wishing to confirm results and find future optimisations. Read: Scientific method.

The Open Planets Foundation (OPF) format corpus represents a great set of files. Arguably it isn’t as complete as it could be, but it represents files being used for testing on digital preservation tools, that I can attach a count to, that others can easily access, and gives me some level of complexity for testing against. That is, there is some structure to the collection in terms of numbers of folders and depth, format coverage, and a decent enough number of files so as to not simply be processed at light-speed by the tools in question, allowing us to collect useful, comparative timing values.  

But the corpus in its current form on GitHub is difficult to use without some additional effort. Like a garden needs watering, it sometimes needs a little weeding too, and I think that’s the case with the format corpus. Ideally if we can extract the weeds from it, it can become completely standalone, and useful to anyone who comes along who needs to use it for a number of ever-so-slightly different purposes.

Purposes of having a format corpus with a broad range of files with known characteristics and attributes, of many different combinations have been expounded before. Other reasons include:

  • Monitoring the consistency of the behaviour of tools and their output.
  • Monitor improvements in capability and performance.
  • Understand the behaviour of tools against files with more obscure characteristics and attributes.
  • Enable other tools to be developed and measured alongside existing tools using the same baseline for testing.
  • And, in this instance, for the purpose measuring (and demonstrating) the performance of the tools within the digital preservation toolkit and to enable those same tests to be run by other users and organisations.

It is possible to do this with the corpus in its current form, but it contains two types of file. Those meant specifically for testing a tool, functional files, e.g. exemplars of JP2, PDF, and ODF.

Compare this to non-functional files, those that contain metadata, or even the results of other experiments on the files with other tools. To explain further:

A process one might follow at present to experiment on the corpus is to download the repository from GitHub. This will give you a folder that looks like this:

.gitattributes .gitignore .opf.yml .project .pydevproject <DIR> desktop-publishing <DIR> ebooks <DIR> file-archive <DIR> filesys-trials <DIR> govdocs1-error-pdfs <DIR> jp2k-formats <DIR> jp2k-test <DIR> knowledge-management <DIR> office <DIR> office-examples <DIR> pcraster <DIR> pdfCabinetOfHorrors <DIR> statistica <DIR> tiff-examples <DIR> tools <DIR> variations <DIR> video

The files that we are interested in testing against exist inside the lower level directories, e.g. ‘office-examples/’.

A directory at this top-level level, and sitting amongst the others, that does not constitute part of the corpus is ‘tools’. This folder does what it says on the tin and it contains tools for working with the format corpus, not corpus files.

Files like,, and .gitignore; while these can be 'handled' by any of the tools we might test against, usefully, or not, they’re not part of the corpus of files that we want to be measuring against.

This presents us with an issue counting the number of functionally useful files in the corpus for presenting back useful, repeatable results. As we move deeper into the structure of the repository we find other *.md files, and other metadata objects in inconsistent formats, e.g. comma-separated-values files, inconsistently distributed within different folders where contributors may or may-not have created them.

The two sets of objects need to be presented back in such a way that the testing objects can simply be picked-up and used by others. The descriptive and utility objects should not clutter this other collection and should sit, modularly, somewhere else.

Following these principles I remixed the corpus as is to create the following structure at the top-level:

diagram.png <DIR> format-corpus <DIR> tools now contains all descriptive information previously found about the files in the ‘format-corpus/’ including metadata about the objects and licensing information where it has been provided by the content creator. Diagram.png is part of

Further issues inside the corpus include the inclusion of tool output, such as Jpylyzer output, and Jhove output.

As examples of XML objects these files would prove useful. Other than that, they clutter the corpus repository somewhat and serve little purpose. While fulfilling some suggested metadata requirements outlined by Andy Jackson of the British Library, i.e. including some or all of the following information:

formatName: formatVersion: extensions: mimeType: mimeTypeAliases: pronomId: xmlNameSpace: creatorTool: creatorToolUrl: formatSpecUrl:

The tool output does not constitute part of the collection that we’re interested in testing against.

On the one hand, this output is variable over time, but the corpus files are constants. The specific version of the tools generating the output from these files is constant too (although the platform might not be). Given correct source control with any of these utilities, we don’t need to store any output. We simply need to run older or newer versions of them over time, depending on perspective, to recreate these ‘metadata’ objects.

Should they be viewed as the result of ‘testing’, and those results important to keep with the format-corpus then perhaps we can create a ‘testing-results’ folder in the top level of the repository.

Tool output has been removed entirely.

The corpus now sits in a standalone space. When it is downloaded via Git, the user will receive a folder containing test files and the directory structure wrapping those files; none of the extraneous non-functional, and descriptive data. The ‘format-corpus/’ directory can simply be passed to any tool being tested free from the additional pollution.

I am not sure this is a perfect model yet. contains the file index, simply because GitHub renders it on the front page of a repository folder, by default. We could still extract all of this information into individual files to be managed that way. Work could also be done to generate files matching the suggested metadata schema by Andy Jackson - a task at a hackathon maybe? – Not as cool as coding, but maintaining such a useful resource, equally important.

Regardless, I hope that in my remix I’ve demonstrated some principles that other contributors will be happy to follow, or perhaps some ideas that can open up a discussion about alternative ways to do this.

To conclude, I’ve branched the current corpus here:

I need this to be able to make a write-up in reference to a set of files fixed to a point in time, and the full corpus is here:

Overall, I’d be extremely happy if I’ve managed to keep this work in a state that enables it to be forked and placed somewhere more useful to the community, and am happy to see it move back into the Open Planets Foundation GitHub where the community can continue to work on it.

Happy gardening! 



Structure: Because it is difficult to rework structure en masse using Git, we do need to think quite carefully up front about our work. Unfortunately the remix required a full extract, generation of a new repo, and re-upload. Ideally I would have worked on a Fork of the current collection to the same aim.


Preservation Topics: ResourcesCorpora
Categories: Planet DigiPres

SCAPE QA Tool: Technologies behind Pagelyzer - II Web Page Segmentation

12 February 2014 - 1:21pm

Web pages are getting more complex than ever. Thus, identifying different elements from web pages, such as main content, menus, user comments, advertising among others, becomes difficult. Web page segmentation refers to the process of dividing a Web page into visually and semantically coherent segments called Blocks or Segments. Detecting these different blocks is a crucial step for many applications, for example mobile devices content visualization, information retrieval and change detection between versions in the web archive context.

Web Page Segmentation at a Glance

For a web page (W) the output of its segmentation is the semantic tree of a web page (W'). Each node represents a data region in the web page, which is called a block. The root block represents the whole page. Each inner block is the aggregation of all its children blocks. All leaf blocks are atomic units and form a flat segmentation of the web page. Each block is identified by a block-id value (See Figure 1 for an example).

Fig. 1

An efficient web page segmentation aproach is important for several issues:

  • Process different part of a web page accordingly to its type of content.

  • Assign importance to a region in a web page over the rest

  • Understand the structure of a web page

Pagelyzer is a tool containing a supervised framework that decides if two web page versions are similar or not. Pagelyzer takes two urls and two browsers types (e.g. firefox, chrome) and one comparison type as input (image-based, hybrid or content-based). If browsers types are not set, it uses firefox by default. SVM-based comparison is discussed in the post (SCAPE QA Tool: Technologies behind Pagelyzer - I Support Vector Machine). Based on the segmentation hyperlinks are extracted from each block and the jaccard distance between them are calculated.    

In this post, I will try to explain what web page segmentation does specially for pagelyzer. It provides information of about the web page content.

Web page Segmentation Algorithm

We present here the detail for the Block-o-Matic web page segmentation algorithm used by pagelyzer to perform the segmentation. It is an hybrid between the visual-based approach and document processing approach.

The segmentation process is divided in three phases: analysis, understanding and reconstruction. It comprise three taks: filter, mapping and combine. It produces three structures: DOM structure, content structure and logic structure. The main aspect of the whole process is producing this structures where the logic structure represent the final segmentation of the web page. 

The DOM tree is obtained from the rendering of a web browser. The result of the analysis phase is the content structure (Wcont ), built from the DOM tree with the d2c algorithm. Mapping the content structure into a logical structure (Wlog ) is called document understanding. This mapping is performed by the c2l algorithm with a granularity parameter pG. Web page reconstruction gather the three structures (Rec function),


W' = Rec(DOM, d2c(DOM ), c2l(d2c(DOM, pG))).


For the integration of the segmentation outcome to pagelyzer it is used a XML representation: ViDIFF. It represent hierarchicaly the blocks, their geometric properties, the links and text in each block.


Block-o-matic algorithm is available:

ReferencesStructural and Visual Comparisons for Web Page ArchivingM. T. Law, N. Thome, S. Gançarski, M. Cord12th edition of the ACM Symposium on Document Engineering (DocEng) 2012 Structural and Visual Similarity Learning for Web Page ArchivingM. T. Law, C. Sureda Gutierrez, N. Thome, S. Gançarski, M. Cord10th workshop on Content-Based Multimedia Indexing (CBMI) 2012 Block-o-Matic: A Web Page Segmentation FrameworkA. Sanoja and S. Gançarski. Paper accepted for oral presentation in the International Conference on Multimedia Computing and Systems (ICMCS'14). Morroco, April 2014. Block-o-Matic: a Web Page Segmentation Tool and its EvaluationSanoja A., Gançarski S.BDA. Nantes, France. 2013. Yet another Web Page Segmentation ToolSanoja A., Gançarski S.Proceedings iPRES 2012. Toronto. Canada, 2012 Understanding Web Pages Changes.Pehlivan Z., Saad M.B. , Gançarski S.International Conference on Database and Expert Systems Applications DEXA (1) 2010: 1-15Preservation Topics: Software AttachmentSize fig1.png127.53 KB exampleseg.png1.77 MB
Categories: Planet DigiPres

SCAPE QA Tool: Technologies behind Pagelyzer - I Support Vector Machine

7 February 2014 - 1:15pm


The Web is constantly evolving over time. Web content like texts, images, etc. are updated frequently. One of the major problems encountered by archiving systems is to understand what happened between two different versions of the web page. We want to underline that the aim is not to compare two web pages like this (however, the tool can also do that):




but web page versions:




An efficient change detection approach is important for several issues:


  • Crawler optimization by deciding if the page should be crawled or not on the fly.

  • Discovering new crawl strategies e.g. based on patterns

  • Quality assurance for crawlers, for example, by comparing the live version of the page with the just crawled one.

  • Detecting format obsolescence following to evolving technologies, is the rendering of web pages are identique visually by using different versions of the browser or different browsers

  • Archive maintenance, different operations like format migration can change the archived versions renderings.

Pagelyzer is a tool containing a supervised framework that decides if two web page versions are similar or not. Pagelyzer takes two urls and two browsers types (e.g. firefox, chrome) and one comparison type as input (image-based, hybrid or content-based). If browsers types are not set, it uses firefox by default.


It is based on two different technologies:


1 – Web page segmentation (let's keep the details for another blog post)

2 – Supervised Learning with Support Vector Machine(SVM).


In this blog, I will try to explain simply (without any equations) what SVM does specially for pagelyzer. You have two urls, let's say url1 and url2 and you would like to know if they are similar (1) or dissimilar (0).


You calculate the distance (or similarity) as a vector based on the comparison type. If it is image-based, your vector will contain the features related to images (e.g. SIFT, HSV). If it is content-based, your vector will contain features for text similarities(e.g. jacard distance for links, images and words). To better explain how it works, let's assume that we have two dimensions (two features). One feature is SIFT and the other one is HSV. They are both color descriptives.


To make your system learn, you should provide at the beginning annotated data to your system. In our case, we need a list of url pairs <url1,url2> annotated manually as similar or not similar. For pagelyzer, this dataset is provided by Internet Memory Foundatation (IMF). With a part of your dataset (ideally 1/3) you train your system, with the other part you test your results.



Let's start training:



First, you put all your vectors in input space.


As, this data is annotated, you know which one is similar (in green), which one is dissimilar(in red).


You find the optimal decision boundary (hyperplane) in input space. Anything above the decision boundary should have label 1 (similar). Similarly, anything below the decision boundary should have label 0 (dissimilar).



Let's classify:



Your system is intelligent now! When you have new pair of urls without any annotation, based on the decision boundry, you can say if they are similar or not.

The pair of urls in blue will be considered as dissimilar, the one in orange will be considered as similar by pagelyzer.


When you choose different types of comparison, you choose different types of features and dimensions. The actual version of Pagelyzer uses the results of SVM learned with 202 couples of web page provided by IMF, 147 are in positive class and 55 are in negative class. As it is a supervised system, increasing the training set size will always lead to better results.

An image to show what happens when you have more than two dimensions:




Preservation Topics: Web ArchivingToolsSCAPESoftware
Categories: Planet DigiPres