Planet DigiPres

HTML and fuzzy validity

File Formats Blog - 8 July 2014 - 5:54pm

Andy Jackson wrote an interesting post on the question of HTML validity. Only registered Typepad users can comment, so it’s easier for me to add something to the discussion here.

When I worked on JHOVE, I had to address the question of valid HTML. A few issues are straightforward; the angle brackets for tags have to be closed, and so do quoted strings. Beyond that, everything seems optional. There are no required elements in HTML, not even html, head, or body; a blank file or a plain text file with no tags can be a valid HTML document. The rules of HTML are designed to be forgiving, which just makes it harder to tell if a document is valid or not. I’ve recommended that JHOVE users not use the HTML module; it’s time-consuming and doesn’t give you much useful information.

There are things in XHTML which aren’t legal in HTML. The “self-closing” tag (<tag/>) is good XHTML, but not always legal HTML. In HTML5, <input ... /> is legal, but <span ... /> isn’t, because input doesn’t require a closing tag but span does. (In other words, it’s legal only when it’s superfluous.) However, any recent browser will accept both of them.

The set of HTML documents which are de facto acceptable and unambiguous is much bigger than the set which is de jure correct. Unfortunately, the former is a fuzzy set. How far can you push the rules before you’ve got unsafe, ambiguous HTML? It depends on which browsers and versions you’re looking at, and how strenuous your test cases are.

The problem goes beyond HTML proper. Most browsers deal with improper tag nesting, but JavaScript and CSS can raise bigger issues. These are very apt to have vendor-specific features, and they may have major rendering problems in browsers for which they weren’t tested. A document with broken JavaScript can be perfectly valid, as far as the HTML spec is concerned.

It’s common for JavaScript to be included by an external reference, often on a completely different website. These scripts may themselves have external dependencies. Following the dependency chain is a pain, but without them all the page may not work properly. I don’t have data, but my feeling is that far more web pages are broken because of bad scripts and external references than because of bad HTML syntax.

So what do you do when validating web pages? Thinking of it as “validating HTML” pulls you into a messy area without addressing some major issues. If you insist on documents that are fully compliant with the specs, you’ll probably throw out more than you accept, without any good reason. But at the same time, unless you validate the JavaScript and archive all external dependencies, you’ll accept some documents that have significant preservation issues.

It’s a mess, and I don’t think anyone has a good solution.

Categories: Planet DigiPres

Professional update

File Formats Blog - 8 July 2014 - 10:08am

Just to keep everyone up to date on what I’m doing professionally:

Currently I’m back in consulting mode, offering my services for software development and consultations. Those of you who’ve been following this blog regularly know I’ve been working with libraries for a long time and I’m familiar with the technology. I’ve updated my business home page at and moved it to new hosting, which will allow me to put demos and other materials of interest on the site.

The key to success is, of course, networking. so if you happen to hear of a situation where my skills could be put to good use, please let me know.

Tagged: business
Categories: Planet DigiPres

The SCAPE Project video is out!

Open Planets Foundation Blogs - 3 July 2014 - 8:31am

Do you want a quick intro to what SCAPE is all about?

Then you should watch the new SCAPE video!

The video will be used at coming SCAPE events like SCAPE demonstration days and workshops and it will be available on Vimeo for everyone to use. You can help us to disseminate this SCAPE video by tweeting using this link



Standard tools become overtaxed.....          ....SCAPE addresses these challenges 


The production of this SCAPE video was part of the final project presentation. The idea behind the video is to explain what SCAPE is about to both technical and non-technical audiences. In other words, the overall outcomes and unique selling points of the project in a short and entertaining video. But how do you condense a four year project with 19 partners and lots of different tools and other SCAPE products in just two minutes?

We started with formulating the SCAPE overall messages and unique selling points, from which a script was distilled. This was the basis for the voice over text and a story board, after which the animation work began. There were lots of adjustments to be made in order to stay close to the actual SCAPE situation. It was great that SCAPErs from different areas of the project were kind enough to look at what we from the Take Up team came up with. 

Please take a look and use this video to tell everyone how SCAPE helps you to bring your digital preservation into the petabyte dimension!


SCAPE Project - Digital Preservation into the Petabyte Dimension from SCAPE project on Vimeo.

Preservation Topics: SCAPE
Categories: Planet DigiPres

Introducing Flint

Open Planets Foundation Blogs - 2 July 2014 - 12:53pm

Hi, this is my first blog post in which I want to introduce the project I am currently working on: Flint.


Flint (File/Format Lint) has developed out of DRMLint, a lightweight piece of Java software that makes use of different third party tools (Preflight, iText, Calibre, Jhove) to detect DRM in PDF-files and EPUBs. Since its initial release we have added validation of files against an institutional policy, making use of Johan’s pdfPolicyValidate work, restructured it to be modular and easily extendible, and found ourselves having developed a rather generic file format validation framework.  

what does Flint do?

Flint is an application and framework to facilitate file/format validation against a policy. It's underlying architecture is based on the idea that file/format validation has nearly always a specific use-case with concrete requirements that may differ from say a validation against the official industry standard of a given format. We discuss the principal ideas we've implemented in order to match such requirements.

The code centres on individual file format modules, and thus takes a different approach to FITS; for example - the PDF module makes use of its own code and external libraries to check for DRM. Creating a custom module for your own file formats is relatively straight-forward.

The Flint core, and modules, can be used via a command line interface, graphical user interface or as a software library. A MapReduce/Hadoop program that makes use of Flint as a software library is also included.

The following focuses on the main features:


The core module provides an interface for new format-specific implementations, which makes it easy to write a new module. The implementation is provided with a straight-forward core workflow from the input-file to standardised output results. Several optional functionalities (e.g. schematron-based validation, exception and time-out handling of the validation process of corrupt files) help to build a robust validation module.


Visualisation of Flints core functionality; a format-specific implementation can have domain-specific validation logic on code-level (category C) or on configuration level (categories A and B). The emphasis is on a simple workflow from input-file to standardised check results that bring everything together.

Policy-focused validation

The core module optionally includes a schematron-based policy-focused validator. 'Policy' in this context means a set of low-level requirements in form of a schematron xml file, that is meant to validate against the xml output of other third-party programs. In this way domain-specific validity requirements can be customised and reduced to the essential.  For example: does this PDF require fonts that are not embedded?

We make use of Johan’s work for a Schematron check of Apache Preflight outputs, introduced in this blog post. Using Schematron it is possible to check the XML output from tools, filtering and evaluating them based on a set of rules and tests that describe the concrete requirements of *your* organisation on digital preservation.



Aside from its internal logic, Flint contains internal wrapper code around a variety of third-party libraries and tools to make them easier to use, ensuring any logic to deal with them is in one place only:

* Apache PDFBox

* Apache Tika

* Calibre

* EPUBCheck

* iText - if this library is enabled note that it is AGPL3 licensed

These tools all do (a) something slightly different or (b) do not have full coverage of the file formats in some respects or (c) they do more than what one actually needs. All these tools relate more or less to the fields of PDF and EPUB validation, as these are the two existing implementations we're working on at the moment.

Format-specific Implementations
  • flint-pdf: validation of PDF files using configurable, schematron-based validation of Apache Preflight results and additionally internal logic and all tools in the list above to focus on DRM and Wellformedness
  • flint-epub: validation of EPUB files using configurable, schematron-based validation of EPUBCheck results and additionally internal logic and all tools in the list above to focus on DRM and Wellformedness

NOTE: both implementations are work-in-progress, and should be a good guide for how to implement your own format-validation implementation using Flint.  It would be easy to add a Microsoft Office file format module that looked for DRM etc, for example.



Visualisation of the Flint ecosystem with different entry points and several format/feature-specific implementations (deep blue: existing ones, baby blue: potentially existing ones); the core, as visualised in Figure 1 connects the different ends of the ‘ecosystem’ with each other


how we are using it

Due to the recent introduction of non print legal deposit The British Library is preparing to receive large numbers of PDF and EPUB files. Development of this tool has been discussed with operational staff within the British Library and we aim for it to be used to help determine preservation risks within the received files.

what’s next

Having completed some initial large-scale testing of a previous version of Flint we plan on running more large-scale tests with the most recent version.  We are also interested in the potential of adding additional file format modules; work is underway on some geospatial modules.

help us make it better

It’s all out there (the schematron utils are part of our tools collection at, Flint is here:, please use it, please help us to make it better.

Preservation Topics: CharacterisationPreservation RisksSCAPE
Categories: Planet DigiPres

How much of the UK's HTML is valid?

Open Planets Foundation Blogs - 2 July 2014 - 12:05pm

I thought OPF members might be interested in this UK Web Archive blog post I wrote on format identification and validation of our historical web archives: How much of the UK's HTML is valid?

Preservation Topics: Identification
Categories: Planet DigiPres

OOXML: The good and the bad

File Formats Blog - 27 June 2014 - 12:05pm

An article by Markus Feilner presents a very critical view of Microsoft’s Open Office XML as it currently stands. There are three versions of OOXML — ECMA, Transitional, and Strict. All of them use the same extensions, and there’s no easy way for the casual user to tell which variant a document is. If a Word document is created on one computer in the Strict format, then edited on another machine with an older version of Word, it may be silently downgraded to Transitional, with resulting loss of metadata or other features.

On the positive side, Microsoft has released the Open XML SDK as open source on Github. This is at least a partial answer to Feilner’s complaint that “there are no free and open source solutions that fully support OOXML.”

Incidentally, I continue to hate Microsoft’s use of the deliberately confusing term “Open XML” for OOXML.

Thanks to @willpdp for tweeting the links referenced here.

Tagged: Microsoft, standards, XML
Categories: Planet DigiPres

SCAPE Demo Day at Statsbiblioteket

Open Planets Foundation Blogs - 27 June 2014 - 8:38am

Statsbiblioteket (The State and University Library, Aarhus, hereafter called SB) welcomed a group of people from The Royal Library, The National Archives, and Danish e-Infrastructure Cooperation on June 25, 2014. They were invited for our SCAPE Demo day where some of SCAPE’s results and tools were presented. Bjarne S. Andersen, Head of IT Technologies, welcomed everybody and then our IT developers presented and demonstrated SB’s SCAPE work.

The day started with a nice introduction to the SCAPE project by Per Møldrup-Dalum, including short presentations of some of the tools which would not be presented in a demo.  Among others this triggered questions about how to log in to Plato – a Preservation Planning Tool developed in SCAPE.

Per continued with a presentation about Hadoop and its applications. Hadoop is a large and complex technology, which was already decided to use before the project started. This has resulted in some discussion during the project, but Hadoop has proven really useful for large-scale digital preservation. Hadoop is available both as open source and as commercial distributions. The core concept of Hadoop is the MapReduce algorithm which was presented in the paper “MapReduce: Simplified Data Processing on Large Clusters” in 2004 by Jeffrey Dean and Senjay Ghemawat. This paper prompted Cutting and Cafarella to implement Hadoop and they published their system under an open source license. Writing jobs for Hadoop has traditionally been done by using the Java programming language, but in the recent years several alternatives to Java have been introduced, e.g. Pig Latin  and Hive  Other interesting elements in a Hadoop cluster are HBase, Mahout, Giraph, Zookeeper and a lot more. At SB we use an Isilon Scale-Out NAS storage cluster which enables us to make a lot of different experiments on the four 96GB RAM CPU nodes each with a 2 Gbit Ethernet interface. This setup potentially makes the complete online storage of SB reachable for the Hadoop cluster.

                                            Sometimes it is hard to fit an elephant in a library


Bolette A. Jurik was next in line and told the story about how Statsbiblioteket wanted to migrate audio files using Hadoop (and Taverna…. and xcorrSound Waveform Compare). The files were supposed to be migrated from mp3 to wav. Checking this collection in Plato gave us the result ‘Do nothing’ – meaning leave the files as mp3. But we still wanted to perform the experiment – to test that we have the tools to migrate, extract and compare properties, validate the file format and compare the content of the mp3 and wav files, and that we can create a scalable workflow for this.". We did not have a tool for the content comparison, so we had to develop one, xcorrSound waveform-compare. The output shows which files need special attention – as an example one of the files failed the waveform comparison although it looked right. This was due to a lack of content in some parts of the file so Waveform Compare had no sound to compare! Bolette also asked her colleagues to create "migrated" soundfiles with problems that the tool would not find – read more about this small competition in this blog post.

Then Per was up for yet another presentation – this time describing the experiment: Identification and feature extraction of web archive data based on Nanite. The test was to extract different kinds of metadata (like authors, GPS coordinates for photographs etc.) using Apache Tika, DROID, (and libmagic) . The experiment was run on the Danish Netarchive (archiving of the Danish web – a task undertaken by The Royal Library and SB together). For the live demo a small job with only three ARC files was used – taking all of the 80,000 files in the original experiment would have lasted 30 hours.  Hadoop generates loads of technical metadata that enables us to analyse such jobs in detail after the execution. Per’s presentation was basically a quick review of what is described in the blog post A Weekend With Nanite.

An analysis of the original Nanite experiment was done live in Mathematica presenting a lot of fun facts and interesting artefacts. For one thing we counted the number of unique MIME types in the 80,000 ARC files or 260,603,467 individual documents to

  • 1384 different MIME types were reported by the HTTP server at harvest time,
  • DROID counted 319 MIME types,
  • Tika counted 342 MIME types.

A really weird artefact was that approx. 8% of the identification tasks were complete before they started! The only conclusion to this is that we’re experiencing some kind of temporal shift that would also explain the great performance of our cluster…

Two years ago SB concluded a job that had run for 15 months. 15 months of FITS characterising 12TB of web archive data. The experiment with Nanite characterised 8TB in 30 hours. Overall this extreme shift in performance is due to our involvement in the SCAPE project.

After sandwiches and a quick tour to the library tower Asger Askov Blekinge took over to talk about Integrating the Fedora based DOMS repository with Hadoop. He described Bitmagasinet (SB’s data repository) and DOMS (SB’s Digital Object Management System based on Fedora) and how our repository is integrated with Hadoop.

SB is right now working on a very large project to digitize 32 million pages of newspapers. The digitized files are delivered in batches and we run Hadoop map/reduce jobs on each batch to do quality assurance. An example is to run Jpylyzer on a batch (Map runs Jpylyzer on each file, Reduce stores the results back in DOMS). The SCAPE way to do it includes three steps:

  • Staging – retrieves records
  • Hadooping – reads, works and writes new updated records
  •  Loading  - stores updated records in DOMS

The SCAPE Data model is mapped with the newspapers in the following way:

                                           SCAPE Data Model mapped with newspapers

SCAPE Stager/Loader creates a sequence file which can then be read and each record updated by Hadoop and after that the records are stored in DOMS.

The last demo was presented by Rune Bruun Ferneke-Nielsen. He described the policy driven validation of JPEG 2000 files based on Jpylyzer and performed on SB’s Newspaper digitization project. The newspapers are scanned from microfilms by a company called Ninestars, and then quality assured by SB’s own IT department. We need to make sure that the content conforms to the corresponding file format specifications and that the file format profile conforms to our institutional policies.


530,000 image files have been processed within approx. five hours.

We want to be able to receive 50,000 newspaper files per day and this is more than one server can handle. All access on data for quality assurance etc. is done via Hadoop. Ninestars runs a quality assurance before they send the files back to SB and then the files are QA’ed again inhouse.

                                                               Fuel for the afternoon (Photo by Per Møldrup-Dalum)

One of the visitors at the demo is working at The Royal Library with the NetArchive and would like to make some crawl log analyses. These could perhaps be processed by using Hadoop - this is definitely worth discussing after today to see if our two libraries can work together on this.

All in all this was a very good day, and the audience learned a lot about SCAPE and the benefits of the different workflows and tools. We hope they will return for further discussion on how they can best use SCAPE products at their own institutions.

Preservation Topics: SCAPE AttachmentSize ElephantOnSB.png662 KB DataModel.png70.11 KB NewspaperQA.png160.83 KB Fuel for the afternoon (Photo by Per Møldrup-Dalum)45.11 KB
Categories: Planet DigiPres

Bulk disk imaging and disk-format identification with KryoFlux

Open Planets Foundation Blogs - 26 June 2014 - 3:15pm
The problem

We have a large volume of content on floppy disks that we know are degrading but which we don't know the value of.

  1. We don't want to waste time/resources on low-value content.
  2. We don't know the value of the content.
  3. We want to be able to back up the content on the disks to ensure it doesn't degrade any more than it already has.
  4. Using unskilled students to do the work is cost-effective.
  5. Unskilled students have often never seen "floppy" disks, let alone can distinguish between different formats of floppy disk. So we need a solution that doesn't require them to differentiate (e.g. between apple formats, PC formats, Amiga, etc).
  1. Make KryoFlux stream files using the KryoFlux hardware and software.
  2. Use the KryoFlux software to create every variant of disk image from those streams
  3. Use the mount program on Linux to mount each disk image using each variant of file system parameter. 
  4. Keep the disk images that can mount in Linux (as that ability implies that they are the right format).

Very rough beginnings of a program to perform the automatic format identification using the KryoFlux software and Mount are available here.

Issues with the solution
  1. When you use the KryoFlux to create raw stream files it only seems to do one pass of each sector. Whereas when you specify the format it will try to re-read sectors that it identifies as "bad sectors" in the first pass. This can lead to it successfully reading those sectors when it otherwise wouldn't. So using the KryoFlux stream files may not lead to as much successful content preservation as you would get if you specified the format of the disk before beginning the imaging process. I'm trying to find out whether using "multiple" in the output options in the KryoFlux software might help with this
  2. Mount doesn't mount all file-systems - though as this is improved in the future the process could be re-run
  3. Mount can give false positives
  4. I don't know whether there is a difference between disk images created with Kroflux using many of the optional parameters or using the defaults. For example there doesn't appear to be a difference in mount-ability of disk images created where the number of sides is specified or disk images when it is not and defaults to both sides (for e.g. MFM images the results of both seem to mount successfully).
  5. Keeping the raw streams is costly. A disk image for a 1.44mb floppy is ~1.44mb. The stream files are in the 10s of MBs
Other observations:
  1. It might be worth developing signatures for use in e.g. DROID to identify the format of the stream files directly in the future. Some e.g. emulators can directly interact with the stream files already I believe
  2. The stream files might provide a way of over-coming bad-sector based copy protection, (e.g. the copy protection used in Lotus 1-2-3 and Lotus Jazz) by enabling the use of raw stream files (which -i believe- contain the "bad" sectors as well as good) in emulators

Thoughts/feedback appreciated

Preservation Topics: IdentificationPreservation RisksBit rotTools
Categories: Planet DigiPres

Will the real lazy pig please scale up: quality assured large scale image migration

Open Planets Foundation Blogs - 24 June 2014 - 9:12am

Authors: Martin Schaller, Sven Schlarb, and Kristin Dill

In the SCAPE Project, the memory institutions are working on practical application scenarios for the tools and solutions developed within the project. One of these application scenarios is the migration of a large image collection from one format to another.

There are many reasons why such a scenario may be of relevance in a digital library. On the one hand, conversion from an uncompressed to a compressed file format can significantly decrease storage costs. On the other hand, particularly from a long-term perspective, file formats may be in danger of becoming obsolete, which means that institutions must be able to undo the conversion and return to the original file format. In this case a quality assured process is essential to allow for reconstruction of the original file instances and especially to determine when deletion of original uncompressed files is needed – this is the only way to realize the advantage of reducing storage costs. Based on these assumptions we have developed the following use case: Uncompressed TIFF image files are converted into compressed JPEG2000 files; the quality of the converted file is assured by applying a pixel for pixel comparison between the original and the converted image.

For this, a sequential Taverna concept workflow was first developed, which was then modelled into a scalable procedure using different tools developed in the SCAPE Project.

The Taverna Concept Workflow

The workflow input is a text file containing paths to the TIFF files to be converted. This text file is then transformed into a list that allows the sequential conversion of each file, hence simulating a non-scalable process. Before the actual migration commences, validity of the TIFF file is checked. This step is realized by using FITS - a wrapper that applies different tools to extract the identification information of a file. Since the output of FITS is an XML-based validation report, an XPath service extracts and checks the validity information. If the file is valid, migration from TIFF to JPEG2000 can begin. The tool used in this step is OpenJPEG 2.0. In order to verify the output, Jpylyzer – a validator as well as feature extractor for JPEG2000 images created within the SCAPE Project – is employed. Again, an Xpath service is used to extract the validity information. This step concludes the file format conversion itself, but in order to ensure that the migrated file is indeed a valid surrogate, the file is reconverted into a TIFF file, again using OpenJPEG 2.0. Finally, in a last step the reconverted and the original TIFF files are compared pixel for pixel using LINUX based ImageMagick. Only through the successful execution of this final step can the validity as well as the possibility of a complete reconversion be assured. taverna workflow

Figure 1 (above): Taverna concept workflow

In order to identify how much time was consumed by each element of this workflow, we ran a test consisting of the migration of 1,000 files. Executing the described workflow on the 1,000 image files took about 13 hours and five minutes. Rather unsurprisingly, conversion and reconversion of the files took the longest: the conversion to JPEG2000 took 313 minutes and the reconversion 322 minutes. FITS validation needed 70 minutes and the pixel-wise comparison was finished in 62 minutes. The SCAPE developed tool Jypylizer required only 18 minutes and was thus much faster than the above mentioned steps. diagram taverna workflow

Figure 2 (above): execution times of each of the concept workflows' steps

Making the Workflow Scale

The foundation for the scalability of the described use case is a Hadoop cluster containing five Data Nodes and one Name Node (specification: see below). Besides having economic advantages – Hadoop runs on commodity hardware – it also bears the advantage of being designed for failure, hence reducing the problems associated with hardware crashes.

The distribution of tasks for each core is implemented via MapReduce jobs. A Map job splits the handling of a file. For example, if a large text file is to be processed, a Map job divides the file into several parts. Each part is then processed on a different node. Hadoop Reduce jobs then aggregates the outputs of the processing nodes again to a single file.

But writing MapReduce jobs is a complex matter. For this reason, the programming language Apache Pig is used. Pig was built for Hadoop and translates a set of commands in a language called “Pig Latin” into MapReduce jobs, thus making the handling of MapReduce jobs much easier or, as Professor Jimmy Lin described the powerful tool during the ‘Hadoop-driven digital preservation Hackathon’ in Vienna, easy enough “… for lazy pigs aiming for hassle-free MapReduce.”

Hadoop HDFS, Hadoop MapReduce and Apache Pig make up the foundation of the scalability on which the SCAPE tools ToMaR and XPath Service are based. ToMaR wraps command line tasks for parallel execution as Hadoop MapReduce jobs. These are in our case the execution of FITS, OpenJPEG 2.0, Jpylyzer and ImageMagick. As a result, the simultaneous execution of these tools on several nodes is possible. This has a great impact on execution times as Figure 3 (below) shows.

The blue line represents the non-scalable Taverna workflow. It is clearly observable how the time needed for file migration increases in proportion to the number of files that are converted. The scalable workflow, represented by the red line, shows a much smaller increase in time needed, thus suggesting that scalability has been achieved. This means that, by choosing the appropriate size for the cluster, it is possible to migrate a certain number of image files within a given time frame. Performance_image_migration

Figure 3 (above): Wallclock times of concept workflow and scalable workflow

Below is the the specification of the Hadoop Cluster where the master node runs the jobtracker and namenode/secondary namenode daemons, and the worker nodes each runs a tasktracker and a data node daemon.

Master node: Dell Poweredge R510

  • CPU: 2 x Xeon E5620@2.40GHz
  • Quadcore CPU (16 HyperThreading cores)
  • RAM: 24GB
  • NIC: 2 x GBit Ethernet (1 used)
  • DISK: 3 x 1TB DISKs; configured as RAID5 (redundancy); 2TB effective disk space

Worker nodes: Dell Poweredge R310

  • CPU: 1 x Xeon X3440@2.53GHz
  • Quadcore CPU (8 HyperThreading cores)
  • RAM: 16GB
  • NIC: 2 x GBit Ethernet (1 used)
  • DISK: 2 x 1TB DISKs; configured as RAID0 (performance); 2TB effective disk space

However, the throughput we can reach using this cluster and pig/hadoop job configuration is limited; as figure 4 shows, the throughput (measured in Gigabytes per hour - GB/h) is rapidly growing when the number of files being processed is increased, and then stabilises at a value around slightly more than 90 Gigabytes per hour (GB/h) when processing more than 750 image files. throughput_gb_per_h

Figure 4 (above): Throughput of the distributed execution measured in Gigabytes per hour (GB/h) against the number of files processed

As our use case shows, by using a variety of tools developed in the SCAPE Project together with the Hadoop framework it is possible to distribute the processing on various machines thus enabling the scalability of large scale image migration and significantly reducing the time needed for data processing. In addition, the size of the cluster can be tailored to fit the size of the job so that it can be completed within a given time frame.

Apart from the authors of this blog post, the following SCAPE Project partners contributed to this experiment:

  • Alan Akbik, Technical University of Berlin
  • Matthias Rella, Austrian Institute of Technology
  • Rainer Schmidt, Austrian Institute of Technology
Preservation Topics: MigrationSCAPEjpylyzer
Categories: Planet DigiPres

Library of Congress format recommendations

File Formats Blog - 23 June 2014 - 8:39pm

The Library of Congress has issued a set of recommendations for formats for both physical and digital documents. The LoC’s digital preservation blog has an interview with Ted Westervelt of the LoC on their development. They’re not just for the library’s own staff, he explains, but for “all stakeholders in the creative process.”

The guidelines repeatedly state: “Files must contain no measures that control access to or use of the digital work (such as digital rights management or encryption).” That’s pushback that can’t be ignored. In some cases, though, the message is mixed. For theatrically released films, standard or recordable Blu-Ray is accepted, but the boilerplate against DRM is included. I don’t know where they expect to get DRM-free Blu-Ray, but DRM-free options are few when it comes to big-name movies.

It’s also interesting that software, specifically games and learning materials, is included. This has been a growing area of interest in recent years. Rather than relying on emulation, the recommendations call for source code, documentation, and a specification of the exact compiler used to build the application.

There’s material here to fuel constructive debate and expansion for years.

Tagged: libraries, preservation, standards
Categories: Planet DigiPres

Interview with a SCAPEr - Leïla Medjkoune

Open Planets Foundation Blogs - 20 June 2014 - 11:59am
Leïla MedjkouneWho are you?

My name is Leïla Medjkoune and I am responsible for the Web Archiving projects and activities at Internet Memory.

Tell us a bit about your role in SCAPE and what SCAPE work you are involved in right now?

My involvement in Scape is twofold. I am working as a project manager, following the project and ensuring that Internet Memory as a partner fulfils the project plan. I am also involved as a functional expert, representing web archivists’ needs. I am therefore working within several areas of the project such as Quality Assurance and the Web Testbed work, where I contribute to the development of tools and workflows in relation to web archiving.

Why is your organisation involved in SCAPE?

Since its creation in 2004 Internet Memory actively participates in improving the preservation of the Internet. It supports cultural institutions involved in web archiving projects through its large scale shared platform, by building its own web archive and also by developing innovative methods and tools, such as its own crawler, MemoryBot, either internally or as a result of participation in EU-funded research projects, aiming to tackle web archiving and large scale preservation challenges. As part of SCAPE, Internet Memory is willing to test, develop and hopefully implement within its infrastructure, preservation tools and methods, including an automated visual quality tool applied to web archives.

What are the biggest challenges in SCAPE as you see it?

SCAPE is a very interesting project with a quite complex organisation. This is due to the fact that we are looking at a broad range of tools and methods trying to tackle a variety of preservation issues. Beyond the organisational aspects, one of the biggest challenges is as stated within the acronym, to answer the scalability issues currently met by most archives and libraries. This is even more critical for web archives as the amount of the heterogeneous content to preserve and to provide access to is constantly growing in size. Another challenge will be to disseminate SCAPE's outcomes so that they reach the preservation community and will be used within libraries, archives and preservation institutions in general.

What do you think will be the most valuable outcome of SCAPE?

As most web archives, we are willing to implement robust automated tools within our infrastructure that could not only facilitate operations but would also reduce costs. Improving characterisation tools so that they scale and developing QA tools designed for web archives, such as the Pagelyzer, are the most useful outcomes from our perspective. We are also strongly involved within the SCAPE platform work and believe this platform is a useful example of how several preservation tools and systems can be integrated within one single infrastructure.

Contact information:

Leïla Medjkoune

Preservation Topics: SCAPE
Categories: Planet DigiPres

Update on JHOVE

File Formats Blog - 18 June 2014 - 12:50pm

I’ve updated the UTF-8 module in the JHOVE source on Github to include the new code blocks for Unicode 7.0.0. Also, I’ve recently fixed the pom.xml file so it will put both the command line and the GUI JAR files into the local repository.

I need more input before I’m comfortable with creating a release 1.12 of JHOVE. I don’t have any prior experience with creating a public, open-source project that’s built with Maven, and I don’t know how much of the baggage of the SourceForge project really needs to be kept. There are some specialty JARs in the old project, but I don’t know if anyone uses them. Most importantly, there still needs to be a distribution in Zip and Tar formats. New features would be interesting, but the first thing is to make a JHOVE that was as useful as it was before.

Comments, suggestions, and code contributions are welcome, as always.

Tagged: JHOVE, preservation, software, Unicode
Categories: Planet DigiPres

New blocks in Unicode 7

File Formats Blog - 18 June 2014 - 10:04am

Unicode 7.0.0 has been released, with 2.834 new character codes. It’s been fascinating looking into some of the blocks that have been added; here’s a sampling.

Bassa Vah is a really obscure script from what is now Liberia, possibly predating the country. Old Permic is supposed to be a close relative of Cyrillic, but any visual resemblance is lost on me.

Some of the writing systems came from a religious impulse. Mende Kikakui was devised by an Islamic scholar and was once widely used for the Mende language in Africa. It’s been mostly displaced by the Latin alphabet. Shong Lue Yang introduced the Pahawh Hmong writing system for the Hmong language in southeast Asia, claiming to have received it from God. Pau Cin Hau, named after its creator, was a 20th century system used for religious writings in Burma. Its original version had over a thousand characters, but the Unicode block is based on the 57-character alphabetic system. The Manichaean alphabet is fascinating just because of its name, recalling the conflicts in early Christianity. According to tradition, Mani, the founder of Manichaeanism, created the alphabet.

Finally, one of the oldest writing systems in the world, Linear A, is new in Unicode 7. It’s from ancient Crete, and no one knows how to read its texts. Now you can create computer documents in it, if you’re a scholar of old languages or just like confusing people.

Still no Klingon, though.

Now the JHOVE UTF-8 module needs to be updated for all these new blocks.

Tagged: standards, Unicode
Categories: Planet DigiPres

What to call non-Google-worthy questions

Digital Continuity Blog - 17 June 2014 - 5:54pm

Here is something I’ve been discussing frequently recently and would love some input on. 

I believe we need a word for questions that you don’t know the answer to, but which are so unimportant/insignificant/uninteresting that you can’t even be bothered googling them. 

Since realising that there is a need for this I’ve noticed that the concept/situation comes up quite frequently (though it might just be a case of the Baader-Meinhof Phenomenon).

Any ideas?

Categories: Planet DigiPres

APARSEN Webinar: Interoperability of Persistent Identifiers

Alliance for Permanent Access News - 11 June 2014 - 8:38am

25th June 2014, 15:00 CET, 2 pm UK time, 9 am East Coast US. The webinar is scheduled to last 2 hours.

Come and join the next APARSEN Webinar to learn about new interoperability solutions for Persistent Identifier systems for digital objects, authors and contributors. In the current Web environment, many initiatives have been launched for Persistent Identifiers and a variety of different communities are working with their own set of schemas and resolution services. For digital preservation it is of upmost importance that Persistent Identifiers schemes remain operable in the future and become interoperable among each other in order to ensure long-term accessibility, exchange and reuse of scientific and cultural data. In the Webinar experts will present different approaches to defragment the current situation and potential benefits for users of implementing services across PI domains and communities.
These results will feed into the Virtual Centre of Expertise network under development by the APARSEN project.


  1. Maurizio Lunghi & Emanuele Bellini, Fondazione Rinascimento Digitale: “A model for interoperability of PI systems and new services across domains
  2. Sunje Dallmeier-Tiessen & Samuele Carli, CERN: “Implementation of persistent identifiers in a large scale digital library
  3. Juha Hakala, Finnish National Library: “Revision of the URN standards and its implications to other PIDs
  4. Anila Angjeli, Bibliotheque Nationale de France, ISNI consortium: “ISNI- abridge PID, a hub of links
  5. Laure Haak, ORCID: “Connecting persistent identifiers for people, places, and things: ORCID as an information hub
  6. Barbara Bazzanella, University of Trento: “The Entity Name System (ENS): a technical platform for implementing a Digital Identifier Interoperability Infrastructure
  7. David Giaretta, APA: “The APARSEN Virtual Centre of Expertise


Moderator: Eefke Smit (STM Association)

Further information about the event will be added shortly.

The meeting will be a web-meeting and will take place on megameeting.

Participation is free, no registration required!

Read more about previous APARSEN webinars here.

Categories: Planet DigiPres

APARSEN Webinar: Interoperability of Persistent Identifiers

Alliance for Permanent Access News - 11 June 2014 - 8:38am

25th June 2014, 15:00 CET, 2 pm UK time, 9 am East Coast US. The webinar is scheduled to last 2 hours.

Come and join the next APARSEN Webinar to learn about new interoperability solutions for Persistent Identifier systems for digital objects, authors and contributors. In the current Web environment, many initiatives have been launched for Persistent Identifiers and a variety of different communities are working with their own set of schemas and resolution services. For digital preservation it is of upmost importance that Persistent Identifiers schemes remain operable in the future and become interoperable among each other in order to ensure long-term accessibility, exchange and reuse of scientific and cultural data. In the Webinar experts will present different approaches to defragment the current situation and potential benefits for users of implementing services across PI domains and communities.
These results will feed into the Virtual Centre of Expertise network under development by the APARSEN project.


  1. Maurizio Lunghi & Emanuele Bellini, Fondazione Rinascimento Digitale: “A model for interoperability of PI systems and new services across domains
  2. Sunje Dallmeier-Tiessen & Samuele Carli, CERN: “Implementation of persistent identifiers in a large scale digital library
  3. Juha Hakala, Finnish National Library: “Revision of the URN standards and its implications to other PIDs
  4. Anila Angjeli, Bibliotheque Nationale de France, ISNI consortium: “ISNI- abridge PID, a hub of links
  5. Laure Haak, ORCID: “Connecting persistent identifiers for people, places, and things: ORCID as an information hub
  6. Barbara Bazzanella, University of Trento: “The Entity Name System (ENS): a technical platform for implementing a Digital Identifier Interoperability Infrastructure
  7. David Giaretta, APA: “The APARSEN Virtual Centre of Expertise


Moderator: Eefke Smit (STM Association)

Further information about the event will be added shortly.

The meeting will be a web-meeting and will take place on megameeting.

Participation is free, no registration required!

Read more about previous APARSEN webinars here.

Categories: Planet DigiPres

An Analysis Engine for the DROID CSV Export

Open Planets Foundation Blogs - 3 June 2014 - 7:20am

I have been working on some code to ensure the accurate and consistent output of any file format analysis based on the DROID CSV export, example here. One way of looking at it is an executive summary of a DROID analysis, except I don't think executives, as such, will be its primary user-base. 

The reason for pushing the code and this blog out now is to seek feedback on what else might be useful for users. Specifically, formatting of output, and seeking other use-cases for how others analyse the same results to understand if I can incorporate these methods into my work to the benefit of the community. 

The hope is that the results output by the tool can be used by digital preservation researchers, analysts, coders, archivists and digital archivists alike - where there is such a distinction to be drawn. 

The tool is split into two or three components depending on how you break it down.

This places DROID CSV export data into a SQLite database with the same filename as the input CSV file.

The process adds two additional columns to the saved table, URI_SCHEME so we can query to a greater granularity the URI scheme used by the various URIs output by DROID; and DIR_NAME to enable analysis on base directory names, e.g. to help us understand the break-down of directories in a collection.

This combines the functions of by calling droid2sqlite's primary class. Further it provides a query layer on top of the DROID SQLite database, outputting the results of various queries we might ask of the dataset to the command line.

This is a class created to help spot potentially difficult to handle file names from any DROID CSV output. The class is based on a Microsoft Development Network article but also checks for non-ascii characters and a handful of other characters that can prove problematic, such as square brackets.

Database and Analysis Engine

This work mirrors some of that done by DROID already. DROID outputs an Apache Derby database for its Profile file format. Information on connecting to it can be found on the droid-list mailing list. For my purposes I had a desire to learn the database management system SQLite and more practically I found a greater amount of support for it in terms of libraries available in Python or the applications I can use to access it. Instead of attempting to access the DROID Derby database and build on top of that, I decided to map the results to a SQLite database. SQLite also has features that I like that might lend itself better to long-term preservation enabling the long term storage of the database alongside any collection analysis documentation outside of the digital preservation system, if necessary.

DROID also enables filtering and the generation of reports, however I haven’t found the way it collects information to be useful in the past and so needed a different approach; an approach that gives me greater flexibility to create more reports or manipulate output.

The DROID CSV export is as simple as it needs to be and provides a lot of useful information and so provided an adequate platform for this work in its own right.

The database engine doesn’t have a hard coded schema; it simply reads the column headers in the CSV provided to the tool. Given the appearance of particular columns it creates two additional columns on top to provide greater query granularity, as mentioned above.

The analysis output includes summary statistics, along with listings of PUIDS and file paths depending on the query that we’re interested in. On top of the summary statistics, the following information is output:

  • Identified PUIDs and format names
  • PUID frequency
  • Extension only identification in the collection and frequency
  • ID method frequency
  • Unique extensions identified across the collection
  • Multiple identification listing
  • MIME type frequency
  • Zero-byte object listing
  • No identification listing
  • Top signature and identified PUIDs list
  • Container types in collection
  • Duplicate content listing
  • Duplicate filename listing
  • Listing of potentially difficult filenames

An example analysis, based on a DROID scan from the re-factored opf-format-corpus I host, can be found here. The summary statistics generated are as follows:

Total files: 500 Total container objects: 14 Total files in containers: 176 Total directories: 85 Total unique directory names: 75 Total identified files (signature and container): 420 Total multiple identifications (signature and container): 1 Total unidentified files (extension and blank): 80 Total extension ID only count: 17 Total extension mismatches: 32 Total signature ID PUID count: 54 Total distinct extensions across collection: 64 Total files with duplicate content (MD5 value): 155 Total files with duplicate filenames: 117 Percentage of collection identified: 84.0 Percentage of collection unidentified: 16.0

One point to note is that DROID can analyse the contents of container files, or not. In the former case it makes it difficult to generate a count of top-level objects (objects not stored within a container). It is, however, useful to understand both counts where possible, but duplication of reports might be undesirable. The creation of a URI_SCHEME column in the database enables this count to be calculated without the need to run DROID twice. The number of top-level objects in the opf-format-corpus can be calculated by subtracting the number of files in containers from the total number of files, so: 324.

Questions that we’re asking…

As we get into the analysis of a number of collections that we hope will be our first born-digital transfers at Archives New Zealand, we find ourselves asking questions of them to ensure they can be ingested with minimal issue into our long-term preservation system. We also want to ensure access is uninhibited for end-users once it arrives there.

Our first attempt at born-digital transfer sees us do this analysis up-front in an attempt to understand the transfers completely, looking at the issues likely to be thrown up on ingest and what pre-conditioning we are likely to have to do before that stage. Some of the questions are also part of a technical appraisal that will help us to understand what to do with examples of files with duplicate content and those that might otherwise be considered non-records or non-evidential e.g. zero-byte files.

The output of the tool represents ALL of the questions that we have considered so far. We do expect there to be more and better questions to be asked as well. Throwing the code and this blog into the public domain can help us build on this work through public input, so:

  • What other information might be useful to query the DROID export for?
  • What output format might be most useful?
  • What formatting of that output format will best lend itself to data visualisation?
  • What other comments and questions do readers of this blog have?

There is still some work to do here. I need to incorporate unit tests into the code base so that everything is as robust as I seek. I imagine future releases of DROID might initially break this tool's compatibility with DROID CSV exports and so that will have to be catered for at the time.

An important note about maintenance is that having created this tool for my day-to-day work I do hope to continue to maintain it for my day-to-day job as necessary.

One of the things I like about accessing DROID results via a database is that we don’t need an analysis layer on top of it. If users have a different requirement of the database than I have catered for then they can simply use the database and use their own queries on top, using their preferred flavour of programming language. Other ways of using such a database might include re-mapping the output to be suitable for cataloguing and archival description, if one desires.

I have considered adding a temporal angle to the database by enabling the storage of multiple DROID reports relating to the same transfer. This could be used to monitor the result of pre-conditioning or analysis of a collection using progressively up-to-date DROID Signature Files. This lends itself to reporting and demonstration of progress to management. The realization of this is more difficult as there doesn’t seem to be a single immutable piece of information we can hook into to make this possible with MD5 hashes likely to change on pre-conditioning, and the potential for file paths to change depending on machine being used to complete a scan. Thoughts on this matter are appreciated.

The tool is licensed under the Zlib license and so can be easily re-used and incorporated into other’s work without issue. 

Preservation Topics: Preservation ActionsIdentificationCharacterisationPreservation RisksToolsSoftware
Categories: Planet DigiPres

A Weekend With Nanite

Open Planets Foundation Blogs - 28 May 2014 - 9:30pm
Well over a year ago I wrote the ”A Year of FITS”( blog post describing how we, during the course of 15 months, characterised 400 million of harvested web documents using the File Information Tool Kit (FITS) from Harvard University. I presented the technique and the technical metadata and basically concluded that FITS didn’t fit that kind of heterogenic data in such large amounts. In the time that has passed since that experiment, FITS has been improved in several areas including the code base and organisation of the development and it could be interesting to see how far it has evolved for big data. Still, FITS is not what I will be writing on today. Today I’ll present how we characterised more than 250 million web documents, not in 9 months, but during a weekend.Setting the Scene: The Tools and the DataThe Hardware

When we at the Danish State and University Library (SB) started our work on the SCAPE project, we acquired four machines to support this work. It was on those four machines that the FITS experiment was performed and it is on the same four machines that the present experiment was performed. The big difference being the software that handles the processes and how the data is accessed.

The four machines are all Blade servers each with two six core Intel Xeon 2.93GHz CPUs, 96GB RAM, and 2Gbit network interfaces (details at Much to the contrary of a traditional node for a Hadoop cluster, these machines do not have local data storage. At SB we very much rely on NAS for data storage, more specifically our storage is based on the Isilon scale out NAS solution from EMC. The Isilon system is a cluster designed for storage and at SB it is at the present storing several PB of digital preservation data.

As NAS is a prerequisite for data storage at SB, we have been doing a lot of experiments on how best to integrate Hadoop with that kind of infrastructure. We have tested two Hadoop distributions, Cloudera and Pivotal HD, in different hardware and software configurations and the setup used for this experiment is the best so far while in no way being good enough. For the experiment we used the Cloudera distribution version 4.5.0, which builds upon Hadoop 2.0.0.

The Data

During the last months we have been moving our copy of the Danish Web Archive onto the above-mentioned Isilon cluster and providing online read-only access through NFS. This makes it possible for relevant jobs to access more than half a PB of web documents, roughly more than 18 billion documents harvested during the last decade. That is a lot of data!

These documents are stored as ARC or WARC container files in a shallow directory tree with the ARC and WARC files in the nethermost directory. A few months ago we shifted from storing data in the ARC format to storing the data in the newer WARC format as this format is superior to the older ARC format. This will become relevant later in this post.

To select the data for this large scale experiment, a few simple UNIX find commands were issued giving a file with the file paths of 147,776 ARC files amounting to almost 15TB, roughly around 450 million documents.

The Software

For this large-scale characterisation experiment I substituted FITS with the Nanite project. This project is lead by Andy Jackson from the UK Web Archive and enables DROID, Apache Tika, and libmagic to be effectively used on the Hadoop platform. For this experiment I will not use the libmagic component. Will Palmer of the British Library (BL) has written a blog post on this project ( and how to integrate the Apache Tika parser into Nanite. Will Palmer has also been very helpful with uncovering and rectifying the problems uncovered during the present experiment.

Before unleashing Nanite on the big data set, I did a lot of preliminary tests and tweaks in the source code of Nanite, more specifically the nanite-hadoop module that compiles to a JAR containing a self-contained Hadoop job. (In this blog post Nanite is a synonymy for nanite-hadoop)

One tweak was to write code for extracting the ARC metadata for each document and storing it along side the properties extracted by Tika.

Another tweak was to substitute the newlines in the extracted HTML title elements with spaces as those newlines breaks the data structure of one key/value pair per line in one of the output data files.

Both these two changes are now in the main Nanite code base.

After disabling a small hack that enables the code to run on the BL Hadoop cluster and disabling a preliminary check to ensure that all input files are compressed, which ours are not, the code should be ready for some real test runs.

Not quite so! The Nanite project depends upon another UK Web Archive project called warc-hadoop-recordreaders, which again depends on a version of the Heritrix project that has a bug when dealing with uncompressed ARC files. The Heritrix code simply cannot handle such uncompressed ARC files. When this bug was uncovered (with help from Will and Andy), it was easy fixing the problem by using a newer, but unreleased version of Heritrix (3.1.2-SNAPSHOT) and rebuilding the dependencies.

During some fun and interesting Skype conversations with Will, we decided to have three different output formats for Nanite. So, basically Nanite produces two kinds of outputs. It produces a traditional data set that are created by the mappers and aggregated in the reducers. This data set contains lines of MIME type, format version, and, for DROID, PUID values. The other kind of output is the extracted features of the documents. Each mapper task handles one ARC file at a time and creates a single data set with all the extracted features from all the documents contained in the ARC file. This data set is then stored by the mapper in three different containers.

  • A ZIP file per ARC file that contains one file per document with key/value pairs, one per line.
  • A ZIP file per ARC file containing the serialised metadata objects for each document.
  • A sequence file per ARC files with the serialised metadata objects for each document.

When we know more about who will use this data and to what purpose, those three formats could be reduced to one or substituted by something entirely different, maybe even improving the performance of the process. The above three output formats are only available in Will’s fork of Nanite at

As a last experiment in this prologue I ran Nanite on 221GB of ARC files to test the performance. This test showed a processing speed at 4.5GB/minute. The test created circa 1GB of extracted metadata. Extrapolating these values it seemed that the complete 15TB could be processed in less than three days and giving less than 100GB of new data. The latter would have no problem in fitting on the available HDFS space and three days processing time would be impressive, to say the least.

Running the Experiment

Only thing left was to execute

$ hadoop jar jars/nanite-1.1.5-74-SNAPSHOT-job-will.jar ~/working/pmd/netarkiv-147776 out-147776

and wait three days.

I just forgot that this was big data. I forgot that with sheer data size comes complexity. That with 147,776 ARC files, something had inevitable to go wrong.

Before initiating this first run I had decided that I wouldn’t change code or fiddle with the cluster configuration more that absolutely necessary. My primary focus was to get the job to run by jumping as many fences as I had to get the job to terminate without failure.

So when I after a few seconds got a heap space exception, I didn’t change heap space configuration. Instead I split the input file into two equal sized chunks with the intention of running two separate Hadoop jobs.

When I subsequent got an “Exceeded max jobconf size” I didn’t experiment with the mapred.user.jobconf.limit Hadoop parameter but split the original input file into chunks with only 30,000 ARC references each.

After that I started getting “ZIP file must have at least one entry” exceptions. First I removed 4832 ARC files that didn’t contain documents but instead contained metadata regarding the original web harvest. I didn’t know these files were present until I got the above error. That didn’t fix the problem. Next I discovered that I unexpectedly also had WARC files in my data set and the code presumable couldn’t handle those. The set of 147,776 files was now reduced to 79,831, split into three chunks. As a last guard against the job failing I did something bad. I surrounded the failing code with a try-catch instead of understanding the problem—actually I just postponed the real bug-hunt for another opportunity.

Late Saturday evening the first job was about to complete, but as I didn’t want to stay up too late just for starting a Hadoop job, I dared to try to run two Hadoop jobs at the same time. Wrong decision! The second job started spitting out a lot of ”Error in configuring object”. So I killed it and went to bed. The next morning the first job had, luckily, completed with success and I had the first set of results. Off course I then started the second job yet again on a idle Hadoop cluster, but, very worrying, got the same error as in the previous evening! It being Sunday, I closed the ssh connection and tried to forget all about this for the rest of the weekend.

Monday morning at work this second job started without hick-ups and I still don’t know what went wrong!

The second job ran for ten hours, completing 29,999 ARC files out of 30,000 and then it failed and cleaned everything out. Upon examination of the log files and the input data I discovered an ARC file of size zero! That is a problem, not only for my experiment, but certainly also for the web archive and this discovery has been flagged in the organisation.

I removed the reference to this zero size ARC file, ran the job again with success. In the third job I also discovered two more zero size ARC files whose references were removed and the job completed with success.

After 32 hours of processing time, all 79,829 ARC files was processed and a lot of interesting metadata created for further analysis in addition to new experience and knowledge gained.


As mentioned above the job produces two kinds of metadata, but it is actually three kinds: MIME type identifications of the documents stored in tab separated text files, extracted metadata for each document stored in ZIP files, and data about the job run itself. I will analyse the last kind first.

To count the number of processed documents I ran this Bash script

for f in $(cat all-files.txt) do rn=$(basename $f .zip) s=$(ls -l $f|awk '{print $5}') c=$(unzip -l $f|tail -1|awk '{print $2}') echo "$f, $c, $s" done

that creates a list of ARC file name and document count pairs. This data gives the total amount of processed documents, which were 260,603,467, and the following distribution of number of documents per ARC file

Number of documents per ARC

In the tail now shown we have 350 ARC files with more than 15,000 documents. The ARC file with the most documents counts 41,140 pieces. Again, an anomaly that would be interesting to dive into as well as many other questions this distribution rises.

The execution time for the three sub jobs were

job idamount of data readnumber of ARC files readprocessing time00592.764TB30,00012hrs, 43mins, 53sec00692.763TB30,00010hrs, 46mins, 21sec00731.820TB19,8297hrs, 44mins, 38sec 7.347TB79,82931.24hrs

Basically the cluster processed 3.92GB/minute, or 44 ARC files per minute, or 63,000 ARC files a day, or 2317 documents per second—which ever sounds most impressive.

To dig a bit deeper into this data set, I collected the run time of each of the 79,839 map tasks. This data can be explored in an HTML table on the job level in the Cloudera Manager, but I wrote a small Java job that converts this data into a CSV file: The data was then read into R for analysis. For the interested reader, I created a Gist with the unordered R code that generated the analysis and figures at

The processing time spans from -43s [sic] to 1462s. A very weird observation is that 6,490 tasks, i.e. 8%, were reported as having completed in negative time. As I don’t think the Hadoop cluster utilises temporal shifts, though that would explained the fast processing, this should be investigated further.

Only 83 map task took more than 500s. Ignoring that long thin tail, the distribution of the processing times looks as below

This looks like an expected distribution. The median of this data set is 57s but that includes 8% negative values so it’s unclear what valuable information that gives.

The MIME type identifications are aggregated in 10 files, one per reducer. These files are easily aggregated further into one big file. Eyeball examination of this data file reveals that it contains lots of error messages. The course of these errors should be investigated, and Will has been doing that, but I’ll jump the fence once more, this time using grep

grep -v ^Exception all-parts | grep -v ^IOException > all-parts-cleaned

Still, that was not enough because after trying to read the file into R, I discovered that I needed to remove all single and double quotes

sed 's/"//g' < all-parts-cleaned | sed "s/\'//g" > all-parts-really-cleaned

With that data cleaning completed, the following command finally reads the data into R

r<-read.table("data/all-parts-really-really-cleaned",sep="\t",header=FALSE, comment.char="",col.names=c("?","http","droid","tika","tikap","year","count"))

Observe the comment.char=”” argument. This is necessary as R assumes everything after a # is a comment and in three instances in the data an http server reported a MIME type value as a series of #s. Why? The same server? This just gives rise to even more interesting questions that can be asked about the uncovered anomalies in this data.

Apart from the above mentioned error messages, the data also contain another kind of error. The Tika parser has timed-out on a lot of the documents, 177 million of them, to be exact. That is a 67% failure rate for the Tika parser and I would consider that a serious error. Fortunately this error has already been dealt with by Will and pushed to the main Nanite project.

Had I chosen to include the harvest year into this data during the job, the detected MIME types could be correlated with harvest year. Instead I can compare the distribution of MIME types between the values reported by the server that served the documents, the DROID detection, and the Tika detection.

The web servers reported 1370 different MIME types; Tika detected 342; and DROID 319.

A plot of the complete set of these MIME types is presented for completeness

If we select the top 20 MIME types we get a more clear picture

Top 10 with MIME type names



It seems that the web servers generally claim that they deliver a few documents with a lot of few different MIME types and a lot of documents with a few MIME types, primarily HTML pages, when in fact we receive quite a diversified set of document types as detected by Tika and DROID. Also, this data set can actually give a confidence factor for the trustworthiness the Danish web servers, even a time series plot of the evolution of such a confidence factor. Or…, Or… The questions just keep coming…

Manually examination of the MIME type data set also reveals fun stuff. E.g. how did a Microsoft Windows Shortcut end up on the Internet? Or what about doing a complete virus scan of all the documents. This could give data for research in the evolution of computer viruses?

There’s no end to the interesting questions when browsing this data. And that’s just looking at one feature, namely the MIME type.

The third kind of data created in the Hadoop job gives a basis for much more detailed observations. We’ve got EXIF data, including GPS if available, HTML title and keyword elements, authors, last modification time, etc. I selected at random a data file from a typical ARC file, i.e. one that contains circa the average of 3500 documents. This data file has 430 unique properties!

These extracted properties are collected in 79,829 ZIP files and before any analysis would be feasible, these files should be combined into one big sequence file and a few Pig Latin UDFs should be written. This would facilitate a kind of explorative analysis. How close to real-time read, eval, print, loop, such a investigation could be, remains to be seen, as this is, unfortunately, a task for another day.

Lessons learned

First off, thank you if you’ve read this far. I did put quite a bit of detail into this blog post—without any TL;DR warning.

The process of going from the idea of this experiment to finishing this blog post has been long but very rewarding.

I’ve uncovered bugs in and added features to Nanite in fun collaboration with Will Palmer and this is far from over. I will continue working with Will on Nanite as I see great value in this tool. As a start I would like to address all the issues uncovered as described in this blog post.

I’ve learned that CPU time actually is much cheaper than developer time. If the cluster should run for 10 hours and fail in the last few minutes, who cares? It’s better than me spending hours trying to dig up a small bug that might be hidden somewhere down in the deepest Java dependencies. Up till a certain break-even threshold, that is!

I’ve learnt a valuable lesson: Know thy data! When performing jobs that potentially could run for weeks, it is very important that you know the data you’re trying to crunch. Still, it’s equally as important to have your tools be indeed very robust. They should be able to survive anything that might be thrown at them, because, running on large data amounts like this, every kind of valid and non-valid data will be encountered. I.e. if one file in a million is a serious problem, you will encounter 17,000 serious problems processing the web archive. If the job runs for 3 months, that’s almost 10 serious problems an hour.

All this being said and done, we must not forget why we do this. It’s not for the sake of creating fast tools, nor reliable tools. It is to enable the curators to preserve our shared data as easy and trustworthy as possible. Also, especially relevant for web documents, it’s for the benefit of the researches in the humanities. To enable them to get answers to all the different questions they could possible imagine asking to such a huge corpora of documents from the last decade.

Even though this experiment answered some questions, I now stand with even more questions to be answered. Oh, and I need to run Nanite on 18 billion documents that are just waiting for me on a NFS mount point…

Preservation Topics: Preservation ActionsIdentificationCharacterisationWeb ArchivingToolsSCAPE
Categories: Planet DigiPres

New SCAPE project business case tools

Open Planets Foundation Blogs - 21 May 2014 - 8:06pm

Over the last month I've been working with the Open Planets Foundation, as part of the SCAPE Project, to develop some guidance materials to help practitioners understand and leverage the business context to some exciting new SCAPE technologies. The SCAPE Business Case Templates provide detailed guidance on building a business case focused on the application of SCAPE tools. It's all about understanding, applying and selling relevant benefits, costs and risks.

Each template introduces one of the three SCAPE technologies before exploring the business context and the case for putting them into action. The templates cover:

  • One of the focused preservation tools developed by SCAPE: Jpylyzer
  • The scalable architecture and toolset developed by SCAPE to address the preservation processing of large datasets: The SCAPE Platform
  • The assessment and decision making tools and approaches developed by SCAPE: The Planning and Watch Suite

A core part of each of the templates is a set of concise business benefits of relevance to each technology. A business case is of course all about that cost/benefit ratio so the detailed benefits are accompanied by notes on costs and risks. This information provides many of the raw materials needed in a business case, but they need to be adapted and applied carefully so that they align with organisational objectives, use appropriate language for the target audience and sell aspects of the work specific to the case in question. Each template therefore includes a business case example illustrating how benefits can be applied to build a strong business case in a specific context.

The templates can be viewed as a stand alone deliverable from SCAPE, but they've also been carefully integrated with an existing resource: the Digital Preservation Business Case Toolkit (DPBCT). This provides users of the SCAPE work with lots of useful background guidance and examples. It also ensures that these SCAPE results are able to live on, post project, as part of a toolkit under the stewardship of the Digital Preservation Coalition. As I said, the Templates can be viewed as standalone guides, but elements of the text are also embedded (using Mediawiki transclusion) in relevant sections of the DPBCT. This makes for a mutually beneficial collaboration with enhancement for DPBCT while still delivering detailed business case examples focused on SCAPE tech. I'm very keen of project work that builds on, extends or otherwise enhances existing resources rather than reinventing the wheel and then dying out at project end, and I think we've managed to pull that off here.

Paul Wheatley, Paul Wheatley Consulting Limited

Preservation Topics: SCAPE
Categories: Planet DigiPres

Catalogue of Policy Elements

Open Planets Foundation Blogs - 20 May 2014 - 8:55pm

Writing preservation policies might be a daunting task, but support is underway!

The SCAPE project has produced  a Catalogue of Preservation Policy elements, which is now available as a wiki. Here you will find an explanation of the SCAPE Policy Framework, consisting of 3 levels of policies. From high level or Guidance Policies to Preservation Procedure Policies to a very detailed level of policies useful for automatic workflows in preservation.

Preservation Procedure Policies, the intermediate level,  is designed to assist you to  create or update  your preservation policies. This level describes the approach an organisation intends to take in order to achieve their high level goals. It is this level that you can find in the Catalogue of Preservation Policy Elements. The catalogue offers a unique overview of policy elements that could be part of your preservation policy.

How does this work?

Each Guidance Policy has a set of related Preservation Procedure Policies. The details of these are described, using a template with a variety of information, for example a definition of the policy, the life cycle phase in which this policy will be relevant, a suggestion of who should be involved in creating the policy etc. For more inspiration you can also have a look at our collection of published preservation policies.

Although the SCAPE project is mainly focused on libraries, data centers and web archives, we believe that this Catalogue is also relevant for other disciplines with a preservation task.

We are convinced that, the Catalogue of Preservation Policy Elements will need to be updated based on new insights in digital preservation, even after the SCAPE project finishes in September. So we would like to invite you to send us your feedback, either by adding this in the Catalogue (each page offers an opportunity to add comments), or to send feedback to Barbara.Sierman@KB.NL.

The final version of the Catalogue of Preservation Policy Elements was created by:  Barbara Sierman (National Library of the Netherlands),   Catherine Jones (Science and Technologies Facilities Council, UK) and Gry Elstrøm (State and University Library, Aarhus, Denmark)

Interested? Join the SCAPE Webinar on Preservation Policies on May 28, 14.00 hrs CET.

Preservation Topics: SCAPE
Categories: Planet DigiPres