Open Planets Foundation Blogs

Subscribe to Open Planets Foundation Blogs feed
The Open Planets Foundation has been established to provide practical solutions and expertise in digital preservation, building on the €15 million investment made by the European Union and Planets consortium.
Updated: 23 min 49 sec ago

Meet SCAPE, APARSEN and many more….

27 August 2014 - 9:03am

SCAPE and APARSEN have joint forces and are hosting a free workshop, ‘Digital Preservation Sustainability on the EU Policy Level’ in connection with the upcoming DL2014 conference in London.

The first part of the workshop will be a panel session at which David Giaretta (APARSEN), Ross King (SCAPE), and Ed Fay (OPF) will be discussing digital preservation.

After this a range of digital preservation projects will be presented at different stalls. This part will begin with an elevator pitch session at which each project will have exactly one minute to present their project.

Everybody is invited to visit all stalls and learn more about the different projects, their results and thoughts on sustainability. At the same time there will be a special ‘clinic’ stall at which different experts will be ready to answer any questions you have on their specific topic – for instance PREMIS metadata or audit processes.

The workshop takes place at City University London, 8 September 2014, 1pm to 5pm.

Looking forward to meeting you!


Read more about the workshop

Register for the workshop (please notice! Registration for this workshop should not be done via the DL registration page)

Read more about DL2014

Oh, did I forget? We also have a small competition going on… Read more.


Preservation Topics: SCAPE
Categories: Planet DigiPres

When is a PDF not a PDF? Format identification in focus.

21 August 2014 - 10:40am

In this post I'll be taking a look at format identification of PDF files and highlighting a difference in opinion between format identification tools. Some of the details are a little dry but I'll restrict myself to a single issue and be as light on technical details as possible. I hope I'll show that once the technical details are clear it really boils down to policy and requirements for PDF processing.


I'm considering format identification in its simplest role as first contact with a file that little, if anything, is known about. In these circumstances the aim is to identify the format as quickly and accurately  as possible then pass the file to format specific tools for deeper analysis.

I'll also restrict the approach to magic number identification rather than trust the file extension, more on this a little later.

Software and data

I performed the tests using the selected govdocs corpora (that's a large download BTW) that I mentioned in my last post. I chose four format identification tools to test:

  • the fine free file utility (also known simply as file),
  • DROID,
  • FIDO, and
  • Apache Tika.

I used as up to date versions as possible but will spare the details until I publish the results in full.

So is this a PDF?

So there was plenty of disagreement between the results from the different tools, I'll be showing these in more detail at our upcoming PDF Event. For now I'll focus on a single issue, there are a set of files that FIDO and DROID don't identify as PDFs that file and Tika do. I've attached one example to this post, Google chrome won't open it but my ubuntu based document viewer does. It's a three page PDF about Rumen Microbiology and this was obviously the intention of the creator. I've not systematically tested multiple readers yet but Libre Office won't open it while ubuntu's print preview will. Feel free to try the reader of your choice and comment.

What's happening here?

It appears we have a malformed PDF and this is the case . The issue is caused by a difference in the way that the tools go about identifying PDFs in the first place. This is where it gets a little dull but bear with me. All of these tools use "magic" or "signature" based identification. This means that they look for unique (hopefully) strings of characters in specific positions in the file to work out the format. Here's the Tika 1.5 signature for PDF:

<match value="%PDF-" type="string" offset="0"/>

What this says is look for the string %PDF- (the value) at the start of the file (offset="0") and if it's there identify this as a PDF. The attached file indeed starts:


meaning it's a PDF version 1.2. Now we can have a look at the DROID signature (version 77) for the PDF 1.2 sig:

<InternalSignature ID="125" Specificity="Specific">    <ByteSequence Reference="BOFoffset">        <SubSequence MinFragLength="0" Position="1"            SubSeqMaxOffset="0" SubSeqMinOffset="0">            <Sequence>255044462D312E32</Sequence>            <DefaultShift>9</DefaultShift>            <Shift Byte="25">8</Shift>            <Shift Byte="2D">4</Shift>            <Shift Byte="2E">2</Shift>            <Shift Byte="31">3</Shift>            <Shift Byte="32">1</Shift>            <Shift Byte="44">6</Shift>            <Shift Byte="46">5</Shift>            <Shift Byte="50">7</Shift>        </SubSequence>    </ByteSequence>    <ByteSequence Reference="EOFoffset">        <SubSequence MinFragLength="0" Position="1"            SubSeqMaxOffset="1024" SubSeqMinOffset="0">            <Sequence>2525454F46</Sequence>            <DefaultShift>-6</DefaultShift>            <Shift Byte="25">-1</Shift>            <Shift Byte="45">-3</Shift>            <Shift Byte="46">-5</Shift>            <Shift Byte="4F">-4</Shift>        </SubSequence>    </ByteSequence></InternalSignature>Which is a little more complex than Tika's signature but what it says is a matching file should start with the string %PDF-1.2, which our sample does. This is in the first <ByteSequence Reference="BOFoffset"> section, a begining of file offset. Crucially this signature adds another condition, that the file contains the string %EOF within 1024 bytes of the end of the tile. There are two things that are different here. The start condition change, i.e. Tika's "%PDF-" vs. DROID's "%PDF-1.2%" is to support DROID's capability to identify versions of formats. Tika simply detects that a file looks like a PDF and returns the application/pdf mime type and has a single signature for the job. DROID can distinguish between versions and so has 29 different signatures for PDF. It's also NOT the cause of the problem. The disagreement between the results is caused by DROID's requirement for a valid end of file marker %EOF. A hex search of our PDF confirms that it doesn't contain an %EOF marker.So who's right?

An interesting question. The PDF 1.3 Reference states:

The last line of the file contains only the end-of-file marker,%%EOF. (See implementation note 15 in Appendix H.)The referenced implementation note reads:3.4.4, “File Trailer”15. Acrobat viewers require only that the %%EOF marker appear somewherewithin the last 1024 bytes of the file. 

So DROID's signature is indeed to the letter of the law plus amendments. It's really a matter of context when using the tools. Does DROID's signature introduce an element of format validation to the identification process? In a way yes, but understanding what's happening and making an informed decision is what really matters.

What's next?

I'll be putting some more detailed results onto GitHub along with a VM demonstrator. I'll tweet and add a short post when this is finished, it may have to wait until next week.

Preservation Topics: Identification AttachmentSize It looks like a PDF to me....44.06 KB
Categories: Planet DigiPres

Win an e-book reader!

21 August 2014 - 7:30am

On September 8 the SCAPE/ APARSEN workshop Digital Preservation Sustainability on the EU Level is held at London City University in connection with the DL2014 conference.

The main objective of the workshop is to provide an overview of solutions to challenges within Digital Preservation Sustainability developed by current and past Digital Preservation research projects. The event brings together various EU projects/initiatives to present their solutions and approaches, and to find synergies between them.

Attached to the workshop Digital Preservation Sustainability on the EU Level SCAPE and APARSEN launch a competition:


Which message do YOU want to send to the EU for the future of Digital Preservation projects?


You can join the competition on Twitter. Only tweets including the hashtag #DP2EU are contending in the competition. You are allowed to include a link to a text OR one picture with your message. Messages which contain more than 300 characters in total are excluded from the competition, though.

The competition will close September 8th at 16:30 UK time. The workshop panel will then choose one of the tweets as a winner. The winner will receive an e-book reader as a prize.


There are only a few places left for the workshop.  Registration for the workshop is FREE and must be completed by filling out the form here Please don’t register for this workshop on the DL2014 registration page, since this workshop  is free of charge!


Categories: Planet DigiPres

Coming to "Preserving PDF - identify, validate, repair" in Hamburg?

12 August 2014 - 10:01am

The OPF is holding a PDF event in Hamburg on 1st-2nd September 2014 where we'll be taking an in-depth look at the PDF format, its sub-flavours like PDF/A and open source tools that can help. This is a quick post of list of things you can do to prepare for the event if you're attending and looking to get the most out of it.


The Wikipedia entry on PDF provides a readable overview of the formats history with some technical details. Adobe provide a brief PDF 101 post that avoids technical detail.

Johan van der Knijff's OPF blog has a few interesting posts on PDF preservation risks:

This MacTech article is still a reasonable introduction to PDF for developers. Finally, if you really want a detailed look you could try the Adobe specification page but it's heavy weight reading.


Below are brief details of the main open source tools we'll be working with. It's not essential that you dowload and install these tools. The all require Java and none of them have user friendly install procedures. We'll be looking at ways to improve that at the event. We'll also be providing a pre-configured virtual environement to allow you to experiment in a friendly, throw away environment. See the Software section a little further down.


JHOVE is an open source tool that performs format specific identification, characterisation and validation of digital objects. JHOVE can identify and validate PDF files against the PDF specification while extracting technical and descriptive metadata. JHOVE recognises PDFs that state that they conform to the PDF/A profile, but it can't then validate that a PDF conforms to the PDF/A specification.

Apache Tika

The Apache Foundation's Tika project is an application / toolkit that can be used to identify, parse, extract metadata, and extract content from many file formats.  

Apache PDFBox

Written in Java, Apache PDFBox is an open source library for working with PDF documents. It's primarily aimed at developers but has some basic command line apps. PDFBox also contains a module that verifies PDF/A-1 documents that has a command line utility.

These libraries are of particular interest to Java developers who can incorporate the libraries into their own programs, Apache Tika uses the PDFBox libraries for PDF parsing.

Test Data

These test data sets were chosen because they're freely available. Again it's not necessary to download them before attending but they're good starting points for testing some of the tools or your code:

PDFs from GovDocs selected dataset

The original GovDocs corpora is a test set of nearly 1 million files and is nearly half a terabyte in size. The corpus was reduced in size by removing similar items by David Tarrant, as described in this post. The remaing data set is still large at around 17GB and can be downloaded here.

Isator PDF/A test suite

The Isator test suite is published by the PDF Association's PDF/A competency centre, in their own words: 

This test suite comprises a set of files which can be used to check the conformance of software regarding the PDF/A-1 standard. More precisely, the Isartor test suite can be used to “validate the validators”: It deliberately violates the requirements of PDF/A-1 in a systematic way in order to check whether PDF/A-1 validation software actually finds the violations.

More information about the suite can be found on the PDF Association's website along with a download link.

PDFs from OPF format corpus

The OPF has a GitHub repository where members can upload files that represent preservation risks / problems. This has a couple of sub-collections of PDFs, these show problem PDFs from the GovDocs corpus and this is a collection of PDFs with features that are "undesirable" in an archive setting.


If you'd like the chance to get hands-on with the software tools at the event and try some interactive demonstrations / exercises we'll be providing light virtualised demonstration environments using VirtualBox and Vagrant. It's not essential that you install the software to take part but it does offer the best way to try things for yourself, particularly if you're not a techie. These are available for Windows, Mac, and linux and should run on most people's laptops, download links are shown below.

Vagrant downloads page:

Oracle VirtualBox downloads page:

Be sure to install the VirtualBox extensions also, it's the same download for all platforms.

What next?

I'll be writing another post for Monday 18th August that will take a look at using some of the tools and test data together with a brief analysis of the results. This will be accompanied by a demonstration virtual environment that you can use to repeat the tests and experiment yourself.

Categories: Planet DigiPres

EaaS: Image and Object Archive — Requirements, Implementation and Example Use-Cases

23 July 2014 - 10:33am
bwFLA's Emulation-as-a-Service makes emulation widely available for non-experts and could prove emulation as a valuable tool in digital preservation workflows. Providing these emulation services to access preserved and archived digital objects poses further challenges to data management. Digital artifacts are usually stored and maintained in dedicated repositories and object owners want to – or are required to – stay in control over their intellectual property. This article discusses the problem of managing virtual images, i.e. virtual harddisks bootable by an emulator, and derivatives thereof but the solution proposed can be applied to any digital artifact.RequirementsOnce a digital object is stored in an archive and an appropriate computing environment has been created for access, this environment should be immutable and should not be modified except explicitly by an administrational interface. This guarantees that a memory institution's digital assets are unaltered by the EaaS service and remain available in the future. Immutability, however, is not easy to handle for most emulated environments. Just booting the operating system may change an environment in unpredictable ways. When the emulated software writes parts of this data and reads it again, however, it probably expects the read data to represent the modifications. Furthermore, users that want to interact with the environment should be able to change or customize it. Therefore, data connectors have to provide write access for the emulation service while they cannot write the data back to the archive. The distributed nature of the EaaS approach requires an  efficient network transport of data to allow for immediate data access and usability. However, digital objects stored in archives can be quite large in size. When representing a hard disk image, the installed operating system together with installed software can easily grow up to several GBs in size. Even with today's network bandwidths, copying these digital objects in full to the EaaS service may take minutes and affects the user experience. While the archived amount of data is usually large, the data that is actually accessed frequently can be very small. In a typical emulator scenario, read access to virtual hard disk images is block-aligned and only very few blocks are actually read by the emulated system. Transferring only these blocks instead of the whole disk image file is typically more efficient, especially for larger files. Therefore, the network transport protocol has to support random seeks and sparse reads without the need for actually copying the whole data file. While direct filesystem access provides these features if a digital object is locally available to the EaaS service, such access it is not available in the general case of separate emulation and archive servers that are connected via the internet.ImplementationThe Network Block Device (NBD) protocol provides a simple client/server architecture that allows direct access to single digital objects as well as random access to the data stream within these objects. Furthermore, it can be completely implemented in userspace and does not require a complex software infrastructure to be deployed to the archives.  In order to access digital objects, the emulation environment needs to reference these objects in the emulation environment. Individual objects are identified in the NBD server by using unique export names. While the NBD URL schema directly identifies the digital object and the archive where the digital object can be found, the data references are bound to the actual network location. In a long-term preservation scenario, where emulation environments, once curated, should last longer than a single computer system that acts as the NBD server, this approach has obvious drawbacks. Furthermore, the cloud structure of EaaS allows for interchanging any component that participates in the preservation effort, thus allowing for load balancing and fail-safety. This advantage of distributed systems is offset by static, hostname-bound references.Handle It!To detach the references from the object's network location, the Handle System is used as persistent object identifier throughout our reference implementation. The Handle System provides a complete technological framework to deal with these identifiers (or "Handles'' (HDL) in the Handle System) and constitutes a federated infrastructure that allows the resolution of individual Handles using decentralized Handle Services. Each institution that wants to participate in the Handle System is assigned a prefix and can host a Handle Service. Handles are then resolved by a central resolver by forwarding requests to these services according to the Handle's prefix. As the Handle System, as a sole technological provider, does not pose any strict requirements to the data associated with Handles, this system was used as a PI technology.Persistent User Sessions and DerivativesAs digital objects (in this case the virtual disk image) are not to be modified directly in the archive by the EaaS service, a mechanism to store modifications locally  while reading unchanged data from the archive has to be implemented. Such a transparent write mechanism can be achieved using a copy-on-write access strategy. While NBD allows for arbitrary parts of the data to be read upon request, not requiring any data to be provided locally, data that is written through the data connector is tracked and stored in a local data structure. If a read operation requests a part of data that is already in this data structure, the previously changed version of the data should be returned to the emulation component. Similarly, parts of data that are not in this data structure were never modified and must be read from the original archive server. Over time, a running user session has its own local version of the data, but only those parts of data that were written are actually copied. We used the qcow2 container format from the QEMU project to keep track of local changes to the digital object. Besides supporting copy-on-write, it features an open documentation as well as a widely used and tested reference implementation with a comprehensive API, the QEMU Block Driver. The qcow2 format allows to store all changed data blocks and the respective metadata for tracking these changes in a single file. To define where the original blocks (before copy-on-write) can be found, a backing file definition is used. The Block Driver API provides a continuous view on this qcow2 container,  transparently choosing either the backing file or the copy-on-write data structures as source. This mechanism allows modifications of data to be stored separately and independent from the original digital object during an EaaS user session, allowing to keep every digital object in its original state as it was preserved  Once the session has finished, these changes can be retrieved from the emulation component and used to create a new, derived data object. As any Block Driver format is allowed in the backing file of a qcow2 container, the backing file can also be a qcow2 container again. This allows „chaining" a series of modifications as copy-on-write files that only contain the actually modified data. This greatly facilitates efficient storage of derived environments as a single qcow2 container can directly be used in a binding without having to combine the original data and the modifications to a  consolidated stream of data. However, this makes such bindings rely not only on the availability of the qcow2 container with the modifications, but also on the original data the qcow2 container refers to. Therefore, consolidation is still possible and directly supported by the tools that QEMU provides to handle qcow2 files. Once the data modifications and the changed emulation environment are retrieved after a session, both can be stored again in an archive to make this derivate environment available. Only those chunks of data that actually  were changed by the user have to be retrieved. These, however, reference and  remain dependent on the original, unmodified digital object. The derivate can then be accessed like any other archived environment. Since all derivate environments contain (stable) references to their backing files, modifications can be stored in  a different image archive, as long as the backing file is available. Therefore, each object owner is in charge for providing storage for individualized system environments but is also  able to protect its modification without loosing the benefits of shared base images. Examples and Use-CasesTo provide a better understanding of the image archive implementation, the following three use-cases demonstrate how the current implementation works. Firstly, a so called derivative is created, a tailored system environment suitable to render a specific object. In a second scenario, a container object (CD-ROM) is injected into the environment which is then modified for object access, i.e. installation of a  viewer application and adding the object to the autostart folder. Finally, an existing harddisk image (e.g. an image copy of a real machine) is ingested into the system. This last case requires, besides the technical configuration of the hardware environment, private files to be removed before public access.Derivatives – Tailored Runtime EnvironmentsTypically, an EaaS provider provides a set of so-called base images. These images contain a basic OS installation which has been configured to be run on a certain emulated platform. Depending on the user's requirements, additional software and/or configuration may be required, e.g. the installation of certain software frameworks or text processing or image manipulation software. This can be done by uploading or making available a software installation package. On our current demo instance this is done either by uploading individual files or a CD ISO image. Once the software is installed the modified environment can be saved and made accessible for object rendering or similar purposes. Object Specific CustomizationIn case of complex CD-ROM objects with rich multimedia content from the 90s and early 2000s, e.g. encyclopedias and teaching software, typically a custom viewer application has to be installed to be able to render its content. For these objects, an already prepared environment (installed software, autostart of the application) would be useful and would surely improve the user experience during access as „implicit“ knowledge on using an outdated environment is not required anymore to make use of the object. Since the number of archived media is large, duplicating for instance a Microsoft Windows environment for every one of them would add a few GBs of data to each object. Usually, neither the object’s information content nor the current or expected user demand justify these extra costs. Using derivatives of base images, however, only a few MBs are required for each customized environment since only changed parts of the virtual image are to be stored for each object. In the case of the aforementioned collection of multimedia CD-ROMs, the derivate size varies between 348KBs and 54MBs.  Authentic Archiving and Restricted Access to Existing ComputersSometimes it makes sense to preserve a complete user system like the personal computer of Vilèm Flusser in the Vilèm Flusser Archive. Such complete system environments usually can be achieved by creating a hard disk image of the existing computer and use this image as the virtual hard disk for EaaS. Such hard disk images can, however, contain personal data of the computer's owner. While EaaS aims at providing interactive access to complete software environments, it is impossible to restrict this "interactiveness", e.g. to forbid access to a certain directory directly from the user interface. Instead, our approach to this problem is to create a derivative work with all the personal data being stripped from the system. This allows users with sufficient access permissions (e.g. family or close friends) to access the original system including personal data, while the general public access only sees a computer with all the personal data removed.Conclusion

With our distributed architecture and an efficient network transport protocol, we are able to provide Emulation as a Service quite efficiently while at the same time allowing owners of digital objects to remain in complete control over their intellectual property. Using copy-on-write technology it is possible to create a multitude of different configurations and flavors of the same system with only minimal storage requirements. Derivatives and their respective "parent" system can be handled completely independent from each other and withdrawing access permissions for a parent will automatically invalidate all existing derivatives. This allows for a very efficient and flexible handling of curation processes that involve the installation of (licensed) software, personal information and user customizations.

Open Planets members can test aforementioned features using the bwFLA demo instance. Get the password here:

Taxonomy upgrade extras: EaaSPreservation Topics: Emulation
Categories: Planet DigiPres


17 July 2014 - 2:36pm

We have just set up a vagrant environment for C3PO. It starts a headless vm where the C3PO related functionalities (Mongodb, Play, a downloadable commandline jar) are managable from the host's browser. Further, the vm itself has all relevant processes configured at start-up independently from vagrant, so it can be, once created, downloaded and used as a stand-alone C3PO vm. We think this could be a scenario applicable to other SCAPE projects as well. The following is a summary of the ideas we've had and the experiences we've made.

The Result

The Vagrantfile and a directory containing all vagrant-relevant files live directly in the root directory of the C3PO repository. So after installing Vagrant and cloning the repository a simple 'vagrant up' should do all the work, as downloading the base box, installing the necessary software and booting the new vm.

After a few minutes one should have a running vm that is accessible from the hosts browser at localhost:8000. This opens a central welcome page that contains information about the vm-specific aspects and links to the playframework's url (localhost:9000) and the Mongodb admin interface (localhost:28017). It also provides a download link for the command-line jar, which has to be used in order to import data. This can be used from the outside of the vm as the Mongodb port is mapped as well. So I can import and analyse data with C3PO without having to fiddle through the setup challenges myself, and, believe me, that way can be long and stony.

The created image is self-contained in that sense that, if I put it on a server, anyone who has Virtualbox installed can download it and use it, without having to rely on vagrant working on their machine.

General Setup

The provisioning script has a number of tasks:

  • it downloads all required dependencies for building the C3PO environment
  • it installs a fresh C3PO (from /vagrant, which is the shared folder connection between the git repository and the vm) and assembles the command-line app
  • it installs and runs a Mongodb server
  • it installs and runs the Playframework
  • it creates a port-forwarded static welcome page with links to all the functionalities above
  • it adds all above to the native ubuntu startup (using /etc/rc.local, if necessary), so that an image of the vm can theoretically be run independently from the vagrant environment

These are all trivial steps, but it can make a difference not having to manually implement all of them.

Getting rid of proxy issues

In case you're behind one of those very common NTLM company proxies, you'll really like that the only thing you have to provide is a config script with some some details around your proxy. If the setup script detects this file, it will download the necessary software and configure maven to use it. Doing it in this way has been actually the first time I got maven running smoothly on a linux VM behind our proxy.

Ideas for possible next steps

There is loads left to do, here are a few ideas:

  • provide interesting initial test-data that ships with the box, so that people can play around with C3PO without having to install/import anything at all.
  • why not having a vm for more SCAPE projects? we could quickly create a repository for something like a SCAPE base vm configuration that is useable as a base for other vms. The central welcome page could be pre-configured (SCAPE branded) as well as all the proxy- and development-environment-related stuff mentioned above.
  • I'm not sure about the sustainablity of shell provisioning scripts with increasing complexity of the bootstrap process. Grouping the shell commands in functions is certainly an improvement, it might be worth though to check out other, more dynamic provisioners. One I find particularly interesting is Ansible.
  • currently there's no way of testing that the vm works with the current development trunk; a test environment that runs the vm and tests for all the relevant connection bits would be handy


Preservation Topics: SCAPE
Categories: Planet DigiPres

CSV Validator version 1.0 release

15 July 2014 - 12:10pm

Following on from my previous brief post announcing the beta release of the CSV Validator,, today we've made the formal version 1.0 release of the CSV Validator and the associated CSV Schema Language.  I've described this in more detail on The NAtional Archives' blog,

Preservation Topics: Tools
Categories: Planet DigiPres

New QA tool for finger detection on scans

10 July 2014 - 11:49am

I would like to draw your attention to the new QA tool for finger detection on scans: This tool was developed by AIT in scope of the SCAPE project.


Checking to identify fingers on scan manually is a very time-consuming and error-prone process. You need a tool to help you: Fingerdet.

Fingerdet is an open source tool which:

  • provides decision-making support for finger on scans detection in or across collections
  • identifies fingers even independent from file format, scan quality, finger sizes, direction, shape, colour and lighting conditions
  • applies state-of-the art image processing
  • is useful in assembling collections from multiple sources, and identifying corrupted files

Fingerdet brings the following benefits:

  • Automated quality assurance
  • Reduced manual effort and error
  • Saved time
  • Lower costs, e.g. storage, effort
  • Open source, standalone tool. Also as Taverna component for easy invocation
  • Invariant to format, rotation, scale, translation, illumination, resolution, cropping, warping and distortions
  • May be applied to wide range of image collections, not just print images
Preservation Topics: Web ArchivingPreservation RisksSCAPESoftware
Categories: Planet DigiPres

Quality assured ARC to WARC migration

10 July 2014 - 10:44am

This blog post continues a series of posts about the weeb archiving topic „ARC to WARC migration“, namely it is a follow-up on the posts „ARC to WARC migration: How to deal with de-duplicated records?“, and „Some reflections on scalable ARC to WARC migration“.

Especially the last one of these posts ,which described how SCAPE tools can be used for multi-Terabyte web archive data migration, is the basis for this post from a subject point of view. One consequence of evaluating alternative approaches for processing web archive records using the Apache Hadoop framework was to abandon the native Hadoop job implementation (the arc2warc-migration-hdp module was deprecated and removed from the master branch) because of having some disadvantages without bringing significant benefits in terms of performance and scale-out cabability compared to the command line application arc2warc-migration-cli used together with the SCAPE tool ToMaR for parallel processing. While this previous post did not elaborate on quality assurance, it will be the main focus of this post.

The workflow diagram in figure 1 illustrates the main components and processes that were used to create a quality assured ARC to WARC migration workflow.


Figure 1: Workflow diagram of the ARC to WARC migration workflow

The basis of the main components used in this workflow is the Java Web Archive Toolkit (JWAT) for reading web archive ARC container files. Based on this toolkit the „hawarp“ tool set was developed in the SCAPE project which bundles several components for preparing and processing web archive data, especially data that is stored in ARC or WARC container files. Special attention was given to making sure that data can be processed using the Hadoop framework, an essential part of the SCAPE platform for distributed data processing using computer clusters.

The input of the workflow is an ARC container file, a format originally proposed by the Internet Archive to persistently store web archive records. The ARC to WARC migration tool is a JAVA command line application which takes an ARC file as input and produces a WARC file. The tool basically performs a procedural mapping of metadata between the ARC and WARC format (see constructors of the eu.scape_project.hawarp.webarchive.ArchiveRecord class). One point that is worth highlighting is that the records of ARC container files from the Austrian National Library's web archive were not structured homogeneously. When creating ARC records (using the Netarchive Suite/Heritrix web crawler in this case), the usual procedure was to strip off the HTTP response metadata from the server's response and store these data as part of the header of the ARC record. Possibly due to malformed URLs this was not applied to all records, so that the HTTP response metadata were still part of the payload content as it was actually defined later for the WARC standard. The ARC to WARC migration tool handles these special cases accordingly. Generally, and as table 1 shows, HTTP response metadata is transferred from ARC header to WARC payload header and therefore becomes part of the payload content.

ARC Header
→ HTTP Response MetadataWARC HeaderARC Payload→ HTTP Response Metadata
WARC Payload

Table 1: HTTP response metadata is transferred from ARC header to WARC payload header

The CDX-Index Creation module, which is also part of the „hawarp“ tool set, is used to create a file in the CDX file format to store selected attributes of web archive records aggregated in ARC or WARC container files - one line per record - in a plain text file. The main purpose of the CDX-index is to provide a lookup table for the wayback software. The index file contains the necessary information (URL, date, offset: record position in container file, container identifier, etc) to retrieve data required for rendering an archived web page and depending ressources from the container files.

Apart from serving the purpose of rendering web ressources using the wayback software, the CDX index file can also be used to do a basic verification if the container format migration process was successful or not, namely by comparing the CDX fields of the ARC CDX file and the WARC CDX file. The basic assumption here is that apart from the offset and container identifier fields all the other fields must have the same values for corresponding records. Especially the payload digest allows verifying if the digest computed for the binary data (payload content) are the same for all records in the two container formats respectively.

An additional step of the workflow in order to verify the quality of the migration is to compare the rendering results of selected ressources when being retrieved from the original ARC and the migrated WARC container files. To this end, the CDX files are deployed to the wayback application in a first step. In a second step the PhantomJS framwork is used to take snapshots from rendering the same ressource retrieved once from the ARC container and once from the WARC container file.

Finally, the snapshot images are compared using Exiftool (basic image properties) and ImageMagick (measure AE: absolute error) in order to determine if the rendering result is equal for both instances. Randomized manual verification of individual cases may then conclude the quality control process.

There is an executable Taverna workflow available on myExperiment. The Taverna workflow is configured by adapting the values of the constant values (light-blue boxes) which define the paths to configuration files, deployment files, and scripts in the processing environment. However, as in this workflow Taverna is just used as an orchestration tool to build a sequence of bash script invocations, it is also possible to just use the individual scripts of this workflow and replace the Taverna variables (embraced by two per cent symbols) accordingly.

The following prerequisites must be fulfilled to be able to execute the Taverna workflow and/or the bash scripts it contains:

The following screencast demonstrates the workflow using a simple "Hello World!" crawl as example:

Taxonomy upgrade extras: SCAPESCAPEProjectSCAPE-ProjectWeb ArchivingPreservation Topics: Preservation ActionsMigrationWeb ArchivingSCAPE
Categories: Planet DigiPres

EaaS in Action — And a short meltdown due to a friendly DDoS

9 July 2014 - 4:23pm

On June 24th 9.30 AM EST Dragan Espenschied, Digital Conservator at Rhizome NY, released an editorial on featuring a restored home computer previously owned by Cory Arcangel. The article uses an embedded emulator powered by the bwFLA Emulation as a Service framework  and the University of Freiburg’s computing center. The embedded emulator allows readers to explore and interact with the re-enacted machine. 

Currently the bwFLA test and demo infrastructure runs an old, written off IBM Blade Cluster, using 12 blades for demo purposes, each equipped with 8 core Intel(R) Xeon(R) CPUs (E5440  @ 2.83GHz). All instances are booted diskless (network boot) with the latest bwFLA codebase deployed. Additionally, there is a EaaS gateway running on 4 CPUs delegating request and providing a Web container framework (JBoss) for IFrame delivery. To ensure, a decent performance of individual emulation sessions we assign one emulation session to every available physical CPU. Hence, our current setup can handle 96 parallel session. 

Due to some social media propaganda and good timing (US breakfast/coffee time) our resources were exhausted within minutes.

The figure above shows the total number of sessions for June 26th. Between 16:00 and 20:00 CEST, however, we were unable to deal with the demand.

After two days however, load has normalized again, even though on a higher level.

Lessons learned

The bwFLA EaaS framework is able to scale with demand, however, not with our available (financial) resources. Our cluster is suitable for „normal“ load scenarios. For peaks like we have experienced with Dragan’s post, a temporary deployment in the Cloud (e.g. using PaaS / IaaS Cloud services) is cost effective strategy, since these heavy load situations last only for a few days. The „local“ infrastructure should be scaled to average demand, keeping costs running EaaS at bay. For instance, Amazon EC2 charges for an 8 CPU machine about € 0.50 per hour. In case of Dragan’s post, the average session time of a user playing with the emulated machine was 15 minutes, hence, the average cost per user is about 0.02€ if a machine is fully utilized. 


Taxonomy upgrade extras: EaaSPreservation Topics: Emulation
Categories: Planet DigiPres

BSDIFF: Technological Solutions for Reversible Pre-conditioning of Complex Binary Objects

9 July 2014 - 12:31am

During my time at The National Archives UK, colleague, Adam Retter, developed a methodology for the reversible pre-conditioning of complex binary objects. The technique was required to avoid the doubling of storage for malformed JPEG2000 objects numbering in the hundreds of thousands. The difference between a malformed JPEG2000 file and a corrected, well-formed JPEG2000 file, in this instance was a handful of bytes, yet the objects themselves were many megabytes in size. The cost of storage means that doubling it in such a scenario is not desirable in today’s fiscal environment – especially if it can be avoided.

As we approach ingest of our first born-digital transfers at Archives New Zealand, we also have to think about such issues. We’re also concerned about the documentation of any comparable changes to binary objects, as well as any more complicated changes to objects in any future transfers.

The reason for making changes to a file pre-ingest, in our process terminology - pre-conditioning, is to ensure well-formed, valid objects are ingested into the long term digital preservation system. Using processes to ensure changes are:

  • Reversible
  • Documented
  • Approved

We can counter any issues identified as digital preservation risks in the system’s custom rules up-front ensuring we don’t have to perform any preservation actions in the short to medium term. Such issues may be raised through format identification, validation, or characterisation tools. Such issues can be trivial or complex and the objects that contain exceptions may also be trivial or complex themselves.

At present, if pre-conditioning is approved, it will result in a change being made to the digital object and written documentation of the change, associated with the file, in its metadata and in the organisation’s content management system outside of the digital preservation system.

As example documentation for a change we can look at a provenance note I might write to describe a change in a plain text file. The reason for the change is the digital preservation system looking for the object to be encoded as UTF-8. A conversion can give us stronger confidence about what this file is in future. Such a change, converting the object from ASCII to UTF-8, can be completed as either a pre-conditioning action pre-ingest, or preservation migration post-ingest.

Provenance Note

“Programmers Notepad 2.2.2300-rc used to convert plain-text file to UTF-8. UTF-8 byte-order-mark (0xEFBBBF) added to beginning of file – file size +3 bytes. Em-dash (0x97 ANSI) at position d1256 replaced by UTF-8 representation 0xE28094 at position d1256+3 bytes (d1259-d1261) – file size +2 bytes.”

Such a small change is deceptively complex to document. Without the presence of a character sitting outside of the ASCII range we might have simply been able to write, “UTF-8 byte-order-mark added to beginning of file.” – but with its presence we have to provide a description complete enough to ensure that the change can be observed, and reversed by anyone accessing the file in future.

Pre-conditioning vs. Preservation Migration

As pre-conditioning is a form of preservation action that happens outside of the digital preservation system we haven’t adequate tools to complete the action and document it for us – especially for complex objects. We’re relying on good hand-written documentation being provided on ingest. The temptation, therefore, is to let the digital preservation system handle this using its inbuilt capability to record and document all additions to a digital object’s record, including the generation of additional representations; but the biggest reason to not rely on this is the cost of storage and how this increases with the likelihood of so many objects requiring this sort of treatment over time.

Proposed Solution

It is important to note that the proposed solution can be implemented either pre- or post-ingest therefore removing the emphasis from where in the digital preservation process this occurs, however, incorporating it post-ingest requires changes to the digital preservation system. Doing it pre-ingest enables it to be done manually with immediate challenges addressed. Consistent usage and proven advantages over time might see it included in a digital preservation mechanism at a higher level.

The proposed solution is to use a patch file, specifically a binary diff (patch file) which stores instructions about how to convert one bitstream to another. We can create a patch file by using a tool that compares an original bitstream to a corrected (pre-conditioned) version of it and stores the result of the comparison. Patch files can add and remove information as required and so we can apply the instructions created to a corrected version of any file to re-produce the un-corrected original.

The tool we adopted at The National Archives, UK was called BSDIFF. This tool is distributed with the popular operating system, FreeBSD, but is also available under Linux, and Windows.

The tool was created by Colin Percival and there are two utilities required; one to create a binary diff - BSDIFF itself, and the other to apply it BSPATCH. The manual instructions are straightforward, but the important part of the solution in a digital preservation context is to flip the terms <oldfile> and <newfile>, so for example, in the manual:

  • $ bsdiff <oldfile> <newfile> <patchfile>

Can become:

  • $ bsdiff <newfile> <oldfile> <patchfile>

Further, in the below descriptions, I will replace <newfile> and <oldfile> for <pre-conditioned-file> and <malformed-file> respectively, e.g.

  • $ bsdiff <pre-conditioned-file> <malformed-file> <patchfile>


BSDIFF generates a patch <patchfile> between two binary files. It compares <pre-conditioned-file> to <malformed-file> and writes a <patchfile> suitable for use by BSPATCH.


BSPATCH applies a patch built with BSDIFF, it generates <malformed-file> using <pre-conditioned-file>, and <patchfile> from BSDIFF.


For my examples I have been using the Windows port of BSDIFF referenced from Colin Percival’s site.

To begin with, a non-archival example simply re-producing a binary object:

If I have the plain text file, hen.txt:

  • The quick brown fox jumped over the lazy hen.

I might want to correct the text to its more well-known pangram form – dog.txt:

  • The quick brown fox jumped over the lazy dog.

I create dog.txt and using the following command I create hen-reverse.diff:

  • $ bsdiff dog.txt hen.txt hen-reverse.diff

We have two objects we need to look after, dog.txt and hen-reverse.diff.

If we ever need to look at the original again we can use the BSPATCH utility:

  • $ bspatch dog.txt hen-original.txt hen-reverse.diff

We end up with a file that matched the original byte for byte and can be confirmed by comparing the two checksums.

$ md5sum hen.txt 84588fd6795a7e593d0c7454320cf516 *hen.txt $ md5sum hen-original.txt 84588fd6795a7e593d0c7454320cf516 *hen-original.txt

Used as an illustration, we can re-create the original binary object, but we’re not saving any storage space at this point as the patch file is bigger than the <malformed-file> and <pre-conditioned-file> together:

  • hen.txt – 46 bytes
  • dog.txt – 46 bytes
  • hen-reverse.diff – 159 bytes

The savings we can begin to make, however, using binary diff objects to store pre-conditioning instructions can be seen when we begin to ramp up the complexity and size of the objects we’re working with. Still working with text, we can convert the following plain-text object to UTF-8 complementing the pre-conditioning action we might perform on archival material as described in the introduction to this blog entry:

  • Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas quam lacus, tincidunt sit amet lobortis eget, auctor non nibh. Sed fermentum tempor luctus. Phasellus cursus, risus nec eleifend sagittis, odio tellus pretium dui, ut tincidunt ligula lorem et odio. Ut tincidunt, nunc ut volutpat aliquam, quam diam varius elit, non luctus nulla velit eu mauris. Curabitur consequat mauris sit amet lacus dignissim bibendum eget dignissim mauris. Nunc eget ullamcorper felis, non scelerisque metus. Fusce dapibus eros malesuada, porta arcu ut, pretium tellus. Pellentesque diam mauris, mollis quis semper sit amet, congue at dolor. Curabitur condimentum, ligula egestas mollis euismod, dolor velit tempus nisl, ut vulputate velit ligula sed neque. Donec posuere dolor id tempus sodales. Donec lobortis elit et mi varius rutrum. Vestibulum egestas vehicula massa id facilisis.

Converting the passage to UTF-8 doesn’t require the conversion of any characters within the text itself, rather just the addition of the UTF-8 byte-order-mark at the beginning of the file. Using Programmers Notepad we can open lorem-ascii.txt and re-save it with a different encoding as lorem-utf-8.txt. As with dog.txt and hen.txt we can then create the patch, and then apply to see the original again using the following commands:

  • $ bsdiff lorem-utf-8.txt lorem-ascii.txt lorem-reverse.diff
  • $ bspatch lorem-utf-8.txt lorem-ascii-original.txt lorem-reverse.diff

Again, confirmation that bspatch outputs a file matching the original can be seen by looking at their respective MD5 values:

$ md5sum lorem-old.txt ec6cf995d7462e20f314aaaa15eef8f9 *lorem-ascii.txt $ md5sum lorem-ascii.txt ec6cf995d7462e20f314aaaa15eef8f9 *lorem-ascii-original.txt

The file sizes here are much more illuminating:

  • lorem-ascii.txt – 874 bytes
  • lorem-utf-8.txt – 877 bytes
  • lorem-reverse.diff – 141 bytes

Just one more thing… Complexity!

We can also demonstrate the complexity of the modifications we can make to digital objects that BSDIFF affords us. Attached to this blog is a zip file containing supporting files, lorem-ole2-doc.doc and lorem-xml-docx.docx.

The files are used to demonstrate a migration exercise from an older Microsoft OLE2 format to the up-to-date OOXML format.

I’ve also included the patch file lorem-word-reverse.diff.

Using the commands as documented above:

  • $ bsdiff lorem-xml-docx.docx lorem-ole2-doc.doc lorem-word-reverse.diff
  • $ bspatch lorem-xml-docx.docx lorem-word-original.doc lorem-word-reverse.diff

We can observe that application of the diff file to the ‘pre-conditioned’ object, results in a file identical to the original OLE2 object:

$ md5sum lorem-ole2-doc.doc 3bb94e23892f645696fafc04cdbeefb5 *lorem-ole2-doc.doc $ md5sum lorem-word-original.doc 3bb94e23892f645696fafc04cdbeefb5 *lorem-word-original.doc

The file-sizes involved in this example are as follows:

  • Lorem-ole2-doc.doc – 65,536 bytes
  • Lorem-xml-docx.docx – 42.690 bytes
  • Lorem-word-reverse.diff – 16,384 bytes

The neat part of this as a solution, if it wasn’t enough that the most complex of modifications are reversible, is that the provenance note remains the same for all transformations between all digital objects. The tools and techniques are documented instead, and the rest is consistently consistent, and perhaps even more accessible to all users who can understand this documentation over more complex narrative, byte-by-byte breakdowns that might otherwise be necessary.


Given the right problem and the insight of an individual from outside of the digital preservation sphere (at the time) in Adam, we have been shown an innovative solution that helps us to demonstrate provenance in a technologically and scientifically sound manner, more accurately, and more efficiently than we might otherwise be able to do so using current approaches. The solution:

  • Enables more complex pre-conditioning actions on more complex objects
  • Prevents us from doubling storage space
  • Encapsulates pre-conditioning instructions more efficiently and more accurately - there are fewer chances to make errors

While it is unclear whether Archives New Zealand will be able to incorporate this technique into its workflows at present, the solution will be presented alongside our other options so that it can be discussed and taken into consideration by the organisation as appropriate.

Work does need to be done to incorporate it properly, e.g. respecting original file naming conventions; and some consideration should be given as to where and when in the transfer / digital preservation process the method should be applied, however, it should prove to be an attractive, and useful option for many archives performing pre-conditioning or preservation actions on all future, trivial and complex digital objects.




Documentation for the BSDIFF file format is attached to this blog and was created by Colin Percival and is BSD licensed. 


Preservation Topics: Preservation ActionsMigrationPreservation StrategiesNormalisation AttachmentSize BSDIFF format documentation provided by Colin Percival [PDF]29.93 KB Support files for OLE2 to OOXML example [ZIP]69.52 KB
Categories: Planet DigiPres

The SCAPE Project video is out!

3 July 2014 - 8:31am

Do you want a quick intro to what SCAPE is all about?

Then you should watch the new SCAPE video!

The video will be used at coming SCAPE events like SCAPE demonstration days and workshops and it will be available on Vimeo for everyone to use. You can help us to disseminate this SCAPE video by tweeting using this link



Standard tools become overtaxed.....          ....SCAPE addresses these challenges 


The production of this SCAPE video was part of the final project presentation. The idea behind the video is to explain what SCAPE is about to both technical and non-technical audiences. In other words, the overall outcomes and unique selling points of the project in a short and entertaining video. But how do you condense a four year project with 19 partners and lots of different tools and other SCAPE products in just two minutes?

We started with formulating the SCAPE overall messages and unique selling points, from which a script was distilled. This was the basis for the voice over text and a story board, after which the animation work began. There were lots of adjustments to be made in order to stay close to the actual SCAPE situation. It was great that SCAPErs from different areas of the project were kind enough to look at what we from the Take Up team came up with. 

Please take a look and use this video to tell everyone how SCAPE helps you to bring your digital preservation into the petabyte dimension!


SCAPE Project - Digital Preservation into the Petabyte Dimension from SCAPE project on Vimeo.

Preservation Topics: SCAPE
Categories: Planet DigiPres

Introducing Flint

2 July 2014 - 12:53pm

Hi, this is my first blog post in which I want to introduce the project I am currently working on: Flint.


Flint (File/Format Lint) has developed out of DRMLint, a lightweight piece of Java software that makes use of different third party tools (Preflight, iText, Calibre, Jhove) to detect DRM in PDF-files and EPUBs. Since its initial release we have added validation of files against an institutional policy, making use of Johan’s pdfPolicyValidate work, restructured it to be modular and easily extendible, and found ourselves having developed a rather generic file format validation framework.  

what does Flint do?

Flint is an application and framework to facilitate file/format validation against a policy. It's underlying architecture is based on the idea that file/format validation has nearly always a specific use-case with concrete requirements that may differ from say a validation against the official industry standard of a given format. We discuss the principal ideas we've implemented in order to match such requirements.

The code centres on individual file format modules, and thus takes a different approach to FITS; for example - the PDF module makes use of its own code and external libraries to check for DRM. Creating a custom module for your own file formats is relatively straight-forward.

The Flint core, and modules, can be used via a command line interface, graphical user interface or as a software library. A MapReduce/Hadoop program that makes use of Flint as a software library is also included.

The following focuses on the main features:


The core module provides an interface for new format-specific implementations, which makes it easy to write a new module. The implementation is provided with a straight-forward core workflow from the input-file to standardised output results. Several optional functionalities (e.g. schematron-based validation, exception and time-out handling of the validation process of corrupt files) help to build a robust validation module.


Visualisation of Flints core functionality; a format-specific implementation can have domain-specific validation logic on code-level (category C) or on configuration level (categories A and B). The emphasis is on a simple workflow from input-file to standardised check results that bring everything together.

Policy-focused validation

The core module optionally includes a schematron-based policy-focused validator. 'Policy' in this context means a set of low-level requirements in form of a schematron xml file, that is meant to validate against the xml output of other third-party programs. In this way domain-specific validity requirements can be customised and reduced to the essential.  For example: does this PDF require fonts that are not embedded?

We make use of Johan’s work for a Schematron check of Apache Preflight outputs, introduced in this blog post. Using Schematron it is possible to check the XML output from tools, filtering and evaluating them based on a set of rules and tests that describe the concrete requirements of *your* organisation on digital preservation.



Aside from its internal logic, Flint contains internal wrapper code around a variety of third-party libraries and tools to make them easier to use, ensuring any logic to deal with them is in one place only:

* Apache PDFBox

* Apache Tika

* Calibre

* EPUBCheck

* iText - if this library is enabled note that it is AGPL3 licensed

These tools all do (a) something slightly different or (b) do not have full coverage of the file formats in some respects or (c) they do more than what one actually needs. All these tools relate more or less to the fields of PDF and EPUB validation, as these are the two existing implementations we're working on at the moment.

Format-specific Implementations
  • flint-pdf: validation of PDF files using configurable, schematron-based validation of Apache Preflight results and additionally internal logic and all tools in the list above to focus on DRM and Wellformedness
  • flint-epub: validation of EPUB files using configurable, schematron-based validation of EPUBCheck results and additionally internal logic and all tools in the list above to focus on DRM and Wellformedness

NOTE: both implementations are work-in-progress, and should be a good guide for how to implement your own format-validation implementation using Flint.  It would be easy to add a Microsoft Office file format module that looked for DRM etc, for example.



Visualisation of the Flint ecosystem with different entry points and several format/feature-specific implementations (deep blue: existing ones, baby blue: potentially existing ones); the core, as visualised in Figure 1 connects the different ends of the ‘ecosystem’ with each other


how we are using it

Due to the recent introduction of non print legal deposit The British Library is preparing to receive large numbers of PDF and EPUB files. Development of this tool has been discussed with operational staff within the British Library and we aim for it to be used to help determine preservation risks within the received files.

what’s next

Having completed some initial large-scale testing of a previous version of Flint we plan on running more large-scale tests with the most recent version.  We are also interested in the potential of adding additional file format modules; work is underway on some geospatial modules.

help us make it better

It’s all out there (the schematron utils are part of our tools collection at, Flint is here:, please use it, please help us to make it better.

Preservation Topics: CharacterisationPreservation RisksSCAPE
Categories: Planet DigiPres

How much of the UK's HTML is valid?

2 July 2014 - 12:05pm

I thought OPF members might be interested in this UK Web Archive blog post I wrote on format identification and validation of our historical web archives: How much of the UK's HTML is valid?

Preservation Topics: Identification
Categories: Planet DigiPres

SCAPE Demo Day at Statsbiblioteket

27 June 2014 - 8:38am

Statsbiblioteket (The State and University Library, Aarhus, hereafter called SB) welcomed a group of people from The Royal Library, The National Archives, and Danish e-Infrastructure Cooperation on June 25, 2014. They were invited for our SCAPE Demo day where some of SCAPE’s results and tools were presented. Bjarne S. Andersen, Head of IT Technologies, welcomed everybody and then our IT developers presented and demonstrated SB’s SCAPE work.

The day started with a nice introduction to the SCAPE project by Per Møldrup-Dalum, including short presentations of some of the tools which would not be presented in a demo.  Among others this triggered questions about how to log in to Plato – a Preservation Planning Tool developed in SCAPE.

Per continued with a presentation about Hadoop and its applications. Hadoop is a large and complex technology, which was already decided to use before the project started. This has resulted in some discussion during the project, but Hadoop has proven really useful for large-scale digital preservation. Hadoop is available both as open source and as commercial distributions. The core concept of Hadoop is the MapReduce algorithm which was presented in the paper “MapReduce: Simplified Data Processing on Large Clusters” in 2004 by Jeffrey Dean and Senjay Ghemawat. This paper prompted Cutting and Cafarella to implement Hadoop and they published their system under an open source license. Writing jobs for Hadoop has traditionally been done by using the Java programming language, but in the recent years several alternatives to Java have been introduced, e.g. Pig Latin  and Hive  Other interesting elements in a Hadoop cluster are HBase, Mahout, Giraph, Zookeeper and a lot more. At SB we use an Isilon Scale-Out NAS storage cluster which enables us to make a lot of different experiments on the four 96GB RAM CPU nodes each with a 2 Gbit Ethernet interface. This setup potentially makes the complete online storage of SB reachable for the Hadoop cluster.

                                            Sometimes it is hard to fit an elephant in a library


Bolette A. Jurik was next in line and told the story about how Statsbiblioteket wanted to migrate audio files using Hadoop (and Taverna…. and xcorrSound Waveform Compare). The files were supposed to be migrated from mp3 to wav. Checking this collection in Plato gave us the result ‘Do nothing’ – meaning leave the files as mp3. But we still wanted to perform the experiment – to test that we have the tools to migrate, extract and compare properties, validate the file format and compare the content of the mp3 and wav files, and that we can create a scalable workflow for this.". We did not have a tool for the content comparison, so we had to develop one, xcorrSound waveform-compare. The output shows which files need special attention – as an example one of the files failed the waveform comparison although it looked right. This was due to a lack of content in some parts of the file so Waveform Compare had no sound to compare! Bolette also asked her colleagues to create "migrated" soundfiles with problems that the tool would not find – read more about this small competition in this blog post.

Then Per was up for yet another presentation – this time describing the experiment: Identification and feature extraction of web archive data based on Nanite. The test was to extract different kinds of metadata (like authors, GPS coordinates for photographs etc.) using Apache Tika, DROID, (and libmagic) . The experiment was run on the Danish Netarchive (archiving of the Danish web – a task undertaken by The Royal Library and SB together). For the live demo a small job with only three ARC files was used – taking all of the 80,000 files in the original experiment would have lasted 30 hours.  Hadoop generates loads of technical metadata that enables us to analyse such jobs in detail after the execution. Per’s presentation was basically a quick review of what is described in the blog post A Weekend With Nanite.

An analysis of the original Nanite experiment was done live in Mathematica presenting a lot of fun facts and interesting artefacts. For one thing we counted the number of unique MIME types in the 80,000 ARC files or 260,603,467 individual documents to

  • 1384 different MIME types were reported by the HTTP server at harvest time,
  • DROID counted 319 MIME types,
  • Tika counted 342 MIME types.

A really weird artefact was that approx. 8% of the identification tasks were complete before they started! The only conclusion to this is that we’re experiencing some kind of temporal shift that would also explain the great performance of our cluster…

Two years ago SB concluded a job that had run for 15 months. 15 months of FITS characterising 12TB of web archive data. The experiment with Nanite characterised 8TB in 30 hours. Overall this extreme shift in performance is due to our involvement in the SCAPE project.

After sandwiches and a quick tour to the library tower Asger Askov Blekinge took over to talk about Integrating the Fedora based DOMS repository with Hadoop. He described Bitmagasinet (SB’s data repository) and DOMS (SB’s Digital Object Management System based on Fedora) and how our repository is integrated with Hadoop.

SB is right now working on a very large project to digitize 32 million pages of newspapers. The digitized files are delivered in batches and we run Hadoop map/reduce jobs on each batch to do quality assurance. An example is to run Jpylyzer on a batch (Map runs Jpylyzer on each file, Reduce stores the results back in DOMS). The SCAPE way to do it includes three steps:

  • Staging – retrieves records
  • Hadooping – reads, works and writes new updated records
  •  Loading  - stores updated records in DOMS

The SCAPE Data model is mapped with the newspapers in the following way:

                                           SCAPE Data Model mapped with newspapers

SCAPE Stager/Loader creates a sequence file which can then be read and each record updated by Hadoop and after that the records are stored in DOMS.

The last demo was presented by Rune Bruun Ferneke-Nielsen. He described the policy driven validation of JPEG 2000 files based on Jpylyzer and performed on SB’s Newspaper digitization project. The newspapers are scanned from microfilms by a company called Ninestars, and then quality assured by SB’s own IT department. We need to make sure that the content conforms to the corresponding file format specifications and that the file format profile conforms to our institutional policies.


530,000 image files have been processed within approx. five hours.

We want to be able to receive 50,000 newspaper files per day and this is more than one server can handle. All access on data for quality assurance etc. is done via Hadoop. Ninestars runs a quality assurance before they send the files back to SB and then the files are QA’ed again inhouse.

                                                               Fuel for the afternoon (Photo by Per Møldrup-Dalum)

One of the visitors at the demo is working at The Royal Library with the NetArchive and would like to make some crawl log analyses. These could perhaps be processed by using Hadoop - this is definitely worth discussing after today to see if our two libraries can work together on this.

All in all this was a very good day, and the audience learned a lot about SCAPE and the benefits of the different workflows and tools. We hope they will return for further discussion on how they can best use SCAPE products at their own institutions.

Preservation Topics: SCAPE AttachmentSize ElephantOnSB.png662 KB DataModel.png70.11 KB NewspaperQA.png160.83 KB Fuel for the afternoon (Photo by Per Møldrup-Dalum)45.11 KB
Categories: Planet DigiPres

Bulk disk imaging and disk-format identification with KryoFlux

26 June 2014 - 3:15pm
The problem

We have a large volume of content on floppy disks that we know are degrading but which we don't know the value of.

  1. We don't want to waste time/resources on low-value content.
  2. We don't know the value of the content.
  3. We want to be able to back up the content on the disks to ensure it doesn't degrade any more than it already has.
  4. Using unskilled students to do the work is cost-effective.
  5. Unskilled students have often never seen "floppy" disks, let alone can distinguish between different formats of floppy disk. So we need a solution that doesn't require them to differentiate (e.g. between apple formats, PC formats, Amiga, etc).
  1. Make KryoFlux stream files using the KryoFlux hardware and software.
  2. Use the KryoFlux software to create every variant of disk image from those streams
  3. Use the mount program on Linux to mount each disk image using each variant of file system parameter. 
  4. Keep the disk images that can mount in Linux (as that ability implies that they are the right format).

Very rough beginnings of a program to perform the automatic format identification using the KryoFlux software and Mount are available here.

Issues with the solution
  1. When you use the KryoFlux to create raw stream files it only seems to do one pass of each sector. Whereas when you specify the format it will try to re-read sectors that it identifies as "bad sectors" in the first pass. This can lead to it successfully reading those sectors when it otherwise wouldn't. So using the KryoFlux stream files may not lead to as much successful content preservation as you would get if you specified the format of the disk before beginning the imaging process. I'm trying to find out whether using "multiple" in the output options in the KryoFlux software might help with this
  2. Mount doesn't mount all file-systems - though as this is improved in the future the process could be re-run
  3. Mount can give false positives
  4. I don't know whether there is a difference between disk images created with Kroflux using many of the optional parameters or using the defaults. For example there doesn't appear to be a difference in mount-ability of disk images created where the number of sides is specified or disk images when it is not and defaults to both sides (for e.g. MFM images the results of both seem to mount successfully).
  5. Keeping the raw streams is costly. A disk image for a 1.44mb floppy is ~1.44mb. The stream files are in the 10s of MBs
Other observations:
  1. It might be worth developing signatures for use in e.g. DROID to identify the format of the stream files directly in the future. Some e.g. emulators can directly interact with the stream files already I believe
  2. The stream files might provide a way of over-coming bad-sector based copy protection, (e.g. the copy protection used in Lotus 1-2-3 and Lotus Jazz) by enabling the use of raw stream files (which -i believe- contain the "bad" sectors as well as good) in emulators

Thoughts/feedback appreciated

Preservation Topics: IdentificationPreservation RisksBit rotTools
Categories: Planet DigiPres

Will the real lazy pig please scale up: quality assured large scale image migration

24 June 2014 - 9:12am

Authors: Martin Schaller, Sven Schlarb, and Kristin Dill

In the SCAPE Project, the memory institutions are working on practical application scenarios for the tools and solutions developed within the project. One of these application scenarios is the migration of a large image collection from one format to another.

There are many reasons why such a scenario may be of relevance in a digital library. On the one hand, conversion from an uncompressed to a compressed file format can significantly decrease storage costs. On the other hand, particularly from a long-term perspective, file formats may be in danger of becoming obsolete, which means that institutions must be able to undo the conversion and return to the original file format. In this case a quality assured process is essential to allow for reconstruction of the original file instances and especially to determine when deletion of original uncompressed files is needed – this is the only way to realize the advantage of reducing storage costs. Based on these assumptions we have developed the following use case: Uncompressed TIFF image files are converted into compressed JPEG2000 files; the quality of the converted file is assured by applying a pixel for pixel comparison between the original and the converted image.

For this, a sequential Taverna concept workflow was first developed, which was then modelled into a scalable procedure using different tools developed in the SCAPE Project.

The Taverna Concept Workflow

The workflow input is a text file containing paths to the TIFF files to be converted. This text file is then transformed into a list that allows the sequential conversion of each file, hence simulating a non-scalable process. Before the actual migration commences, validity of the TIFF file is checked. This step is realized by using FITS - a wrapper that applies different tools to extract the identification information of a file. Since the output of FITS is an XML-based validation report, an XPath service extracts and checks the validity information. If the file is valid, migration from TIFF to JPEG2000 can begin. The tool used in this step is OpenJPEG 2.0. In order to verify the output, Jpylyzer – a validator as well as feature extractor for JPEG2000 images created within the SCAPE Project – is employed. Again, an Xpath service is used to extract the validity information. This step concludes the file format conversion itself, but in order to ensure that the migrated file is indeed a valid surrogate, the file is reconverted into a TIFF file, again using OpenJPEG 2.0. Finally, in a last step the reconverted and the original TIFF files are compared pixel for pixel using LINUX based ImageMagick. Only through the successful execution of this final step can the validity as well as the possibility of a complete reconversion be assured. taverna workflow

Figure 1 (above): Taverna concept workflow

In order to identify how much time was consumed by each element of this workflow, we ran a test consisting of the migration of 1,000 files. Executing the described workflow on the 1,000 image files took about 13 hours and five minutes. Rather unsurprisingly, conversion and reconversion of the files took the longest: the conversion to JPEG2000 took 313 minutes and the reconversion 322 minutes. FITS validation needed 70 minutes and the pixel-wise comparison was finished in 62 minutes. The SCAPE developed tool Jypylizer required only 18 minutes and was thus much faster than the above mentioned steps. diagram taverna workflow

Figure 2 (above): execution times of each of the concept workflows' steps

Making the Workflow Scale

The foundation for the scalability of the described use case is a Hadoop cluster containing five Data Nodes and one Name Node (specification: see below). Besides having economic advantages – Hadoop runs on commodity hardware – it also bears the advantage of being designed for failure, hence reducing the problems associated with hardware crashes.

The distribution of tasks for each core is implemented via MapReduce jobs. A Map job splits the handling of a file. For example, if a large text file is to be processed, a Map job divides the file into several parts. Each part is then processed on a different node. Hadoop Reduce jobs then aggregates the outputs of the processing nodes again to a single file.

But writing MapReduce jobs is a complex matter. For this reason, the programming language Apache Pig is used. Pig was built for Hadoop and translates a set of commands in a language called “Pig Latin” into MapReduce jobs, thus making the handling of MapReduce jobs much easier or, as Professor Jimmy Lin described the powerful tool during the ‘Hadoop-driven digital preservation Hackathon’ in Vienna, easy enough “… for lazy pigs aiming for hassle-free MapReduce.”

Hadoop HDFS, Hadoop MapReduce and Apache Pig make up the foundation of the scalability on which the SCAPE tools ToMaR and XPath Service are based. ToMaR wraps command line tasks for parallel execution as Hadoop MapReduce jobs. These are in our case the execution of FITS, OpenJPEG 2.0, Jpylyzer and ImageMagick. As a result, the simultaneous execution of these tools on several nodes is possible. This has a great impact on execution times as Figure 3 (below) shows.

The blue line represents the non-scalable Taverna workflow. It is clearly observable how the time needed for file migration increases in proportion to the number of files that are converted. The scalable workflow, represented by the red line, shows a much smaller increase in time needed, thus suggesting that scalability has been achieved. This means that, by choosing the appropriate size for the cluster, it is possible to migrate a certain number of image files within a given time frame. Performance_image_migration

Figure 3 (above): Wallclock times of concept workflow and scalable workflow

Below is the the specification of the Hadoop Cluster where the master node runs the jobtracker and namenode/secondary namenode daemons, and the worker nodes each runs a tasktracker and a data node daemon.

Master node: Dell Poweredge R510

  • CPU: 2 x Xeon E5620@2.40GHz
  • Quadcore CPU (16 HyperThreading cores)
  • RAM: 24GB
  • NIC: 2 x GBit Ethernet (1 used)
  • DISK: 3 x 1TB DISKs; configured as RAID5 (redundancy); 2TB effective disk space

Worker nodes: Dell Poweredge R310

  • CPU: 1 x Xeon X3440@2.53GHz
  • Quadcore CPU (8 HyperThreading cores)
  • RAM: 16GB
  • NIC: 2 x GBit Ethernet (1 used)
  • DISK: 2 x 1TB DISKs; configured as RAID0 (performance); 2TB effective disk space

However, the throughput we can reach using this cluster and pig/hadoop job configuration is limited; as figure 4 shows, the throughput (measured in Gigabytes per hour - GB/h) is rapidly growing when the number of files being processed is increased, and then stabilises at a value around slightly more than 90 Gigabytes per hour (GB/h) when processing more than 750 image files. throughput_gb_per_h

Figure 4 (above): Throughput of the distributed execution measured in Gigabytes per hour (GB/h) against the number of files processed

As our use case shows, by using a variety of tools developed in the SCAPE Project together with the Hadoop framework it is possible to distribute the processing on various machines thus enabling the scalability of large scale image migration and significantly reducing the time needed for data processing. In addition, the size of the cluster can be tailored to fit the size of the job so that it can be completed within a given time frame.

Apart from the authors of this blog post, the following SCAPE Project partners contributed to this experiment:

  • Alan Akbik, Technical University of Berlin
  • Matthias Rella, Austrian Institute of Technology
  • Rainer Schmidt, Austrian Institute of Technology
Preservation Topics: MigrationSCAPEjpylyzer
Categories: Planet DigiPres

Interview with a SCAPEr - Leïla Medjkoune

20 June 2014 - 11:59am
Leïla MedjkouneWho are you?

My name is Leïla Medjkoune and I am responsible for the Web Archiving projects and activities at Internet Memory.

Tell us a bit about your role in SCAPE and what SCAPE work you are involved in right now?

My involvement in Scape is twofold. I am working as a project manager, following the project and ensuring that Internet Memory as a partner fulfils the project plan. I am also involved as a functional expert, representing web archivists’ needs. I am therefore working within several areas of the project such as Quality Assurance and the Web Testbed work, where I contribute to the development of tools and workflows in relation to web archiving.

Why is your organisation involved in SCAPE?

Since its creation in 2004 Internet Memory actively participates in improving the preservation of the Internet. It supports cultural institutions involved in web archiving projects through its large scale shared platform, by building its own web archive and also by developing innovative methods and tools, such as its own crawler, MemoryBot, either internally or as a result of participation in EU-funded research projects, aiming to tackle web archiving and large scale preservation challenges. As part of SCAPE, Internet Memory is willing to test, develop and hopefully implement within its infrastructure, preservation tools and methods, including an automated visual quality tool applied to web archives.

What are the biggest challenges in SCAPE as you see it?

SCAPE is a very interesting project with a quite complex organisation. This is due to the fact that we are looking at a broad range of tools and methods trying to tackle a variety of preservation issues. Beyond the organisational aspects, one of the biggest challenges is as stated within the acronym, to answer the scalability issues currently met by most archives and libraries. This is even more critical for web archives as the amount of the heterogeneous content to preserve and to provide access to is constantly growing in size. Another challenge will be to disseminate SCAPE's outcomes so that they reach the preservation community and will be used within libraries, archives and preservation institutions in general.

What do you think will be the most valuable outcome of SCAPE?

As most web archives, we are willing to implement robust automated tools within our infrastructure that could not only facilitate operations but would also reduce costs. Improving characterisation tools so that they scale and developing QA tools designed for web archives, such as the Pagelyzer, are the most useful outcomes from our perspective. We are also strongly involved within the SCAPE platform work and believe this platform is a useful example of how several preservation tools and systems can be integrated within one single infrastructure.

Contact information:

Leïla Medjkoune

Preservation Topics: SCAPE
Categories: Planet DigiPres

An Analysis Engine for the DROID CSV Export

3 June 2014 - 7:20am

I have been working on some code to ensure the accurate and consistent output of any file format analysis based on the DROID CSV export, example here. One way of looking at it is an executive summary of a DROID analysis, except I don't think executives, as such, will be its primary user-base. 

The reason for pushing the code and this blog out now is to seek feedback on what else might be useful for users. Specifically, formatting of output, and seeking other use-cases for how others analyse the same results to understand if I can incorporate these methods into my work to the benefit of the community. 

The hope is that the results output by the tool can be used by digital preservation researchers, analysts, coders, archivists and digital archivists alike - where there is such a distinction to be drawn. 

The tool is split into two or three components depending on how you break it down.

This places DROID CSV export data into a SQLite database with the same filename as the input CSV file.

The process adds two additional columns to the saved table, URI_SCHEME so we can query to a greater granularity the URI scheme used by the various URIs output by DROID; and DIR_NAME to enable analysis on base directory names, e.g. to help us understand the break-down of directories in a collection.

This combines the functions of by calling droid2sqlite's primary class. Further it provides a query layer on top of the DROID SQLite database, outputting the results of various queries we might ask of the dataset to the command line.

This is a class created to help spot potentially difficult to handle file names from any DROID CSV output. The class is based on a Microsoft Development Network article but also checks for non-ascii characters and a handful of other characters that can prove problematic, such as square brackets.

Database and Analysis Engine

This work mirrors some of that done by DROID already. DROID outputs an Apache Derby database for its Profile file format. Information on connecting to it can be found on the droid-list mailing list. For my purposes I had a desire to learn the database management system SQLite and more practically I found a greater amount of support for it in terms of libraries available in Python or the applications I can use to access it. Instead of attempting to access the DROID Derby database and build on top of that, I decided to map the results to a SQLite database. SQLite also has features that I like that might lend itself better to long-term preservation enabling the long term storage of the database alongside any collection analysis documentation outside of the digital preservation system, if necessary.

DROID also enables filtering and the generation of reports, however I haven’t found the way it collects information to be useful in the past and so needed a different approach; an approach that gives me greater flexibility to create more reports or manipulate output.

The DROID CSV export is as simple as it needs to be and provides a lot of useful information and so provided an adequate platform for this work in its own right.

The database engine doesn’t have a hard coded schema; it simply reads the column headers in the CSV provided to the tool. Given the appearance of particular columns it creates two additional columns on top to provide greater query granularity, as mentioned above.

The analysis output includes summary statistics, along with listings of PUIDS and file paths depending on the query that we’re interested in. On top of the summary statistics, the following information is output:

  • Identified PUIDs and format names
  • PUID frequency
  • Extension only identification in the collection and frequency
  • ID method frequency
  • Unique extensions identified across the collection
  • Multiple identification listing
  • MIME type frequency
  • Zero-byte object listing
  • No identification listing
  • Top signature and identified PUIDs list
  • Container types in collection
  • Duplicate content listing
  • Duplicate filename listing
  • Listing of potentially difficult filenames

An example analysis, based on a DROID scan from the re-factored opf-format-corpus I host, can be found here. The summary statistics generated are as follows:

Total files: 500 Total container objects: 14 Total files in containers: 176 Total directories: 85 Total unique directory names: 75 Total identified files (signature and container): 420 Total multiple identifications (signature and container): 1 Total unidentified files (extension and blank): 80 Total extension ID only count: 17 Total extension mismatches: 32 Total signature ID PUID count: 54 Total distinct extensions across collection: 64 Total files with duplicate content (MD5 value): 155 Total files with duplicate filenames: 117 Percentage of collection identified: 84.0 Percentage of collection unidentified: 16.0

One point to note is that DROID can analyse the contents of container files, or not. In the former case it makes it difficult to generate a count of top-level objects (objects not stored within a container). It is, however, useful to understand both counts where possible, but duplication of reports might be undesirable. The creation of a URI_SCHEME column in the database enables this count to be calculated without the need to run DROID twice. The number of top-level objects in the opf-format-corpus can be calculated by subtracting the number of files in containers from the total number of files, so: 324.

Questions that we’re asking…

As we get into the analysis of a number of collections that we hope will be our first born-digital transfers at Archives New Zealand, we find ourselves asking questions of them to ensure they can be ingested with minimal issue into our long-term preservation system. We also want to ensure access is uninhibited for end-users once it arrives there.

Our first attempt at born-digital transfer sees us do this analysis up-front in an attempt to understand the transfers completely, looking at the issues likely to be thrown up on ingest and what pre-conditioning we are likely to have to do before that stage. Some of the questions are also part of a technical appraisal that will help us to understand what to do with examples of files with duplicate content and those that might otherwise be considered non-records or non-evidential e.g. zero-byte files.

The output of the tool represents ALL of the questions that we have considered so far. We do expect there to be more and better questions to be asked as well. Throwing the code and this blog into the public domain can help us build on this work through public input, so:

  • What other information might be useful to query the DROID export for?
  • What output format might be most useful?
  • What formatting of that output format will best lend itself to data visualisation?
  • What other comments and questions do readers of this blog have?

There is still some work to do here. I need to incorporate unit tests into the code base so that everything is as robust as I seek. I imagine future releases of DROID might initially break this tool's compatibility with DROID CSV exports and so that will have to be catered for at the time.

An important note about maintenance is that having created this tool for my day-to-day work I do hope to continue to maintain it for my day-to-day job as necessary.

One of the things I like about accessing DROID results via a database is that we don’t need an analysis layer on top of it. If users have a different requirement of the database than I have catered for then they can simply use the database and use their own queries on top, using their preferred flavour of programming language. Other ways of using such a database might include re-mapping the output to be suitable for cataloguing and archival description, if one desires.

I have considered adding a temporal angle to the database by enabling the storage of multiple DROID reports relating to the same transfer. This could be used to monitor the result of pre-conditioning or analysis of a collection using progressively up-to-date DROID Signature Files. This lends itself to reporting and demonstration of progress to management. The realization of this is more difficult as there doesn’t seem to be a single immutable piece of information we can hook into to make this possible with MD5 hashes likely to change on pre-conditioning, and the potential for file paths to change depending on machine being used to complete a scan. Thoughts on this matter are appreciated.

The tool is licensed under the Zlib license and so can be easily re-used and incorporated into other’s work without issue. 

Preservation Topics: Preservation ActionsIdentificationCharacterisationPreservation RisksToolsSoftware
Categories: Planet DigiPres

A Weekend With Nanite

28 May 2014 - 9:30pm
Well over a year ago I wrote the ”A Year of FITS”( blog post describing how we, during the course of 15 months, characterised 400 million of harvested web documents using the File Information Tool Kit (FITS) from Harvard University. I presented the technique and the technical metadata and basically concluded that FITS didn’t fit that kind of heterogenic data in such large amounts. In the time that has passed since that experiment, FITS has been improved in several areas including the code base and organisation of the development and it could be interesting to see how far it has evolved for big data. Still, FITS is not what I will be writing on today. Today I’ll present how we characterised more than 250 million web documents, not in 9 months, but during a weekend.Setting the Scene: The Tools and the DataThe Hardware

When we at the Danish State and University Library (SB) started our work on the SCAPE project, we acquired four machines to support this work. It was on those four machines that the FITS experiment was performed and it is on the same four machines that the present experiment was performed. The big difference being the software that handles the processes and how the data is accessed.

The four machines are all Blade servers each with two six core Intel Xeon 2.93GHz CPUs, 96GB RAM, and 2Gbit network interfaces (details at Much to the contrary of a traditional node for a Hadoop cluster, these machines do not have local data storage. At SB we very much rely on NAS for data storage, more specifically our storage is based on the Isilon scale out NAS solution from EMC. The Isilon system is a cluster designed for storage and at SB it is at the present storing several PB of digital preservation data.

As NAS is a prerequisite for data storage at SB, we have been doing a lot of experiments on how best to integrate Hadoop with that kind of infrastructure. We have tested two Hadoop distributions, Cloudera and Pivotal HD, in different hardware and software configurations and the setup used for this experiment is the best so far while in no way being good enough. For the experiment we used the Cloudera distribution version 4.5.0, which builds upon Hadoop 2.0.0.

The Data

During the last months we have been moving our copy of the Danish Web Archive onto the above-mentioned Isilon cluster and providing online read-only access through NFS. This makes it possible for relevant jobs to access more than half a PB of web documents, roughly more than 18 billion documents harvested during the last decade. That is a lot of data!

These documents are stored as ARC or WARC container files in a shallow directory tree with the ARC and WARC files in the nethermost directory. A few months ago we shifted from storing data in the ARC format to storing the data in the newer WARC format as this format is superior to the older ARC format. This will become relevant later in this post.

To select the data for this large scale experiment, a few simple UNIX find commands were issued giving a file with the file paths of 147,776 ARC files amounting to almost 15TB, roughly around 450 million documents.

The Software

For this large-scale characterisation experiment I substituted FITS with the Nanite project. This project is lead by Andy Jackson from the UK Web Archive and enables DROID, Apache Tika, and libmagic to be effectively used on the Hadoop platform. For this experiment I will not use the libmagic component. Will Palmer of the British Library (BL) has written a blog post on this project ( and how to integrate the Apache Tika parser into Nanite. Will Palmer has also been very helpful with uncovering and rectifying the problems uncovered during the present experiment.

Before unleashing Nanite on the big data set, I did a lot of preliminary tests and tweaks in the source code of Nanite, more specifically the nanite-hadoop module that compiles to a JAR containing a self-contained Hadoop job. (In this blog post Nanite is a synonymy for nanite-hadoop)

One tweak was to write code for extracting the ARC metadata for each document and storing it along side the properties extracted by Tika.

Another tweak was to substitute the newlines in the extracted HTML title elements with spaces as those newlines breaks the data structure of one key/value pair per line in one of the output data files.

Both these two changes are now in the main Nanite code base.

After disabling a small hack that enables the code to run on the BL Hadoop cluster and disabling a preliminary check to ensure that all input files are compressed, which ours are not, the code should be ready for some real test runs.

Not quite so! The Nanite project depends upon another UK Web Archive project called warc-hadoop-recordreaders, which again depends on a version of the Heritrix project that has a bug when dealing with uncompressed ARC files. The Heritrix code simply cannot handle such uncompressed ARC files. When this bug was uncovered (with help from Will and Andy), it was easy fixing the problem by using a newer, but unreleased version of Heritrix (3.1.2-SNAPSHOT) and rebuilding the dependencies.

During some fun and interesting Skype conversations with Will, we decided to have three different output formats for Nanite. So, basically Nanite produces two kinds of outputs. It produces a traditional data set that are created by the mappers and aggregated in the reducers. This data set contains lines of MIME type, format version, and, for DROID, PUID values. The other kind of output is the extracted features of the documents. Each mapper task handles one ARC file at a time and creates a single data set with all the extracted features from all the documents contained in the ARC file. This data set is then stored by the mapper in three different containers.

  • A ZIP file per ARC file that contains one file per document with key/value pairs, one per line.
  • A ZIP file per ARC file containing the serialised metadata objects for each document.
  • A sequence file per ARC files with the serialised metadata objects for each document.

When we know more about who will use this data and to what purpose, those three formats could be reduced to one or substituted by something entirely different, maybe even improving the performance of the process. The above three output formats are only available in Will’s fork of Nanite at

As a last experiment in this prologue I ran Nanite on 221GB of ARC files to test the performance. This test showed a processing speed at 4.5GB/minute. The test created circa 1GB of extracted metadata. Extrapolating these values it seemed that the complete 15TB could be processed in less than three days and giving less than 100GB of new data. The latter would have no problem in fitting on the available HDFS space and three days processing time would be impressive, to say the least.

Running the Experiment

Only thing left was to execute

$ hadoop jar jars/nanite-1.1.5-74-SNAPSHOT-job-will.jar ~/working/pmd/netarkiv-147776 out-147776

and wait three days.

I just forgot that this was big data. I forgot that with sheer data size comes complexity. That with 147,776 ARC files, something had inevitable to go wrong.

Before initiating this first run I had decided that I wouldn’t change code or fiddle with the cluster configuration more that absolutely necessary. My primary focus was to get the job to run by jumping as many fences as I had to get the job to terminate without failure.

So when I after a few seconds got a heap space exception, I didn’t change heap space configuration. Instead I split the input file into two equal sized chunks with the intention of running two separate Hadoop jobs.

When I subsequent got an “Exceeded max jobconf size” I didn’t experiment with the mapred.user.jobconf.limit Hadoop parameter but split the original input file into chunks with only 30,000 ARC references each.

After that I started getting “ZIP file must have at least one entry” exceptions. First I removed 4832 ARC files that didn’t contain documents but instead contained metadata regarding the original web harvest. I didn’t know these files were present until I got the above error. That didn’t fix the problem. Next I discovered that I unexpectedly also had WARC files in my data set and the code presumable couldn’t handle those. The set of 147,776 files was now reduced to 79,831, split into three chunks. As a last guard against the job failing I did something bad. I surrounded the failing code with a try-catch instead of understanding the problem—actually I just postponed the real bug-hunt for another opportunity.

Late Saturday evening the first job was about to complete, but as I didn’t want to stay up too late just for starting a Hadoop job, I dared to try to run two Hadoop jobs at the same time. Wrong decision! The second job started spitting out a lot of ”Error in configuring object”. So I killed it and went to bed. The next morning the first job had, luckily, completed with success and I had the first set of results. Off course I then started the second job yet again on a idle Hadoop cluster, but, very worrying, got the same error as in the previous evening! It being Sunday, I closed the ssh connection and tried to forget all about this for the rest of the weekend.

Monday morning at work this second job started without hick-ups and I still don’t know what went wrong!

The second job ran for ten hours, completing 29,999 ARC files out of 30,000 and then it failed and cleaned everything out. Upon examination of the log files and the input data I discovered an ARC file of size zero! That is a problem, not only for my experiment, but certainly also for the web archive and this discovery has been flagged in the organisation.

I removed the reference to this zero size ARC file, ran the job again with success. In the third job I also discovered two more zero size ARC files whose references were removed and the job completed with success.

After 32 hours of processing time, all 79,829 ARC files was processed and a lot of interesting metadata created for further analysis in addition to new experience and knowledge gained.


As mentioned above the job produces two kinds of metadata, but it is actually three kinds: MIME type identifications of the documents stored in tab separated text files, extracted metadata for each document stored in ZIP files, and data about the job run itself. I will analyse the last kind first.

To count the number of processed documents I ran this Bash script

for f in $(cat all-files.txt) do rn=$(basename $f .zip) s=$(ls -l $f|awk '{print $5}') c=$(unzip -l $f|tail -1|awk '{print $2}') echo "$f, $c, $s" done

that creates a list of ARC file name and document count pairs. This data gives the total amount of processed documents, which were 260,603,467, and the following distribution of number of documents per ARC file

Number of documents per ARC

In the tail now shown we have 350 ARC files with more than 15,000 documents. The ARC file with the most documents counts 41,140 pieces. Again, an anomaly that would be interesting to dive into as well as many other questions this distribution rises.

The execution time for the three sub jobs were

job idamount of data readnumber of ARC files readprocessing time00592.764TB30,00012hrs, 43mins, 53sec00692.763TB30,00010hrs, 46mins, 21sec00731.820TB19,8297hrs, 44mins, 38sec 7.347TB79,82931.24hrs

Basically the cluster processed 3.92GB/minute, or 44 ARC files per minute, or 63,000 ARC files a day, or 2317 documents per second—which ever sounds most impressive.

To dig a bit deeper into this data set, I collected the run time of each of the 79,839 map tasks. This data can be explored in an HTML table on the job level in the Cloudera Manager, but I wrote a small Java job that converts this data into a CSV file: The data was then read into R for analysis. For the interested reader, I created a Gist with the unordered R code that generated the analysis and figures at

The processing time spans from -43s [sic] to 1462s. A very weird observation is that 6,490 tasks, i.e. 8%, were reported as having completed in negative time. As I don’t think the Hadoop cluster utilises temporal shifts, though that would explained the fast processing, this should be investigated further.

Only 83 map task took more than 500s. Ignoring that long thin tail, the distribution of the processing times looks as below

This looks like an expected distribution. The median of this data set is 57s but that includes 8% negative values so it’s unclear what valuable information that gives.

The MIME type identifications are aggregated in 10 files, one per reducer. These files are easily aggregated further into one big file. Eyeball examination of this data file reveals that it contains lots of error messages. The course of these errors should be investigated, and Will has been doing that, but I’ll jump the fence once more, this time using grep

grep -v ^Exception all-parts | grep -v ^IOException > all-parts-cleaned

Still, that was not enough because after trying to read the file into R, I discovered that I needed to remove all single and double quotes

sed 's/"//g' < all-parts-cleaned | sed "s/\'//g" > all-parts-really-cleaned

With that data cleaning completed, the following command finally reads the data into R

r<-read.table("data/all-parts-really-really-cleaned",sep="\t",header=FALSE, comment.char="",col.names=c("?","http","droid","tika","tikap","year","count"))

Observe the comment.char=”” argument. This is necessary as R assumes everything after a # is a comment and in three instances in the data an http server reported a MIME type value as a series of #s. Why? The same server? This just gives rise to even more interesting questions that can be asked about the uncovered anomalies in this data.

Apart from the above mentioned error messages, the data also contain another kind of error. The Tika parser has timed-out on a lot of the documents, 177 million of them, to be exact. That is a 67% failure rate for the Tika parser and I would consider that a serious error. Fortunately this error has already been dealt with by Will and pushed to the main Nanite project.

Had I chosen to include the harvest year into this data during the job, the detected MIME types could be correlated with harvest year. Instead I can compare the distribution of MIME types between the values reported by the server that served the documents, the DROID detection, and the Tika detection.

The web servers reported 1370 different MIME types; Tika detected 342; and DROID 319.

A plot of the complete set of these MIME types is presented for completeness

If we select the top 20 MIME types we get a more clear picture

Top 10 with MIME type names



It seems that the web servers generally claim that they deliver a few documents with a lot of few different MIME types and a lot of documents with a few MIME types, primarily HTML pages, when in fact we receive quite a diversified set of document types as detected by Tika and DROID. Also, this data set can actually give a confidence factor for the trustworthiness the Danish web servers, even a time series plot of the evolution of such a confidence factor. Or…, Or… The questions just keep coming…

Manually examination of the MIME type data set also reveals fun stuff. E.g. how did a Microsoft Windows Shortcut end up on the Internet? Or what about doing a complete virus scan of all the documents. This could give data for research in the evolution of computer viruses?

There’s no end to the interesting questions when browsing this data. And that’s just looking at one feature, namely the MIME type.

The third kind of data created in the Hadoop job gives a basis for much more detailed observations. We’ve got EXIF data, including GPS if available, HTML title and keyword elements, authors, last modification time, etc. I selected at random a data file from a typical ARC file, i.e. one that contains circa the average of 3500 documents. This data file has 430 unique properties!

These extracted properties are collected in 79,829 ZIP files and before any analysis would be feasible, these files should be combined into one big sequence file and a few Pig Latin UDFs should be written. This would facilitate a kind of explorative analysis. How close to real-time read, eval, print, loop, such a investigation could be, remains to be seen, as this is, unfortunately, a task for another day.

Lessons learned

First off, thank you if you’ve read this far. I did put quite a bit of detail into this blog post—without any TL;DR warning.

The process of going from the idea of this experiment to finishing this blog post has been long but very rewarding.

I’ve uncovered bugs in and added features to Nanite in fun collaboration with Will Palmer and this is far from over. I will continue working with Will on Nanite as I see great value in this tool. As a start I would like to address all the issues uncovered as described in this blog post.

I’ve learned that CPU time actually is much cheaper than developer time. If the cluster should run for 10 hours and fail in the last few minutes, who cares? It’s better than me spending hours trying to dig up a small bug that might be hidden somewhere down in the deepest Java dependencies. Up till a certain break-even threshold, that is!

I’ve learnt a valuable lesson: Know thy data! When performing jobs that potentially could run for weeks, it is very important that you know the data you’re trying to crunch. Still, it’s equally as important to have your tools be indeed very robust. They should be able to survive anything that might be thrown at them, because, running on large data amounts like this, every kind of valid and non-valid data will be encountered. I.e. if one file in a million is a serious problem, you will encounter 17,000 serious problems processing the web archive. If the job runs for 3 months, that’s almost 10 serious problems an hour.

All this being said and done, we must not forget why we do this. It’s not for the sake of creating fast tools, nor reliable tools. It is to enable the curators to preserve our shared data as easy and trustworthy as possible. Also, especially relevant for web documents, it’s for the benefit of the researches in the humanities. To enable them to get answers to all the different questions they could possible imagine asking to such a huge corpora of documents from the last decade.

Even though this experiment answered some questions, I now stand with even more questions to be answered. Oh, and I need to run Nanite on 18 billion documents that are just waiting for me on a NFS mount point…

Preservation Topics: Preservation ActionsIdentificationCharacterisationWeb ArchivingToolsSCAPE
Categories: Planet DigiPres