A day after running our Characterisation Hackathon (and helping out with a lively DPC event on PDF/A-3) and I'm still feeling exhausted. This was a developer only event and not as taxing on my facilitation skills as our usual mashups, but it's still been an action packed few days. All this moaning is of course somewhat irrelevant as these events are all about the participants and it was certainly those guys who did the hard work.

After a year predominantly focussed on external SCAPE events (Open Research Challenges Workshop @ iPres, the First SCAPE Training Event in Guimarães) we finally organised another project internal scenario and developer’s workshop. As always, the event provided a great opportunity for developers and SCAPE scenario holders to get together and talk face-to-face. It also gave those new to the project an opportunity to meet some of those people who they’d only spoken to over Skype (or even only over email!).
Flexibility in the workshop agenda enabled each day’s activities to be adjusted slightly so that everyone had opportunity to discuss and gain awareness of important topics at this point in the project. Main topics focused on were the scenario status updates, refining scenarios, policy representation work, and the functional review criteria; enabling team members to have an appreciation of these subjects will be important in helping to improve project-wide understanding and communication, especially as the project steps up integration of the various components we have been working on.
The first day started by reviewing the existing scenarios. Plenty of excellent work has gone on across the project, and much of this is driven and directed by content holders who express their needs and assess solutions through the various Scenarios. From assessing this list, it was clear that we have a good spread of Scenarios underway, and completing these should form the focus for upcoming work. As it stood at the meeting, across the three TestBeds (Large Scale Digital Repositories, Web Content, and Research Datasets), we have 11 active scenarios, 7 just starting to be worked on, with a further 10 not started, postponed or unknown (2 unknowns due to no feedback before the meeting).
Understanding the breadth and status of scenarios helps us understand and prioritise forthcoming work. Combined with the recent gap analysis and the scenario refinement work, which aims to succinctly identify the issue with each scenario and the associated solution’s requirements, it will also help make on-going development work clearer, better directed and easier to measure.
Catherine Jones (STFC) presented an overview of the policy representation work going on within the project. This aims to break down organisational high level policies into low level machine actionable policies which can be monitored and reacted to by the automated watch component (for example, reacting to risks and constraints derived from policies), as well as used by preservation planning to ensure that Preservation Plans meet institutional needs and requirements (for example, using institutional policies to select appropriate plans or components for plans). As a means to understand this better, the group worked on developing example policies for the three defined policy levels (High-level guidance policies, Preservation policies, and Control Level policies) using an existing scenario (TIFF to JPG2000 migration).
On the final day, a couple of demonstrations were given to the group. One focussed on the development of the Component Plugin functionality in Taverna (to enable the creation and use of SCAPE Preservation Components within the Taverna Workbench) and the associated Component Catalogue APIs for storage, access and discovery of these components within the web based catalogue. The second demo was of Matchbox, a tool for detecting near-similar images, for example where one image is a rotation or scaled transformation of another. In particular this demo focussed on the detection of duplicate images from a collection of content, with excellent results.
To wrap things up, Carl Wilson (OPF) gave a presentation surrounding the Functional Coding Review aspects of the project which serves to ensure that the software we produce is of good quality and easily maintainable. He covered many related aspects such as documentation, licensing, unit tests, bug tracking and packaging. He also discussed the need for a team of developers who will be responsible for reviewing code against our coding guidelines (as well as reviewing and iterating the guidelines themselves) – if you can spare some time to help with this, please get in touch.
On the whole the workshop was very successful. Flexible arrangements allowed a lot of important work to be covered, with the participants I spoke to all having positive things to say about this approach and the meet-up in general. Ultimately, the benefits of such flexibility is probably reflective of the fact that it is constructive participation which is important in setting the direction and success of any endeavour.
On this note, if you have a scenario with a scalability challenge which SCAPE should be working on, are able to help with functional review, or have feedback on SCAPE outputs (especially from those external to the project) then please get in touch.
Preservation Topics: SCAPE
Part of my work on the SCAPE testbeds involves producing a workflow for the large scale migration of TIFF to JP2 files, with validation. The tests I have run all involve the lossy compression of files.
Two tools that could be used for the validation of image payload, and therefore success of a migration, are Matchbox, developed for SCAPE by AIT, and ImageMagick’s “compare” tool. One of Matchbox’s tests gives a result of SSIM, a value between 0 and 1. The metric I chose to use from “compare” was PSNR, a value in decibels.
I ran some tests using thirty master TIFF files (approximately 28 mega-pixels in size) to see how effective Matchbox and “compare” were when calculating the sameness of a TIFF and an altered version of that TIFF, for example, with added noise, blur, pixellation and horizontally shifted pixels.
Baseline figures for a high quality lossy compression of a TIFF to JP2 using what is essentially the BL newspaper profile, gives a PSNR value of 52dB (good) and Matchbox SSIM result of 0.996 (good).
The tests showed that Matchbox successfully identified that the files were similar, despite the alterations. Mean results were greater than 0.995, indicating a good match. For the same comparisons using “compare”, the mean average PSNR was between 29-39dB, indicating that “compare” was better able to identify noise within the files, i.e. corruption in this use case, and that they were not identical. Runtime was a major difference between the tools, Matchbox took about five times as long for its comparison than ImageMagick.
Using Matchbox as a means of ascertaining whether the images are exactly the same is not quite what the tool is designed for. It is designed to identify whether the image content within files is the same, for example, in two scans of the same document: “near duplicates”. There is a presentation about its impressive abilities. ImageMagick’s “compare”, on the other hand, cannot compare two files that have different dimensions. Because of the lower PSNR scores in these tests, which are an attempt to identify success of an image migration (sameness), a better metric may be PSNR as it is more sensitive to smaller differences between images.
If a migration did not involve the lossy compression of files there are other tests that could be used, such as direct comparison of pixel values.
Preservation Topics: Preservation ActionsMigrationToolsSCAPE
This blog post is an answer to willp-bl's post "Mixing Hadoop and Taverna" and is building on some of the ideas that I presented in my blog post "Big data processing: chaining Hadoop jobs using Taverna".
First of all, it is very interesting to see willp-bl's variants of implementing a large scale file format migration workflow using Taverna and Hadoop, and it is also interesting to see the implications which different integration patterns can have on performance and throughput of a workflow run.
However, while the final conclusion that "it’s clear that adding more processing increases execution time" is logically true, I will argue that interpreting this conclusion in the sense that there is necessarily a significant performance decrease when using Taverna together with Hadoop can be absolutely misleading. In the following, I will explain why it highly depends on the system architecture if I should actually care about this. And I will tell you why I don't.
The intended use of Hadoop is with a cluster of machines where the MapReduce programming model together with the Hadoop Distributed File System (HDFS) provide a powerful backend for processing large amounts of data.
Even with the help of such a backend, typical preservation tasks, like file format migration of millions of image files or mime type identification of billions of objects in a web archive, are long running processes that can take hours or even days depending on the type of processing, the size of the input data set, hardware specifications of the cluster, etc.
When using Taverna to start this kind of Hadoop jobs, there is just the startup time caused by the launcher component, like Taverna’s Tool service, for example. This additional cost can be minimized by using the server version of Taverna (Taverna Server) deployed to a servlet container instead of running Taverna in headless mode.
Should I actually care about the cost of 30 seconds additional startup time for initiating a Hadoop job that runs 24 hours? A Taverna workflow managing a sequence of 4 Hadoop jobs can create some minutes of overhead. However, using the wrong integration approach for preservation tools on the record processing level can have much more serious implications when processing 25 million records, for example. I really do care about the latter.
In my opinion, Taverna’s strength is not the batch processing performance, it will always stay behind when comparing it’s list processing with direct batch processing in this regard. I see Taverna’s role here therefore rather in the orchestration layer which - just to stay with willp-bl’s words here - “should not be mixed” with the large scale processing layer.
According to my understanding the real impact on processing time and throughput lies in the way how the preservation tools are invoked in the iterative Map execution phase because an increase here is multiplied by the number of records being processed. And compared to this the startup time of Taverna is absolutely negligible.
Let us quickly look at another example to make this clearer. Considering alternatives for doing a mime type detection on 1 Terabyte of archived web content using Apache Tika and the unix tool “file” we observed the following differences regarding throughput on our experimental cluster with 5 nodes:
In this sense I see willp-bl’s post as a study on alternative integration patterns for a combination of different preservation tools including the Taverna workflow engine itself. But, regarding system integration, especially when it comes to large scale processing, I prefer using Taverna Server separated from the backend in an orchestration layer where it takes care of scheduling a sequence of long-running jobs.
Admittedly, just for the workflow processing, this could also be done in a batch script, no need for a workflow execution engine here. Therefore, it must be noted that Taverna’s functionality is being extended in the SCAPE project by the possibility to add semantic annotations to inputs, outputs, and to the components of a workflow. On the long term this will help developers – and I wish not only less-technical people – to find and use the right preservation components when designing digital preservation workflows.
Background
In 2002 the UK government introduced regulation that required all UK local authorities to provide the British Library with a copy of the electoral register every year. However, the legislation did not require this data to be provided in any particular format and, as a result the data is sent to the British Library in a variety of digital formats.
This has presented some challenges to the British Library concerning how this information is stored for the long term.
Problems/Challenges
Most of the data is sent to the British Library in the form of delimited text files; however there are also Excel spread sheets, Microsoft Word documents and a significant number of PDF files.
In addition some of the files are zipped and others are password protected. There are also some with incorrect file types, for examples some files have suffixes that indicate an excel spread sheet but are in fact text files.
To date the work at the British Library has concentrating on the data supplied as delimited text files, although some work has been done on PDFs.
Delimited Text
Of the delimited text files the nature of the data varies :-
· Some have headers, although the header names are not consistent, while others have no headers.
· Some are comma delimited, while others are tab delimited.
· Some have data values enclosed in quotes, others don’t, while some have some have a mixture of both, quoted and unquoted.
· Most files include more data than the BL is required to keep and some of it is duplicated.
The Solution
The aim was to create a tool that can take the supplied electoral register data as input and produce a single data file containing just the items of information the BL is required to hold in a normalised format.
Although this specific problem relates to UK Electoral Register data it represents a more general problem of processing disparate text files which need to be rationalised. The software has therefore been designed to be as generic as possible in order that it can be reused with other datasets of this type.
The solution consists of 4 stages :
· Identification
· Characterisation
· Migration
· Collation
Identification
The identification stage involves identifying the type of the incoming file using Apache Tika. At present only text files are supported.
Characterisation
This is the most complex stage as it is responsible for mapping columns from the incoming files to corresponding columns in the normalised output file. In order to do this it is necessary to determine the nature of the data in the incoming file. For example :-
Is it tabular (delimited)
Does it have a header line ?
If there is a header line we need to match header names in the incoming file to the required column headers in the output file, this is done using pattern matching.
If there is no header we need to identify the contents of columns using the data in those columns. This is done using a mixture of pattern matching e.g. postcodes, and comparing the contents of columns with expected values, e.g. common surnames.
The information used to carry out this characterisation is held in a normalisation properties file that is specific to the context. It effectively contains a description of the normalised output file.
Examples of the kind of information held in the properties file are :-
· the column headers of the output file,
· whether each item (column) of data is mandatory or optional,
· the kind of delimiter used,
· whether output data is enclosed in quotes,
· regular expressions describing the format of the expected data e.g. postcode
· validation information in order to carry out quality assurance, this might take the form of pattern matching or lists of values.
The aim of the characterisation stage is to produce a mapping that describes how the input data maps to the output data, i.e. which columns in the input file need to be extracted and copied to the output file.
Migration
The next step is migration which involves copying data from the input file to the output file. The mapping, produced by the characterisation stage, makes this a relatively trivial task, although it also incorporates QA using validation information stored in the properties file.
Collation
The first three stages, identification, characterisation and migration, are carried out by the Hadoop map process. The more usual Hadoop scenario is that of one large file that is split between a number of Hadoop nodes. In this case we have a large number of relatively small files and the input to the Hadoop map method is a file containing a list of all the incoming files to be processed. These files are then distributed between the available nodes in the Hadoop cluster. The output from the map process is a single normalised output file per incoming file. The Hadoop reduce process then merges these files together to produce a single normalised output file.
Results
The tool has worked well in producing normalised output from incoming electoral register data. In order to check how generic it is the tool was also tested against another, completely different, set of government data files. All that should be needed to generate a normalised dataset from the new input files is a new version of the normalisation properties file. When the tool was tested against the new dataset it successfully produced the required normalised data file.
Preservation Topics: CharacterisationWe are pleased to welcome two new affiliate organisations to our membership: Portico, a digital preservation service, and the School of Information and Library Science (SILS) at the University of North Carolina at Chapel Hill, a number one ranked school in the United States.
“We are delighted these organisations have decided to join us,” said Bram van der Werf, Executive Director of OPF. “Both organisations have a significant impact on the digital preservation practice on a global scale. The UNC brings expertise in research and education, and Portico for digital preservation services. As members of OPF, UNC and Portico will both play a major role in the further development of tools which are relevant for the OPF community.” Both Portico and SILS are enthusiastic to be a part of OPF. "Portico values OPF’s focus on practical solutions and its emphasis on a broad-based and active community of practise. We share OPF's belief that the digital preservation community as a whole benefits from the wide-spread sharing of experience, tools, and techniques” said Kate Wittenberg, Managing Director of Portico. “I am very excited that SILS is joining the OPF and will be helping to extend the reach of the very successful OPF model and community into the US,” said Christopher (Cal) Lee, Associate Professor at SILS. SILS is the first iSchool to join the OPF, and Portico the first digital preservation service organisation.The OPF now has three member organisations from the US. The current list of members may be viewed at: http://openplanetsfoundation.org/members. For more information about UNC SILS, visit: http://sils.unc.edu/. For more information about Portico, visit: http://www.portico.org/digital-preservation/.Preservation Topics: Open Planets FoundationIf you are, what are you retaining and why?
If not, why not?
There is more to come from us on this topic - but for now I'd love hear any opinions / thoughts.
And what do I mean by technical provenance?
Good question. I mean any filename sanitation, or QA changes to (meta)data, or any file structure moves, or normalisation data or details of any technical process that has touched the original bitstream as it was found (at rest, if applicable) on its source medium.
Preservation Topics: Preservation RisksBuilding a Debian Package from a program written in Ruby is not a straightforward task. This post intends to be a step by step practical guide on packaging ruby programs based on the lessons we learned during the debianization process.
We will use in this guide a sample program: Pagelyzer (http://wiki.opf-labs.org/display/TR/Pagelyzer). This program is an interesting example because of its complexity, it contains Ruby code, java, javascript, as well as some binary libraries in C.
Packaging Ruby scripts is not that different as packaging other software, but using different rules. A debian packaging software relies on standard linux development tools, such as make. This step is crucial to construct a deb package.
As Ruby is normally interpreted (can be compiled but it is not usual) the make command will not work, because there is not MakeFile. Therefore, ruby community has put in place an option for going to the whole process. The proutils ruby project gives all the requirements needed to create a deb package. Its goal is to work in the same way as the make command. Thus, the packaging software won't complain in the process.
This tutorial is a summary based on the previous work of Ubuntu developer David Green tutorial posted on Sep 2012 (https://wiki.ubuntu.com/PackagingGuideDeprecated/Ruby).
In this section we will describe the software and file structures needed to make the package.
Setting up the enviromentHere's what we need to begin packaging our software:
The corresponding apt-get command:
$ sudo apt-get install ruby1.9.1-full wget dh-make build-essential fakeroot cdbs debhelper ruby-pkg-tools
Some of the tools introduced below will look for two environment variables to guess your name and email address to put in the package metadata, let's set them up here:
$ export DEBFULLNAME="Your Name"
$ export DEBEMAIL="Your.Email@address.here"
you should also add these to your .bashrc or other shell startup script if you want them to be set up automatically.
Creating the Source ArchiveTo create the source archive we need to:
Create a directory in the following format package-name-version. We will use pagelyzer-ruby-0.9
Also, change into the new directory.
$ mkdir pagelyzer-ruby-0.9
$ cd pagelyzer-ruby-0.9
We need to download the setup.rb file from:
$ wget http://i.loveruby.net/archive/setup/setup-3.4.1.tar.gz
We only need the setup.rb file, we can delete the rest of the files in the folder.
Or download it from the attachment (bottom of the page): http://www.openplanetsfoundation.org/system/files/setup-3.4.1.zip
Create the Directory StructureThe directory structure used by setup.rb is as followed:
PackageTop/
lib/
(ruby scripts)
ext/
(ruby extensions)
bin/
(commands)
data/
(data files)
etc/
(configuration files)
man/
(manual pages)
test/
(tests)
(taken from the setup.rb manual)
Create these directories:
$ mkdir lib ext bin data etc man test
Create other directories that will be used:
$ mkdir man/man1 data/pagelyzer-ruby data/pagelyzer-ruby/js data/doc data/doc/pagelyzer-ruby
Add the FilesHere we list the correspondence of scripts into the directory structure:
FileFolderpagelyzer_analyzerbinpagelyzer_capturebinpagelyzer_changedetectionbinpagelyzer_block.rblibpagelyzer_convex_hull.rblibpagelyzer_dimension.rblibpagelyzer_driver.rblibpagelyzer_heuristic.rblibpagelyzer_point.rblibpagelyzer_separator.rblibpagelyzer_url_utils.rblibpagelyzer_util.rblibjs/compress_js.rbdata/pagelyzer-ruby/jsjs/decorate.jsdata/pagelyzer-ruby/jsjs/decorate_mini.jsdata/pagelyzer-ruby/jsmarcalizer.zipdata/pagelyzer-rubypagelyzer_diff.jardata/pagelyzer-ruby
Note: All .rb files in bin and lib folder should be executable. In contrary case, setup.rb will not include them.
We need to create a manpage for each executable file in /usr/bin. To do this edit man/man1/pagelyzer_changedetection.1. Here a small example, but it should be more extensive.
.TH pagelyzer_changedetection 1 "JAN 20 2013" "Andrés Sanoja"
.SH NAME
pagelyzer_changedetection \- a tool for detecting changes in web pages and their rendering
.SH SYNOPSIS
.B pagelyzer_changedetection
.BR [string]
.PP
.SH DESCRIPTION
Covers the change detection process: capture, segmentation, version analysis (visual and structural)
.PP
.SH AUTHOR
.TP
Andrés SANOJA <andres.sanoja@lip6.fr>
Myriam Ben Saad <myriam.ben-saad@lip6.fr>
Marc Law <marc.law@lip6.fr>
Carlos Sureda <carlos.sureda@lip6.fr>
Jordi Creus <Jordi.Creus@lip6.fr>
Note: manpages are written in the nroff format. You can also use other formats such as ri or pod and convert them to nroff.
Test That it WorksInstall pagelyzer-ruby on your system using setup.rb directly:
$ ruby setup.rb config
$ sudo ruby setup.rb install
Next run:
$ capture.rb –url=http://www.lip6.fr
which should output the web page screenshot, decorated file and source code in the ~/pagelyzer/out folder
Also, test that the manpage works:
$ man pagelyzer_analyzer
To uninstall run:
$ sudo rm -rfi `cat InstalledFiles`
Delete the '.config' file:
$ rm .config
Create the TarballCreate a gzipped tar archive of the working folder:
$ cd ..
$ tar cavf pagelyzer-ruby-0.9.tar.gz pagelyzer-ruby-0.9
This should create your source archive, pagelyzer-ruby-0.9.tar.gz.
The Packaging ProcessTo create a package we need to:
We are going to use dh_make, which will create a template from which we will work on. Run:
$ dh_make -c lgpl -s -r cdbs -f ../pagelyzer-ruby-0.9.tar.gz
which means: -c lgpl tells it that the package is licensed under the LGPL license, -s tells it that we just want one binary package, -r tells it to use CDBS, Common Debian Build System, which will make our packaging simple, so we can concentrate on the Ruby specific things. -f ../pagelyzer-ruby-0.9.tar.gz tells it that we are using the ../pagelyzer-ruby-0.9.tar.gz file as our source.
You should see something like:
Maintainer name : Your Name
Email-Address : Your.Email@address.here
Date : Wed, 24 Jan 2013 19:53:51 +0530
Package Name : pagelyzer-ruby
Version : 0.9
License : lgpl3
Using dpatch : no
Using quilt : no
Type of Package : cdbs
Hit <enter> to confirm:
Currently there is no top level Makefile. This may require additional tuning.
Please edit the files in the debian/ subdirectory now. Before we look at what has happened inside the pagelyzer-ruby-0.9/ directory, let's see what has happened to the directory above it:
$ ls ..
You'll notice that there is a file here that we haven't created: pagelyzer-ruby-1.0.orig.tar.gz. Packaging programs, in addition to binary package, also generates a source package which consists of three files: ${PKGNAME}_${VER}.orig.tar.gz (the original upstream tarball), ${PKGNAME}_${VER}-${PKGVER}.diff.gz (a diff file for the debian/ directory) and ${PKGNAME}_${VER}-${PKGVER}.dsc (a signed summary of the source package). Because we told dh_make where our upstream source tarball was, it renamed it appropriately (${PKGNAME}_${VER}.orig.tar.gz). We could very well have renamed it ourself and not passed the -f option, we chose to be lazy!
Rename debian/postinst.ex filepostinst.ex is a template we need later, rename it to postinst (without extension):
$ mv debian/postinst.ex debian/postinst
Remove Unnecessary FilesSome of the files created are examples and not required. We can delete those with this command:
$ rm debian/*.ex debian/*.EX debian/READ*
Edit debian/rulesSet the contents of debian/rules to this:
#!/usr/bin/make -f
# -*- mode: makefile; coding: utf-8 -*-
include /usr/share/cdbs/1/rules/debhelper.mk
include /usr/share/ruby-pkg-tools/1/class/ruby-setup-rb.mk
This tells the packager to use setup.rb to create the package.
Edit debian/controlEdit the contents of debian/control to something like this:
Source: pagelyzer-ruby
Section: misc
Priority: extra
Maintainer: Andrés Sanoja <andres.sanoja@lip6.fr>
Build-Depends: cdbs, debhelper (>= 8.0.0), ruby-pkg-tools
# ruby1.9.1-full, libxslt-dev, libxml2-dev, openjdk-7-jdk, imagemagick, ruby1.9.1-dev
Standards-Version: 3.9.2
Homepage: http://wiki.opf-labs.org/display/TR/Pagelyzer
#Vcs-Git: git://git.debian.org/collab-maint/pagelyzer.git
#Vcs-Browser: http://git.debian.org/?p=collab-maint/pagelyzer.git;a=summary
Package: pagelyzer-ruby1.9.1
Architecture: amd64
Depends: ruby1.9.1, cdbs, debhelper (>= 8.0.0), ruby-pkg-tools, libxslt-dev, libxml2-dev, openjdk-6-jdk, imagemagick, ruby1.9.1-dev, ${shlibs:Depends}, ${misc:Depends}
# ruby1.9.1-full
# openjdk-7-jdk
Description: Suite of tools for detecting changes and its rendering
Tool for the web pages comparison based on structural and visual approach.
Research challenge for this tool is the learning algorithm based on frequency.
.
Pagelyzer is a tool which compares two web pages versions and decides if they
are similar or not.
.
It is based on:
* a combination of structural and visual comparison methods embedded in a
statistical discriminative model,
* a visual similarity measure designed for Web pages that improves change
detection,
* a supervised feature selection method adapted to Web archiving.
.
We train a Support Vector Machine model with vectors of similarity scores
between successive versions of pages. The trained model then determines whether
two versions, defined by their vector of similarity scores, are similar or not.
Experiments on real Web archives validate our approach.
Package: pagelyzer-ruby
Architecture: amd64
Depends: pagelyzer-ruby1.9.1, ${misc:Depends}
# , ruby1.9.1-full, cdbs, debhelper (>= 8.0.0), ruby-pkg-tools, libxslt-dev, libxml2-dev, openjdk-6-jdk, imagemagick,ruby1.9.1-dev, ${shlibs:Depends}
# openjdk-7-jdk
Description: Suite of tools for detecting changes and its rendering
metapackage
Suite of tools for detecting changes and its rendering.
Dummy package for pagelyzer-ruby1.9.1
Note that we need to split the packages into a ruby version dependent (dependent on ruby1.9.1) and a dummy package that depends on the version dependent package. If we don't do this, the packaging process will seem to work OK but the packages will not contain any of the files we created will not be in the resulting .deb files! (Remark made by SevenMachines on the Ubuntu Forums thread).
Edit debian/postinst actionsSome ruby gems should be present for the software works properly. In the debian/postinst file (remove .ex extension) add the following:
...
# dh_installdeb will replace this with shell code automatically
# generated by other debhelper scripts.
sudo ln -sf /usr/bin/ruby1.9.1 /usr/bin/ruby
sudo ln -sf /usr/bin/gem1.9.1 /usr/bin/gem
sudo gem install --version '= 0.8.6' hpricot
sudo gem install --version '= 1.5.5' nokogiri
sudo gem install --version '= 2.0.3' sanitize
sudo gem install --version '= 2.29.0' selenium-webdriver
#DEBHELPER#
...
Edit debian/changelog and debian/copyright. Make sure you edit these correctly - especially the debian/copyright file.
Scape project is based on git version control. All changelog information is in there. So, the best way is to download a script from (https://github.com/rackerhacker/gitlog-to-deblog) get into a git hub working folder and generate the changelog file.
It is important to take advice in the version numbers and package name. It should be the same. In our case it is 0.9 but git can change it a bit. For example,
pagelyzer (initial-11-gbbcc12f) unstable; urgency=low
* Including performance test and enhacements in change_detection.rb
should be change to something like this:
pagelyzer-ruby (0.9-11-gbbcc12f) unstable; urgency=low
* Including performance test and enhacements in change_detection.rb
And an example of 'copyright' file:
Format: http://dep.debian.net/deps/dep5
Upstream-Name: pagelyzer-ruby
Source: https://github.com/openplanets/pagelyzer
Files: *
Copyright: 2011, 2012 Andrés Sanoja <afsanoja@gmail.com>
2011, 2012 Stéphane Gançarski <Stephane.Gancarski@lip6.fr>
2011, 2012 Zeynep Pehlivan <zeynep.pehlivan@gmail.com>
2011, 2012 Denis Pitzalis <denis.pitzalis@gmail.com>
2011, 2012 Marc Law <marc.law@lip6.fr>
License: LGPL-3.0+
Files: debian/*
Copyright: 2013 Jordi Creus Tomàs <Jordi.Creus@lip6.fr>
License: LGPL-3.0+
License: LGPL-3.0+
This package is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 3 of the License, or (at your option) any later version.
.
This package is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
.
On Debian systems, the complete text of the GNU Lesser General
Public License can be found in "/usr/share/common-licenses/LGPL-3".
Make sure you are in the project root directory and use the 'debuild' command to create the packages:
$ debuild -us -uc
That will build the source and binary package. The -us -uc options are not to sign the source and changes files (we would need to sign them to upload them to Ubuntu/Debian/your PPA, but we'll skip that in this tutorial, if you are interested in PPAs, there was a session about it in https://wiki.ubuntu.com/MeetingLogs/openweekhardy/LaunchpadPPAs
It should create separate packages for pagelyzer-ruby1.9.1_0.9-rr-cccc_i386.deb and pagelyzer-ruby_0.9-rr-cccc_i386.deb. Where rr is the last revision number (from changelog file) and cccc the hash of the revision (also from changelog file)
For some strange reason this command is not enough, we should use also:
$ dpkg-buildpackage
Note: be careful if you use a virtual machine (e.g., virtualbox, etc), your files should be in a folder where you have “real” write permissions (e.g., a folder inside your virtual disk). Otherwise, if you use a shared folder for your files you will come across a 'Read-only file system' error.
Signing your packageYou need first a GPG key. Follow the steps in http://keyring.debian.org/creating-key.html. If you have never run gpg before, do it:
$ gpg
This will create the ~/.gnupg directory, then you will be able to modify the ~/.gnupg/gpg.conf file according to the tutorial. Finally, do not forget to make your new key publicly available to the pgp server (it will be automatically distributed on the pgp network in a few minutes):
$ gpg --keyserver subkeys.pgp.net --send-key 12345678
Now, you can finally sign your package by running the command:
$ debuild -k12345678
Copying GPG keys across different machinesIf you are building a package for different architectures, i386, amd64... You must sign all them with the same key. To copy a GPG key from one machine to another, you simply need to copy all *.gpg files (pubring.gpg, secring.gpg and trustdb.gpg) and gpg.conf on ~/.gnupg directory from one machine to the other one. (random_seed file is not mandatory)
Uploading your package to the repositoryInstall and configure dupload according to http://wiki.opf-labs.org/display/SP/Submitting+Your+Package tutorial. Do not forget to enable FTP_PASSIVE mode!
$ export FTP_PASSIVE=1
Finally, you can upload your files:
$ dupload pagelyzer-ruby_0.9-12-gbbcc12f_amd64.changes
...
$ dupload pagelyzer-ruby_0.9-12-gbbcc12f_i386.changes
...
Double click in the .deb file
pagelyzer-ruby1.9.1_0.9-xx-yyyyyyy_i386.deb
This will install everything you need as well as the tool itself.
Authors: Andrés Sanoja & Jordi Creus
AttachmentSize
As part of our work on test-beds for the SCAPE project we have been investigating the various ways in which a large scale file format migration workflow could be implemented. The underlying technologies chosen for the platform are Hadoop and Taverna. One of the aims of the SCAPE project is to allow the automatic generation and execution of Taverna workflows, which will be executed via Hadoop.
The four methods for implementing a file format migration workflow that we tested were:
The code is a generic wrapper for Hadoop, set up so the different types of workflow can be chosen at runtime.
The example workflow we used is a file format migration from TIFF to JPEG2000, followed by some validation of the file structure and image data.
The structure of the workflows are broadly the same as that for SCAPE LSDRT-3:
The batch execution shell script
The shell script contains a rudimentary version of the above workflow. No reporting is performed by the implementation of the shell script used.
The Java workflow via Hadoop (CommandLineJob)
This workflow is controlled from a Java class: CommandLineJob. It contains the full workflow above. There is Java code to produce a report, generate a zip file containing an empty SUCCESS/FAILURE file and return success/failure back through to the HDFS result file.
The Taverna workflow via Hadoop (TavernaCommandLineJob)
The Taverna workflow contains the full workflow, and is executed by Hadoop calling the Taverna command line client. The reporting and zip generation steps are written in Beanshell/shell script. Success/failure is reported in one of two ways: an empty SUCCESS/FAILURE file in the zip, or if the failure was more serious, a log file ending in “.error” is stored in HDFS, with no other outputs.
There is code in the repository to call Taverna Server from Hadoop, instead of the Taverna command line. The code has not been tested for a while and may not work. Note that input files for the workflow are not currently uploaded to the server by this class so it will only work on a single node Hadoop machine at the moment.
The XML workflow, via Taverna calling Hadoop (XMLCommandLineJob & XMLWorkflowReport)
This method incorporates more technologies. Initially, the workflow is loaded and run from Taverna (note that only Taverna Workbench has been tested so far). The workflow contains several calls to Hadoop to execute an XMLCommandLineJob, with a command line and parameters defined in XML files. Generated files from each call are stored in HDFS and tracked with a JobTracker class. Messages regarding the success of each step are queued to an ActiveMQ instance. The final step in the workflow runs an XMLWorkflowReport via Hadoop. It wraps up the processing by all the messages for the previous steps, generating a short report, checksumming all the generated files and producing a zip.
Execution time
The above methods were tested on a single core Debian Wheezy VM with 2GB RAM. The test files were thirty files from our JISC1 newspaper collection. When encoded to JP2 with certain settings using OpenJPEG, one of the files is known to produce a bad encode. The setting that makes the difference is coder bypass being enabled (-M 1), This has been reported to OpenJPEG.
Shell script
Hadoop->Java
Hadoop->Taverna
Taverna->Hadoop
Runtime (mm:ss)
36:08
41:59
76:58
77:54
Runtime/file
01:12
01:23
02:33
02:35
MB/hour
772.14
664.55
362.49
358.15
Errors (true positive)
NA~
1
1
1
Errors (false positive)
NA~
0
1*
5^
~ No reporting was present in the shell script
*Matchbox SSIM was 0.85 (i.e. <0.9). When the jp2 file was first on the “compare” command line the result was 0.85, when the TIFF file was first the result was 0.97. Both orderings need to be checked.
^All had exit code 137 from Matchbox SIFT comparison, which indicates out of memory
Conclusions
From the numbers above it’s clear that adding more processing increases execution time. Although there is an increase in execution time it is hoped that we will be able to develop tools to allow less technical people develop and execute their own workflows. Some of the steps still need a good technical understanding, such as the beanshell and java code needed to glue the Taverna workflows together.
Preservation Topics: Preservation ActionsMigrationToolsSCAPEjpylyzerFollowing a few interesting conversations recently, I got interested in the idea of 'bit flip' - an occasion where a single binary bit changes state from a 0 to a 1 or from a 1 to a 0 inside a file.
I wrote a very inefficent script that sequentially flipped every bit in jpeg file, saved the new bitstream as a jpeg, attempted to render it in the [im] python library, and if successful, to calcuate an RMSe error value for the new file.
I've not really had much time to take this further at the moment, but its an academic notion I'd be interested in exploring some more.
I'm not sure if a bit flip is a theoritical or 'real' threat on modern storage devices - in the millions of digital objects that have passed through my hands in the past 10 years, I've never knowningly handled a randon bit flip errored file. I'd be interested in any thoughts / experiances / observations on the topic.
Please see the attached file for some pretty pictures.
Feel free to get in touch if you want any more data - images, RMSe data or scripts.
Preservation Topics: Bit rot AttachmentSizeThe kick-off meeting of the Succeed project (http://www.succeed-project.eu) took place last Friday, February 1, in Paris.
Succeed is a project coordinated by the Universidad de Alicante and supported by the European Commission with a contribution of 1.8 mio. €. The core objective of Succeed is to promote the take-up of the research results generated by technological companies and research centres in Europe in a strategic field for Europe: digitisation and preservation of its cultural heritage. Succeed will foster the take-up of the most recent tools and techniques by libraries, museums and archives through the organisation of meetings of experts in digitisation, competitions to evaluate techniques, technical conferences to broadcast results and through the maintenance of an online platform for the demonstration and evaluation of tools. Succeed will contribute in this way to the coordination of efforts for the digitisation of cultural heritage and to the standardisation of procedures. It will also propose measures to the European Union to foster the dissemination of European knowledge through centres of competence in digitisation, such as Open Planets Foundation, PrestoCentre, APARSEN, 3D-COFORM Virtual Competence Centre, and V-MusT.net. In addition to the University of Alicante, the consortium includes the following European institutions: the National Library of the Netherlands, the Dutch Institute of Lexicology, the Fraunhofer Gesellschaft, the Poznań Supercomputing Centre, the University of Salford, the Foundation Biblioteca Virtual Miguel de Cervantes Savedra, the French National Library and the British Library. For additional information, please contact Rafael Carrasco (Universidad de Alicante) or send an email to succeed@ua.es.
Last week I had the honour to host the OPF Webinar "Digital Preservation at your command, part II".
During the Webinar attendees were shown the difference and/or similarities between the command line interfaces of MS DOS, Linux and Apple.
Here is a short summary of the Webinar:
* Comparison of command line interfaces (MS DOS, Linux, Apple) and a little history
* Invoking a command line application, arguments and argument length, input/output and redirection
* Caveats of the MS DOS command line (UTF8, forking)
* Managing performance and load balancing of processes
* Using command line applications and FIDO to create custom functionality
If you missed it, the video is available on Youtube now.
The DIY custom directory listing tool "FIDIR", as demonstrated in the Webinar, shows you the PUIDs and mime-types of files. The source code for MS DOS and Linux/Mac can be found at the OPF GitHub. You'll also need FIDO to run FIDIR.
Also see the OPF Labs Wiki for a page on command line trickery. Note that this page is brand new and far from complete, more trickery and comments are very welcome.
The following post is based on my contribution to the Dagstul Seminar "Is the Future of Preservation Cloudy?" in November 2012:
A growing number of archival institutions are turning towards POSE (Pay Once, Store Eternally) or Endowment models for funding their long-term digital archiving and preservation activities. The endowment model has a number of seductive advantages. First, it fits in nicely with a project-oriented digitisation efforts, as the endowment costs can be included in a project budget and do not have to be added to annual running budgets. Endowment models also allow simple budget calculations based on total storage volume, which in turn support business models based on archival services. As archival institution face pressure to become self-sustaining, such business models are in great demand.
However, there may be a number of dangerous assumptions behind simple endowment models. There is a pervading view that data centres with endowment models are like pension plans, in which incoming endowments (workers) will pay for old data (retirees). This analogy is clearly false because, unlike unfortunate pensioners, old data never dies. The reply to this is usually, "but the old data is so much smaller than the new data." This is true, but the only way in which an endowment model can handle the ever-increasing volumes of new data is by basing the cost model on a careful analysis of storage costs, such as detailed in Rosenthal et al.. Too many endowment models simply assume that per volume storage costs will continue to decrease (Kryder's law) forever, which simply cannot be the case. Rather, the storage capacity per unit cost, which is presently in an exponential growth phase, will eventually reach a stationary phase and level off, just as every other exponential growth scenario in nature. It is incumbant upon an endowment model to at least attempt to predict when this stationary phase will occur, and at what rate storage capacity will continue to grow.
Commercial storage providers such as Google and Amazon are well-aware of these difficulties, and offer business models that are highly advantageous to themselves as a result (the decreases in service costs offered over the past five years are still much higher than the real decrease in storage costs). The question is, are libraries, archives, and other data centres equally aware? Endowment models are complex, and probably more expensive than we think. The inevitable conclusion is that we can no longer afford to archive everything.
Last year I blogged about my frustrations related to digital preservation tool registries. Rather than pooling all of our knowledge in one place and creating a valuable community resource, we've spread our knowledge about tools thinly across the web. Instead of seeing collaboration between organisations working in digital preservation, we're actually seeing competition! Virtually every organisation involved in the field promotes it's own registry or tool list. This is a ridiculous state of affairs. As I observed at IPRES last year in my least eloquent but most frequently quoted moment, it's a big fail for our community.
Two weeks ago I presented a proposal for the creation of a community owned tool registry to the latest workshop on Aligning National Approaches to Digital Preservation, graciously hosted by the lovely people at IDCC. I'm pleased to say that the proposal was one of four key areas prioritised for further action, and I'm now leading some initial activities to take things forward with backing from ANADP (note that a full report from ANADP on the workshop outcomes will be available here shortly).
However, I'd like to get even broader support for this community proposal from everyone who has their own registry or tool list, whether it's a quick blog post or a full on registry. If that applies to you/your organisation then I'd like you to participate in the following way:
Exactly where the new registry will be hosted and maintained is yet to be decided (quite possibly a "neutral" URL/location. Whatever meets our requirements!). This will require some practical work to establish but is certainly not insurmountable. The key issue is to get buy in from the community. As I note in the proposal, we already have support in principle from the Library of Congress, the Digital Curation Centre and the Open Planets Foundation. This is a great start, but for this to be a success we need a lot more organisations to get involved.
Over the next couple of weeks I'll be putting together an outline and roadmap as an initial talking point for comment and requirements and sharing it via this blog. So this is my call to arms for COPTR: a Community Owned digital Preservation Tool Registry. Who would like to voice their support and commitment, create a valuable tool registry for us all, and kick off some vital community collaboration in the process?
On 6-7 December 2012, the first SCAPE training event was held in the beautiful city of Guimaraes, Portugal. The event was supported by the European Capital of Culture 2012, who kindly provided the venue, the Archaeological Museum of the Martins Sarmento Society.
The focus of the training event was identification and characterisation. The event began with an introduction to file formats, and some of the resources and tools that are used for identification such PRONOM, and FILE, FIDO, and TIKA, and how they are applied in different scenarios. This was followed by a session on wrapping tools using FITS and a group discussion on the advantages and disadvantages of wrapping tools. The second day began with an introduction to content profiling and planning and looked at the c3po tool. There was then a demonstration of the matchbox tool which is used to identify duplicate images in digital collections. The afternoon focussed on using tools as part of a workflow with Taverna and Hadoop.
The event was very hands-on; the trainers provided virtual machines and sample data so the attendees could run a number of identification and characterisation experiments using the command line. The practitioners and developers worked in pairs to complete the exercises, which were followedby group discussions.
Twenty-one participants from across Portugal, and Norway, Germany and the UK attended the event. There was a fairly equal balance of digital preservation practitioners and developers who came from archives, universities, vendors, government, public sector organisations, business and industry.
The participants were asked to complete an event evaluation survey. Nineteen of the 21 participants completed the survey, and the response to the event was very positive.
When asked what they thought were the strong points of the training event, attendees commented:
The resources, presentations and training materials have been uploaded to the SCAPE wiki pages here: http://wiki.opf-labs.org/display/SP/Resources+-+SCAPE+Training+event+-+Guimaraes.
The next SCAPE training event will take place on 16-17 September at The British Library, London. Further details will be announced on the SCAPE website.
Organisational readiness for Open Source
The demand for mature tools and services that support the digital preservation process is strong and growing stronger – and for a good reason.
Sure, a substantial number of research and grant-funded projects have delivered software and tools – and these tools are freely available on SourceForge, in accordance with the requirements of funders. But, what has happened to them? Most have become orphans because they have been abandoned. This is why I sometimes call SourceForge half-jokingly a “software cemetery”.
The tools and open source projects that have “survived” beyond the grant period, have done so thanks to some degree of prolonged investment in maintenance and development by one or two institutions who have “adopted” the software, after the project ended. Typical examples are DROID (National Archives UK), JHOVE (JSTOR/Harvard) and FITS (Harvard).
It is claimed that some of these tools are widely deployed in the community and firmly embedded in institutional digital preservation processes. But the number of software downloads is not a measure of deployment. You will read in most digital preservation survey reports that these same tools are not meeting the needs of the community. At conferences, you will hear complaints about the performance of the tools. BUT, most strikingly, when visiting the sites where these tools are downloadable for free, you will see no signs of an active user community reporting bugs and submitting feature requests. The forums are silent. The open source code is sometimes absent and there are neither community building approaches nor procedures in place for committing code to the open source project.
Why is it so difficult for digital preservation tools and software in the open source to achieve maturity? I contend that this is due to the fact that, in our community, we talk a lot about open source but in fact we don’t “do open source”. Most institutions prefer to act as critical “consumers” of open source software, instead of being constructive “contributors”. Most developers of open source tools are not actively engaged in a shared open source project, they are developing for the needs of the institution that employs them.
In this blog I propose to explain how the community might benefit more from a reciprocal approach to Open Source and the organisational consequences.
Understanding Open Source
At the heart of the “consumer” behaviour lies a very limited understanding of what open source development entails.
I mentioned the funders of software development projects, who have realised that sharing software deliverables as open source increases the chances of re-use. What they fail to understand however is that dumping code on SourceForge is not a guarantee for sustainability.
Within research and cultural heritage institutions – and in particular at middle and higher management levels, open source is quite popular. To a certain extent this is due to bad experiences with vendor lock-in solutions in past years, but it also fits in the spirit of the time. The polarised stance against vendor solutions and for open source solutions is based on quite naïve assumptions. Moreover it is not conducive to constructive collaboration based on trust – which lies at the basis of successful partnerships with vendors and open source communities alike. On a more opportunistic note, open source is also conceived as “software for free” and thus an attractive option to reduce the costs of ICT-solutions. What the managers fail to understand is that consuming open source is not for free, that it can actually be very costly if they do not have the right expertise and skills in house to integrate open source in their ICT-environments.
The ICT-staff (system administrators, software developers and their managers) in such organisations is largely selected to manage standard office automation environments and as a result they have little affinity with open source and lack the necessary skills-set to carry out open source development. They use a lot of stable and mature open source tools and software on a daily basis, such as Apache web servers, but again, in such cases they act as consumers and are hardly aware of the development model driving the product. As software consumers, they are by default more inclined to use commercial solutions because of the support they get and the SLA-based guarantees.
In academic research units, developers are usually employed on a project basis to build tools that support the short-term needs of research (e.g. a database, a visualisation tool, etc.). These developers are often working on their own and largely dependent on open source software – again, as consumers. Ask your favourite developer what he thinks of open source development and you will be surprised to hear mostly negative reactions. Open source developers are often considered by their colleagues (non open source developers) as small entrepreneurs, who develop bad code deliberately so that they can sell their expertise to consumers who have trouble in making the software work in their environment. A lot of the misunderstanding has to do with fear of loss of control and fear of sharing code with peers.
All these observations lead me to the conclusion that there is widespread misconception of what open source software development entails. As long as this is the case, there will continue to be too little cultural and organisational readiness in the community to really embrace such an approach for digital preservation.
Doing Open Source the right way
We can distinguish between 3 types of open source activities: 1) using available open source software (consumer), 2) developing software and making it available for free in the open source (contributor) and 3) community-wide open source development (open source project).
From the previous, it is clear that in the digital preservation community we primarily act as consumers (1) and contributors (2). Both are unilateral forms of activities. The third type of activity however, is reciprocal and therefore much more effective and interesting. Most successful open source software activities are of this type, to name just a few: Apache, Linux, Firefox, Drupal.
What are the characteristics of community-wide open source development?
First of all, such activity is not a temporary, grant-driven project. Although misleadingly referred to as “open source project” – it is not a project at all but rather a process activity. A project is by definition limited and constrained by scope, funds and deadlines. A process, in terms of quality management, is a continuous cycle of improvement. Open source is a software development process. The software is incrementally improved in cycles (inception, elaboration, construction, transition) which are the Digital Age incarnation of Quality Circles and PDCA (Plan, Do, Check, Act) – see my previous blog.
Digital preservation is not a project either: it has no predefined deliverables or predetermined results. Preservation is a continuous process that tries to respond to the challenges of the day. The underlying software tools need to be developed in close relation with the preservation process and practices. Both, the digital preservation process and the software development process, go hand in hand. Remember, it is all about learning by doing. And it is about responsiveness to change – in software development this translates into short and fast moving iterative cycles.
Secondly, the Open source approach is all about sharing and collaborating. It presumes a shared interest and a shared purpose. It leads to shared benefits and shared rights. An open source community collaborates at all levels: at the strategy level, software development, testing and maintenance levels. Most importantly, there is agreement on shared requirements. In the digital preservation community we tend to cultivate our differences and to think in terms of “what are the benefits of open source to my organization?”. Even national libraries, which form a closely knit community, are unable to join hands in defining their process requirements. Each is focused on its own, specific and customized Ingest process. Each chooses and implements different approaches and divergent solutions - ranging from open source, vendor and in-house development solutions. There is little open exchange of lessons learned from which the community as a whole can learn. These are not signs of a strong community that is able to develop and maintain its own robust and sustainable solutions. It is the sharing by many that makes the load bearable – digital preservation is a task that no organization can carry out on its own – not the big ones, not the small ones. For open source and for commercial solutions alike, the same economic principles apply. It is all about scale. The more users choose for the same solution, the more economic it will become. Our goal should be to turn digital preservation tools into commodities.
Thirdly, Open source is based on trust. It is an environment without contracts, SLA’s and formal liabilities. All community members are equally responsible and share the successes and the failures. All are peers. There is no vendor-customer relationship. All contribute and consume; the big players and the small ones alike. They share their experts and resources on the basis of reciprocity. Experts and leaders in this environment are selected on the basis of meritocracy and not on the basis of seniority or institutional affiliation. The open source approach has to do with the ability of organizations to learn from and contribute to their peers without expecting to get something back. It is all about unselfishness and trust.
Building constructive and successful partnerships with vendors is also based on trust, not on SLAs. Defining joint requirements as a user group of vendor solutions is also necessary for turning “bespoke” commercial tools into commodities. In many ways, open source and vendor approaches are alike in terms of the pre-conditions necessary to be successful and to achieve economies of scale. There are however, a few fundamental differences between a vendor and an open source approach. One difference has to do with innovation. A commercial solution is constrained by SLAs and revenue models, etc. It will try to avoid bleeding-edge technology and will tend to be driven by more conservative demands – to keep the customer base happy. In contrast, the free and collaborative nature of open source communities can be more conducive to out-of-the-box thinking and responding to technology trends. Another difference has to do with investment of resources. At the end of the day, one buys products and services from a vendor, but one invests expertise and personnel in open source solutions. The benefit of open source is not immediate: it is an investment. More specifically, it is a long-term investment in people, an investment in the organization. This is why the title of this blog is “Organizational readiness for Open Source”.
Are we ready?
One would expect that the open source approach would be a perfect fit for the public sector: no spending of tax-money on expensive commercial solutions, opportunity to deploy own resources in optimal ways, etc. In reality however, the public sector has institutionalized bureaucratic organizational and financial practices that are in every respect conflicting with Open source practices: measures that favor the outsourcing of tasks and expertise, thresholds to public expenses and requirements to tender, limited flexibility to deploy human resources, limited investment possibilities, etc. It is often far more difficult for public sector institutions to be involved in open source activities than it is to buy commercial solutions.
The underlying philosophy of open source originates from the Free Software movement and the concept of the digital commons – which is based on trust, diversity and reciprocity. The digital commons only exist by virtue of self-control, collaboration, intellectual freedom and freedom to act. During the last decade many public institutions have embraced the use of Open Source Software and started to adopt Creative Commons licenses to make cultural heritage freely available in the digital commons. Europeana’s advocacy for CC0-licensing is an illustrative example. Yet, the logic of rights & obligations and the drive to control, regulate, standardize and register is deeply embedded in the organizational DNA of public sector institutions. Will the organizational culture in the public sector be able to adapt to the uncontrolled, trust-based open source software practice?
I have highlighted many aspects and characteristics that demonstrate why the digital preservation community is probably not yet ready to embrace a full open source approach – but more importantly, why it might not even be a desirable or realistic objective. Still, there are compelling arguments for the digital preservation community to start working together as an open source software community: namely, to foster innovation, to break through the resistance of industry and memory organisations to change, and to invest in a shared pool of experts and skilled people. The OPF Hackathons serve as a venue for digital preservation practitioners, (open source) tool developers and vendor participants – a venue where the needs, the possibilities and the constraints are brought together with the goal to arrive at deployable solutions.
Ideally, the digital preservation community will make use of the best of both worlds: the open source solutions to drive innovation and commercial solutions to deploy and commoditise robust services.
We have got in excess of 300 TB of essential unknown data. At the State and University Library in Denmark we recently passed 300TB of harvested web resources in our web archive. These web resources have been harvested by crawling the Danish part of the internet since 2005, i.e. from every publicly available URL on the Danish top level domain “.dk”. This harvesting is done in a couple of different ways. We have scheduled crawls four times a year that do a complete harvest of the whole of .dk. We also harvest selected sites on e.g. an hourly schedule at big national events like elections for the government or royal weddings.
Due to privacy and copyright concerns these harvested web resources are stored in a so-called “dark archive” with means that basically these resources are inaccessible for anyone but the most serious researchers. Still, we are obliged to ensure these web resources remain accessible and in order to do so it is imperative that we know the content of our archive. Such knowledge is generally expressed through format identifiers such as MIME types and PRONOM IDs.
During the harvest we also collect available metadata from the web servers, e.g. the MIME type of the documents. This MIME type is deduced by the web server from simple attributes like document extension and it is therefore considered unreliable. We recently did a very informal check on that assumption. We ran the Apache Tika tool on a few thousand web resources and compared the extracted MIME type with that of the web server. We found that for most of the web resources the two MIME types actually matched. Surprising? Still, knowing what we have got does not limit itself to identification by its MIME type. We also need to acquire information on document version, image types, sizes, bit rates etc. Values that are not served by the web server during the harvesting.
To acquire such data one has a wide range of tools to select from. The SCAPE report Characterisation technology Release 1 — release report outlines some of these tools and their pros and cons.
Running FITSAs part of the SCAPE project we were asked by the Planning and Watch sub project to produce input data for an experimental project they worked on, a project which later became C3PO. C3PO is a tool used for repository profiling based on characterisation data from the repository. Planning and Watch asked us to use FITS on a selected part of our web archive to produce such characterisation data. The arguments behind this choice of FITS can is detailed in the article To FITS or not to FITS. To comply we selected representative parts of our web archive for each year we have harvested data.
Our web archive data is stored in ARC files. An ARC file contains a arbitrary number records, each record consisting of a header followed by the raw web resource.
The representative corpus is described in numbers in the following table
Year of harvest Number of ARC files 2005 4024 2006 20497 2007 17139 2008 30685 2009 23019 2010 14090 2011 13386We started this job in November 2011 and as of November 2012, when the analysis described here-in was performed, the job had processed more than 100000 ARC files amounting to almost 12TB or just above 400 million web resources.
The platform on which we were able to execute the job consisted of five Blade servers (Intel® Xeon® Processor X5670), each with twelve cores and a total memory of 288GB. The servers are connected to a SAN through 1GB ethernet. To handle the distribution and load balancing for the job we wrote a simple system in Bash. The code for this system is published as SB-Fits-webarchive on Github. Keep in mind that this system was created solely for this specific experimental data gathering task.
One thing we noticed when gathering data for this analysis is that FITS completely lacks any performance metrics. From the FITS data itself we cannot know anything about how long each ARC file took to process etc. Initially, we also did not record the numerous times we had to restart a FITS process either due to a crash or some infinite loop. When running the job, this was never a concern to us as the intention of this experiment solely was to produce FITS metadata as input to the C3PO project.
Fortunately, this does not mean that we are completely without performance metadata. Examining the result files we can deduce some information on the performance.
The Result filesAn ARC file contains a certain number of records where each record is a web resource. Such a resource can have a format like HTML, XHTML, XML, PDF, Flash, DOC, mp3, GIF, MPG, EXE, etc. For each record FITS produces an XML file including all the characterisation information extracted by all the modules FITS is configured to run on a given format. The produced XML files from a single ARC file are after the FITS run packed into a tarball for later processing.
Getting the performance data Timing from the TGZ filesTo get the timing metrics of the FITS jobs we look at the modification time of the individual XML files within the tarball of FITS results corresponding to a given ARC file. This has some implications, e.g. we are not able to get timing information for the first XML file or for ARC files containing a single record. Still, given the amount of data created by the FITS experiment, this should not give rise to any significant problem.
The source code for this timing extraction tool can be found at the fits-analysis Github page.
ARC file sizesAn ARC file consists of any number of records. The harvesting system is configured to produce ARC files with a size of 100MB, but, as will be seen in the following, not all files adhere to this limitation. Therefore we would like to use the precise size in the analysis and need to obtain these sizes from the original ARC files.
As the cluster we used for the experiment does not have local storage we had to apply a rather cumbersome process of transferring the ARC files from the production storage system to a work area, both storage systems are located on a SAN. After copying an ARC file to the work area it was unpacked for FITS to run on each unpacked record. Lastly, the ARC file and all its records were deleted before fetching the next ARC file.
To obtain the ARC file sizes for this analysis we used the original configuration file listing the data corpus and then had a Bash job use 'ls' over ssh to the production storage for each ARC file that had been characterised. Fortunately, this only took a few days to complete.
AnalysisAfter the above data extractions we have the following data set.
parameter name descriptionARC file name this name could be used for further data acquisition count the number of records length the size of the FITS results file. This is not used. time the time it took in seconds to process the ARC file.** size the size of the ARC file in MB** As mentioned above, the processing time is calculated as the difference in modification time from the oldest and the youngest XML file in the tarball.
Every ARC file contains an arbitrary number of records that each has an arbitrary size and type. Therefore we do not expect any correlation between the ARC file size and the number of records in a given ARC file nor the time it took to process each file. Still, scatter plots might show some interesting artefacts.

Looking closely at the region below the 100MB limit gives us a fine linear correlation but one might wonder why we see such a big difference below and above the 100MB peak.



Another way of looking at the data would be simple histograms like the following three charts





In all of the data visualisations above, it is evident that we are dealing with a lot of outliers and very long tails. As this is just a preliminary examination of the performance of FITS, we will choose to ignore some of these features.
The hard peak at around 100MB in ARC file size arises from a configuration of the harvest process. 87.3978% of the files lies in the interval from 99MB to 120MB.
The hard peak in the processing time histogram probably stems from problems in the FITS program. Furthermore, it might be affected by our load balancing system. If we choose to ignore this long tail, i.e. only look at data samples with a processing time below 8000 seconds, we reduce the corpora to 84.1339% of the original corpora.
If we reduce the corpora by both rules above, we get a corpora the size of 72.3009% which for this purpose is enough.
A deeper investigation of the long tail might reveal some of the bottlenecks of FITS combined with our load balancing system. That investigation will not be accounted for here.
We can now calculate the processing speed which has the following distribution

The processing speed is distributed from 0.75 MB/minute to 6000 MB/minute, but the tail already begins below 10 MB/minute. Again, closer examination of this very long tail, which could be done by e.g. including MIME types, might reveal some of the problems with FITS.
Another apparent feature in this visualisation is the two distinct maxima. What features in ARC files can give rise to such a phenomena?
A few statistical numbers from this speed data sample
parametervalue1st quartile 1.448 MB/minute median 2.738 MB/minute mean 3.951 MB/minute 3rd quartile 4.024 MB/minuteSo, using the above corpora reductions we get a median for the speed at 2.7 MB/minute, but it is important to note that the corpora exhibits very long tails and outliers in all parameters. One should therefore be very careful before drawing any generalised conclusions.
Wall timeAs I stated in the introduction, we have been running this FITS job for a year to gather this data. If we take the 11.64TB of ARC files and divide that amount by by a year we get a processing speed of 30 MB/minute. The median speed calculated in the previous section is for a single process. This would seem to state that we have got a circa 10 time speed increase running on a cluster of up to four processes on five 12 core servers, which sounds reasonable considering the instability of FITS.
Conclusion and what's nextBefore concluding anything about this experiment and analysis, a comment on the FITS tool versus our corpora is due. We have employed FITS on perhaps the most difficult corpora there is. 12TB of random, unknown and very heterogeneous data spanning most known digital formats—at least known in Denmark. The critique of FITS is only related to how it performs on such a corpora. FITS do have lots of qualities, most importantly it aggregates characterisation data from a whole range of tools into a common format. That being said, if we were to use the presently available FITS tool to characterise our web archive, it would take us 300 TB divided by 10 TB per year per five servers. I.e. on the present cluster the job wouldn't finish until after we all were retired.
We have stated in the first SCAPE evaluation that we want to be able to characterise a harvest of the complete Danish Internet in weeks, preferable less than three weeks. Such a harvest amounts to around 25TB and would thus take us two and a half years with the present set-up. In other words, we would need to acquire a cluster the size of 40 times what we have used for this experimental FITS job. That is probably not going to happen! So if we want to do this kind of characterisation we need to improve on the software and the general platform while, of course, still looking into up-sizing our hardware within the economically possible.
In short, the FITS tool as it is now is not fast enough for real world web archives. On the other hand we think that the data presented here might be of help in optimising both the FITS tool and how we use it. As we know the MIME type of each and every data point, a deeper investigation of the outliers and the long tails might reveal the bottlenecks of this setup and maybe even where to avoid using FITS or specific FITS modules. During the job we have also been gathering lists of files which caused FITS to crash. In other words a lot of data that could be used for bug reporting—–and fixing. So this blog post is not only presenting the data from an experiment, but just as important we want to share code, data, and ideas for digital preservation.
The job is still running and it is now characterising web resources for 2012 so the story will be continued…
Preservation Topics: IdentificationCharacterisationSCAPEThe most important new feature of the recently released PDF/A-3 standard is that, unlike PDF/A-2 and PDF/A-1, it allows you to embed any file you like. Whether this is a good thing or not is the subject of some heated on-line discussions. But what do we actually mean by embedded files? As it turns out, the answer to this question isn't as straightforward as you might think. One of the reasons for this is that in colloquial use we often talk about "embedded files" to describe the inclusion of any "non-text" element in a PDF (e.g. an image, a video or a file attachment). On the other hand, the word "embedded files" in the PDF standards (including PDF/A) refers to something much more specific, which is closely tied to PDF's internal structure.
Embedded files and embedded file streamsWhen the PDF standard mentions "embedded files", what it really refers to is a specific data structure. PDF has a File Specification Dictionary object, which in its simplest form is a table that contains a reference to some external file. PDF 1.3 extended this, making it possible to embed the contents of referenced files directly within the body of the PDF using Embedded File Streams. They are described in detail in Section 7.11.4 of the PDF Specification (ISO 32000). A File Specification Dictionary that refers to an embedded file can be identified by the presence of an EF entry.
Here's an example (source: ISO 32000). First, here's a file specification dictionary:
31 0 obj
<</Type /Filespec /F (mysvg.svg) /EF <</F 32 0 R>> >>
endobj
Note the EF entry, which references another PDF object. This is the actual embedded file stream. Here it is:
32 0 obj
<</Type /EmbeddedFile /Subtype /image#2Fsvg+xml /Length 72>>
stream
…SVG Data…
endstream
endobj
Note that the part between the stream and endstream keywords holds the actual file data, here an SVG image, but this could really be anything!
So, in short, when the PDF standard mentions "embedded files", this really means Embedded File Streams.
So what about "embedded" images?Here's the first source of confusion: if a PDF contains images, we often colloquially call these "embedded". However, internally they are not represented as Embedded File Streams, but as so-called Image XObjects. (In fact the PDF standard also includes yet another structure called inline images, but let's forget about those just to avoid making things even more complicated.)
Here's an example of an Image XObject (again taken from ISO 32000):
10 0 obj
<< /Type /XObject /Subtype /Image /Width 100 /Height 200 /ColorSpace /DeviceGray /BitsPerComponent 8 /Length 2167 /Filter /DCTDecode >>
stream
…Image data…
endstream
endobj
Similar to embedded filestreams, the part between the stream and endstream keywords holds the actual image data. The difference is that only a limited set of pre-defined formats are allowed. These are defined by the Filter entry (see Section 7.4 in ISO 32000) . In the example above, the value of Filter is DCTDecode, which means we are dealing with JPEG encoded image data.
Embedded file streams and file attachmentsGoing back to embedded file streams, you may now start wondering what they are used for. According to Section 7.11.4.1 of ISO 32000, they are primarily intended as a mechanism to ensure that external references in a PDF (i.e. references to other files) remain valid. It also states:
The embedded files are included purely for convenience and need not be directly processed by any conforming reader.
This suggests that the usage of embedded file streams is simply restricted to file attachments (through a File Attachment Annotation or an EmbeddedFiles entry in the document’s name dictionary).
Here's a sample file (created in Adobe Acrobat 9) that illustrates this:
http://www.opf-labs.org/format-corpus/pdfCabinetOfHorrors/fileAttachment.pdf
Looking at the underlying code we can see the File Specification Dictionary:
37 0 obj
<</Desc()/EF<</F 38 0 R>>/F(KSBASE.WQ2)/Type/Filespec/UF(KSBASE.WQ2)>>
endobj
Note the /EF entry, which means the referenced file is embedded (the actual file data are in a separate stream object).
Further digging also reveals an EmbeddedFiles entry:
33 0 obj
<</EmbeddedFiles 34 0 R/JavaScript 35 0 R>>
endobj
However, careful inspection of ISO 32000 reveals that embedded file streams can also be used for multimedia! We'll have a look at that in the next section...
Embedded file streams and multimediaSection 13.2.1 (Multimedia) of the PDF Specification (ISO 32000) describes how multimedia content is represented in PDF (emphases added by me):
Rendition actions (...) shall be used to begin the playing of multimedia content.
A rendition action associates a screen annotation (...) with a rendition (...)
The actual data for a media object are defined by Media Clip Objects, and more specifically by the media clip data dictionary. Its description (Section 13.2.4.2) contains a note, saying that this dictionary "may reference a URL to a streaming video presentation or a movie embedded in the PDF file". The description of the media clip data dictionary (Table 274) also states that the actual media data are "either a full file specification or a form XObject".
In plain English, this means that multimedia content in PDF (e.g. movies that are meant to be rendered by the viewer) may be represented internally as an embedded file stream.
The following sample file illustrates this:
http://www.opf-labs.org/format-corpus/pdfCabinetOfHorrors/embedded_video_quicktime.pdf
This PDF 1.7 file was created in Acrobat 9, and if you open it you will see a short Quicktime movie that plays upon clicking on it.
Digging through the underlying PDF code reveals a Screen Annotation, a Rendition Action and a Media clip data dictionary. The latter looks like this:
41 0 obj
<</CT(video/quicktime)/D 42 0 R/N(Media clip from animation.mov)/P<</TF(TEMPACCESS)>>/S/MCD>>
endobj.
It contains a reference to another object (42 0), which turns out to be a File Specification Dictionary:
42 0 obj
<</EF<</F 43 0 R>>/F(<embedded file>)/Type/Filespec/UF(<embedded file>)>>
endobj
What's particularly interesting here is the /EF entry, which means we're dealing with an embedded file stream here. (The actual movie data are in a stream object (43 0) that is referenced by the file specification dictionary.)
So, the analysis of this sample file confirms that embedded filestreams are actually used by Adobe Acrobat for multimedia content.
What does PDF/A say on embedded file streams?In PDF/A-1, embedded file streams are not allowed at all:
A file specification dictionary (...) shall not contain the EF key. A file's name dictionary shall not contain the EmbeddedFiles key
In PDF/A-2, embedded file streams are allowed, but only if the embedded file itself is PDF/A (1 or 2) as well:
A file specification dictionary, as defined in ISO 32000-1:2008, 7.11.3, may contain the EF key, provided that the embedded file is compliant with either ISO 19005-1 or this part of ISO 19005.
Finally, in PDF/A-3 this last limitation was dropped, which means that any file may be embedded (source: this unofficial newsletter item, as at this moment I don't have access to the full specification of PDF/A-3).
Does this mean PDF/A-3 supports multimedia?No, not at all! Even though nothing stops you from embedding multimedia content (e.g. a Quicktime movie), you wouldn't be able to use it as a renderable object inside a PDF/A-3 document. The reason is that the annotations and actions that are needed for this (e.g. Screen annotations and Rendition actions, to name but a few) are not allowed in PDF/A-3. So effectively you are only able to use embedded file streams as attachments.
Adobe adding to the confusionA few weeks ago the embedding issue came up again in a blog post by Gary McGath. One of the comments there is from Adobe's Leonord Rosenthol (who is also the Project Leader for PDF/A). After correctly pointing out some mistakes in both the original blog post and in an earlier a comment by me, he nevertheless added to the confusion by stating that objects that are are rendered by the viewer (movies, etc.) all use Annotations, and that embedded files (which he apparently uses a a synonym to attachments) are handled in a completely different manner. This doesn't appear to be completely accurate: at least one class of renderable objects (screen annotations/rendition actions) may be using embedded filestreams. Also, embedded files that are used as attachments may be associated with a File Attachment Annotation, which means that "under the hood" both cases are actually more similar than first meets the eye (which is confirmed by the analysis of the 2 sample files in the preceding sections). Contributing to this confusion is also the fact that Section 7.11.4 of ISO 32000 erroneously states that embedded file streams are only used for non-renderable objects like file attachments, which is contradicted by their allowed use for multimedia content.
Does any of this matter, really?Some might argue that the above discussion is nothing but semantic nitpicking. However, details like these do matter if we want to do a proper assessment of preservation risks in PDF documents. As an example, in this previous blog post I demonstrated how a PDF/A validator tool can be used to profile PDFs for "risky" features. Such tools typically give you a list of features. It is then largely up to the user to further interpret this information.
Now suppose we have a pre-ingest workflow that is meant to accept PDFs with multimedia content, while at the same time rejecting file attachments. By only using the presence of an embedded file stream (reported by both Apache's and Acrobat's Preflight tools) as a rejection criterion, we could end up unjustly rejecting files with multimedia content as well. To avoid this, we also need to take into account what the embedded file stream is used for, and for this we need to look at what annotation types are used, and the presence of any EmbeddedFiles entry in the document’s name dictionary. However, if we don't know precisely which features we are looking for, we may well arrive at the wrong conclusions!
This is made all the worse by the fact that preservation issues are often formulated in vague and non-specific ways. An example is this issue on the OPF Wiki on the detection of "embedded objects". The issue's description suggests that images and tables are the main concern (both of which aren't strictly speaking embedded objects). The corresponding solution page subsequently complicates things further by also throwing file attachments in the mix. In order to solve issues like these, it is helpful to know that images are (mostly) represented as Image XObjects in PDF. The solution should then be a method for detecting Image XObjects. However, without some background knowledge of PDF's internal data structure, solving issues like these becomes a daunting, if not impossible task.
Final noteIn this blog post I have tried to shed some light on a number of common misconceptions about embedded content in PDF. I might have inadvertently created some new ones in the process, so feel free to contribute any corrections or additions using the comment fields below.
The PDF specification is vast and complex, and I have only addressed a limited number of its features here. For instance, one might argue that a discussion of embedding-related features should also include fonts, metadata, ICC profiles, and so on. The coverage of multimedia features here is also incomplete, as I didn't include Movie Annotations or Sound Annotations (which preceded the Screen Annotations, which are now more commonly used). These things were all left out here because of time and space constraints. This also means that further surprises may well be lurking ahead!
Johan van der Knijff
KB / National Library of the Netherlands
Remote access to emulation, remote emulation and the emergence of a wide range of different cloud services, and end users interacting with them remotely through standardized (web-)client applications on their various devices offers the chance to combine both into an integrated access system to various obsolete computer environments. In order to provide a wide range of services, especially in combination with authentic performance and user-experience, a distributed system model and architecture is helpful. It can be run as a cloud service allowing for the specialization both of memory institutions and third party service providers. Such offerings could help to shift the usually non-trivial task of the emulation of obsolete software environments from the end user to specialized providers. Optimally, instead of installing a huge number of software packages which are difficult to maintain even for a small number of relevant platforms, the user should be able to install a simple access application made available for a wide range of today's and future platforms.
Towards Emulation-as-a-ServiceTrying to use several different emulators or the KEEP emulation framework, it becomes clear that it is challenging to provide various hardware emulators and framework services on diverse computer architectures for several, mostly technical, reasons. Architectural and technical differences between powerful desktop machines, netbooks, laptops, thin clients, tablets or even next generation TV screens are significant and thus difficult to be bridged with a one-fits-all solution. Especially, end user devices are a fast moving target as they are replaced quite regularly. Additionally, often secondary software components (like operation systems, ROMs, firmwares, drivers, etc.) are required at the user's side. A similar problem arises from access-restricted digital artefacts. If a user is interested in such an object, a memory institution may not be able to hand over the complete object e.g. due to legal reasons (IPR, copyrights, regulations, ...) or privacy issues.
Emulation-as-a-Service (EaaS, a term originally coined by J. van der Hoeven) can provide a convenient solution for both, reducing the variance of host systems for original environments and solving IPR compliance. Development and maintenance of emulators and their according digital preservation frameworks can be focused on only a few current architectures, leading to a controlled and well understood environment. This avoids a couple of complexities of cross-platform development and allows easier testing as fewer targets with less variety are to be considered. Original environments and their various software components as well as the requested artefacts are not getting outside the controlled systems of service providers. EaaS can offer users of memory institutions the various contents from their digital collections like digital art, encyclopedias, primary scientific data, teaching material or famous persons' original working environments without giving them direct access to the primary data itself. The access to certain digital artefacts or complete environments can be controlled in a more effective fashion, hindering the user to copy material or analyse it in an undesired way.
Remote access application and protocol are to be defined (like Guakamole, OnLive) to abstract from and translate the actual capabilities of the chosen local platform to the remote running service interfaces. This is not a particular challenge as the same base principles are valid for accessing recent environments over computer networks like for the different virtual machine platforms. Emulated original environments then could "blend" in seamlessly with actual services. Such considerations could provide a solution which allows the access to various 1985 home computer games running in MESS, early 1990ies art work running in some Motorola 68k powered Apple Macintosh, various Windows or Linux desktops and some early electronic arcade games just through the same application representing a front-end interface to emulation service backends. The use of standardized and well-established remote access applications and protocols to link front and backends gains platform independence. This allows also to adapt to different input/output methods used by the end user devices and required by the original environments.
The separation of the service from the user interface allows a distributed environment. Services like emulation components, software archives and authentication services can be shared and split among several institutions and 3rd party providers to enable specialization following the division of labour principle. While EaaS does not primarily require large computing power or storage capacity the model very well benefits from cloud technology in terms of remote access, distribution of services or established authentication and accounting frameworks. The EaaS model can help to leverage typical cloud advantages for better user-centered access services in digital preservation: scalable, on-demand services, less waste of compute resources, optimization of costs or solving IPR related challenges. Resources can be scaled to the actual needs of the organization and thus provide great flexibility, e.g. pay-per-use business models are usually targeted short-term contracts.
Emulation in EaaSOf course emulators are at the core of EaaS as they bridge the outdated computer platforms to actual software and hardware. With EaaS their operation and maintenance can be simplified as the provisioning of emulators takes place in well-controlled environment which is easier to define and to maintain compared to end user systems. It focuses the available resources and allows the specialization and division of labour among the involved institutions and service providers. The amount of emulators required depends on the artefacts or original environments to be accessed or to be rendered. In EaaS they provide both the base layer for the original system environment and the user interfaces. Special functionality might be required for (large scale) framework integration and automation, like the proper configuration of the emulators and workflows to prepare and transport the original artefacts to be used in and extracted from the emulated environments. While a cloud approach to emulation simplifies a number of issues, several challenges of the emulation strategy still need to be solved. For example, appropriate hardware emulators need to be available to support that original environments are compatible to the host system of the cloud service. Additional components are required, like original software applications, operating systems, firmwares, drivers including the appropriate rights to use them within EaaS. As the original environments are meant to be available to users on their actual devices, access components to emulators also need to translate machine in- and output concepts (as discussed in some previous post). Not only screen resolution and color depth have increased over time, but also keyboard layouts have changed and different other types of inputs like a wide range of mice, joysticks or recently, position sensors have been added to the rather new class of mobile devices like tablets and smart phones.
Acquiring new Audiences through EaaSEmulation-as-a-Service offers several advantages. It allows new stakeholders to enter the market as services can be offered to a wide range of different customers remotely. Memory institutions can use their knowledge and advantage in the field of digital preservation and access to provide paid services to commercial entities requiring authentic reproduction of digital objects and processes for e.g. legal reasons.
Additionally, to merely provide access to deprecated computer systems new types of services might be established. Running systems can be frozen and resumed by different users or offered to be run from a certain execution state. Furthermore, parallel access of several users to the same system is possible, e.g. for performance measures, scientific, guidance or teaching purposes.
The bwFLA project for functional long-term archival and access, has started implementing and integrating EaaS as part of a state-wide initiative. Currently bwFLA EaaS supports 8 different emulators being able to run 15 distinct legacy computer platforms. The platforms range from MacOS 7 running on a MK68 system emulator, PPC based platforms to various x86-based platforms. Each emulation component is available to be used in various archival workflows through a common web service interface. Remote interaction is provided via VNC at the moment when it will get replaced by a more general and versatile HTML5 component. The project implemented a first prototype of an authentication and authorization component using the existing state-wide Shibboleth infrastructure deployed by universities and libraries. Some of the bwFLA EaaS components will be made available through OPF for testing in the near future.
Preservation Topics: Preservation ActionsEmulationPreservation StrategiesDatabase ArchivingBlog - Learning by doing in digital preservation
Libraries, archives and museums have been extremely successful in preserving centuries old paper-based, cultural and scientific heritage. How well are they doing with the growing and rapidly ageing digital-based heritage?
This question has been haunting us (the digital preservation community) for a while now, even though the digital era has only just begun. We are still unsure about so many things: Are we keeping the right information? Should we be more selective? What is the right preservation strategy: safeguarding the original containers and carriers, transferring the data to long-lasting media, emulating the hardware before it becomes obsolete? Which metadata should we record? Et cetera.
At iPRES 2012, keynote speaker Steve Knight set the tone by observing that ”we are still asking the same questions as 10 years ago” and not making much progress. Paul Wheatly pointed to the duplication of effort in research projects and tool-building, and called it “a big fail”.
The conference proceedings do not reflect this discussion - they are a compilation of the papers that were accepted by the Scientific Program Committee – but you will find blogs and tweets that have captured the mood and voices of the participants. The concerns in the community are very real and deserve attention. In a series of blogs, I will attempt to address these concerns and to foster the informal conversation about the way forward.
Benchmarking
The concerns voiced at iPRES can be listed as follows: the gap between research and practice is too large; we need to move away from short-term project funding and move towards long-term investments; we start lots of initiatives and most of them do the same: there is too much duplication of effort for such a niche area and there is a lot of waste; we need to align ourselves and work together to achieve enough scale and to make the work more cost-effective. How do we know we are heading in the right direction? How can we measure progress? What are our benchmarks? How well do I perform in comparison with other digital archives and repositories? Et cetera. The many methods and tools developed over the past 10 years, for the audit, assessment and certification of “trusted digital repositories” are evidence of such concerns. Just tally the occurrence of the words “risk”, “standard” and “certification” in recent conference proceedings on digital preservation: you will be overwhelmed! And the sheer number of surveys carried out to determine the state of preservation practices is astonishing. Everyone is talking about benchmarking and how to become a trustworthy repository, but benchmarking is neither a goal in itself nor a research question.
Let us take a step back and try to understand better what it is we are trying to do.
How did we do it in the paper era?
For centuries, we have assured the preservation of books, journals, newspapers, music sheets, maps and many more paper-based containers of information. To this day, we are able to provide access to most of these materials and the information therein is still mostly human-readable. This is a Herculean achievement that has been possible only thanks to a continuous and dedicated process of learning and improvement over centuries. This was neither a scientific process nor a standard-setting process. Organizations that have proven to be trusted keepers of the paper-based heritage have done so on the basis of grass-root practices that have matured over hundreds of years. Today, these good practices are woven into the fabric of the memory institutions. The setting of standards did not have a play in this evolutionary development. Preservation standards and regulations appeared only very recently and in most countries, they have not (yet) been enforced. In the Netherlands, for example, the regulation of storage conditions in public archives was set as recently as 2002, but before that, most public archives already adhered to the requirements. Research into the degradation and embrittlement of paper only started in the 1930’s. It has made impactful progress in the past decades and is still ongoing, but it is a background process at library preservation programs.
What are we doing different now?
In digital preservation, most effort has been focused on research, modeling, risk assessment and standardization. This seems to indicate that we are proceeding in a different order: research is leading and applied to the design and engineering of processes and systems. Research informs the standard-setting process, the results of which are then put into practice on the ground. The way in which the OAIS-model has evolved from a reference framework (2002) into a recommended practice (2012) that underpins most audit and certification approaches to digital preservation, illustrates this very well. In contrast to the bottom-up development of good practices in the paper era, we are now trying to standardize “best practices” that have been developed by research, in a top-down fashion , very much along the principles of scientific management developed by Frederick Winslow Taylor (1856 -1915). In this order of things, there is very little room for feed-back from practitioners on the ground and for learning by doing.
Learning by doing and the importance of failure
In quality management circles, it is widely accepted that the top-down approach does not work - not on the long-run and not for complex and ICT-dense systems. Henry Mintzberg (1939 -), who was critical of Taylor’s method, argued that effective managing requires some balanced combination of art (visioning), craft (venturing) and science (planning). This balance can only be achieved after years of experience and learning on the job.

Mintzberg’s managerial style triangle
Research cannot solve all the problems in advance. It was Joseph Moses Juran (1904 – 2008) who championed the importance of the learning process and who added the human dimension to quality management. Practitioners are part of the learning process: they have the skills sets and the work experience that can contribute to increased knowledge and improved workflows. Failure is also part of the learning process. Organizations should deal positively with failure because it leads to improvement. Worker’s participation in the continuous improvement of work processes was taken forward by Masaaki Imai (1930 - ) in his concept of “Kaizen”. William Edwards Deming (1900 – 1993) finally helped to popularize the concept of quality cycles, which is most commonly known as PDCA (Plan, Do, Check, Act). The notion that continuous improvement moves in repetitive cycles (also called iterations) was introduced some 20 years ago, in the software development industry – with the RUP process, Extreme Programming and various agile software development frameworks.
It is clear that digital preservation-as-a-process, evolving in an ICT-dense context, would benefit greatly from adopting the quality management approach of continuous improvement. In this approach the practitioners are driving the learning process and research is facilitating. OPF’s philosophy is based on this approach. OPF Hackathons bring together practitioners and researchers and aim to move the practice of digital preservation forward through “learning by doing together”.