Open Planets Foundation

bwFLA EaaS: Releasing Digital Art into the Wild

klaus — Fri, 20 Dec 2013 09:29:25 +0000

With the bwFLA Emulation-as-a-Service you can enable users to view your (interactive) objects without actually giving the environment+object to the user. This is a nice feature, especially for dig. art and similar: you can provide access to an almost unlimited amount of people being able to view, use and interact with a piece of dig. art without being able to copy it. The owner remains in control of the object and is able to restrict access any time.

Jon Thomson and Alison Craighead (www.thomson-craighead.net) nicely prepared and integrated two of their art pieces for public access using the bwFLA EaaS infrastructure. Please take a look at:

http://www.thomson-craighead.net/docs/thal.html

and

http://www.triggerhappy.org

Please be patient it may take a minute or so to load.

For more information about bwFLA and Emulation as a Service please take a look at our website: http://bw-fla.uni-freiburg.de/

Preservation Topics:

Emulation

The SCAPE Developers Workshop in BRNO

MelanieImming — Wed, 18 Dec 2013 10:01:41 +0000

From Tuesday 19th November until Thursday 21st November the internal SCAPE Developer’s Workshop was held at the Brno University of Technology.

The overall aims of the workshop were diverse. First of all to get everyone, in particular new partners, up-to-speed and aligned with project work and developments. Second to get a clear understanding of how the new partners' work and existing project work will integrate and what needs to be done to make this happen. Furthermore to identify, understand and work at issues surrounding PT/PC/PW integration and last but not least to productise existing SCAPE tools.

It is always good to meet so many SCAPErs and to work together face to face. It was really inspiring to hear all the demonstrations and presentations and to get a good overview of all the SCAPE activities. The overall feeling was very positive, there were lots of discussions and everyone worked very hard on the needed next steps in this last year of SCAPE.

I was there as a representative of the Take Up sub project. My main focus was the productization of the SCAPE tools and the need for more general information about the tools and SCAPE overall. It was good to see that this need for productization was recognized and we immediately started working on adjusting the Readme files. There will be more follow up actions to get all the general information in place but the Workshop was a good start to address a lot of important SCAPErs: the developers.

This was just one of the many, many things that people were working on.

Overall, it was a good Workshop and our new partner Brno University made sure that we felt very welcome: the hosting was excellent in this former cloister.

Scout - a preservation watch system

lfaria — Mon, 16 Dec 2013 17:31:52 +0000

What is preservation watch?

The reason why we should worry about preservation of digital content and why some preservation action needs to be done is closely related to the idea that content is at risk. The risk relates to the potential of losing something of value, weighted against the potential of gaining something of value. In digital preservation, the risk relates to losing long-term and continuous access (or usability) of content by the intended users and it is weighted against the cost (or profit) of maintaining such access. The long-term and continuous aspects of this access mean that there should be a continuous and long-term process that knows when content is misaligned with the requirements of the intended users, and this process is preservation watch.

In practice preservation watch becomes even more complex as long-term and continuous are many times conflicting requirements. To tackle this, an institution would normally define a "preservation format" which tries to fulfill the long-term access requirement, and create "access" or "dissemination" copies, which are optimized for user community.

Monitoring if content is aligned with the long-term and continuous access requirements, i.e. if selected preservation and access format are still adequate, is a big endeavor that quickly becomes infeasible with large-scale content. Institutions are normally able to tackle the usual suspects, like images and text documents, but are unable to process the long tail of file formats that almost all institutions have.

Scout – a preservation watch system

http://openplanets.github.io/scout/

Scout is a preservation watch system being developed within the SCAPE project. It provides an ontological knowledge base to centralize all necessary information to detect preservation risks and opportunities. It uses plugins to allow easy integration of new sources of information, as file format registries, tools for characterization, migration and quality assurance, policies, human knowledge and others. The knowledge base can be easily browsed and triggers can be installed to automatically notify users of new risks and opportunities. Examples of such notification could be: content fails to conform to defined policies, a format became obsolete or new tools able to render your content are available.

For example, you can continuously monitor your content file formats and other characteristics, e.g. compression scheme. Scout can monitor your content profile throughout time and allow you to compare it with other institutions, see how content evolves and cross-reference that information with your policies, file format registries (like PRONOM), and any other information that can be provided to Scout.

This will give you an invaluable insight into your content and how it relates with the outside world.

What information does Scout currently have?

Content

Scout is able to monitor the content profile, which is a summary of the content characterization. Scout is fetching information about file format distribution, file size, and file characteristics like compression scheme. Scout does this using C3PO and FITS, you can run FITS on every file of your content to get the characterization output, and run C3PO to generate the content profile XML that can be monitored by Scout.

Here is an example of the data gathered form a web archive collection:

Internet Memory Foundation web archive collection (harvests of a confidential domain from 2009 to 2012)

Content size (on each harvest):

Format distribution (table with latest status):

... and a long tail of other formats.

Format distribution (diagram with history on each harvest):

Compression scheme (on each harvest):

Policies

Scout allows upload of preservation control-policies in an RDF model created in the SCAPE project. Check the Preservation Policy Levels in SCAPE paper for more information about the Preservation Policy model. These control policies define requirements on the content that can be automatically checked for conformance with monitored content. For example, you can upload to Scout a policy that defines that compression scheme must be lossless, and monitor your content to be warned whenever a lossy format is added to your content.

To add policies you have to log into Scout and add upload your policy RDF model in the Scout dashboard.

Note that the current version of Scout does not support multiple users (so you must download your own version of Scout to do this). Note also that not all policies can be checked for conformance, as they might depend on non-existing information, but you can add more information to Scout anytime (via source adaptors). Finally, please note that you might need to create new trigger that cross-references a control policy with the content profile, but default triggers and common vocabularies are currently being developed to make this cross-reference easier.

Registries

Scout currently monitors PRONOM registry via the SPARQL endpoint. It currently has 843 file formats:

Web

An experiment with Automatic Preservation Watch using Information Extraction on the Web was presented at the last iPRES conference (2013). In this experiment, the journals that are provided by a publisher are automatically extracted from the Web by doing focused crawlings on the Web (using journal and publisher names), and relations are calculated from natural language statements using information extraction tools.

In the experiment, 500,000 web pages that, with about 18 million sentences were crawled, and this resulted 2,000 journal titles and 500 journal-publisher relations. Comparing the results with eDepot and the Keepers registry gave the following results.

Comparing the results with eDepot we found that 86% of the gathered journal titles were not on the eDepot and should be added, 10% were already registered and 4% were false-positives. Manually comparing a sample of the results with the Keepers registry we also estimate that aroud 50% for all found journal-publisher relationships were already added, and 35% needed to be added. We also estimate that there exist more false-positives in the journal-publisher results because detecting journal(title)-publisher(name) relations is more complex and error prone that just detecting the journal titles.

This experiment demonstrates that information extraction technologies can be a good complement to registries and even serve as a substitute information source when no registries exist on some subject. Nevertheless, some work is needed to reduce the related error, several suggestions on how to do this are available on the paper.

How can I use Scout?

Plans exist to create a central instance for Scout, which could serve as a central hub for digital preservation information. For now, there is no such central instance, but you can check out the demonstration instance at http://scout.scape.keep.pt (please be aware that this is a development/demonstration site and may go down at any moment).

You can also download and install your own instance of Scout, gather information and monitor your content. To know how, check the development site: http://openplanets.github.io/scout/

Finally, you can send us your content profile and be an early adopter. To know more, please contact me at lfaria[AT]keep.pt

Web Archive FITS Characterisation using ToMaR

shsdev — Mon, 16 Dec 2013 15:13:31 +0000

From the very beginning of the SCAPE project on, it was a requirement that the SCAPE Execution Platform be able to leverage functionality of existing command line applications. The solution for this is ToMaR, a Hadoop-based application, which, amongst other things, allows for the execution of command line applications in a distributed way using a computer cluster. This blog post describes the combined usage of a set of SCAPE tools for characterising and profiling web-archive data sets.

We decided to use FITS (File Information Tool Set) as a test case for ToMaR for two reasons: First, the FITS approach of producing “normalised” output on the basis of various file format characterisation tools makes sense, and therefore, enabling the execution of this tool on very large data sets will be of great interest for many people working in the digital preservation domain. Second, the application is challenging from a technical point of view, because it starts several tools as sub-processes. Even if a process takes only one second per file, we have to keep in mind that web archives usually have potentially billions of files to process.

The workflow in figure 1 is an integrated example of using several SCAPE outcomes in order to create a profile of web archive content. It shows the complete process from unpacking a web archive container file to viewing aggregated statistics about the individual files it contains using the SCAPE profiling tool C3PO:

Figure 1: Web Archive FITS Characterisation using ToMaR, available on myExperiment: www.myexperiment.org/workflows/3933

The inputs in this worklow are defined as follows:

“c3po_collection_name”: The name of the C3P0 collection to be created.
“hdfs_input_path”, a Hadoop Distributed File System (HDFS) path to a directory which contains textfile(s) with absolute HDFS paths to ARC files.
“num_files_per_invocation”: Number of items to be processed per FITS invocation.
“fits_local_tmp_dir”: Local directory where the FITS output XML files will be stored

The workflow uses the Map-only Hadoop job Spacip to unpackage the ARC container files into HDFS and creates input files which subsequently can be used by ToMaR. After merging the Mapper output files from Spacip into one single file (MergeTomarInput), the FITS characterisation process is launched by ToMaR as a MapReduce job. ToMaR uses an XML tool specification document which defines inputs, outputs and the execution of the tool. The tool specification document for FITS used in this experiment defines two operations, one for single file invocation, and the other one for directory invocation.

FITS comes with a command line interface API that allows a single file to be used as input to produce the FITS XML characterisation result. But if the tool were to be started from the command line for each individual file in large a web archive, the start-up time of FITS including its sub-processes would accumulate and result in a poor performance. Therefore, it comes in handy that FITS allows the definition of a directory which is traversed recursively to process each file in the same JVM context. ToMaR permits making use of this functionality by defining an operation which processes a set of input files and produces a set of output files.

The question how many files should be processed per FITS invocation can be addressed by setting up a Taverna experiment like the one shown in Figure 2. The workflow presented above is embedded in a new workflow in order to generate a test series. A list of 40 values, ranging from 10 to 400 in steps of 10 files to be processed per invocation is given as input to the “num_files_per_invocation” parameter. Taverna will then automatically iterate over the list of input values by combining the input values as a cross product and launching 40 workflow runs for the embedded workflow.

Figure 2: Wrapper workflow to produce a test series.

5 ARC container files with a total size of 481 Megabytes and 42223 individual files were used as input for this experiment. The 40 workflow cycles were completed in around 24 hours and led to the result shown in figure 3.

Figure 3: Execution time vs. number of files processed per invocation.

The experiment shows a range of values with the execution time stabilising at about 30 minutes. Additionally, the evolution of the execution time of the average and worst performing task is illustrated in figure 4 and can be taken into consideration to choose the right parameter value.

Figure 4: Average and worst performing tasks.

As a reference point, the 5 ARC files have been processed locally on one cluster node in a single-threaded application run in 8 hours and 55 minutes.

The cluster used in this experiment has one controller machine (Master) and 5 worker machines (Slaves). The master node has two quadcore CPUs (8 physical/16 HyperThreading cores) with a clock rate of 2.40GHz and 24 Gigabyte RAM. The slave nodes have one quadcore CPUs (4 physical/8 HyperThreading cores) with a clock rate of 2.53GHz and 16 Gigabyte RAM. Regarding the Hadoop configuration, five processor cores of each machine have been assigned to Map Tasks, two cores to Reduce tasks, and one core is reserved for the operating system. This is a total of 25 processing cores for Map tasks and 10 cores for Reduce tasks. The best execution time on the cluster was about 30 minutes which compares to the single-threaded execution time as illustrated in figure 5.

Figure 5: Single-threaded execution on one cluster node vs. cluster execution.

Processing larger data sets can be done in a similar manner to the one that is shown in figure 2, only that a list of input directory HDFS paths determines the sequence of workflow runs and the number of files per FITS invocation is set as a single fixed value.

The following screencast shows a brief demo of the workflow using a tiny arc file containing the harvest of an HTML page referencing a PNG image. It demonstrates how Taverna orchestrates the Hadoop jobs using tool service components.

Taxonomy upgrade extras:

Preservation Topics:

Interview with a SCAPEr - Zeynep Pehlivan

Jette Junge — Fri, 13 Dec 2013 09:00:32 +0000

Who are you?

My name is Zeynep PEHLIVAN. I joined University Pierre and Marie Curie (UPMC) for a master degree at 2009. I recently received my PhD at the same university. I have been involved in the SCAPE project since September 2012.

Tell us a bit about your role in SCAPE and what SCAPE work you are involved in right now?

I ensure the work package lead position for the Quality Assurance Components work package within the Preservation Components sub project under the supervision of Stéphane Gançarski and Matthieu Cord. In coming months, I will be also involved in the development of Quality Assurance tools for UPMC.

Why is your organisation involved in SCAPE?

Our team at UPMC has been conducting research on digital preservation, especially on web archiving, since a while. As a university, participating in this project allows us to better evaluate the users’ real needs, to see how our research results are used in real life and to collaborate with different institutions.

What are the biggest challenges in SCAPE as you see it?

I think, due to its size and its international position, a project like SCAPE will have several administrative challenges. However, above all, the most important challenges for me are the technical ones. There are so many useful tools developed in the project answering different issues related to digital preservation in different development environments. Integration of these tools into one single system is a big challenge but I think, today, through the last year of the project, we see the light at the end of the tunnel.

In addition, digital objects are ephemeral depending on different reasons. Taking this ephemeral nature into account while designing our solutions is another challenge. Although it is well studied in the project, we can not predict all issues based on ephemerality for the durability of the system.

What do you think will be the most valuable outcome of SCAPE?

The size of the digital collections is getting larger each day. Thus, when we talk about digital collections, in fact, we refer to “big data”. As indicated also in the name of the project, scalability will be the most valuable outcome of the project, in my opinion.

Digital collections represent a huge information source. If access to these collections is not provided, unfortunately the preservation effort can ultimately become irrelevant. Previous works show that users of digital collections need to analyze, compare and evaluate the information. It will be interesting to see developed access tools to let users search, evaluate, and visualize these huge collections.

Contact information

Zeynep PEHLIVAN

University Pierre and Marie Curie

4 Place Jussieu, 75005 Boite 169

Zeynep.pehlivan@lip6.fr

Linkedin: http://www.linkedin.com/profile/view?id=7183444

Preservation Topics:

SCAPE

Impressions of the ‘Hadoop-driven digital preservation Hackathon’ in Vienna

shsdev — Fri, 06 Dec 2013 16:30:44 +0000

More than 20 developers visited the ‘Hadoop-driven digital preservation Hackathon’ in Vienna which took place in the baroque room called "Oratorium" of the Austrian National Library from 2nd to 4th of December 2013. It was really exciting to hear people vividly talking about Hadoop, Pig, Hive, HBase followed by silent phases of concentrated coding accompanied by the background noise of mouse clicks and keyboard typing.

In #scapeproject we always have a high ceiling #hadoop4DP #fb pic.twitter.com/mVRsBfArFC
— Per Møldrup-Dalum (@perdalum) 2. Dezember 2013

There were Hadoop newbies, people from the SCAPE Project with some knowledge about Apache Hadoop related technologies, and, finally, Jimmy Lin who works currently as an associate professor at the University of Maryland and who was employed as research scientist at Twitter before. There is no doubt that his profound knowledge of using Hadoop in an ‘industrial’ big data context was that certain something of this event.

The topic of this Hackathon was large-scale digital preservation in the web archiving and digital books quality assurance domains. People from the Austrian National Library presented application scenarios and challenges and introduced the sample data which was provided for both areas on a virtual machine together with a pseudo-distributed Hadoop-installation and some other useful tools from the Apache Hadoop ecosystem.

I am sure that Jimmy’s talk about Hadoop was the reason why so many participants became curious about Apache Pig, a powerful tool which was humorously characterised by Jimmy as the tool for lazy pigs aiming for hassle-free MapReduce. Jimmy gave a live demo running some pig scripts on the cluster at his university explaining how pig can be used to find out which links point to each web page in a web archive data sample from the Library of Congress. Asking Jimmy about his opinion on Pig and Hive as two alternatives for data science to choose from, I found it interesting that he did not seem to have a strong preference for Pig. If an organisation has a lot of experienced SQL experts, he said, Hive is a very good choice. On the other hand, from the perspective of the data scientist, Pig offers a more flexible, procedural approach for manipulating data and to do data analysis.

Towards the end of the first day we started to split into several groups. People gathered ideas in a brainstorming session which at the end led to several groups:

· Cropping error detection

· Full-text search on top of warcbase

· Hadoop-based Identification and Characterisation

· OCR Quality

· PIG User Defined Functions to operate on extracted web content

· PIG User Defined Functions to operate on METS

Many participants made their first steps in Pig scripting during the event, so it is clear that one cannot expect code that is ready to be used in a production environment, but we can see many points to start from when we do the planning of projects with similar requirements.

On the second day, there was another talk by Jimmy about HBase and his project WarcBase which looks like a very promising approach of providing a scalable HBase storage backend with a very responsive user interface that offers basic functionality of what the WayBack machine does for rendering ARC and WARC web archive container files. In my opinion, the upside of his talk was to see HBase as tremendously powerful database on top of Hadoop’s distributed file system (HDFS), Jimmy brimming over with ideas about possible use cases for scalable content delivery using HBase. The downside was to hear his experiences about how complex the administration of a large HBase cluster can become. First, additionally to the Hadoop administration tasks, it is necessary to keep additional daemons (ZooKeeper, RegionServer) up and running, and he explained how the need for compacting data stored in HFiles, once you believe that the HBase cluster is well balanced, can lead to what the community calls a “compaction storm” that blows up your cluster - luckily this only manifests itself with endless java stack-traces.

One group provided a full text search for WarcBase and they picked up the core ideas from the developer groups and presentations to build a cutting-edge environment where the web archive content was indexed by the Terrier search engine and the index was enriched with metadata from the Apache Tika mime-type and language detection. There were two ways to add metadata to the index. The first option was to run a pre-processing step that uses Pig user defined function to output the metadata of each document. The second option was to use Apache Tika during indexing to detect both the MimeType and language. In my view, this group has won the price of the fanciest set-up, sharing resources and daemons running on their laptops.

I was impressed how in the largest working group the outcomes were dynamically shared between developers: One implemented a Pig user defined function (UDF) making use of Apache Tika’s language detection API (see section MIME type detection) which the next developer used in a Pig script for mime type and language detection. Also Alan Akbik, SCAPE project member, computer linguist and Hadoop researcher from the University of Berlin, was reusing building blocks from this group to develop Pig Scripts for old German language analysis using dictionaries as a means to determine the quality of noisy OCRed text. As an experienced Pig scripter he produced impressive results and deservedly won the Hackathon’s competition for the best presentation of outcomes.

The last group was experimenting with functionality of classical digital preservation tools for file format identification, like Apache Tika, Droid, and Unix file, and looking into ways to improve the performance on the Hadoop platform. It’s worth highlighting that digital preservation guru Carl Wilson found a way to replace the command line invocation of unix file in FITS by a Java API invocation which proved to be ways more efficient.

Finally, Roman Graf, researcher and software developer from the Austrian Institute of Technology, took images from the Austrian Books Online project in order to develop python scripts which can be used to detect page cropping errors and which were especially designed to run on a Hadoop platform.

On the last day, we had a panel session with people talking about experiences regarding the day-to-day work with Hadoop clusters and the plans that they have for the future of their cluster infrastructure.

Panel session, sharing experiences -adventures in implementing #Hadoop4DP #SCAPEProject pic.twitter.com/Mjssc1pSSy
— OPF (@openplanets) 4. Dezember 2013

I really enjoyed these three days and I was impressed by the knowledge and ideas that people brought to this event.

Preservation Topics:

A Sustainable Future for FITS

paul — Fri, 06 Dec 2013 11:37:04 +0000

As Paul mentioned here, FITS is a classic case of a great digital preservation tool that many of us use and benefit from but that wasn’t set up to accept community code contributions. Different versions of FITS were proliferating instead of dovetailing into a better product. For this reason we decided to take a look at the situation to see what we could do to change it.

First we looked at the current FITS codebase and all of the forks out there with the aim of merging all existing stable features and patches. While merging appears a rather trivial task, ensuring that the existing functionality is not broken afterwards, isn’t. This is especially tricky when there aren’t (m)any unit tests. Writing unit tests post-factum usually involves refactoring code for testability. As any seasoned developer out there will likely agree - refactoring a large code base without unit tests usually means one thing: bugs…

So how do you verify, with a relatively high level of confidence, that the code base still works as expected following the merge? Blackbox testing and git-bisect to the rescue!

In order to circumvent this in the limited time we had available for the FITS Blitz we decided to use blackbox testing. We created a FITS XML comparator, which compares the output files produced by different FITS versions. We also created an accompanying script that combines this comparison tool with git-bisect. For those of you who don’t know git-bisect, it's a tool that is able to pinpoint a specific commit within a git repository that introduced a problem. This is done with the help of a simple binary search and a test suite - in our case the FITS XML comparator.

We were able to go through the different branches and take the ones that didn’t break functionality, but leave the ones that still needed more work. After a result of all this merging during the FITS Blitz, the next version of FITS will include:

A few minor performance optimisations
The possibility to run FITS in a nailgun server
Droid updated to version 6
Apache Tika enhancements
Numerous bug fixes
Better error reporting
Logging

And the best thing is: these are all community improvements! Unfortunately, not all of the contributors have dared to hit the Pull Request Button on github and that is something we have to improve as a community.

In any case, having this simple way of validating that nothing major is broken has another advantage. We can now set up a continuous integration infrastructure that will help FITS maintainers to get further insight into future patches before merging them. Note, that this doesn’t mean that no unit tests should be written. Quite the opposite, creating a unit test suite and refactoring the core of FITS where necessary is the next logical step.

From this foundation, made possible with a Jisc-funded SPRUCE award, we will now work in partnership with interested members of the community to develop and maintain FITS in a way that we hope will give its users much greater belief in its reliability and ability to accept code contributions. To that end we're in the process of establishing a Steering Group that will meet regularly to review the status of FITS, manage a more sustainable development process, develop and champion community contributions to FITS, and create a development roadmap for the toolset. The Group will be composed of a variety of experienced FITS developers and users, and we'll be aiming to be as inclusive as possible within (in particular) the developer community.

So how will all this work in practice? When we've added the finishing touches to this phase of the work, Carl will be back to blog about the new development process and how you can get involved to make FITS better. We are in the process of setting up a new website for FITS to centralize (and improve!) the FITS documentation.

Our ultimate aim is to make FITS a community-maintained tool that is kept up to date with a reliable build at everyone's fingertips, and hopefully demonstrate a better way to sustain community-created preservation tools.

Petar Petrov, Carl Wilson, Andrea Goethals, Spencer McEwen and Paul Wheatley

Week 48: A SCAPE Developer Short Story

BoletteJurik — Wed, 04 Dec 2013 10:35:33 +0000

It's been two weeks since the internal SCAPE developer workshop in Brno, Czech Republic. It was a great workshop. We had a lot of presentations and demos, and were brought up to date on what's going on in the other corners of the SCAPE project. We also had some (loud) discussions, but I think we came to some good agreements on where we as developers are going next. And we started a number of development and productisation activities. I came home with a long list of things to do next week (this ended up not at all being what I did last week, but I still have the list, so next week, fingers crossed). Tasks for week 48:

xcorrSound
- make versioning stable and meaningful (this I looked at together with my colleague in week 48)
- release new version (this one we actually did)
- finish writing nice microsite
- tell my colleague to finish writing small website, where you can test the xcorrSound tools without installing them yourself
- write unit tests
- introduce automatic rpm packaging?
- finish xcorrSound Hadoop job
- do the xcorrSound Hadoop Testbed Experiment
  - Update the corresponding user story on the wiki
  - Write the new evaluation on the wiki
- finish the full Audio Migration + QA Hadoop job
- do the full Audio Migration + QA Hadoop Testbed Experiment
  - Update the corresponding user story on the wiki
  - Write the new evaluation on the wiki
- write a number of new blog posts about xcorrsound and SCAPE testbed experiments
- new demo of xcorrsound for the SCAPE all-staff meeting in February
SCAPE testbed demonstrations
- define the demos that we at SB are going to do as part of testbed (this one we also did in week 48; the actual demos we'll make next year)
FITS experiment (hopefully not me, but a colleague)
JPylyzer experiment (hopefully not me, but a colleague)
Mark FFprobe experiment as not active
... there are some more points for the next months, but I'll spare you...

So what did I do in week 48? Well, I sort of worked on the JPylyzer experiment, which is on the list above. In the Digital Preservation Technology Development department at SB we are currently working on a large scale digitized newspapers ingest workflow including QA. As part of this work we run JPylyzer from Hadoop on all the ingested files, and then validate a number of properties using Schematron. These properties come from the requirements to the digitization company, but in SCAPE context these properties should come from policies, so there is still some work to do for the experiment. But running JPylyzer from Hadoop, and validating properties from the JPylyzer output using Schematron now seems to work in the SB large scale digitized newspapers ingest project :-)

And for now I'll put week 50 on the above list, and when I have finished a sufficient number of bullet points I'll blog again! This post is missing links, so I hope you can read it without.

Preservation Topics:

SCAPE

SPRUCE project Award: Lovebytes Media Archive Project

paul — Thu, 28 Nov 2013 18:32:04 +0000

Lovebytes currently holds an archive of digital media assets representing 19 years of the organisation’s activities in the field of digital art and a rich historical record of emerging digital culture at the turn of the century. It contains original artworks in a wide variety of formats, video and audio documentation of events alongside websites and print objects.

In June 2013 we were delighted to receive an award from SPRUCE, which enabled us to devise and test a digital preservation plan for the archive through auditing, migrating and stabilising a representative sample of material, concentrating on migrating digital video and Macromedia Director files.

Alongside this we developed a Business Case, which makes the case for preserving the archive and describes the work that needs to be done to make it accessible for the benefit of current and future generations, with a view to this forming the basis of applications for funding to continue this work.

Context

Lovebytes was set up to explore the cultural and creative impact of digitalisation across the whole gamut of artistic and creative practice through a festival of exhibitions, talks, workshops, performances, film screenings and commissions of new artwork.

We wanted the festival to be a forum to pose open questions about the impact of digitalisation for artists and audiences, in an attempt to find commonalities in working practice, new themes and highlight new and emerging forms and trends in creative digital practice and also provide support for artists to disseminated and distribute their own work through commissions.

This was a groundbreaking model for a UK media festival and established Lovebytes as key player amongst a new wave of international arts festivals.

The intention in developing a plan for Lovebytes Media Archive is to look at how best to capture the 'shape' of the festival by and how to best represent this in creating an accessible version of archive.

Main Objectives

The Objectives of the project funded through SPRUCE are outlined below:

Develop a workflow for the migration of the digital files and interactive content, progressing on from work done during SPRUCE Mashup London.
Tackle issues around dealing with obsolete formats and authoring platforms used by artists (such as Macromedia Director Projector files) and look at ways of making this content more accessible whilst also maintaining original copies for authenticity.
Research and develop systems for transcription, data extraction and the use of metadata to increase accessibility of the archive.
Report on progress and share our findings for the benefit of the digital preservation community.
Develop a digital preservation Business case, with a view to approaching funders.

Approach

We started by developing a research plan for a representational sample of the archive (see below), focusing on one festival, rather than a range of samples from over the 19 years. We selected the year 2000 as this included a limited edition CD Rom / Audio CD publication which contains specially commissioned interactive and generative artwork in a variety of formats.

Additional assets in the representation sample include video documentation of panel sessions, printed publicity, photographs, press cuttings and audience interviews in a wide variety of formats.

Research plan for the representational sample

Auditing the archive.
Choosing a representative sample.

Stabilising and migrating

Reviewing content to assess problems and risk
Stabilise again with a view to rectifying problems
Cataloguing and naming.
Planning for future accessibility and interpretation.
Extracting metadata.
Prototyping a search interface to provide access to the archive (with Mark Osborne from Nooode).

Data integrity is paramount in digital preservation and requires utmost scrutiny when dealing with 'born digital' artworks, where every aspect of the artists original intentions should be considered a matter for preservation and any re-presentation of a digital artwork can be regarded as a reinterpretation of the work.

In all cases, the most urgent work was the migration of data to stabilise and secure it. Amongst the wide range of formats we hold, CDs and CD ROMs are prone to bit rot and other magnetic formats can degrade gradually or be damaged by electrical and environmental conditions or easily damaged during attempts to read or playback.

The majority of our preservation work was to migrate from a wide variety of formats to hard drive, essentially consolidating our collection into one storage medium, which is then duplicated as a part of a back up routine.

Our research focused on the following 6 areas

Macromedia Director Projector files
- Migrating obsolete files and addressing compatibility issues.
DV Tapes
- Migrating DV tapes and transcribing panel sessions with a view to researching how transcriptions could be used for text based searches of video content, and how this can be embedded as subtitles using YouTube.
Restoring Lovebytes website
- Lovebytes website is currently offline, although is captured on the British Library's UK Web Archive.
Developing naming systems for assets
Prototyping a searchable web interface and exploring the potential for using ready-made, free and accessible tools for transcription dissemination.
Writing a Business Case for Lovebytes Media Archive

We learned some valuable lessons on the way that we'd we like to share with likeminded organisations, especially those who have limited resources and are looking to preserve their own digital legacy on a tight budget.

Our findings have been compiled into a detailed report, providing a workflow model which makes recommendations for capturing, cataloguing and preserving material. It outlines our research into preserving artwork on obsolete formats and authoring platforms, as well as systems for transcription, data extraction and the use of metadata to increase accessibility of the archive.

We wanted to begin looking at the preservation issues for our collection and devise our own systems and best practice, therefore the recommendations reached for preserving digital assets in various media formats reflect the organisational needs of Lovebyes and might not align with another organisations goals.

Business Case

We used the Digital Preservation Business Case Toolkit to help us get started on our Business Case. This was a fantastic resource and helped us shape our Case and consider all the information and options we needed to include.

The Business Case will form the foundation for applications for public and private funding and will be tailored to meet specific requirements. Through writing this, we were able to identify the potential risks to the archive, its value and how we might restage artworks or commission artists to use data from it within the preservation process.

Conclusions

As non-experts in digital preservation we knew we were about to encounter some steep climbs and were initially apprehensive about what lay ahead, given that most of our material had been sat in a garage for ten years. Our collection, until then, had remained largely un-catalogued and aside from being physically sealed in oversized tupperware, the digital assets had been neglected. Many items were the only copy, stored in one location in danger of decay, damage or loss. As a small arts organisation recently hit by cuts to the art funding, Lovebytes and its archives were in a precarious position; unsupported and vulnerable.

The SPRUCE Award gave us the opportunity to take a step back and re-evaluate these assets, making us aware of their value and the need to save them and to start the preservation process. It has given us the opportunity to explore solutions and devise our own systems for best practice within the limited resources and funding options available to us.

It has allowed us to crystallize our thoughts around using the Lovebytes Media Archive to investigate digital archivism as a creative process and specifically how digital preservation techniques may be used to capture and preserve the curatorial shape and context of arts festivals.

By using available resources and bringing in external expertise where necessary, we found this process rewarding both in terms of developing new skills and also reaffirming in terms our past, current and future curatorial practice.

Having undertaken this research we now feel positive about the future of the archive and have a clear strategy for preservation and a case to take to funders and partners to secure it as an exemplar digital born archive project which attempts to capture preserve and represent the history of Lovebytes as a valuable record of early international digital arts practice at the turn of the century.

Jon Harrison and Janet Jennings of Lovebytes, and Mark Osborne of Nooode

Preservation Topics:

SPRUCE

The OPF Appoints New Executive Director

becky — Wed, 20 Nov 2013 08:55:34 +0000

The Board of the OPF has appointed Ed Fay as the new Executive Director. Ed will join the OPF in February 2014 and will lead the organisation in its efforts to address its members' digital preservation challenges with a practical, and community-led approach.

Ross King, Chair of the Board, said: "The Board was extremely gratified to receive qualified applications from Europe, the Middle-East, India, and the United States. Four top candidates were selected by all board members and were interviewed personally by a board sub-committee. After evaluating the these candidates, support for Ed Fay within the committee and the OPF board was unanimous. Ed has demonstrated his understanding of the different challenges facing both libraries and archives and has a refreshing take on digital preservation from an institutional perspective. We look forward to working with him to enhance the visibility and reputation of the OPF and to create more value for its members".

Ed commented on his appointment: "I’m thrilled to join the OPF and contribute towards the development of digital preservation practice at an important time for libraries, archives, and memory institutions everywhere. The OPF’s mission is to enable collaboration and shared solutions and I look forward to working with members and the wider community to build capacity for the digital collections of the future"

Before being appointed by the OPF, Ed has been the Digital Library Manager of the London School of Economics (

LSE) for 5 years. He successfully managed the development of LSE’s digital library from its ince

ption to implementation. He also led digital preservation activities at LSE and their participation in a number of related projects and working groups. Prior to this he worked on several mass digitisation projects funded by JISC.

Ed will take over the role from Bram van der Werf who has managed and grown the OPF from its foundation in 2010 to become a sustainable membership organisation.

Preservation Topics:

Open Planets Foundation

Establishing a Workflow Model for Audio CD Preservation

tonisant — Tue, 19 Nov 2013 13:49:33 +0000

The preservation of audio CDs is something that is slightly different from the preservation of CDs containing data other than audio. Data on audio CDs cannot be easily cloned for preservation, as the music industry has lobbied the main operating system developers to curtail the duplication of CDs to crack down on the mass production of pirate copies. While this is understandable from an intellectual property perspective, it is rather problematic from a preservation viewpoint.

I have scoured published documents in this area but there are no comprehensive examples of best practice related to data preservation from audio CDs. There are guidebooks on the preservation of the CDs themselves but next to nothing about the preservation of the data on the audio CDs. This area requires urgent attention because audio CDs may contain at risk and decaying audio data on a fragile medium. Certain types of audio CDs are nearing their end of life faster than others.

At the SPRUCE London Mashup in July 2013 I proposed the creation of a workflow model for the preservation of audio CDs. Working mainly with Peter May (British Library) and Carl Wilson (OPF), with input from other developers at the mashup, we established that the main problem that needed to be resolved was the fact that there was no open source tool to easily create a disk image or clone of data on an audio CD.

While this may seem like a straightforward project, it took no fewer than three experienced developers working on this problem many hours before a practical solution was proposed, based on cdrdao. (See: an outline of the initial solution)

Having resolved the basic need to create a clone or disk image from an audio CD, the next step in this project was to explore how to catalogue the disk image and its contents, as well as normalise the audio files into the standard BWAV format. This was supported by a SPRUCE award (funded by JISC) covering the period August-October 2013, involving Carl Wilson and Toni Sant, with the participation of Darren Stephens from the University of Hull. Through further consultation with digital forensics experts at the British Library and elsewhere, as well as systematic development, this project has addressed this issue directly.

Once the fundamental open solution was in hand, our attention could be turned to the development of a four-step workflow model for the preservation of audio CDs. The four steps are as follows:

1. Disk Imaging (stabilizing the data)
2. Cataloguing (through individual Cue sheets)
3. Data Ripping (normalising the data)
4. Open access to the catalogue (outputting the metadata)

Working with a specific dataset (see: an outline of the dataset) this project is now able to provide a practical workflow model utilizing the solution proposed during the London SPRUCE mashup as a tool for steps 1 & 3 called arcCD. An example of good practice has now been established in this under-explored area of preservation. All materials produced for this project are available on GitHub. Darren Stephens is also integrating further development on outputting the metadata into MediaWiki for easy access and editing of the catalogue, as part of his PhD research project entitled 'A Framework for Optimised Interaction Between Mediated Memory Repositories and Social Media Networks.'

The initial dataset used for the development of this project is managed by the Malta Music Memory Project (M3P), which seeks to provide an inclusive repository for memories of Maltese music and associated arts, ensuring that these are kept in posterity for current and future generations. M3P is one of the projects within the Media and Memory Research Initiative (MaMRI) of the University of Hull and it is facilitated by the M3P Foundation, a voluntary organization registered in Malta.

Preservation Topics:

SPRUCE

COPTR tools registry beta launch

paul — Thu, 14 Nov 2013 18:43:38 +0000

Almost a year ago, I presented a proposal to the Aligning National Approaches to Digital Preservation (ANADP) group to create a community tool registry. I was frustrated by the profusion of tool registries and the lack of coordination between them. Pooling the knowledge in one place would result in a far better resource. It would be easier to discover new tools, to share experience in using them and to help avoid the tool development duplication we've seen so much of in the past. As ANADPII kicks off today in Barcelona, I'm very pleased to announce the beta launch of COPTR: the Community Owned digital Preservation Tool Registry.

We've been working to collate, combine, de-duplicate and align the contents of 5 existing tool registries from: The Open Planets Foundation (OPF), The National Digital Stewardship Alliance (NDSA), The Digital Curation Centre (DCC), The Digital Curation Exchange (DCE) and the Digital POWRR Project. There were of course quite a few duplicates to weed out but the scope and depth of COPTR now supercedes anything out there that I've seen previously, albeit with an inconsistency of depth between resulting tool registry entries. Each source registry had it's own differing characteristics. At one end of the scale the DCC registry had really strong detail but coverage of well under a hundred tools. The DCE registry included over three hundred tools but with each tool described in far less detail.

After much debate and consideration of feedback from many sources (thanks to everyone who got in touch), we've settled on the all important tool registry structure and a technology with which to manage the data: Mediwiki. We've kept the structure minimal to make creation of new entries relatively quick. We've also kept things factual. Experiences and evidence of using tools can be captured elsewhere and referenced from COPTR. Mediawiki provides an environment that enables easy navigation of the registry (probably most usefully by browsing via a tool's function) and that is quite straightforward for managing the data and providing a feed of the data. A nice touch in the registry is use of RSS feeds and Ohloh widgets to indicate how well supported (or otherwise) the codebase of a particular tool is. See an example on the Archivematica page here.

So what happens next? The COPTR approach is not just to pool existing data in one place, but to remove the source registries from the web. The contributing organisations have committed to doing this, but first of course, they need to be happy that COPTR is ticking all the right boxes as a genuine replacement. So the next phase will be to take on board any final comment and ensure everyone is completey happy to move forward. The onus will then be on the contributors to remove their registries and perhaps explore utilising a feed of data from COPTR on their own sites.

Some thought also needs to be given to differences in aims and scope of the existing repsitories and COPTR. the Digital POWRR grid, for example, does a different job which doesn't easily align with COPTR. So there is some discussion to be had with the POWRR team over the next few days on how (and if) we might be able to bring things together more closely.

Perhaps most importantly we need *you* to help make this community resource a success. The data is still far from perfect. It needs tweaks, it needs more URLs, it needs entries for those important tools that are still missing. And most importantly it needs more references to your digital preservation war stories.

Looking ahead we need to develop a roadmap, think about bringing in other registries, look at how we can encourage further editing and enhancement of COPTR data, and sound out interest in a hackathon to do cool things with the data feed. We would of course also appreciate any feedback on this beta launch of COPTR.

In a parallel action, an informal group of experts is looking to bring a question and answer site to digipres.org, to replace the abortive DP Stack Exchange. Watch this space....

Massive thanks go to the organisations who made this initiative possible, and kudos also to Andy Jackson for his terribly clever mediawiki skills.

Paul Wheatley

Preservation Topics:

SPRUCE

SPRUCE Project Award: Northumberland Estates

Chris Fryer — Wed, 13 Nov 2013 13:40:50 +0000

Using the Digital Preservation Business Case Toolkit to justify Digital Repository investment

Northumberland Estates (NE) were delighted to be awarded a SPRUCE Award to carry out a detailed analysis of current digital repository solutions suitable for small to medium organisations. In conjunction with The University of London Computer Centre (ULCC) they created a toolkit justifying investment in a recommended solution. The business case will aim to implement a sustainable digital repository for the long term management of Northumberland Estates digital content. With a particular focus on small to medium organisations this project aims to address the lack of knowledge in the digital preservation community on preservation as a service (PaaS) providers.

Methodology

Objective: Produce a specification detailing exact requirements for procurement of a digital repository

There are a number of high level requirements which the adopted solution must meet. For this purpose, we created an organisational and technical assessment based on the methodology of the OAIS Reference Model. The technical specification is essentially a “shopping list” of what the chosen system has to do to perform digital preservation. The overall aim was to keep the specification concise, manageable and realistic so that it would meet the immediate business needs of NE, while also adhering to best practice.

Objective: Case studies analysed against specification

The specification was recast into a form that could be used for assessing a preservation solution. Before the product analysis was carried out three potential solutions were identified:

1. Open Source: Many Higher Education institutions already have mature repository instances through the use of open source software such as DSpace, EPrints, and Fedora.

2. Out of the Box: The emergence of PaaS providers such as Tessella Preservica and Ex Libris Rosetta provide active preservation and curation of digital assets.

3. Hybrid: A combination of commercial in house/open source systems. For example, Arkivum provides bit level preservation while open source OAIS digital preservation systems such as Archivematica can provide the extra level of preservation required for the creation of SIP’s, AIP’s, and DIP’s.

By conducting a product analysis for each of these options a much greater understanding of the functional capabilities were formed.

Objective: ISO 16363 assessment of NE

The product analysis provided a really good benchmark for the functional aspects of each repository option, but it was felt that the results were tending to emphasise the performance of the software, rather than the needs of the producers, consumers, or archivist. To balance this trend, the project team took on an extra objective that was not originally in scope of the project.

Broader requirements not captured by simply covering repository software functionality needed to be considered. In particular, storage and bit preservation resilience; how many copies of each file, storage in different locations; who will ingest content and where they will do it; will they have different user roles; how and where will users access the data. To cover these gaps, they were expressed by the Digital Curator in narrative form as a “basic information and workflow story” about the work of NE.

The project team agreed to address the requirements by conducting a cut-down ISO 16363 assessment. This organisational analysis was explicitly intended to complement and enhance the assessment of the repository solution. The resulting organisational assessment resulted in a mini gap analysis on the digital preservation capacity of NE. By using the expertise provided by ULCC to validate these assessments against wider expert opinion, the results represent a summary of how and whether each requirement has been met, or could be met in the future.

Objective: Business Case

The final business case needed to be as concise and targeted as possible. The decision was made to take one recommendation forward based on the functional and organisational assessments made:

1. Open Source: Previous research undertaken by the Digital Curator indicated that the implementation of an open source digital repository would not be feasible due to the investment and expertise required.

2. Out of the Box (recommended option): Preservica scored very highly and also proved to be the most cost effective solution based on initial calculations. Other out of the box solutions were considered such as Ex Libris Rosetta, but the cost of implementing this system in-house was prohibitive.

3. Hybrid: The combination of using the OAIS compliant Archivematica in conjunction with bit-level preservation provided by Arkivum was considered. However, the combination of these two solutions was not as comprehensive and cost effective in comparison to an out of the box solution.

Once the recommended option was decided, it was a case of using the guidance of the Digital Preservation Business Case Toolkit to form the final business case. What resulted was a straight to the point and clear justification based on expert knowledge which was presented internally to key stakeholders within NE.

Lessons Learnt

There is no one size fits all solution!

Much of what is concluded will be based on your own organisational context, all of which can influence the right approach towards digital preservation. However, it is hoped that this project can establish a methodology which other small to medium organisations can adopt.

Identify existing business drivers/organisational goals.

Aligning organisational goals from the onset will save you a great deal of work further down the line. By identifying these key drivers you can begin to build up support for your recommended solution before the big pitch to senior management.

Use existing work already available.

There are a number of fantastic resources out there which can save you reinventing the wheel. The first and most obvious point of contact is the new Digital Preservation Business Case Toolkit. A fantastic resource including everything you need to get started.

Lay out the options clearly and concisely.

Nail down upfront costs for at least the first three years. After all, you want a solution which can be sustained into the future. For any costs include benefits and any potential returns on investments which can be identified

Conclusions

We believe that both the methodology and the actual outputs will have reuse value for other small organisations. With the Specification document and the Organisational Assessment form we have achieved a credible specification and assessment method that is a good fit for NE. These two forms are also provided blank which it is hoped that other organisations can use. Our methodology shows it would be possible for any small organisation to devise their own suitable specification. It is based not exclusively on OAIS, but on the business needs of NE and a simple understanding of the user workflow. There are other methods of assessment; for example the MoSCoW method instead of a weighted score.

With a thorough assessment of the solutions NE stands a better chance of selecting the right system for their business needs, using a process that can be repeated and objectively verified. This method should be regarded as quick and easy. Since we used supplier information, success of the method depends on whether that information is accurate and truthful. But it would be a good first step to selecting a supplier. More in-depth assessments of systems are possible.

With the ISO 16363 assessment we can show that it is possible for an organisation to perform a credible cut down and restricted ISO self-assessment in a very short time. This could be a viable alternative to using an expensive consultant. It must be noted that these outputs do not represent a short cut to carry out a full ISO assessment. The methodology and outputs instead demonstrate how smaller organisations can carry out a similar process to assess their own digital preservation requirements.

The results from this project are clearly encouraging for small to medium organisations who wish to address the problems associated with digital preservation. There are a variety of emerging solutions; from out of the box solutions like Preservica, to open source digital preservation systems such as Archivematica. With the correct buy-in from stakeholders and investment in time, resources, and expertise smaller organisations can implement solutions which will preserve digital content in a sustainable manner.

However, procuring preservation systems is by no means a straightforward task. The current market remains relatively small and there are limited options to choose from. If small organisations (no matter which sector they belong to) are to be convinced of the worth of investing in digital preservation systems there needs to be greater advocacy within the wider digital preservation community, and increased competition amongst vendors who provide such solutions.

The full Northumberland Estates case study can be found at:

http://wiki.dpconline.org/index.php?title=Northumberland_estates_case_study

Christopher Fryer – Digital Curator and Assistant Records Manager, Northumberland Estates

Edward Pinsent – Digital Archivist/Project Manager, University of London Computer Centre (ULCC)

Preservation Topics:

SPRUCE

FITS Blitz

paul — Wed, 06 Nov 2013 11:31:26 +0000

FITS is a classic case of a great digital preservation tool that was developed with an initial injection of resource, and subsequently the creator (Harvard University) has then struggled to maintain it. But let me be very clear, Harvard deserves no blame for this situation. They've created a tool that many in our community have found particularly useful but have been left to maintain it largely on their own.

Wouldn't it be great if different individuals and organisations in our community could all chip in to maintain and enhance the tool? Wrap new tools, upgrade outdated versions of existing tools, and so on? Well many have started to do this, including some injections of effort from my own project, SPRUCE. What a lovely situation to be in, seeing the community come together to drive this tool forward...

Unfortunately we were perhaps a little naive about the effort and mechanics needed to make this happen as a genuine open source development. FITS is a complex beast, wrapping a good number of tools that extract a multitude of information about your files which is then normalised by FITS. What happens when you tweak one bit of code? Does the rest of the codebase still work as it should? Obviously you need to have confidence in a tool if it plays a critical role in your preservation infrastructure.

From the point of view of the SPRUCE Project, we'd like to see all the latest tweaks and enhancements to FITS brought together so that the practitioners we're supporting get a more effective tool. But we also equally want future improvements to find their way into the codebase in a managed and dependable way, so that upgrading to a new FITS version doesn't involve lots of testing for every organisation using it.

So in partnership with Harvard and the Open Planets Foundation (with support from Creative Pragmatics), SPRUCE is supporting a two week project to get the technical infrastructure in place to make FITS genuinely maintainable by the community. "FITS Blitz" will merge the existing code branches and establish a comprehensive testing setup so that further code developments only find their way in when there is confidence that other bits of functionality haven't been damaged by the changes.

FITS Blitz commences next Monday. Please get in touch with myself, or Carl Wilson from the Open Planets Foundation, if you'd like to find out more.

Preservation Topics:

SPRUCE

SCAPE/OPF Continuous Integration update

willp-bl — Fri, 01 Nov 2013 10:19:20 +0000

As previously blogged about by Carl we now have virtually all SCAPE and OPF projects in Continuous Integration; building and unit testing in both Travis CI and Jenkins.

Travis compiles the projects and executes unit tests whenever a new commit is pushed to Github, or when a pull request is submitted to the project.
Jenkins builds are generally scheduled once per day. After a build the software has its code quality analysed by Sonar

Complete details of how to build each non-Java project are contained within the .travis.yml files that are found in the project directories. As a side effect of this work the .travis.yml files can be used as instructions for independently building the projects.

Matchbox, Xcorrsound and Jpylyzer have CI builds that are capable of generating an installable Debian package, which we are aiming to publish. Java projects have had their Maven GroupId and package names changed to the appropriate SCAPE names so we can publish binary snapshots.

The daily Maven snapshots of code built in Jenkins are now (or soon will be) published to https://oss.sonatype.org/content/repositories/snapshots/eu/scape-project/ and can be used by adding this repository to your pom.xml:

<parent>
    <groupId>org.sonatype.oss</groupId>
    <artifactId>oss-parent</artifactId>
    <version>7</version>
</parent>

What you can do for your project

Maintain your .travis.yml file if project dependencies change
Ensure code matches the SCAPE/OPF functional review criteria – correct Java package names and Maven GroupIds are essential to be able to publish snapshots
Ensure your project has an up to date README that contains details of how to build and run your software (including dependencies)
Very importantly ensure that your project has (at the very least) a top level LICENSE, ideally source files should each contain a license header
Add unit tests for your project
Ensure that unit tests for your project can easily be run using standard dependencies. Relying on your particular installation for unit tests to pass means that they cannot be successfully run by Travis/Jenkins and show as test failures. Whilst it might not always be possible to have unit tests that can be run independently, if there have to be test dependencies then please document how these should be set up!
Check your project at http://projects.opf-labs.org/

The CI days are generally about once a month. If you are interested in joining us do let us know as we could always do with more help. It’s an opportunity for you to work on CI with Travis/Jenkins, and do other work that is interesting (and rewarding), such as Debian packaging, that you might not normally get to work on.

Preservation Topics:

Tools

Packaging

Open Planets Foundation

SCAPE

Software

jpylyzer

Software Museums (Archives)

Dirk von Suchodoletz — Thu, 10 Oct 2013 13:51:31 +0000

During and around this year's iPRES a couple of discussions sprung up around the topic of proper software archiving and it was part of the DP challenges workshop discussions. With services emerging around emulation as e.g. developed in the bwFLA project (see e.g. the blog post on EaaS demo or Digital Art curation) proper measures need to be taken to make them sustainable from the software side. There are hardware museums around; similar might be desirable too.

Research data, business processes, digital art and generic digital artefacts can often not be viewed or handled simply by themselves, instead they require a specific software and hardware environment to be accessed or executed properly. Software is a necessary mediator for humans to deal with and understand digital objects of any kind. In particular, artefacts based on any one of the many complex and domain specific formats are often best handled by the matching them with the application they were created with. Software can be seen as the ground truth for any file format. It is the software that creates files that truly defines how those files are formatted.

To make old software environments available on an automatable and scalable basis (for example, via Emulation-as-a-Service) proper enterprise-scale software archiving is required. At first look the task appears to be huge because of the large amount of software that has been produced in the past. Nevertheless, much of the software that has been created is standard software, and more or less used all over the world; and there are a lot of low hanging fruit to pick off that would be highly beneficial to preserve and make avaialble. If components of software can be uniquely described, deduplication should also reduce the overall workload significantly. For at least a significant proportion of the software to be covered, licensing might complicate the whole issue a fair amount as different software licensing variants were deployed in different domains and different parts of the world, and current copyright and patent law differs in different jurisdictions in how it applies to older software.

Types of Software

Institutions and users have to decide which software needs to preserved, how and by whom. The answers to these questions will depend on the intended use cases. In simpler cases all that may be needed to render preserved artefacts in emulated original environments could be a few standard office or business environments with standard software. Complex use cases may require very special non-standard, custom-made software components from non-standard sources, like use cases involving development systems or use cases involving the preservation of complex business processes.

Software components required to reproduce original environments for certain (complex) digital objects can be classified in several ways. Firstly, there are the standard software packages like operating systems and off-the-shelf applications sold in (significant) numbers to customers. And secondly there can be different releases and various localized versions (the user interaction part of a software application is often translated to different languages such as in Microsoft Windows or Adobe products) but otherwise the copies are often exactly the same. In general it does not really matter if it is a French, English, or German Word Perfect version being used to interact with a document. But for the user dealing with it or an automated process like the process used for migration-through-emulation the different labeling of menu entries and error messages matters.

The concept of versions is somewhat different for Open Source or Shareware-like software. Often there are many more "releases" available than with commercial software as the software usually gets updated regularly and does not necessarily have a distinct release cycle. Also, different to commercial software, the open source packages feature full localization, as they did not need to distinguish different markets.

In many domains custom made software and user programming plays a significant role. This can be scripts or applications written by scientists to run their analysis on gathered data, run specific computations, or extend existing standard software packages. Or it could be software tools written for governmental offices or companies to produce certain forms or implement and configure certain business processes. Such software needs to be taken care of and stored alongside the preserved base-files of an object in order to ensure they can be accessed and interacted with in the future. The same applies for complex setups of standard components with lots of very specific configurations.

If such standard software is required, it would make sense to be able to assign each instance a unique identifier. This would help to de-duplicate efforts to store copies. Even if a memory institution or commercial service maintains its own copy, it does not necessarily need to replicate the actual bits if other copies are already available somewhere. It may simply be able to manage it’s own licenses and use the bits/software copies provided by a central service. Additionally, it would simplify efforts to reproduce environments in an efficient way.

What Should be Identified?

Some ideas about how to identify and describe software have already been discussed for the upcoming PREMIS 3.0 standard, in particular for the section regarding environments. Suitable persistent identifiers would definitely be helpful for tagging software. Something like ISBNs or the ISSNs that describe books and other media (or DOIs that are becoming ubiquitous for digital artefacts). These tags would be useful for tool registries like TOTEM as well or coudl match to PREMIS PUIDs. There could be three layers of IDing that could become relevant:

On the most abstract layer a software instance is described as a complete package, e.g. Windows 3.11 US Edition, Adobe Page Maker Version X or Command & Conquer II containing all the relevant installation media, license keys etc. The ID of such a package could be the official product code or derived from it. However when using such an approach it might be difficult to distinguish between hidden updates, for example, during the software archiving experiment at Archives New Zealand we acquired and identified two different package sets of Word Perfect 6.0. So a more nuanced approach may be required.
At the layer of the different media (relevant only if it is not just one downloaded installation package) each floppy disk or each optical medium (or USB media) could be distinguished. E.g. Windows 3.11 as well as applications like Word Perfect came with specific disks for just the printer drivers, and the CD (1 or 2) in the Command & Conquer game differentiated which adversary in the game you were assigned to.
At the individual file layer executables, libraries, helper files like font-sets etc. could be distinguished. The number of items in this set is the largest. An approach centered on running a collection of digital signatures of known, traceable software applications is followed e.g. by the NSRL (National Software Reference Library) and may be the most appropriate option for these types of applications.

Usually it is not trivial to map the installed files in an environment to files on the installation medium, as the files typically get packed (compressed in ‘archive’ files) on the medium and a some files get created from scratch during the installation procedure.

Depending on the actual goal, the focus of the IDs will be different. To actually derive what kind of application or operating system is installed on a machine, file level identifiers will be needed. To just reproduce a particular original environment (for e.g. emulation) package level identifiers are more relevant. In some cases it may be useful to address a single carrier, e.g. to automate installation processes of standard environments consisting of an operating system and a couple of applications.

For the description of software and environments it might be useful to investigate what can be learned from commercial software installation handling and lifecycle management. Large institutions and companies have well-defined workflows to create software environments for certain purposes and their approaches may be directly applicable to the long term preservation use case(s).

Software Museum or Archive

What should be archived, who are the stakeholders and users and how can the archive be supported?

A model for nearly full-archiving of a domain is the Computer Games Museum in Berlin which receives every piece of computer game which requires an USK, which is the German abbreviation for the Entertainment Software Self-Regulation Body, an organisation which has been voluntarily established by the computer games industry to classify computer games, classification. The collection is supplemented by donations of a wide range of software (operating systems, popular non-gaming applications) and hardware items (computers, gaming consoles, controllers). Thus, the museum has acquired a nearly complete collection of the domain. An upcoming problem is the rising number of browser and online games which never get a representation on a physical medium. Another unresolved issue is the maintenance of the collection. At the moment the museum does not even have enough funds for bitstream preservation and proper cataloguing the collection.

Archiving (of standard software) already takes place, for example, at the Computer History Museum, the Australian National Library, the National Archives of New Zealand or the Internet Archive to mention a few. Unfortunately, the activities are not coordinated. Both the mostly "dark archives" of memory institutions and the online sites for deprecated software of questionable origin are not sufficient for a sustainable strategy. Nevertheless, landmark institutions like national libraries and archives could be a good place to archive software in a general way. Nevertheless, the archived software is only of any use if it is properly described with standard metadata. Ideally, the software repositories would provide APIs to communicate with a central software archive and attach services to it. The service levels could differ from just offering metadata information to offering access to complete software packages. As an addition to the basic services museums could offer interactive access to selected original environments, as there is a significant difference between having a software package just bit-stream preserved and have it available to explore and test it for a particular purpose interactively. Often, specific, implicit knowledge is required to get some software item up and running. So keeping instances running permanently would have a great benefit. Archiving institutions like museums could try to build online communities around platforms and software packages. Live ‘’exhibition'' of software helps community exchange and can attract users with knowledge who would be otherwise difficult to find.

Software museums can help to reduce duplicated effort to archive and describe standard software. It can at least help that not every archive needs to store multiple copies of standard software but simply can refer to other repositories. Software museums or archives could become brokers for (obsolete) software licenses. They could serve as a place to donate software (from public, private entities), firmware and platform documentation. Such institutions could simplify the proceedings for a software company to take care of their digital legacy. A one-stop institution might be much more attractive to software vendors and archival institutions than the possible alternative of having multiple parties negotiating license terms of legacy packages with multiple stakeholders (Software companies might have a positive attitude towards such a platform or lawmakers could be persuaded to push it a bit). Software escrow services (discussed e.g. within the TIMBUS EU project) can complement these activities. A museum can operate in different modes like in a non-for-profit branch for public presentation, community building, education etc. and commercial branch to lend/lease out software to actually reproduce environments in emulators for commercial customers.

The situation could be totally different for research institutions and users of custom made software. Such packages do not necessarily make sense in a (public) repository. In such cases the question of, how the licensing will be handled arises. If obsolete, they could be handed over to the archive managing the research primary data.

Another issue is the handling of software versions. Products are updated until announced end-of-live. Would it be necessary to keep every intermediate version or concentrate on general milestones. An operating system like ''Windows XP'' (32bit) was officially available in several flavors (like ''Home'' or ''Professional'') from 2001 till 2014. In many cases a ''fuzzy matching'' would be acceptable as a certain software package runs properly in all versions. Other software might require a very specific version to function properly. This needs to be addressable (and could be matched to the appropriate PRONOM environment identifiers). Plus, there are a couple of preservation challenges in the software lifecylce.

Discussion

There are a number of questions which arise when creating or running a software archive or museum:

On which level should a software archive be run: Institutional (e.g. for larger (national) research institutions, state or federal or global level or should a federated approach be favoured)?
Does it make sense (at all) to run a centralized software archive in a relevant size, assuming that for modern, complex scientific environments, the software components are much too individual? What kind of software would be useful in such an archive? Which versions should be kept?
Would it be possible to establish a PRONOM-like identifier system (agreed upon and shared among the relevant memory institutions)? Or use the DOI system to provide access to the base objects?
How, through which APIs should software and/or metadata be offered (or ingested)?
How should the software archive adapt to the ever changing form of installation media from tapes, floppies to optical media of different types to solely network based installations?
Would it be possible to run the software archive as a backend, where locally ingested software is stored in the end?
Is the advantage gain of centralizing knowledge and storage of standard software components big enough to outweigh the efforts required to run such an archive?
Do proper software license and handling models exist for such an archive, like donation of licenses, taking over abandoned packages, escrow services? Would it be possible to bridge the diverse interests of diverse users of a diverse range of software and software producers?
Would there be advantages in running such an archive as/in a non-profit organisation?/What business model would make most sense for such an organisation?

Preservation Topics:

Measuring Bigfoot

johan — Tue, 08 Oct 2013 16:24:05 +0000

My previous blog Assessing file format risks: searching for Bigfoot? resulted in some interesting feedback from a number of people. There was a particularly elaborate response from Ross Spencer, and I originally wanted to reply to that directly using the comment fields. However, my reply turned out to be a bit more lengthy than I meant to, so I decided to turn it into a separate blog entry.

Numbers first?

Ross overall point is that we need the numbers first; he makes a plea for collecting more format-related data, and adding numbers to these. Although these data do not directly translate into risks, Ross argues that it might be possible to use them to address format risks at a later stage. This may look like a sensible approach at first glance, but on closer inspection there's a pretty fundamental problem, which I'll try to explain below. To avoid any confusion here, I will be speaking of "format risk" here in the sense used by Graf & Gordea, which follows from the idea of "institutional obsolescence" (which is probably worth a blog post by itself, but I won't go into this here).

The risk model

Graf & Gordea define institutional obsolescence in terms of "the additional effort required to render a file beyond the capability of a regular PC setup in particular institution". Let's call this effort E. Now the aim is to arrive at an index that has some predictive power of E. Let's call this index R_E. For the sake of the argument it doesn't matter how R_E is defined precisely, but it's reasonable to assume it will be proportional to E (i.e. as the effort to render a file increases, so does the risk):

R_E ∝ E

The next step is to find a way to estimate R_E (the dependent variable) as a function of a set of potential predictor variables:

R_E = f(S, P, C, ... )

where S = software count, P = popularity, C = complexity, and so on. To establish the predictor function we have two possibilities:

use a statistical approach (e.g. multiple regression or something more sophisticated);
use a conceptual model that is based on prior knowledge of how the predictor variables affect R_E.

The first case (statistical approach) is only feasible if we have actual data on E. For the second case we also need observations on E, if only to be able to say anything about the model's ability to predict R_E (verification).

No observed data on E!

Either way, the problem here is that there's an almost complete lack of any data on E. Although we may have a handful of isolated 'war stories', these don't even come close to the amount of data that would be needed to support any risk model, no matter whether it is purely statistical or based on an underlying conceptual model¹. So how are we going to model a quantity for which we do not have any observed data in the first place? Or am I overlooking something here?

Looking at Ross's suggestions for collecting more data, all of the examples he provides fall into the potential (!) predictor variables category. For instance, prompted by my observation on compression in PDF, Ross suggests to start analysing large collections of PDFs to establish patterns on the occurrence of various types of compression (and other features), and attach numbers to them. Ross acknowledges that such numbers by themselves don't tell you if PDF is "riskier" than another format, but he argues that:

once we've got them [the numbers], subject matter experts and maybe some of those mathematical types with far greater statistics capability than my own might be able to work with us to do something just a little bit clever with them.

Aside from the fact that it's debatable whether, in practical terms, the use of compression is really a risk (is there any evidence to back up this claim?), there's a more fundamental issue here. Bearing in mind that, ultimately, the thing we're really interested in here is E, how could collecting more data on potential predictor variables of E ever help here in the near absence of any actual data on E? No amount of clever maths or statistics can compensate for that! Meanwhile, ongoing work on the prediction of E mainly seems to be focused on the collection, aggregation and analysis of potential predictor variables (which is also illustrated by Ross's suggestions), even though the purpose of these efforts remains largely unclear.

Within this context I was quite intrigued by the grant proposal mentioned by Andrea Goethals which, from the description, looks like an actual (and quite possibly the first) attempt at the systematic collection of data on E (although like Andy Jackson said here I'm also wondering whether this may be too ambitious).

Obsolescence-related risks versus format instance risks

On a final note, Ross makes the following remark about the role of tools:

[W]ith tools such as Jpylyzer we have such powerful ways of measuring formats - and more and more should appear over time.

This is true to some extent, but a tool like jpylyzer only provides information on format instances (i.e. features of individual files); it doesn't say anything about preservation risks of the JP2 format in general. The same applies to tools that are are able to detect features in individual PDF files that are risky from a long-term preservation point of view. Such risks affect file instances of current formats, and this is an area that is covered by the OPF File Format Risk Registry that is being developed within SCAPE (it only covers a limited number of formats). They are largely unrelated to (institutional) format obsolescence, which is the domain that is being addressed by FFMA. This distinction is important, because both types of risks need to be tackled in fundamentally different ways, using different tools, methods and data. Also, by not being clear about which risks are being addressed, we may end up not using our data in the best possible way. For example, Ross's suggestion on compression in PDF entails (if I'm understanding him correctly) the analysis of large volumes of PDFs in order to gather statistics on the use of different compression types. Since such statistics say little about individual file instances, a more practically useful approach might be to profile individual files instances for 'risky' features.

On a side note even conceptual models often need to be fine-tuned against observed data, which can make them pretty similar to statistically-derived models. ↩

Preservation Topics:

Preservation Risks

Format Registry

Representation Information

Corpora

Tools

SCAPE

Open-source Database Preservation Toolkit released!

lfaria — Mon, 07 Oct 2013 16:10:07 +0000

The Database Preservation Toolkit allows conversion between Database formats, including connection to live systems, for purposes of digitally preserving databases. The toolkit allows conversion of live or backed-up databases into preservation formats such as DBML, an XML format created for the purpose of database preservation. The toolkit also allows conversion of the preservation formats back into live systems to allow the full functionality of databases. For example, it supports a specialized export into MySQL, optimized for PhpMyAdmin, so the database can be fully experimented using a web interface.

This toolkit was part of the RODA project and now has been released as a project by its own due to the increasing interest on this particular feature.

The toolkit is created as a platform that uses input and output modules. Each module supports read and/or write to a particular database format or live system. New modules can easily be added by implementation of a new interface and adding of new drivers.

To download it, know how to use it and check related publication please visit:

http://keeps.github.io/db-preservation-toolkit/

So give it a try, provide feedback on issues and requested features and feel free to contribute!

Published Preservation Policies

Barbara Sierman — Mon, 07 Oct 2013 09:32:24 +0000

One of the activities in the European project SCAPE is to create a catalogue of policy elements. At the last iPRES conference we explained our work and you can read about it . During our activities we started collecting existing, published policies and we have now put the current set on a wiki http://wiki.opf-labs.org/display/SP/Published+Preservation+Policies Looking at the results of your colleagues might help to create or finalize your own preservation policies. As I said during my presentation at iPRES 2013, there are far more organizations dealing with digital preservation than published preservation policies on the internet – at least based on what we found!

If your organization has a digital preservation policy and you want to see yours in this list as well, please send an email to Barbara.Sierman@kb.nl and it will be added.

Assessing file format risks: searching for Bigfoot?

johan — Mon, 30 Sep 2013 15:49:25 +0000

Last week someone pointed my attention to a recent iPres paper by Roman Graf and Sergiu Gordea titled "A Risk Analysis of File Formats for Preservation Planning". The authors propose a methodology for assessing preservation risks for file formats using information in publicly available information sources. In short, their approach involves two stages:

Collect and aggregate information on file formats from data sources such as PRONOM, Freebase and DBPedia
Use this information to compute scores for a number of pre-defined risk factors (e.g. the number of software applications that support the format, the format's complexity, its popularity, and so on). A weighted average of these individual scores then gives an overall risk score.

This has resulted in the "File Format Metadata Aggregator" (FFMA), which is an expert system aimed at establishing a "well structured knowledge base with defined rules and scored metrics that is intended to provide decision making support for preservation experts".

The paper caught my attention for two reasons: first, a number of years ago some colleagues at the KB developed a method for evaluating file formats that is based on a similar way of looking at preservation risks. Second, just a few weeks ago I found out that the University of North Carolina is also working on a method for assessing "File Format Endangerment" which seems to be following a similar approach. Now let me start by saying that I'm extremely uneasy about assessing preservation risks in this way. To a large extent this is based on experiences with the KB-developed method, which is similar to the assessment method behind FFMA. I will use the remainder of this blog post to explain my reservations.

Criteria are largely theoretical

FFMA implicitly assumes that it is possible to assess format-specific preservation risks by evaluating formats against a list of pre-defined criteria. In this regard it is similar to (and builds on) the logic behind, to name but two examples, Library of Congress' Sustainability Factors and UK National Archives' format selection criteria. However, these criteria are largely based on theoretical considerations, without being backed up by any empirical data. As a result, their predictive value is largely unknown.

Appropriateness of measures

Even if we agree that criteria such as software support and the existence of migration paths to some alternative format are important, how exactly do we measure this? It is pretty straightforward to simply count the number of supporting software products or migration paths, but this says nothing about their quality or suitability for a specific task. For example, PDF is supported by a plethora of software tools, yet it is well known that few of them support every feature of the format (possibly even none, with the exception of Adobe's implementation). Here's another example: quite a few (open-source) software tools support the JP2 format, but for this many of them (including ImageMagick and GraphicsMagick) rely on JasPer, a JPEG 2000 library that is notorious for its poor performance and stability. So even if a format is supported by lots of tools, this will be of little use if the quality of those tool are poor.

Risk model and weighting of scores

Just as the employed criteria are largely theoretical, so is the computation of the risk scores, the weights that are assigned to each risk factor, and they way the individual scores are aggregated into an overall score. The latter is computed as the weighted sum of all individual scores, which means that a poor score on, for example, Software Count can be compensated by a high score on other factors. This doesn't strike me as very realistic, and it is also at odds with e.g. David Rosenthal's view of formats with open source renderers being immune from format obsolescence.

Accuracy of underlying data

A cursory look at the web service implementation of FFMA revealed some results that make me wonder about the data that are used for the risk assessment. According to FFMA:

PNG, JPG and GIF are uncompressed formats (they're not!);
PDF is not a compressed format (in reality text in PDF nearly always uses Flate compression, whereas a whole array of compression methods may be used for images);
JP2 is not supported by any software (Software Count=0!), it doesn't have a MIME type, it is frequently used, and it is supported by web browsers (all wrong, although arguably some browser support exists if you account for external plugins);
JPX is not a compressed format and it is less complex than JP2 (in reality it is an extension of JP2 with added complexity).

To some extent this may also explain the peculiar ranking of formats in Figure 6 of the paper, which marks down PDF and MS Word (!) as formats with a lower risk than TIFF (GIF has the overall lowest score).

What risks?

It is important that the concept of 'preservation risk' as addressed by FFMA is closely related to (and has its origins in) the idea of formats becoming obsolete over time. This idea is controversial, and the authors do acknowledge this by defining preservation risks in terms of the "additional effort required to render a file beyond the capability of a regular PC setup in [a] particular institution". However, in its current form FFMA only provides generalized information about formats, without addressing specific risks within formats. A good example of this is PDF, which may contain various features that are problematic for long-term preservation. Also note how PDF is marked as a low-risk format, despite the fact that it can be a container for JP2 which is considered high-risk. So doesn't that imply that a PDF that contains JPEG 2000 compressed images is at a higher risk?

Encyclopedia replacing expertise?

A possible response to the objections above would be to refine FFMA: adjust the criteria, modify the way the individual risk scores are computed, tweak the weights, change the way the overall score is computed from the individual scores, and improve the underlying data. Even though I'm sure this could lead to some improvement, I'm eerily reminded here of this recent ~~rant~~ blog post by Andy Jackson, in which he shares his concerns about the archival community's preoccupation with format, software, and hardware registries. Apart from the question whether the existing registries are actually helpful in solving real-world problems, Jackson suggests that "maybe we don't know what information we need", and that "maybe we don't even know who or what we are building registries for". He also wonders if we are "trying to replace imagination and expertise with an encyclopedia". I think these comments apply equally well to the recurring attempts at reducing format-specific preservation risks to numerical risk factors, scores and indices. This approach simply doesn't do justice to the subtleties of practical digital preservation. Worse still, I see a potential danger of non-experts taking the results from such expert systems at face value, which can easily lead to ill-judged decisions. Here's an example.

KB example

About five years some colleagues at the KB developed a "quantifiable file format risk assessment method", which is described in this report. This method was applied to decide which still image format was the best candidate to replace the then-current format for digitisation masters. The outcome of this was used to justify a change from uncompressed TIFF to JP2. It was only much later that we found out about a host of practical and standard-related problems with the format, some of which are discussed here and here. None of these problems were accounted for by the earlier risk assessment method (and I have a hard time seeing how they ever could be)! The risk factor approach of GGMA is covering similar ground, and this adds to my scepticism about addressing preservation risks in this manner.

Final thoughts

Taking into account the problems mentioned in this blog post, I have a hard time seeing how scoring models such as the one used by FFMA would help in solving practical digital preservation issues. It also makes me wonder why this idea keeps on being revisited over and over again. Similar to the format registry situation, is this perhaps another manifestation of the "trying to replace imagination and expertise with an encyclopedia phenomenon? What exactly is the point of classifying or ranking formats according to perceived preservation "risks" if these "risks" are largely based on theoretical considerations, and are so general that they say next to nothing about individual file (format) instances? Isn't this all a bit like searching for Bigfoot? Wouldn't the time and effort involved in these activities be better spent on trying to solve, document and publish concrete format-related problems and their solutions? Some examples can be found here (accessing old Powerpoint 4 files), here (recovering the contents of an old Commodore Amiga hard disk), here (BBC Micro Data Recovery), or even here (problems with contemporary formats)?

I think there could also be a valuable role here for some of the FFMA-related work in all this: the aggregation component of FFMA looks really useful for the automatic discovery of, for example, software applications that are able to read a specific format, and this could be could be hugely helpful in solving real-world preservation problems.

Preservation Topics:

Preservation Risks

Format Registry