Planet DigiPres

EaaS in Action — And a short meltdown due to a friendly DDoS

Open Planets Foundation Blogs - 9 July 2014 - 4:23pm

On June 24th 9.30 AM EST Dragan Espenschied, Digital Conservator at Rhizome NY, released an editorial on featuring a restored home computer previously owned by Cory Arcangel. The article uses an embedded emulator powered by the bwFLA Emulation as a Service framework  and the University of Freiburg’s computing center. The embedded emulator allows readers to explore and interact with the re-enacted machine. 

Currently the bwFLA test and demo infrastructure runs an old, written off IBM Blade Cluster, using 12 blades for demo purposes, each equipped with 8 physical Intel(R) Xeon(R) CPUs (E5440  @ 2.83GHz). All instances are booted diskless (network boot) with the latest bwFLA codebase deployed. Additionally, there is a EaaS gateway running on 4 CPUs delegating request and providing a Web container framework (JBoss) for IFrame delivery. To ensure, a decent performance of individual emulation sessions we assign one emulation session to every available physical CPU. Hence, our current setup can handle 96 parallel session. 

Due to some social media propaganda and good timing (US breakfast/coffee time) our resources were exhausted within minutes.

The figure above shows the total number of sessions for June 26th. Between 16:00 and 20:00 CEST, however, we were unable to deal with the demand.

After two days however, load has normalized again, even though on a higher level.

Lessons learned

The bwFLA EaaS framework is able to scale with demand, however, not with our available (financial) resources. Our cluster is suitable for „normal“ load scenarios. For peaks like we have experienced with Dragan’s post, a temporary deployment in the Cloud (e.g. using PaaS / IaaS Cloud services) is cost effective strategy, since these heavy load situations last only for a few days. The „local“ infrastructure should be scaled to average demand, keeping costs running EaaS at bay. For instance, Amazon EC2 charges for an 8 CPU machine about € 0.50 per hour. In case of Dragan’s post, the average session time of a user playing with the emulated machine was 15 minutes, hence, the average cost per user is about 0.02€ if a machine is fully utilized. 


Taxonomy upgrade extras: EaaSPreservation Topics: Emulation
Categories: Planet DigiPres

July Library of Congress Digital Preservation Newsletter

The Signal: Digital Preservation - 9 July 2014 - 1:20pm

The July issue of the Library of Congress Digital Preservation newsletter is now available!

In this issue:July newsletter

  • Featuring “Digital Preservation and the Arts” including Web Archiving and Preserving the Arts, and Preserving Digital and Software-Based Artworks
  • An Interview with Marla Misunas (and friends) of SFMOMA, part 2
  • Community Approaches to Digital Stewardship
  • Exhibiting GIFs, with Jason Eppink
  • NDSA News with the latest reports
  • Residency Program updates
  • Conversation Corner, interviews with Ted Westervelt, Lisa Gitelman and Shannon Mattern
  • Upcoming events

To receive future newsletters, sign up here for the Digital Preservation Newsletter.

Categories: Planet DigiPres

BSDIFF: Technological Solutions for Reversible Pre-conditioning of Complex Binary Objects

Open Planets Foundation Blogs - 9 July 2014 - 12:31am

During my time at The National Archives UK, colleague, Adam Retter, developed a methodology for the reversible pre-conditioning of complex binary objects. The technique was required to avoid the doubling of storage for malformed JPEG2000 objects numbering in the hundreds of thousands. The difference between a malformed JPEG2000 file and a corrected, well-formed JPEG2000 file, in this instance was a handful of bytes, yet the objects themselves were many megabytes in size. The cost of storage means that doubling it in such a scenario is not desirable in today’s fiscal environment – especially if it can be avoided.

As we approach ingest of our first born-digital transfers at Archives New Zealand, we also have to think about such issues. We’re also concerned about the documentation of any comparable changes to binary objects, as well as any more complicated changes to objects in any future transfers.

The reason for making changes to a file pre-ingest, in our process terminology - pre-conditioning, is to ensure well-formed, valid objects are ingested into the long term digital preservation system. Using processes to ensure changes are:

  • Reversible
  • Documented
  • Approved

We can counter any issues identified as digital preservation risks in the system’s custom rules up-front ensuring we don’t have to perform any preservation actions in the short to medium term. Such issues may be raised through format identification, validation, or characterisation tools. Such issues can be trivial or complex and the objects that contain exceptions may also be trivial or complex themselves.

At present, if pre-conditioning is approved, it will result in a change being made to the digital object and written documentation of the change, associated with the file, in its metadata and in the organisation’s content management system outside of the digital preservation system.

As example documentation for a change we can look at a provenance note I might write to describe a change in a plain text file. The reason for the change is the digital preservation system looking for the object to be encoded as UTF-8. A conversion can give us stronger confidence about what this file is in future. Such a change, converting the object from ASCII to UTF-8, can be completed as either a pre-conditioning action pre-ingest, or preservation migration post-ingest.

Provenance Note

“Programmers Notepad 2.2.2300-rc used to convert plain-text file to UTF-8. UTF-8 byte-order-mark (0xEFBBBF) added to beginning of file – file size +3 bytes. Em-dash (0x97 ANSI) at position d1256 replaced by UTF-8 representation 0xE28094 at position d1256+3 bytes (d1259-d1261) – file size +2 bytes.”

Such a small change is deceptively complex to document. Without the presence of a character sitting outside of the ASCII range we might have simply been able to write, “UTF-8 byte-order-mark added to beginning of file.” – but with its presence we have to provide a description complete enough to ensure that the change can be observed, and reversed by anyone accessing the file in future.

Pre-conditioning vs. Preservation Migration

As pre-conditioning is a form of preservation action that happens outside of the digital preservation system we haven’t adequate tools to complete the action and document it for us – especially for complex objects. We’re relying on good hand-written documentation being provided on ingest. The temptation, therefore, is to let the digital preservation system handle this using its inbuilt capability to record and document all additions to a digital object’s record, including the generation of additional representations; but the biggest reason to not rely on this is the cost of storage and how this increases with the likelihood of so many objects requiring this sort of treatment over time.

Proposed Solution

It is important to note that the proposed solution can be implemented either pre- or post-ingest therefore removing the emphasis from where in the digital preservation process this occurs, however, incorporating it post-ingest requires changes to the digital preservation system. Doing it pre-ingest enables it to be done manually with immediate challenges addressed. Consistent usage and proven advantages over time might see it included in a digital preservation mechanism at a higher level.

The proposed solution is to use a patch file, specifically a binary diff (patch file) which stores instructions about how to convert one bitstream to another. We can create a patch file by using a tool that compares an original bitstream to a corrected (pre-conditioned) version of it and stores the result of the comparison. Patch files can add and remove information as required and so we can apply the instructions created to a corrected version of any file to re-produce the un-corrected original.

The tool we adopted at The National Archives, UK was called BSDIFF. This tool is distributed with the popular operating system, FreeBSD, but is also available under Linux, and Windows.

The tool was created by Colin Percival and there are two utilities required; one to create a binary diff - BSDIFF itself, and the other to apply it BSPATCH. The manual instructions are straightforward, but the important part of the solution in a digital preservation context is to flip the terms <oldfile> and <newfile>, so for example, in the manual:

  • $ bsdiff <oldfile> <newfile> <patchfile>

Can become:

  • $ bsdiff <newfile> <oldfile> <patchfile>

Further, in the below descriptions, I will replace <newfile> and <oldfile> for <pre-conditioned-file> and <malformed-file> respectively, e.g.

  • $ bsdiff <pre-conditioned-file> <malformed-file> <patchfile>


BSDIFF generates a patch <patchfile> between two binary files. It compares <pre-conditioned-file> to <malformed-file> and writes a <patchfile> suitable for use by BSPATCH.


BSPATCH applies a patch built with BSDIFF, it generates <malformed-file> using <pre-conditioned-file>, and <patchfile> from BSDIFF.


For my examples I have been using the Windows port of BSDIFF referenced from Colin Percival’s site.

To begin with, a non-archival example simply re-producing a binary object:

If I have the plain text file, hen.txt:

  • The quick brown fox jumped over the lazy hen.

I might want to correct the text to its more well-known pangram form – dog.txt:

  • The quick brown fox jumped over the lazy dog.

I create dog.txt and using the following command I create hen-reverse.diff:

  • $ bsdiff dog.txt hen.txt hen-reverse.diff

We have two objects we need to look after, dog.txt and hen-reverse.diff.

If we ever need to look at the original again we can use the BSPATCH utility:

  • $ bspatch dog.txt hen-original.txt hen-reverse.diff

We end up with a file that matched the original byte for byte and can be confirmed by comparing the two checksums.

$ md5sum hen.txt 84588fd6795a7e593d0c7454320cf516 *hen.txt $ md5sum hen-original.txt 84588fd6795a7e593d0c7454320cf516 *hen-original.txt

Used as an illustration, we can re-create the original binary object, but we’re not saving any storage space at this point as the patch file is bigger than the <malformed-file> and <pre-conditioned-file> together:

  • hen.txt – 46 bytes
  • dog.txt – 46 bytes
  • hen-reverse.diff – 159 bytes

The savings we can begin to make, however, using binary diff objects to store pre-conditioning instructions can be seen when we begin to ramp up the complexity and size of the objects we’re working with. Still working with text, we can convert the following plain-text object to UTF-8 complementing the pre-conditioning action we might perform on archival material as described in the introduction to this blog entry:

  • Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas quam lacus, tincidunt sit amet lobortis eget, auctor non nibh. Sed fermentum tempor luctus. Phasellus cursus, risus nec eleifend sagittis, odio tellus pretium dui, ut tincidunt ligula lorem et odio. Ut tincidunt, nunc ut volutpat aliquam, quam diam varius elit, non luctus nulla velit eu mauris. Curabitur consequat mauris sit amet lacus dignissim bibendum eget dignissim mauris. Nunc eget ullamcorper felis, non scelerisque metus. Fusce dapibus eros malesuada, porta arcu ut, pretium tellus. Pellentesque diam mauris, mollis quis semper sit amet, congue at dolor. Curabitur condimentum, ligula egestas mollis euismod, dolor velit tempus nisl, ut vulputate velit ligula sed neque. Donec posuere dolor id tempus sodales. Donec lobortis elit et mi varius rutrum. Vestibulum egestas vehicula massa id facilisis.

Converting the passage to UTF-8 doesn’t require the conversion of any characters within the text itself, rather just the addition of the UTF-8 byte-order-mark at the beginning of the file. Using Programmers Notepad we can open lorem-ascii.txt and re-save it with a different encoding as lorem-utf-8.txt. As with dog.txt and hen.txt we can then create the patch, and then apply to see the original again using the following commands:

  • $ bsdiff lorem-utf-8.txt lorem-ascii.txt lorem-reverse.diff
  • $ bspatch lorem-utf-8.txt lorem-ascii-original.txt lorem-reverse.diff

Again, confirmation that bspatch outputs a file matching the original can be seen by looking at their respective MD5 values:

$ md5sum lorem-old.txt ec6cf995d7462e20f314aaaa15eef8f9 *lorem-ascii.txt $ md5sum lorem-ascii.txt ec6cf995d7462e20f314aaaa15eef8f9 *lorem-ascii-original.txt

The file sizes here are much more illuminating:

  • lorem-ascii.txt – 874 bytes
  • lorem-utf-8.txt – 877 bytes
  • lorem-reverse.diff – 141 bytes

Just one more thing… Complexity!

We can also demonstrate the complexity of the modifications we can make to digital objects that BSDIFF affords us. Attached to this blog is a zip file containing supporting files, lorem-ole2-doc.doc and lorem-xml-docx.docx.

The files are used to demonstrate a migration exercise from an older Microsoft OLE2 format to the up-to-date OOXML format.

I’ve also included the patch file lorem-word-reverse.diff.

Using the commands as documented above:

  • $ bsdiff lorem-xml-docx.docx lorem-ole2-doc.doc lorem-word-reverse.diff
  • $ bspatch lorem-xml-docx.docx lorem-word-original.doc lorem-word-reverse.diff

We can observe that application of the diff file to the ‘pre-conditioned’ object, results in a file identical to the original OLE2 object:

$ md5sum lorem-ole2-doc.doc 3bb94e23892f645696fafc04cdbeefb5 *lorem-ole2-doc.doc $ md5sum lorem-word-original.doc 3bb94e23892f645696fafc04cdbeefb5 *lorem-word-original.doc

The file-sizes involved in this example are as follows:

  • Lorem-ole2-doc.doc – 65,536 bytes
  • Lorem-xml-docx.docx – 42.690 bytes
  • Lorem-word-reverse.diff – 16,384 bytes

The neat part of this as a solution, if it wasn’t enough that the most complex of modifications are reversible, is that the provenance note remains the same for all transformations between all digital objects. The tools and techniques are documented instead, and the rest is consistently consistent, and perhaps even more accessible to all users who can understand this documentation over more complex narrative, byte-by-byte breakdowns that might otherwise be necessary.


Given the right problem and the insight of an individual from outside of the digital preservation sphere (at the time) in Adam, we have been shown an innovative solution that helps us to demonstrate provenance in a technologically and scientifically sound manner, more accurately, and more efficiently than we might otherwise be able to do so using current approaches. The solution:

  • Enables more complex pre-conditioning actions on more complex objects
  • Prevents us from doubling storage space
  • Encapsulates pre-conditioning instructions more efficiently and more accurately - there are fewer chances to make errors

While it is unclear whether Archives New Zealand will be able to incorporate this technique into its workflows at present, the solution will be presented alongside our other options so that it can be discussed and taken into consideration by the organisation as appropriate.

Work does need to be done to incorporate it properly, e.g. respecting original file naming conventions; and some consideration should be given as to where and when in the transfer / digital preservation process the method should be applied, however, it should prove to be an attractive, and useful option for many archives performing pre-conditioning or preservation actions on all future, trivial and complex digital objects.




Documentation for the BSDIFF file format is attached to this blog and was created by Colin Percival and is BSD licensed. 


Preservation Topics: Preservation ActionsMigrationPreservation StrategiesNormalisation AttachmentSize BSDIFF format documentation provided by Colin Percival [PDF]29.93 KB Support files for OLE2 to OOXML example [ZIP]69.52 KB
Categories: Planet DigiPres

HTML and fuzzy validity

File Formats Blog - 8 July 2014 - 5:54pm

Andy Jackson wrote an interesting post on the question of HTML validity. Only registered Typepad users can comment, so it’s easier for me to add something to the discussion here.

When I worked on JHOVE, I had to address the question of valid HTML. A few issues are straightforward; the angle brackets for tags have to be closed, and so do quoted strings. Beyond that, everything seems optional. There are no required elements in HTML, not even html, head, or body; a blank file or a plain text file with no tags can be a valid HTML document. The rules of HTML are designed to be forgiving, which just makes it harder to tell if a document is valid or not. I’ve recommended that JHOVE users not use the HTML module; it’s time-consuming and doesn’t give you much useful information.

There are things in XHTML which aren’t legal in HTML. The “self-closing” tag (<<tag/>) is good XHTML, but not always legal HTML. In HTML5, <input ... /> is legal, but <span ... /> isn’t, because input doesn’t require a closing tag but span does. (In other words, it’s legal only when it’s superfluous.) However, any recent browser will accept both of them.

The set of HTML documents which are de facto acceptable and unambiguous is much bigger than the set which is de jure correct. Unfortunately, the former is a fuzzy set. How far can you push the rules before you’ve got unsafe, ambiguous HTML? It depends on which browsers and versions you’re looking at, and how strenuous your test cases are.

The problem goes beyond HTML proper. Most browsers deal with improper tag nesting, but JavaScript and CSS can raise bigger issues. These are very apt to have vendor-specific features, and they may have major rendering problems in browsers for which they weren’t tested. A document with broken JavaScript can be perfectly valid, as far as the HTML spec is concerned.

It’s common for JavaScript to be included by an external reference, often on a completely different website. These scripts may themselves have external dependencies. Following the dependency chain is a pain, but without them all the page may not work properly. I don’t have data, but my feeling is that far more web pages are broken because of bad scripts and external references than because of bad HTML syntax.

So what do you do when validating web pages? Thinking of it as “validating HTML” pulls you into a messy area without addressing some major issues. If you insist on documents that are fully compliant with the specs, you’ll probably throw out more than you accept, without any good reason. But at the same time, unless you validate the JavaScript and archive all external dependencies, you’ll accept some documents that have significant preservation issues.

It’s a mess, and I don’t think anyone has a good solution.

Categories: Planet DigiPres

Tag and Release: Acquiring & Making Available Infinitely Reproducible Digital Objects

The Signal: Digital Preservation - 8 July 2014 - 5:17pm

What does it mean to acquire something, like a set of animated .gifs,  that are already widely available on the web? Archives and Museums are often focused on acquiring, preserving and making accessible rare or unique documents, records, objects and artifacts. While someone might take a photo of an object, or reproduce it in any number of ways, the real object would reside in the institution. How does this perspective shift when we switch to working with rare and unique born-digital materials?

"I code therefore I am" by user circle_hk on Flickr.

“I code therefore I am” by user circle_hk on Flickr.

Given that digital objects function on a completely different logic where (for nearly all intents and purposes) any copy is as original as the original, rare and unique is a somewhat outmoded notion for digital material. Any accurate copy of a digital object is as much the object as the original. So, if it is trivial to create lots of copies of unique materials how does that change what it means to acquire and make them available?

Consideration of Cooper Hewitt’s acquisition of the source code of an iPad application offers an opportunity to rethink some of what acquisition can mean for digital materials, and in the process rethink part of what the functions of cultural heritage organizations are and can be in this area. What follows are reflections largely inspired by thinking through Sebastian Chan and Aaron Straup Cope’s recent essay Collecting the present: digital code and collections and Doug Reside’s recent essay File Not Found: Rarity in the Age of Digital Plenty (pdf). Together, I think these two essays suggest a potential shift for thinking about digital artifacts. Potentially, a shift away from a mindset of Acquire and Make Available to a mindset of Tag and Release. It may be that the best thing cultural heritage organizations can do with rare and unique born-digital materials is to make it so that they are no longer rare and unique at all. To make it easy for anyone to interact with and validate copies of these materials. This is some formative thinking on the topic, so I look forward to discussing/talking about these issues with anyone interested in the comments.

Copy the Source, Let Others Copy the Source

In 2013 the Smithsonian Cooper-Hewitt National Design Museum acquired Planetary, an iPad application that creates visualizations of collections of music. In practice, this involved acquiring it’s source code and making that code available through its GitHub account. Note, the acquisition did not involve making a commitment of resources to ensure that people will be able to experience the application as users did on iPads. In fact, in that sense, the Planetary software is already obsolete, in that new versions of the iOS software will not run it.

However, by acquiring the source code under version control, along with all of the bug reports and tickets associated with it’s development, Cooper Hewitt is preserving and making available both the raw material for anyone to make use of and an extensive record of the design and development process. As Doug Reside, digital curator for the performing arts at the New York Public Library recently suggested in an essay in Rare Books and Manuscripts, “the source code behind the program might be considered a manuscript.” In a case like this, where documentation of the entire history of the software’s development is present, the Planetary files might be better understood as an archive, a manuscript collection or, as they are textual in nature, even a documentary edition.

Each commit message with changes and edits to the source is itself a record of the production and creation of the software. In this vein, the acquisition, in a way, escapes the limitations of screen essentialism, i.e. privileging the single representation of a digital object on a screen as its essential form instead of respecting the myriad ways that digital objects manifest themselves. To this end, forgoing the complex issues of attempting to keep the software functional and instead focusing on the ease of collecting the source code and representative documentary materials such as screencaptures will provide future users a base from which to understand and potentially recreate or expand the app.

Anyone can download the entirety of the Planetary acquisition. You can save it to your computer and you too will have, in a sense,  acquired the application as well. That is, the copy of the “real” object on the shelf, or on Cooper-Hewitt’s servers, is no more or less authentic than any other copy of it. The digital objects that make up the acquisition are themselves infinitely, perfectly reproducible. Much like the geocities special collection, anyone is welcome to do what they like with it, exhibit it, revise it, etc. So, what role does the museum as repository play in this case? Using GitHub  to provide access to the source code and its history, Cooper Hewitt has put a stake in the ground to offer resources to steward the code, but it opens up a broader question about what it means to acquire something when anyone can have a perfect copy, undifferentiatable from the original.

The Acquisition of a Sequence of Symbols

In Collecting the present: digital code and collections, Sebastian Chan and Aaron Straup Cope of the Cooper-Hewitt Design Museum offer a wealth of information contextualizing and explaining the acquisition of the Planetary app. Of particular relevance to the question of uniqueness and acquisition they point to an even more symbolic acquisition, the Museum of Modern Art’s acquisition of the @ symbol.

In 2010, the Museum of Modern art acquired the @ symbol. Not a representation of it, but the symbol itself. As Paola Antonelli, Senior Curator, Department of Architecture and Design explains: The acquisition of @ “relies on the assumption that physical possession of an object as a requirement for an acquisition is no longer necessary, and therefore it sets curators free to tag the world and acknowledge things that ‘cannot be had’—because they are too big (buildings, Boeing 747’s, satellites), or because they are in the air and belong to everybody and to no one, like the @—as art objects befitting MoMA’s collection.” While the @ symbol is significantly more ethereal than a digital object, I think the story of this acquisition has some interesting lessons for thinking about acquiring digital materials which are infinitely and perfectly reproducible.

Software’s source code is much more concrete than the @ symbol. A software’s source code consists of a range of digital files. With that said, the acquisition of source code is functionally the acquisition of a sequence of symbols. The non-rivalrous nature of digital objects, means that one organization having a copy of a file doesn’t in any way preclude another organization or individual from having exactly the same thing. The logic of the acquisition Planetary is of pinning these digital objects down, providing some context, and making some commitments to ensuring access to data. It is a logic of non-rivalrous acquisition, simply making a commitment to ensure long term access to these materials.

Tag and Release

The idea of “tagging the world,” in Antonelli’s remarks about the acquisition of the @ symbol, can open up a fruitful way of thinking about digital acquisitions. As I’ve suggested before, I think it’s important for cultural heritage organizations to start letting go of the requirement to provide the definitive interface. Instead, cultural heritage organizations can focus more on selection and working to ensure long term preservation and integrity of data. The Planetary case pushes that idea even further. The Planetary acquisition includes a set of materials that document the experience of the application. They include things like screenshots and descriptions of how it functioned. While these assets offer a sense of what the experience of using the app was, the source code provides a rich set of materials for future users to use to understand how it worked and potentially reenact it.

Instead of wading into the complex issues of attempting to keep the software functional in perpetuity, they have acquired a copy of its source code, made a commitment to ensure long-term access to the data, and made it available under the most liberal license they could. The curatorial function of selection, identifying digital objects that matter and should be preserved, persists without the need to be the only entity that “owns” the object.

In this scenario, the library, archive or museum identifies objects of significance — tagging them in Antonelli’s terms — and then works to broker the right to collect and acquire records and other artifacts that document the object and provide as unrestricted access as possible to what they acquire. Just as a design museum might collect the blueprints for a building instead of collecting the building itself  an institution can collect the source code of a piece of software instead of, or alongside, collecting a copy of the software in its executable form and then working to make that material available in the broadest way possible. From there out, the institution serves to provide authentic copies and validate the authenticity of copies while also providing provenance and context; and ensuring ongoing preservation of an authentic copy.

The future of collecting and preserving born-digital special collections, collections of rare or unique materials like manuscripts, drafts and original source code, might upset some of the core ideas of custodianship. I think the best thing cultural heritage organizations can do with these rare are and unique born-digital materials may be to make it so that they are no longer rare and unique at all. By making a set of unique and rare materials easy for anyone to see and copy the institution can help ensure both the broadest use and access of the materials.

Categories: Planet DigiPres

Professional update

File Formats Blog - 8 July 2014 - 10:08am

Just to keep everyone up to date on what I’m doing professionally:

Currently I’m back in consulting mode, offering my services for software development and consultations. Those of you who’ve been following this blog regularly know I’ve been working with libraries for a long time and I’m familiar with the technology. I’ve updated my business home page at and moved it to new hosting, which will allow me to put demos and other materials of interest on the site.

The key to success is, of course, networking. so if you happen to hear of a situation where my skills could be put to good use, please let me know.

Tagged: business
Categories: Planet DigiPres

NDSA Standards and Practices Survey: Ranking Stumbling Blocks for Video Preservation

The Signal: Digital Preservation - 7 July 2014 - 2:05pm

Have You Taken the NDSA S&P Video Stumbling Blocks Survey bookmark, designed by Kara Van Malssen, AVPreserve


Only 3 (Required) Questions! Survey bookmark designed by Kara Van Malssen, AVPreserve

A new thread emerged during the recent monthly conference calls of the Standards and Practices Working Group of the National Digital Stewardship Alliance (NDSA). What do we do about preserving video? It’s a problem for many of our members. One participant even commented that video is often the last content type to be added to digital repositories.

There are many potential reasons why video is problematic to preserve – the files are big and complex, providing access can be challenging, the equipment can be specialized and expensive – the list goes on. But out of all these, what are the most frequent and challenging stumbling blocks in preserving video? We’ve decided to find out.

The Standards and Practices Working Group has created a short survey – only three required questions – to help us identify and rank some of the issues that may hinder digital video preservation.

The results of this survey will help the Standards and Practices Working Group to explore more effectively these issues, better inform the creation and stewardship of digital video, and help us establish best practices for the long-term sustainability and accessibility of video assets.

We made our survey brief, knowing that it could only touch upon the high-level issues that video production, digitization and preservation professionals currently face. We hope we will receive responses from a wide range of participants.

You can find the survey at Responses are requested before August 2, 2014. We appreciate your time in filling this out and look forward to sharing the results with you soon.

Categories: Planet DigiPres

Research and Development for Digital Cultural Heritage Preservation: A (Virtual and In-Person) Open Forum

The Signal: Digital Preservation - 3 July 2014 - 1:17pm

The following is a guest post from Joshua Sternfeld, National Endowment for the Humanities and Gail Truman, Truman Technologies. The statements and ideas expressed here are attributed solely to the authors and do not necessarily reflect those of any federal agency or institution.


The collection on the wall from user patentboy on Flickr.

As the National Digital Stewardship Alliance prepares to add new categories of content to its 2015 National Agenda for Digital Stewardship, including digital art and software, now is the ideal opportunity to assess the state of research and development for the preservation of digital cultural heritage.  In many respects, digital cultural heritage is dependent on some of the same systems, standards and tools used by the entire digital preservation community.  Practitioners in the humanities, arts, and information and social sciences, however, are increasingly beginning to question common assumptions, wondering how the development of cultural heritage-specific standards and best practices would differ from those used in conjunction with other disciplines.  As many in the humanities and arts point out, digital cultural heritage materials encompass a dizzying array of formats, genres, disciplines, and institution repository types, which bring with them unique intellectual and technical challenges for their preservation.  Most would agree that preserving the bits alone is not enough, and that a concerted, continual effort is necessary to steward these materials over the long term.

We might think of the development of digital cultural heritage standards and practices as a two-way street.  On the one hand, a humanistic or artistic perspective may challenge digital preservation norms that often originate from industry leaders in the private sector or disciplines that seem distant, or even antithetical, to the needs of the humanities and arts user communities.  On the other hand, by elevating the needs of this user community – from artists and scholars, to educators, curators, media makers and students – we may be able to influence a combination of public and private interests to support more targeted user-centric development.  For example, we are just now beginning to consider how adjustments to conventional storage architectures, such as use of abstraction and distributed, cloud-based services, may result in radically different means of organizing, sharing and visualizing cultural heritage data.

The humanities and arts can also bring heightened clarity or awareness of practices and concepts — including selection of content, appraisal, and authenticity — inherent to all digital preservation.  As pressure mounts to ingest exponentially increasing amounts of data, repository stewards are facing difficult decisions to streamline the acquisition and preservation of their collections.  By their nature, the humanities encourages critical interrogation of selection practices, even as they move toward automation.  Similarly the appraisal of digital data by preservationists and users alike has exceeded the capacity of human intervention alone, which has necessitated creative solutions to generating metadata, mining and visualizing “big data,” and accessing complex audiovisual and interactive media.

For the 2014 Digital Preservation Conference hosted by NDSA, the two of us, on behalf of the NDSA Arts and Humanities Content Working Group, will lead an open discussion to identify pervasive issues found in digital cultural heritage that in time may lead to standards and practices adopted widely by those working in museums, archives, libraries, arts organizations, universities and beyond.

In many ways, the session will serve as a follow-up to the 2012 Digital Preservation plenary session “Preserving Digital Culture.” During that session, Megan Winget, then at the University of Texas at Austin, characterized the preservation of digital cultural heritage as a series of “wicked problems,” each of which is “novel and unique” and for which no single solution is “right or wrong, but [only] better and worse.”  If there was one message from the session, it was that work in digital cultural heritage requires a creative balance of intellectual, theoretical, technical, social, and aesthetic matters.  Building upon a spate of initiatives, conferences and studies in recent years, this year’s session will pose whether and how we can both embrace the novel properties intrinsic to each work or collection, while investigating the possibility of developing shared practices and standards.

At the heart of the discussion we’ll pose this question:  What elements contribute to a successful research and development project in digital cultural heritage that results in the adoption of standards and practices?  While it may seem obvious that an interdisciplinary project team comprised of members with diverse backgrounds ought to be a given, finding just the right balance – not to mention resolving differences in methodologies, vocabulary, and theories — may seem more elusive.  Expanded adoption of a new standard or practice requires significant buy-in from the community by tapping into an ever-evolving scaffolding of knowledge, data, case studies, education and tools in order to sustain continued growth and investment.  In short, a more organized and concerted effort is needed, which historically has proven difficult in the arts and humanities-related preservation fields.

The second half of the discussion will move toward areas of current or possible future interest.  The recent work underway by a team assembled by the Smithsonian in the area of time-based media and digital art can serve as a model in building a collaborative, on-the-ground framework for research and development.  Similarly, a series of conferences investigating the preservation of software, including Preserving.exe, has revealed the importance of integrating diverse voices from the cultural heritage community.  Other areas open for discussion that may benefit (or have already benefited) from enhanced attention from the humanities and arts communities may include digital forensics, web archiving, mass digitization, sustainability or metadata schema development, to name just a few.

In true humanistic fashion, the forum will likely raise more questions than provide answers.  Nonetheless, as session chairs we hope that a framework for future discussion and action will emerge.  This blog posting, therefore, is intended to serve as an open invitation to the NDSA community and beyond to offer ideas, discussion points, challenges, areas of research and examples that may be submitted in the comments section below, and which will help inform the in-person session in July.  For those unable to attend the conference, the chairs will make any session materials accessible afterwards.

Categories: Planet DigiPres

The SCAPE Project video is out!

Open Planets Foundation Blogs - 3 July 2014 - 8:31am

Do you want a quick intro to what SCAPE is all about?

Then you should watch the new SCAPE video!

The video will be used at coming SCAPE events like SCAPE demonstration days and workshops and it will be available on Vimeo for everyone to use. You can help us to disseminate this SCAPE video by tweeting using this link



Standard tools become overtaxed.....          ....SCAPE addresses these challenges 


The production of this SCAPE video was part of the final project presentation. The idea behind the video is to explain what SCAPE is about to both technical and non-technical audiences. In other words, the overall outcomes and unique selling points of the project in a short and entertaining video. But how do you condense a four year project with 19 partners and lots of different tools and other SCAPE products in just two minutes?

We started with formulating the SCAPE overall messages and unique selling points, from which a script was distilled. This was the basis for the voice over text and a story board, after which the animation work began. There were lots of adjustments to be made in order to stay close to the actual SCAPE situation. It was great that SCAPErs from different areas of the project were kind enough to look at what we from the Take Up team came up with. 

Please take a look and use this video to tell everyone how SCAPE helps you to bring your digital preservation into the petabyte dimension!


SCAPE Project - Digital Preservation into the Petabyte Dimension from SCAPE project on Vimeo.

Preservation Topics: SCAPE
Categories: Planet DigiPres

End-of-Life Care for Aging, Fragile CDs and Their Data Content

The Signal: Digital Preservation - 2 July 2014 - 1:40pm

CDInstitutions and individuals that use CDs as a storage medium are now concerned because information technologists have deemed the medium to be unsuitable for long-term use. As a result, institutions are racing to get the data off the discs as quickly and safely as possible and into a more reliable digital storage environment.

Two years ago, Butch Lazorchak wrote about the Library of Congress’s Tangible Media Project and its efforts to transfer data off CDs for just that reason. And last month The Atlantic profiled Fanella France, chief of preservation research and testing at the Library of Congress, about the Library’s research into the physical and chemical properties of CDs and how CDs age.

At the upcoming Digital Preservation 2014, John Passmore, archives manager at New York Public Radio, will give a presentation about NYPR’s experiences in transferring the contents of their archive of over 30,000 CD-Rs. Passmore said that some of the older discs exhibit “end-of-life symptoms,” which creates an urgency at NYPR to move the content off the CD-Rs and into the organization’s asset management system. [Trevor Owens interviewed Passmore earlier this year on the subject.]

John Passmore

John Passmore, NYPR.

NYPR is gathering statistical material in the course of their data transfers and they are running forensics tools to generate data so that researchers can look for possible correlations between disc failures and specific errors. The archives uses commercial tools and custom software to automate the process.

Passmore said that there is lot to be learned regarding the chemical composition and materials of the discs, the brands, the batches and the number and severity of errors encountered during the process.

As we learn more about the physical and digital properties of CDs, it may be possible to perform triage on an at-risk collection. An archive with a dauntingly large collection may be able to evaluate a batch of discs and sort them by their relative degradation and stability, essentially creating piles, for example, of discs that are “stable for now” and “high risk of inaccessibility.”

Passmore said, “Our hope is that our data will be shared at this presentation so other organizations can learn how to better assess the long-term storage of their CD-Rs.”

Categories: Planet DigiPres

Introducing Flint

Open Planets Foundation Blogs - 2 July 2014 - 12:53pm

Hi, this is my first blog post in which I want to introduce the project I am currently working on: Flint.


Flint (File/Format Lint) has developed out of DRMLint, a lightweight piece of Java software that makes use of different third party tools (Preflight, iText, Calibre, Jhove) to detect DRM in PDF-files and EPUBs. Since its initial release we have added validation of files against an institutional policy, making use of Johan’s pdfPolicyValidate work, restructured it to be modular and easily extendible, and found ourselves having developed a rather generic file format validation framework.  

what does Flint do?

Flint is an application and framework to facilitate file/format validation against a policy. It's underlying architecture is based on the idea that file/format validation has nearly always a specific use-case with concrete requirements that may differ from say a validation against the official industry standard of a given format. We discuss the principal ideas we've implemented in order to match such requirements.

The code centres on individual file format modules, and thus takes a different approach to FITS; for example - the PDF module makes use of its own code and external libraries to check for DRM. Creating a custom module for your own file formats is relatively straight-forward.

The Flint core, and modules, can be used via a command line interface, graphical user interface or as a software library. A MapReduce/Hadoop program that makes use of Flint as a software library is also included.

The following focuses on the main features:


The core module provides an interface for new format-specific implementations, which makes it easy to write a new module. The implementation is provided with a straight-forward core workflow from the input-file to standardised output results. Several optional functionalities (e.g. schematron-based validation, exception and time-out handling of the validation process of corrupt files) help to build a robust validation module.


Visualisation of Flints core functionality; a format-specific implementation can have domain-specific validation logic on code-level (category C) or on configuration level (categories A and B). The emphasis is on a simple workflow from input-file to standardised check results that bring everything together.

Policy-focused validation

The core module optionally includes a schematron-based policy-focused validator. 'Policy' in this context means a set of low-level requirements in form of a schematron xml file, that is meant to validate against the xml output of other third-party programs. In this way domain-specific validity requirements can be customised and reduced to the essential.  For example: does this PDF require fonts that are not embedded?

We make use of Johan’s work for a Schematron check of Apache Preflight outputs, introduced in this blog post. Using Schematron it is possible to check the XML output from tools, filtering and evaluating them based on a set of rules and tests that describe the concrete requirements of *your* organisation on digital preservation.



Aside from its internal logic, Flint contains internal wrapper code around a variety of third-party libraries and tools to make them easier to use, ensuring any logic to deal with them is in one place only:

* Apache PDFBox

* Apache Tika

* Calibre

* EPUBCheck

* iText - if this library is enabled note that it is AGPL3 licensed

These tools all do (a) something slightly different or (b) do not have full coverage of the file formats in some respects or (c) they do more than what one actually needs. All these tools relate more or less to the fields of PDF and EPUB validation, as these are the two existing implementations we're working on at the moment.

Format-specific Implementations
  • flint-pdf: validation of PDF files using configurable, schematron-based validation of Apache Preflight results and additionally internal logic and all tools in the list above to focus on DRM and Wellformedness
  • flint-epub: validation of EPUB files using configurable, schematron-based validation of EPUBCheck results and additionally internal logic and all tools in the list above to focus on DRM and Wellformedness

NOTE: both implementations are work-in-progress, and should be a good guide for how to implement your own format-validation implementation using Flint.  It would be easy to add a Microsoft Office file format module that looked for DRM etc, for example.



Visualisation of the Flint ecosystem with different entry points and several format/feature-specific implementations (deep blue: existing ones, baby blue: potentially existing ones); the core, as visualised in Figure 1 connects the different ends of the ‘ecosystem’ with each other


how we are using it

Due to the recent introduction of non print legal deposit The British Library is preparing to receive large numbers of PDF and EPUB files. Development of this tool has been discussed with operational staff within the British Library and we aim for it to be used to help determine preservation risks within the received files.

what’s next

Having completed some initial large-scale testing of a previous version of Flint we plan on running more large-scale tests with the most recent version.  We are also interested in the potential of adding additional file format modules; work is underway on some geospatial modules.

help us make it better

It’s all out there (the schematron utils are part of our tools collection at, Flint is here:, please use it, please help us to make it better.

Preservation Topics: CharacterisationPreservation RisksSCAPE
Categories: Planet DigiPres

How much of the UK's HTML is valid?

Open Planets Foundation Blogs - 2 July 2014 - 12:05pm

I thought OPF members might be interested in this UK Web Archive blog post I wrote on format identification and validation of our historical web archives: How much of the UK's HTML is valid?

Preservation Topics: Identification
Categories: Planet DigiPres

Preserving Folk Cultures of the Digital Age: An interview with Folklorist Trevor J. Blank, Pt. 2

The Signal: Digital Preservation - 1 July 2014 - 1:43pm
Author Photo (2014-02-06)

Trevor J. Blank assistant professor of communication at the State University of New York at Potsdam

The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and is working on a range of projects leading up to CurateCamp Digital Culture in July. This is part of an ongoing series of interviews Julia is conducting to better understand the kinds of born-digital primary sources folklorists, and others interested in studying digital culture, are making use of for their scholarship.

Part One of this interview appeared on June 30, 2014.

In the first half of my interview with Trevor Blank, we learned about the kinds of Photoshopped memes and online interactions that make up the vernacular web of digital folklore. Today, in this continuation of our Insights Interview, I am excited to explore where the records of digital folklore are and what roles libraries, archives and museums might play in ensuring long term access to those records. Folklorist Trevor J. Blank is an assistant professor of communication at the State University of New York at Potsdam, where he researches the hybridization of folk culture in the digital age with a particular focus on emergent narrative genres and vernacular expression.

Julia: In a recent NPR interview, you point out that “you have to rely on institutions in order to express yourself in the digital medium”, and that people use those commercial institutions for folk expression.  You also recently asked on Twitter, “Can folk culture circumvent institutional constraints?” What do you think is the answer to that question? What role do institutions and individuals each play in creating a folk culture online?

Trevor: Great question! Fundamentally, we rely on institutions for a number of aspects of everyday life: we look to our government to protect us and keep us moving forward as a society; we expect children to learn something valuable when they go to school; we look for law enforcement to ensure that citizens play by the rules, just to name a few. Folk culture–the informal, unofficial expressive dynamics that constitute everyday life within a group–resides outside of these institutions yet it is inherently aware of and shaped by them. The two unavoidably intermingle in the context of modern American life. For instance, connecting to the Internet requires navigating through an institutional barrier, like a cable company or Internet service provider, before one can even begin to engage in vernacular expression online.

To follow that thread, dynamic folk discourse can take place in the comment sections of institutional websites like YouTube. Or, an individual can publish beautiful, original prose (essentially folk expression) on their blog, which may have been built from a template provided by WordPress (which is institutional). The point is that folk expression and institutions are not inherently antagonistic; in fact, they frequently play off one another or become hybridized in the process of generating folklore. A good, illustrative example can be found in the emerging digital tradition of crafting and publishing humorous, fake product reviews on (I have a forthcoming article in the Journal of American Folklore about this very phenomenon!).


Image of Three Wolf Moon t-shirt, which became an internet phenomena through its rating and reviews on Amazon. From user jimgroom on Flickr.

Amazon, the world’s largest retailer, is a huge corporation with significant institutional power and influence, and they partner with independent companies to make countless products available for purchase by consumers. One major, institutionally crafted feature of Amazon comes in the form of product reviews, which are meant to allow regular people (decidedly non-institutional folks) to provide their own feelings and opinions about a given product following a transaction. The idea behind the system is to make consumers feel as though they have a stake in the Amazon community, which is meant to feel outside the institutional boundaries of the site. In theory, by virtue of being  composed by individuals who are unaffiliated with Amazon, the reviews appear to benefit other customers more than they directly benefit Amazon; in practice, they invariably influence consumers to buy a given item through Amazon’s marketplace. Regardless, the popular “vernacular” review feature has become a ubiquitous part of purchasing something from the “institutional” site.

Some crafty individuals soon realized that they could use the familiar format to write incredibly vivid product reviews that ruthlessly mocked certain items for sale, building narrative repertoires through collaborative engagement. The expressive patterns emblazoned in many of these faux reviews arose from their widespread performance and vernacular deliberation online. So, this creative arena was essentially born out of folk culture circumventing the institutional constraints and participation expectations imposed by Amazon, using the site’s official structure to stake out a means for vernacular expression to come through. Amazon is only one example of this back-and-forth, of course, but I’d say it demonstrates that folklore–as it has always done before–will find a way to rise above institutional constraints in the digital age. Identifying how that is accomplished is a particularly compelling aspect of studying contemporary folklore.

Julia: In the same interview, you argue that the internet can be a means for preserving folklore. While born digital content may seem ephemeral, you note that “it is nevertheless able to be archived in a very vibrant way.” What makes a particular archive “vibrant”?

Trevor: I think the criteria is probably subjective from one individual to the next. Personally, I find the most vibrant archived folklore on (and from) the Internet to stem from vernacular discourse that proliferates in response to an event or phenomenon that has been widely covered in the mass media, such as natural disasters, acts of terrorism, and celebrity sex scandals, among others. In the digital age, practically every form of social media constantly beckons individuals to contribute new content based on their own thoughts about what’s going on in the world (both locally and globally). As you might suspect, many folks happily oblige, posting pictures, sharing news stories, uploading video clips, writing personal updates and perpetually commenting on their peers’ (and their own) offerings. Thus, when a news event attracts excessive media coverage, we start to see jokes, stories, rumors, rants, memes, conspiracy theories, etc. fly through the digital ether right away. Coming across archived discussion forums, virtual community deliberations, circulated image macros, old listservs or even long abandoned tweets reveal so much about a salient moment in time where people turned to one another to process the gravity of their living contexts.

But beyond archived vernacular discourse, I’m also very interested in tracing the evolution of vernacular expression in online settings in order to demonstrate the traditionality of emergent forms and patterns. So, for example, I see special value in looking back at how something like visual parodies made in response to the 9/11 Tourist Guy hoax seem to be thematically present in the creative manipulations of the famous Obama Situation Room photo in an effort to get a better sense of how people use folk knowledge about popular culture and existing digital parody traditions to artfully rebrand how a powerful image is subsequently perceived in the present.

Julia: If librarians, archivists and curators wanted to learn more about approaches like yours what examples of other scholars’ work would you suggest?

Trevor: I’m glad to say that there are a number of folklore scholars out there who are doing really great work in studying folklore and folk culture in the digital age. Robert Glenn Howard, of course, has been prolific. Anthony Buccitelli is another scholar who is heavily invested in the study of folklore and new media. Ever the renaissance men, Simon Bronner and Bill Ellis have each contributed provocative and important research in this and numerous other areas as well. I’ve never read anything by Lynne S. McNeill that I didn’t absolutely love. Andrea Kitta has also introduced really insightful scholarship on risk perception and public health concerns with an eye towards the influence of technologically-mediated communication. Merrill Kaplan recently authored a fantastic essay about the curation of tradition online for Tradition in the Twenty-First Century: Locating the Role of the Past in the Present, which Robert Glenn Howard and I edited. Tok Thompson and Elizabeth Tucker have each published several excellent think pieces. Russell Frank certainly shares my interest in documenting the relationship between digital folklore and mass media institutions. Outside of folklore studies, I’d say that the work of Nancy Baym matches up well with my approaches and interests. The open access e-journal New Directions in Folklore is another source that has published a number of thoughtful articles emphasizing digital culture in recent years.

Julia: Could you tell us a bit about the kinds of digital primary sources folklorists are using to study culture on the web? Do you have a sense of how they are likely collecting and organizing these materials? I ask, in part, because many folklife collections in archives are built around acquiring “ethnographic field collections” and I am curious to learn a bit about what the born digital equivalents of those might be in contemporary study of the web.

Trevor: Folklorists use a variety of sources to study the Internet, but I’d say most approach finding and engaging primary sources the same way they would with face-to-face communication settings. That is, they gravitate towards communities (from those centered around fandom to Christian fundamentalists who congregate to passionately discuss shared and contrasting religious beliefs) and other major intersections of vernacular expression, including narrative-based wikis; hoaxes, rumors, and legends spread by email and social media; and even the comments posted in response to articles and videos (not to mention their own newsfeeds on Facebook).

Ethnographic methods are often generously utilized. Those of us who primarily specialize in the study of Internet folklore often use each other as sounding boards for interesting texts and websites we come across. Two websites that I (and several other folklorists) frequently visit are, the urban legends reference page, and KnowYourMeme since they are both such excellent databases for comparing and contextualizing new and recycled narratives and visual folklore circulating online. I also really like examining Twitter feeds and public posts on Facebook to gather general (and occasionally specific) ideas of the major themes and impressions individuals choose to performatively share with peers.

Julia: I realize collecting and preserving content isn’t your area, but from your perspective as a folklorist, what kinds of online content do you think is the most critical for cultural heritage organizations to preserve for folklorists of the future to study this moment in history? It’s a really broad question, so feel free to take it in any number of directions. Are there particular kinds of digital content you think need to be focused on? Are there particular sub cultures or movements that aren’t getting enough attention?

Trevor: I think that the first major hurdle has already been passed just by simply getting the majority of folklorists to accept the study of folklore and technologically-mediated communication as a valuable area of inquiry. There is also no longer any controversy over whether the Internet should be conceived as a “field” in which legitimate fieldwork can take place. In this context, it’s easier for folklorists to meaningfully contribute to the preservation of cultural heritage as it manifests online. As you pointed out, ethnographic field collections are always sought after when organizing folkloric material for curation. Folklorists are collectors, and I am hopeful that the growing interest in chronicling the changing dynamics of vernacular expression in the digital age will yield a greater collective commitment to the process of preserving cultural heritage.

What that will entail remains to be seen, although it’s clear that things like memes, virtual communities (broadly conceived), creative narrative text genres and the websites/ threads that host them likely present the richest possibilities for expansive collection and annotation. Then again, remembering the overarching aesthetic trends that graced the web domains of yesteryear shouldn’t be neglected either (the Internet Archive Wayback Machine helps on that front). There’s always a chance that subcultures and movements may slip through the cracks, especially against such an everchanging, hybridized backdrop. The real challenge for folklorists will be to keep up and stay motivated, allowing individuals and communities to guide their scholarly gazes to the emically important dimensions of contemporary folk culture.

Categories: Planet DigiPres

Understanding Folk Culture in the Digital Age: An interview with Folklorist Trevor J. Blank, Pt. 1

The Signal: Digital Preservation - 30 June 2014 - 3:42pm
Author Photo (2014-02-06)

Trevor J. Blank, assistant professor of communication at the State University of New York at Potsdam

The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and is working on a range of projects leading up to CurateCamp Digital Culture in July. This is the first of a series of interviews Julia is conducting to better understand the kinds of born-digital primary sources folklorists, and others interested in studying digital culture, are making use of for their scholarship.

When most people think of “folklore,” they tend to think of fairy tales and urban legends. Trevor Blank thinks of photoshopped memes and dark humor. Folklorist Trevor J. Blank is an assistant professor of communication at the State University of New York at Potsdam, where he researches the hybridization of folk culture in the digital age with a particular focus on emergent narrative genres and vernacular expression. In this installment of the Insights Interview series, I talk with Trevor about his approach to studying folklore on the internet. Tomorrow, in part two of this interview, I talk with Trevor about the implications of this line of thinking for institutions working to collect and preserve records of folk culture in the digital age.

Julia: Why does it make sense to approach the web and communication on the web as a folklorist? What do we gain from this approach that we wouldn’t get from other humanities or social science perspectives? Trevor Owens previously interviewed Robert Glenn Howard about his work on the “vernacular web”; do you see your work as being largely in the same vein as Robert’s? Or are there significant differences?

Trevor: Let me begin by expressing my gratitude for the opportunity to chat with you about all of this!

Contrary to popular belief, folklore is just as much, if not more, of an agent of the present as it is of the past. As a folklorist, I am interested in vernacular expression; understanding how people forge traditions, share knowledge, and make meaning in everyday life is central to my work. For me, that centrally involves working with new media technologies and observing the ways in which they’re implemented by individuals and groups in everyday life.

It is critical to document the myriad ways in which folk culture adapts, influences, rejects and responds to changing cultural tides, especially amid the exponential growth of computer-mediated communication technologies. Folklorists are uniquely positioned to comment on emergent forms of communicative expression, noting traditionality and innovation in seemingly new material while contextualizing and interpreting the forms and meanings behind its deployment. Whereas other humanities and social science fields may favor statistical analysis, data mining and text collection/comparison, folklorists employ interdisciplinary approaches, often using ethnographic methods, that strive for a more holistic representation of research subjects. At the end of the day, the emphasis remains on individuals and groups– even if they’re united in an online venue.

Now, not all folklorists have always been keen on studying Internet folklore, preferring instead to focus their energies on oral traditions and expressive culture observed in face-to-face communication. For that reason, Robert Glenn Howard’s work on the vernacular web was a revelation to me, and continues to greatly influence my thinking on approaching my own research on folk culture in the digital age. His work on electronic hybridity also directly informed and prompted my own research on the subject. Like Howard, I’m interested in underscoring everyday communication and interaction by analyzing the interconnected webs of meaning and participatory actions that comprise it. Our individual interests have led us to different research projects (and sometimes different conclusions), but I’d say that my work definitely operates in conversation with the framework he has so masterfully crafted over the years.

Julia: At this point you have been studying folklore and the internet since 2007. How has your approach and perspective developed over time? What (if any) changes have you observed in how folk culture objects are created and disseminated online?

Trevor:  I was initially drawn to studying folklore and the Internet as a graduate student, precisely because it seemed that only a small handful of folklorists were deeply invested in this area of inquiry at the time (Robert Glenn Howard being one of them). To my eager eyes, it seemed that there was a lot of expressive content and cultural phenomena that had been inadequately chronicled by folklorists to that point. Indeed, many folklore scholars were skeptical of the value of folklore collected in online settings, as I mentioned. Of course, this was nothing new; there was similar angst over the study of photocopylore, or “Xeroxlore,” before the Internet was commercially available. In any case, I saw an opportunity to contribute something new to the folklore discipline, or at least a chance to invite greater attention to this rich yet neglected area of study. Ultimately, that resulted in my editing of the anthology Folklore and the Internet: Vernacular Expression in a Digital World, which came out in 2009 and featured essays written by a number of tremendous folklore scholars.

Whether it was the book or the passage of time, the study of folklore in the digital age–in all of its iterations–has since come to enjoy a far warmer reception among folklorists, and now many more scholars are contributing new and exciting perspectives on the everchanging digital landscape. Back then, my standard approach and perspective was to carefully explain why digital folklore was every bit as legitimate as its face-to-face correlates and passionately advocate for its further study to anyone who would listen. As a result, most of my early publications end with a rant about the need for folklorists to jump into the digital fray! Fortunately, that is no longer necessary these days, which is a big deal. I now focus my energies on developing new anthologies, special issues of peer-reviewed journals and my own independent cases studies and theoretical research aimed at broadening the scholarly literature on folklore in the digital age as well as its profile.


Example of a Demotivational poster. Adversity by user cadsonline on Flickr.

Since 2007, I’ve noticed definite shifts in how folklore and various elements of folk culture are created and transmitted online. For one, there has been a greater shift towards “visuality,” meaning that a greater part of the folkloric content we find in circulation online tends to have some kind of eye-catching component that renders it traditional in the context of vernacular expression. Image macros (what most folks simply conceptualize as “memes”), humorous Facebook posts, de-motivational posters, etc. all utilize the online medium’s increasingly proficient ability to host and share visual data quickly and effectively.

By the same token, Vine and Snapchat have also spawned new expressive modes– something that wouldn’t have been feasible a short time ago (even in the YouTube era). What we see now are people adapting to the new expressive tools they have at their disposal. These new tools haven’t displaced older ones necessarily, but they have undoubtedly drawn greater attention to a burgeoning trajectory in the dynamics of technologically-mediated communication. Another example of a popular and developing genre of Internet folklore comes from “creepypastas,” or short horror stories, often paired with a corroborating image or two, that are shared with the intent of gleefully creeping out readers. Here again, many of the stories and visuals echo expressive genres and patterns found in oral traditions, only this time they’ve made it to the digital realm. I think this powerfully speaks to the medium’s adaptive capabilities when it comes to contemporary folklore.

Julia: In your most recent book The Last Laugh: Folk Humor, Celebrity Culture, and Mass-Mediated Disasters in the Digital Age, you focus on the concept of “hybridization.” You define hybridization as “the blending of analog and digital forms in the course of their dissemination and enactment,” which you argue helps people “adapt to the progressing culture by merging the old and familiar with the emergent capabilities of a new medium.” Can you expand a bit more on hybridization in a folklore context? Why is it such an important concept for you?


“Snapchat silliness” by user jessycat_techie on Flickr.

Trevor: Sure! Folklore thrives through the process of repetition and variation, meaning that certain expressive patterns or traits continuously and consistently “repeat” over time (demonstrating/establishing traditionality) and they will also “vary” or undergo some adaptive modification in the course of their dissemination, usually to suit a new context. This is how and why so many people can recall legends or ghost stories that share many similar motifs yet contain components that render them distinct from other versions.

Take the legend of “Bloody Mary,” for example: most versions of the narrative also entice listeners, usually children and adolescents, to carry out a ritual involving a mirror, though the consequences of completing the task range from a friendly apparition appearing to being disemboweled by said apparition. While the themes are related, the specific details of the story change from teller to teller, context to context. My point in mentioning this tale is to draw attention the adaptive capabilities of folklore. As communicative beings, we tailor our repertoires to befit a particular context and look for opportunities to maximize our abilities to convey information effectively.

In the context of vernacular expression on the Internet, individuals rely on their oral/face-to-face/analog conceptualizations of language and communication to inform their corresponding actions in the digital realm. No matter how hard you stare at an abstract body of text online, you won’t always be able to see if the words were infused with the hint of a smile, a sarcastic crack, or genuine anger. As a remedy, people started incorporating emoticons or initialisms like “LOL” to convey laughing out loud or a lighthearted chuckle in online settings. Then, curiously, some folks started exclaiming “LOL” (phonetically as one word, not L-O-L) or “lulz” out loud, in face-to-face communication settings, to convey mild amusement among peers.

These kinds of happenings, which are quite common, reveal the hybridization of folk culture. Because technologically-mediated communication is so ubiquitously and integrally rooted into everyday life (for most individuals), the cognitive boundaries between the corporeal and virtual have been blurred. When we send text messages to a friend or family member, we typically think “I’m sending this text” instead of “these glowing dots of phosphorous are being converted into tiny signals and beamed across several cell towers before being decoded and received on a peer’s phone.” The message is perceived as an authentic extension of our communicative selves without much thought over the medium in which it was sent.

On a more nuanced level there are obvious differences between oral and electronic transmission, but both formats are often equally relied upon and valued for everyday communication while simultaneously shaping each other’s forms. This hybridization is incredibly important because it entails the reciprocal amalgamation of tradition, innovation and adaptation of folk culture across face-to-face and digital venues. As technology continues to improve at exponential rates and more sophisticated opportunities for electronic transmission and digital expression become available, this boundary blurring hybridization will become increasingly pronounced and will continue to complicate existing notions of face-to-face communication and folk culture. This isn’t automatically a bad thing, but it does stress the need for continued monitoring in order to more accurately capture the bustling dynamics of contemporary folklore.

Part two of this interview will appear on July 1, 2014.

Categories: Planet DigiPres

OOXML: The good and the bad

File Formats Blog - 27 June 2014 - 12:05pm

An article by Markus Feilner presents a very critical view of Microsoft’s Open Office XML as it currently stands. There are three versions of OOXML — ECMA, Transitional, and Strict. All of them use the same extensions, and there’s no easy way for the casual user to tell which variant a document is. If a Word document is created on one computer in the Strict format, then edited on another machine with an older version of Word, it may be silently downgraded to Transitional, with resulting loss of metadata or other features.

On the positive side, Microsoft has released the Open XML SDK as open source on Github. This is at least a partial answer to Feilner’s complaint that “there are no free and open source solutions that fully support OOXML.”

Incidentally, I continue to hate Microsoft’s use of the deliberately confusing term “Open XML” for OOXML.

Thanks to @willpdp for tweeting the links referenced here.

Tagged: Microsoft, standards, XML
Categories: Planet DigiPres

SCAPE Demo Day at Statsbiblioteket

Open Planets Foundation Blogs - 27 June 2014 - 8:38am

Statsbiblioteket (The State and University Library, Aarhus, hereafter called SB) welcomed a group of people from The Royal Library, The National Archives, and Danish e-Infrastructure Cooperation on June 25, 2014. They were invited for our SCAPE Demo day where some of SCAPE’s results and tools were presented. Bjarne S. Andersen, Head of IT Technologies, welcomed everybody and then our IT developers presented and demonstrated SB’s SCAPE work.

The day started with a nice introduction to the SCAPE project by Per Møldrup-Dalum, including short presentations of some of the tools which would not be presented in a demo.  Among others this triggered questions about how to log in to Plato – a Preservation Planning Tool developed in SCAPE.

Per continued with a presentation about Hadoop and its applications. Hadoop is a large and complex technology, which was already decided to use before the project started. This has resulted in some discussion during the project, but Hadoop has proven really useful for large-scale digital preservation. Hadoop is available both as open source and as commercial distributions. The core concept of Hadoop is the MapReduce algorithm which was presented in the paper “MapReduce: Simplified Data Processing on Large Clusters” in 2004 by Jeffrey Dean and Senjay Ghemawat. This paper prompted Cutting and Cafarella to implement Hadoop and they published their system under an open source license. Writing jobs for Hadoop has traditionally been done by using the Java programming language, but in the recent years several alternatives to Java have been introduced, e.g. Pig Latin  and Hive  Other interesting elements in a Hadoop cluster are HBase, Mahout, Giraph, Zookeeper and a lot more. At SB we use an Isilon Scale-Out NAS storage cluster which enables us to make a lot of different experiments on the four 96GB RAM CPU nodes each with a 2 Gbit Ethernet interface. This setup potentially makes the complete online storage of SB reachable for the Hadoop cluster.

                                            Sometimes it is hard to fit an elephant in a library


Bolette A. Jurik was next in line and told the story about how Statsbiblioteket wanted to migrate audio files using Hadoop (and Taverna…. and xcorrSound Waveform Compare). The files were supposed to be migrated from mp3 to wav. Checking this collection in Plato gave us the result ‘Do nothing’ – meaning leave the files as mp3. But we still wanted to perform the experiment – to test that we have the tools to migrate, extract and compare properties, validate the file format and compare the content of the mp3 and wav files, and that we can create a scalable workflow for this.". We did not have a tool for the content comparison, so we had to develop one, xcorrSound waveform-compare. The output shows which files need special attention – as an example one of the files failed the waveform comparison although it looked right. This was due to a lack of content in some parts of the file so Waveform Compare had no sound to compare! Bolette also asked her colleagues to create "migrated" soundfiles with problems that the tool would not find – read more about this small competition in this blog post.

Then Per was up for yet another presentation – this time describing the experiment: Identification and feature extraction of web archive data based on Nanite. The test was to extract different kinds of metadata (like authors, GPS coordinates for photographs etc.) using Apache Tika, DROID, (and libmagic) . The experiment was run on the Danish Netarchive (archiving of the Danish web – a task undertaken by The Royal Library and SB together). For the live demo a small job with only three ARC files was used – taking all of the 80,000 files in the original experiment would have lasted 30 hours.  Hadoop generates loads of technical metadata that enables us to analyse such jobs in detail after the execution. Per’s presentation was basically a quick review of what is described in the blog post A Weekend With Nanite.

An analysis of the original Nanite experiment was done live in Mathematica presenting a lot of fun facts and interesting artefacts. For one thing we counted the number of unique MIME types in the 80,000 ARC files or 260,603,467 individual documents to

  • 1384 different MIME types were reported by the HTTP server at harvest time,
  • DROID counted 319 MIME types,
  • Tika counted 342 MIME types.

A really weird artefact was that approx. 8% of the identification tasks were complete before they started! The only conclusion to this is that we’re experiencing some kind of temporal shift that would also explain the great performance of our cluster…

Two years ago SB concluded a job that had run for 15 months. 15 months of FITS characterising 12TB of web archive data. The experiment with Nanite characterised 8TB in 30 hours. Overall this extreme shift in performance is due to our involvement in the SCAPE project.

After sandwiches and a quick tour to the library tower Asger Askov Blekinge took over to talk about Integrating the Fedora based DOMS repository with Hadoop. He described Bitmagasinet (SB’s data repository) and DOMS (SB’s Digital Object Management System based on Fedora) and how our repository is integrated with Hadoop.

SB is right now working on a very large project to digitize 32 million pages of newspapers. The digitized files are delivered in batches and we run Hadoop map/reduce jobs on each batch to do quality assurance. An example is to run Jpylyzer on a batch (Map runs Jpylyzer on each file, Reduce stores the results back in DOMS). The SCAPE way to do it includes three steps:

  • Staging – retrieves records
  • Hadooping – reads, works and writes new updated records
  •  Loading  - stores updated records in DOMS

The SCAPE Data model is mapped with the newspapers in the following way:

                                           SCAPE Data Model mapped with newspapers

SCAPE Stager/Loader creates a sequence file which can then be read and each record updated by Hadoop and after that the records are stored in DOMS.

The last demo was presented by Rune Bruun Ferneke-Nielsen. He described the policy driven validation of JPEG 2000 files based on Jpylyzer and performed on SB’s Newspaper digitization project. The newspapers are scanned from microfilms by a company called Ninestars, and then quality assured by SB’s own IT department. We need to make sure that the content conforms to the corresponding file format specifications and that the file format profile conforms to our institutional policies.


530,000 image files have been processed within approx. five hours.

We want to be able to receive 50,000 newspaper files per day and this is more than one server can handle. All access on data for quality assurance etc. is done via Hadoop. Ninestars runs a quality assurance before they send the files back to SB and then the files are QA’ed again inhouse.

                                                               Fuel for the afternoon (Photo by Per Møldrup-Dalum)

One of the visitors at the demo is working at The Royal Library with the NetArchive and would like to make some crawl log analyses. These could perhaps be processed by using Hadoop - this is definitely worth discussing after today to see if our two libraries can work together on this.

All in all this was a very good day, and the audience learned a lot about SCAPE and the benefits of the different workflows and tools. We hope they will return for further discussion on how they can best use SCAPE products at their own institutions.

Preservation Topics: SCAPE AttachmentSize ElephantOnSB.png662 KB DataModel.png70.11 KB NewspaperQA.png160.83 KB Fuel for the afternoon (Photo by Per Møldrup-Dalum)45.11 KB
Categories: Planet DigiPres

NDSR Selects the Next Class of Residents for New York and Boston

The Signal: Digital Preservation - 26 June 2014 - 6:23pm

The National Digital Stewardship Residency program has recently announced the next group of 10 residents selected for this prestigious program.  This Residency program, funded by the IMLS, has just NDSR-NY_wp_banner1completed its inaugural year, with 10 residents working in various organizations in the Washington, DC area.  The next round of the NDSR will begin in September 2014 and will take place in two cities – New York and Boston.

The NDSR program offers recent master’s program graduates in specialized fields— library science, NDSR Bostoninformation science, museum studies, archival studies and related technology— the opportunity to gain valuable professional experience in digital preservation.  At the same time, it offers the host institutions a highly qualified individual who can provide focused effort on digital preservation needs for the institution (see a recent blog post on this year’s host institutions and their projects).   The selection process for these residencies is very competitive – and all who are selected have proven to be highly skilled and possess a strong commitment to the field of digital preservation.

The residents chosen for both cities, five each for New York  and Boston, are listed below.  We will also be posting more information on The Signal about these residency projects in the months to come.  Congratulations to all!

For New York City

  • Karl-Rainer Blumenthal
    Host institution: New York Art Resources Consortium
  • Peggy Griesinger
    Host institution: Museum of Modern Art
  • Julia Kim
    Host institution: New York University Libraries
  • Shira Peltzman
    Host institution: Carnegie Hall
  • Victoria (Vicky) Steeves
    Host institution: American Museum of Natural History

For Boston

  • Samantha DeWitt
    Host Institution:  Tufts University
  • Rebecca Fraimow
    Host Institution: WGBH
  • Joey Heinen
    Host Institution: Harvard University
  • Jen LaBarbera
    Host Institution: Northeastern University
  • Tricia Patterson
    Host Institution:  MIT Libraries

There will be a panel discussion on the residency program at this year’s Digital Preservation 2014 meeting to be held July 22-24 in Washington, DC. More details will be available soon about the panel, as well as the meeting itself, so keep an eye out for future blog posts.

Categories: Planet DigiPres

Bulk disk imaging and disk-format identification with KryoFlux

Open Planets Foundation Blogs - 26 June 2014 - 3:15pm
The problem

We have a large volume of content on floppy disks that we know are degrading but which we don't know the value of.

  1. We don't want to waste time/resources on low-value content.
  2. We don't know the value of the content.
  3. We want to be able to back up the content on the disks to ensure it doesn't degrade any more than it already has.
  4. Using unskilled students to do the work is cost-effective.
  5. Unskilled students have often never seen "floppy" disks, let alone can distinguish between different formats of floppy disk. So we need a solution that doesn't require them to differentiate (e.g. between apple formats, PC formats, Amiga, etc).
  1. Make KryoFlux stream files using the KryoFlux hardware and software.
  2. Use the KryoFlux software to create every variant of disk image from those streams
  3. Use the mount program on Linux to mount each disk image using each variant of file system parameter. 
  4. Keep the disk images that can mount in Linux (as that ability implies that they are the right format).

Very rough beginnings of a program to perform the automatic format identification using the KryoFlux software and Mount are available here.

Issues with the solution
  1. When you use the KryoFlux to create raw stream files it only seems to do one pass of each sector. Whereas when you specify the format it will try to re-read sectors that it identifies as "bad sectors" in the first pass. This can lead to it successfully reading those sectors when it otherwise wouldn't. So using the KryoFlux stream files may not lead to as much successful content preservation as you would get if you specified the format of the disk before beginning the imaging process. I'm trying to find out whether using "multiple" in the output options in the KryoFlux software might help with this
  2. Mount doesn't mount all file-systems - though as this is improved in the future the process could be re-run
  3. Mount can give false positives
  4. I don't know whether there is a difference between disk images created with Kroflux using many of the optional parameters or using the defaults. For example there doesn't appear to be a difference in mount-ability of disk images created where the number of sides is specified or disk images when it is not and defaults to both sides (for e.g. MFM images the results of both seem to mount successfully).
  5. Keeping the raw streams is costly. A disk image for a 1.44mb floppy is ~1.44mb. The stream files are in the 10s of MBs
Other observations:
  1. It might be worth developing signatures for use in e.g. DROID to identify the format of the stream files directly in the future. Some e.g. emulators can directly interact with the stream files already I believe
  2. The stream files might provide a way of over-coming bad-sector based copy protection, (e.g. the copy protection used in Lotus 1-2-3 and Lotus Jazz) by enabling the use of raw stream files (which -i believe- contain the "bad" sectors as well as good) in emulators

Thoughts/feedback appreciated

Preservation Topics: IdentificationPreservation RisksBit rotTools
Categories: Planet DigiPres

Web Archiving and Preserving the Performing Arts in the Digital Age

The Signal: Digital Preservation - 25 June 2014 - 1:08pm

The following is a guest post from Gavin Frome, an intern for the Web Archiving Team at the Library of Congress.

//">Wikimedia Commons</a>

Philadelphia Orchestra at American premiere of Mahler’s 8th Symphony (1916). Source: Wikimedia Commons

Performing artists are by necessity a traveling people. They journey far and wide in the pursuit of their respective crafts, working, learning and weaving a fabric of loose cultural connections that help bind people together. In the digital age, the role and nature of a professional performing artist has been transformed by advances in recording and distribution technologies, which have enabled a wider population of amateurs to assemble audiences and affect the artistic landscape.

In recent years the internet has increasingly become the space where performers and enthusiasts alike go to build communities and present their work or ideas.  The upshot of this development is that creative materials have never been more accessible, allowing unprecedented levels of artistic exchange and influence. Yet for the massive amount of original content being produced, very little of it is being preserved. Like so much other material on the web, performing arts sites are at risk of vanishing in the event that the party responsible for their maintenance is no longer interested or capable of seeing to their upkeep.

//">Wikimedia Commons</a>

Portrait of Martha Graham and Bertram Ross, faces touching, in Visionary recital, June 27 1961. Source: Wikimedia Commons

The Library of Congress’ Music Division, with the help of the Web Archiving Team is currently engaged in a project to preserve performing arts web sites – primarily those relating in some manner to American dance, theater or music – so that the original content they exhibit will be accessible for future generations. Sites containing articles, blogs, videos, pictures, essays, discussions, interviews or any other variety of material that cannot be found elsewhere online or in the print world are collected regularly, with more being added each week.

These include not only websites of artists, but also those of musicologists, scholars, critics, fans and organizations which support their work. The archived collections aren’t available yet, but among the sites that we have received permission to crawl are,, and, which attract both professionals and enthusiasts with their distinct content.

Given the enormous number of sites that contain some type of original product, the Performing Arts Web Archive has a strong quality-over-quantity focus when determining which new sites to add, the goal being to create a collection that may serve as a representation for the larger population of existent sites. Granted, this selection process is fairly subjective; however, the intention is not to exclude worthy sites, but to provide the best possible resources for future researchers. If a site is rich in original content that seems significant and it has not yet been added to the collection, odds are it will be eventually – provided that the site owners grant permission, which is a requirement for all sites in the collection.

I began working on this project only a few months ago during my internship with the Web Archiving Team, but over that time I’ve become quite attached to it. As a cultural historian and lover of art in all its media, I recognize the significant power art holds in shaping one’s cultural and personal identity. To be a part of ensuring the survival of my generation’s creative output is an honor and gives me hope that the brilliant artistry I have witnessed on the internet will not be lost with the passing of memory.

Categories: Planet DigiPres

Will the real lazy pig please scale up: quality assured large scale image migration

Open Planets Foundation Blogs - 24 June 2014 - 9:12am

Authors: Martin Schaller, Sven Schlarb, and Kristin Dill

In the SCAPE Project, the memory institutions are working on practical application scenarios for the tools and solutions developed within the project. One of these application scenarios is the migration of a large image collection from one format to another.

There are many reasons why such a scenario may be of relevance in a digital library. On the one hand, conversion from an uncompressed to a compressed file format can significantly decrease storage costs. On the other hand, particularly from a long-term perspective, file formats may be in danger of becoming obsolete, which means that institutions must be able to undo the conversion and return to the original file format. In this case a quality assured process is essential to allow for reconstruction of the original file instances and especially to determine when deletion of original uncompressed files is needed – this is the only way to realize the advantage of reducing storage costs. Based on these assumptions we have developed the following use case: Uncompressed TIFF image files are converted into compressed JPEG2000 files; the quality of the converted file is assured by applying a pixel for pixel comparison between the original and the converted image.

For this, a sequential Taverna concept workflow was first developed, which was then modelled into a scalable procedure using different tools developed in the SCAPE Project.

The Taverna Concept Workflow

The workflow input is a text file containing paths to the TIFF files to be converted. This text file is then transformed into a list that allows the sequential conversion of each file, hence simulating a non-scalable process. Before the actual migration commences, validity of the TIFF file is checked. This step is realized by using FITS - a wrapper that applies different tools to extract the identification information of a file. Since the output of FITS is an XML-based validation report, an XPath service extracts and checks the validity information. If the file is valid, migration from TIFF to JPEG2000 can begin. The tool used in this step is OpenJPEG 2.0. In order to verify the output, Jpylyzer – a validator as well as feature extractor for JPEG2000 images created within the SCAPE Project – is employed. Again, an Xpath service is used to extract the validity information. This step concludes the file format conversion itself, but in order to ensure that the migrated file is indeed a valid surrogate, the file is reconverted into a TIFF file, again using OpenJPEG 2.0. Finally, in a last step the reconverted and the original TIFF files are compared pixel for pixel using LINUX based ImageMagick. Only through the successful execution of this final step can the validity as well as the possibility of a complete reconversion be assured. taverna workflow

Figure 1 (above): Taverna concept workflow

In order to identify how much time was consumed by each element of this workflow, we ran a test consisting of the migration of 1,000 files. Executing the described workflow on the 1,000 image files took about 13 hours and five minutes. Rather unsurprisingly, conversion and reconversion of the files took the longest: the conversion to JPEG2000 took 313 minutes and the reconversion 322 minutes. FITS validation needed 70 minutes and the pixel-wise comparison was finished in 62 minutes. The SCAPE developed tool Jypylizer required only 18 minutes and was thus much faster than the above mentioned steps. diagram taverna workflow

Figure 2 (above): execution times of each of the concept workflows' steps

Making the Workflow Scale

The foundation for the scalability of the described use case is a Hadoop cluster containing five Data Nodes and one Name Node (specification: see below). Besides having economic advantages – Hadoop runs on commodity hardware – it also bears the advantage of being designed for failure, hence reducing the problems associated with hardware crashes.

The distribution of tasks for each core is implemented via MapReduce jobs. A Map job splits the handling of a file. For example, if a large text file is to be processed, a Map job divides the file into several parts. Each part is then processed on a different node. Hadoop Reduce jobs then aggregates the outputs of the processing nodes again to a single file.

But writing MapReduce jobs is a complex matter. For this reason, the programming language Apache Pig is used. Pig was built for Hadoop and translates a set of commands in a language called “Pig Latin” into MapReduce jobs, thus making the handling of MapReduce jobs much easier or, as Professor Jimmy Lin described the powerful tool during the ‘Hadoop-driven digital preservation Hackathon’ in Vienna, easy enough “… for lazy pigs aiming for hassle-free MapReduce.”

Hadoop HDFS, Hadoop MapReduce and Apache Pig make up the foundation of the scalability on which the SCAPE tools ToMaR and XPath Service are based. ToMaR wraps command line tasks for parallel execution as Hadoop MapReduce jobs. These are in our case the execution of FITS, OpenJPEG 2.0, Jpylyzer and ImageMagick. As a result, the simultaneous execution of these tools on several nodes is possible. This has a great impact on execution times as Figure 3 (below) shows.

The blue line represents the non-scalable Taverna workflow. It is clearly observable how the time needed for file migration increases in proportion to the number of files that are converted. The scalable workflow, represented by the red line, shows a much smaller increase in time needed, thus suggesting that scalability has been achieved. This means that, by choosing the appropriate size for the cluster, it is possible to migrate a certain number of image files within a given time frame. Performance_image_migration

Figure 3 (above): Wallclock times of concept workflow and scalable workflow

Below is the the specification of the Hadoop Cluster where the master node runs the jobtracker and namenode/secondary namenode daemons, and the worker nodes each runs a tasktracker and a data node daemon.

Master node: Dell Poweredge R510

  • CPU: 2 x Xeon E5620@2.40GHz
  • Quadcore CPU (16 HyperThreading cores)
  • RAM: 24GB
  • NIC: 2 x GBit Ethernet (1 used)
  • DISK: 3 x 1TB DISKs; configured as RAID5 (redundancy); 2TB effective disk space

Worker nodes: Dell Poweredge R310

  • CPU: 1 x Xeon X3440@2.53GHz
  • Quadcore CPU (8 HyperThreading cores)
  • RAM: 16GB
  • NIC: 2 x GBit Ethernet (1 used)
  • DISK: 2 x 1TB DISKs; configured as RAID0 (performance); 2TB effective disk space

However, the throughput we can reach using this cluster and pig/hadoop job configuration is limited; as figure 4 shows, the throughput (measured in Gigabytes per hour - GB/h) is rapidly growing when the number of files being processed is increased, and then stabilises at a value around slightly more than 90 Gigabytes per hour (GB/h) when processing more than 750 image files. throughput_gb_per_h

Figure 4 (above): Throughput of the distributed execution measured in Gigabytes per hour (GB/h) against the number of files processed

As our use case shows, by using a variety of tools developed in the SCAPE Project together with the Hadoop framework it is possible to distribute the processing on various machines thus enabling the scalability of large scale image migration and significantly reducing the time needed for data processing. In addition, the size of the cluster can be tailored to fit the size of the job so that it can be completed within a given time frame.

Apart from the authors of this blog post, the following SCAPE Project partners contributed to this experiment:

  • Alan Akbik, Technical University of Berlin
  • Matthias Rella, Austrian Institute of Technology
  • Rainer Schmidt, Austrian Institute of Technology
Preservation Topics: MigrationSCAPEjpylyzer
Categories: Planet DigiPres