Planet DigiPres

What Could Curation Possibly Mean?

The Signal: Digital Preservation - 25 March 2014 - 1:47pm

My colleague Trevor Owens wrote a great blog post entitled “What Do you Mean by Archive?” This led to a follow-up discussion where I publicly announced on Facebook that I wanted to write about the term “curation.”  Its seemingly widespread use in popular culture in the past 4-5 years has fascinated me.

Curated, photo by Leslie Johnston

Curated, photo by Leslie Johnston

Every time I get a new device I need to teach the spell checker “curate” and “curation.”  In fact, I just had to add “curation” to this blog software dictionary.  So even my email and word processor dictionaries do not know what these words mean.  So why is it everywhere?

I have seen “curated” collections at retail shops. On web sites. In magazines. In the names of companies.  I have seen curated menus. I don’t mean collections of menus – I mean a menu describing the meal as “curated.”  Social media is curated into stories.  Music festivals are curated.  Brands are now curated. There is a great “Curating the Curators” tumblr that documents encounters with the many varied uses of the terms.  And I cannot fail to mention the fabulous meta “Curate Meme” tumblr, “Where Curators Curate Memes about Curation. Where will the absurdity of our use of the term Curation go next?.”

Apparently we are in the age of “Curated Consumption.” Or of “Channelization.”

The most famous discourse on this topic is Pete Martin’s “You Are Not a Curator,” originally written for (now seemingly defunct).  For a great general discussion on this, read the Chicago Tribune piece “Everybody’s a Curator.”  The Details article “Why Calling Yourself a Curator is the New Power Move” is an interesting take on personal brand management. Perhaps most interesting is “The Seven Needs of Real-Time Curators,” a guide for people interested in becoming a real-time blogger/curator.  My colleague Butch has already weighed in on digital curation versus stewardship.  But what is it we really mean when we say curation?

Curation is Acquisition

My academic introduction to the concept of curation came in my initial museums studies courses when I was an undergraduate. I was taught that curation was an act of selection, one of understanding the scope of a collection and a collection development policy, seeing the gaps in the collection and selectively acquiring items to make the collection more comprehensive and/or focused in its coverage. In this a curator is considering and appraising, researching, contextualizing and selecting, constantly searching and refining.  The curator is always auditing the collection and reappraising the items, refining the collection development policy as to its scope.

Curation is Exhibition

I think my first understanding of the word curation came as a museum-going child, when I realized that an actual person was responsible for the exhibits. I came to understand that someone identified a context or message that one wanted to present, and brought together a set of objects that represented a place or time or context, or provided examples to illustrate a point. The exhibition wall texts and labels and catalog and web site are carefully crafted to contextualize the items so they do not appear to be a random selection.  These are acts of conceptualization, interpretation and transformation of objects into an illustration, making the objects and the message accessible to a wide audience. In some cases, curators are transforming objects, some of which may seem quite mundane, into objects of desire through exhibition, by showcasing their cultural value.

Curation is Preservation

In a preservation context, curation is the act of sustaining the collection. Curation is about the storage and care of collections, sometimes passive and sometimes active. The more passive activity is one where items have been stored (ingested) in the safest long-term manner that ensures they require as little interaction as possible for their sustainability, while having an inventory in place so that you always know what you have and where and what it is.

In the more active act of auditing and reappraising a collection, the curator will always be looking for  issues in the collections, such as items exhibiting signs of deterioration or items that no longer meet the scope of the collection. In this curation is the progression of actions from auditing to reappraisal to taking some sort of preservation action, or perhaps de-accessioning items that no longer meet the collection criteria or require care that cannot be provided.

For those of us in the federal sphere, there is this commentary on the definition of curator from a blog post by Steven Lubar, a former curator at the National Museum of American History:

“The official OPM “position classification standard” [pdf] for curators is not of much use. It was written in 1962, and states that “Moreover, unlike library science, the techniques of acquisitioning, cataloging, storing, and displaying objects, and the methods of museum management have not been standardized into formal disciplines and incorporated into formal college courses of training and education.” Shocking, really; someone should update this.”

And what about digital curation? My colleague Doug Reside has written the most succinct description of what it is we do. And the Digital Curation Centre has a great brief guide to the activities and life cycle of digital curation.

How do I feel about what some call the appropriation of these terms?  On one side I dislike the use, as it seems that everyone thinks that they are a curator, which might dilute the professional meaning of the terms. On the other side, the retail usage of curation is not that different from part of what we do – the selection and showcasing and explication of the value of items – with vastly different criteria of course.

And people may indeed curate aspects of their own lives, auditing and reviewing and sustaining and de-accessioning music or books or clothing. So as much as I may get prickly over some uses, they seem to fit in the spirit of the word. Perhaps their use in popular culture will lead to a more widespread understanding of the use of the terms as we use them, and of what we do.

Categories: Planet DigiPres

“Digital Culture is Mass Culture”: An interview with Digital Conservator Dragan Espenschied

The Signal: Digital Preservation - 24 March 2014 - 5:30pm
Dragan cc-by, Flo Köhler

Dragan Espenschied, Digital Conservator @ Rhizome cc-by, Flo Köhler

At the intersection of digital preservation, art conservation and folklore you can find many of Dragan Espenschied’s projects. After receiving feedback and input from Dragan for a recent post on interfaces to digital collections and geocities I heard that he is now stepping into the role of digital conservator at Rhizome. To that end, I’m excited to talk with him as part of our ongoing NDSA innovation group’s Insights interview series about some of his projects and perspectives and his new role as a digital conservator.

Trevor: When I asked you to review the post on some of your work with the Geocities data set, you responded “I agree, archivists should act dumb and take as much as possible, leaving open as many (unforeseeable) opportunities as possible for access design.” Now that you are moving into a role as a digital conservator, I would be curious to hear to what extent you think that perspective might animate and inform your work.

Dragan: I believe that developing criteria of relevance and even selecting what artifacts are allowed into archives poses a problem of scale. The wise choice might be not trying to solve this problem, but to work on techniques for capturing artifacts as a whole – without trying to define significant properties, what the “core” of an artifact might be, or making too many assumptions about the future use of the artifact. The fewer choices are made during archiving, the more choices are open later, when the artifact will be accessed.

While at Rhizome I want to focus on designing how access to legacy data and systems located in an archive can be designed in a meaningful way. For Digital Culture, “access” means finding a way for a whole class of legacy artifacts to fulfill a function in contemporary Digital Culture. How to do that is one of the most pressing issues when it comes to developing an actually meaningful history of Digital Culture. We are still fixated on a very traditional storytelling, along the lines of great men creating groundbreaking innovations that changed the world. I hope I can help by turning the focus to users.

Trevor: BitTorrent has been the primary means by which the geocities archive has been published and shared. Given the recent announcement of AcademicTorrents as a potential way for sharing research data, I would be curious to hear what role you think BitTorrent can or should play in providing access to this kind of raw data.

Dragan: The torrent was an emergency solution in the Geocities case, but the Archive Team’s head Jason Scott turned the disadvantage of not having any institutional support into a powerful narrative device. Today the torrent’s files can be downloaded from The Internet Archive and this is in fact much more comfortable, though less exciting.

In general distribution via BitTorrent is problematic because once nobody is interested in a certain set of data anymore, even temporarily, a torrent dries up and simply vanishes. But torrents can help rescuing stuff trapped in an institution or distributing stuff that no institution would ever dare to touch. One of them is this big pile of Digital Folklore of Geocities, it poses so many problems for institutions: there are literally hundreds of thousands of unknown authors who could theoretically claim some copyright violations if their work would show up under the banner of an institution; there is hardly anybody in charge who would recognize the immense cultural value of the digital vernacular; it is so much material that no-one could ever look inside and check each and every byte for “offensive content” and so forth …

Trevor: In my earlier post on digital interfaces I had called the website One Terabyte of Kilobyte Age, which you artist Olia Lialina run, an interpretation. You said you think of it as “a carefully designed mass re-enactment, based on this scale of authenticity/accessibility.” Could you unpack that for us a bit? What makes it a re-enactment and what do you see as the core claim in your approach to authenticity and accessibility?

Dragan: As much as Digital Culture is Mass Culture, it is also more about practices than objects. In order for artifacts to survive culturally, they need to become useful again in contemporary digital culture. Since, at the moment, “content” that is isolated, de-contextualized and shuffled around in databases of social networking sites is the main form of communication, to be useful an artifact has to work as a “post,” it has to become impartible and be brought into a format that is accepted everywhere. And that is a screenshot.

I have a great setup with emulators and proxy servers and whatnot to access the processed Geocities archive, but this won’t bring it anywhere close to executing its important cultural role, being a testimony of a pre-industrialized web. Even public archives like the rather excellent ReoCities or the Wayback Machine cannot serve as a mediator for 1990’s web culture. The screenshots are easily accessible, sharable and usable: they work as cultural signatures users can assign to themselves by re-blogging them, they can be used to spark discussions and harvest likes and favorites, and so forth.

Some decisions of how these screenshots are automatically created are coming from this perspective of accessibility; for example, although the typical screen resolution of web users increased around the turn of the century, One Terabyte Of Kilobyte Age will continue to serve 800×600 pixel images for the foreseeable future. Larger images would burst many blogs’ layouts and cause unrecognizable elements on downsizing.

Other decisions, like the choice of MIDI replay plugin installed in the browser, is about making the screen shots as narrative as possible. The MIDI replay plugin shipped with Netscape would play MIDI music in the background without any visual representation, if the music would be embedded to the page, it would show simple play controls. The “crescendo” plugin I used always shows the file name of the MIDI file being played, most of the time in a popup window.


Reenactment of CapitolHill/1455/ via One Terabyte of Kilobyte Age

On the Geocities site CapitolHill/1455/ there is a music playing called “2001.mid”. You might think this might be the title theme of the movie 2001, “Also Sprach Zarathustra” by Richard Strauss – and that’s really the case (see this recording artist Olia Lialina made). This screenshot of the – some might say – annoying, even not very “authentic” popup window makes the tune play in your head.

 authenticity (how realistic can the harvested data be presented again) and ease of access (what technical requirements and what knowledge are needed to gain access on a certain level of authenticity)." Image from Dargan.

“Access to the remains of Geocities can be measured on two axis: authenticity (how realistic can the harvested data be presented again) and ease of access (what technical requirements and what knowledge are needed to gain access on a certain level of authenticity).” Image from Dragan.

So, while the screenshots have some “authenticity issues,” this is greatly outweighed by their accessibility and therefore impact. And experiencing the iconic, graphically impressive Netscape browser is something otherwise only achievable by complicated emulator setups. (Olia observed that tumblr users also reblog screenshots of empty Netscape windows, probably because its very dominant interface looks explicitly historic today.)

Trevor: In the announcement of your new position you are quoted as saying “I strongly believe that designing the access to complex legacy digital artifacts and systems is the largest contemporary challenge in digital culture. Digital culture is mass culture, and collection and preservation practices have to change to reflect this fact.” Could you unpack that a bit more for us? What are the implications of mass digital culture for collecting and preserving it?

Dragan: The grief I have with the creation of history in digital culture is that it is in many cases located outside of digital culture itself. Digital culture is regarded as too flimsy (or the classic “ephemeral”) to take care of itself, so conservation is done by removing artifacts from the cultural tempest they originated in and putting them into a safe place. The problem is that this approach doesn’t scale – sorry for using this technical term. I won’t argue that a privileged, careful handling of certain artifacts deemed of high importance or representative value is the wrong way; actually, this approach is the most narrative. But practiced too rigidly it doesn’t do digital culture any justice. Firstly because there simply are no resources to do this with a large amount of artifacts, and secondly because many artifacts can only blossom in their environment, in concert or contrast with a vernacular web, commercial services and so forth.

The other extreme is to write history with databases, pie charts and timelines, like in Google’s Zeitgeist. Going there I can find out that in January 2013 the top search requests in my city were “silvester” and “kalender 2013” – big data, little narration. With the presentation of such decontextualized data points, the main narrative power lies in the design of the visual template they end up in. This year it is a world map, next year it might be a 3D timeline – but in fact users typed in their queries into the Google search box. That is why the popular Google Search autocomplete screen shots, as a part of digital folklore, are more powerful, and typing into the Google search box yourself and watching the suggestions appear is the best way to explore what is being searched for.

Example of Autocomplete screenshot provided by Dragan.

Example of Autocomplete screenshot provided by Dragan.

Mass Digital Culture is posing this challenge: can there be a way of writing its history that does it justice? How to cope with the mass without cynicism and with respect for the users, without resorting to methods of market analysis?

Trevor: I spoke with Ben Fino-Radin, your predecessor in this role, about Rhizome and his take on what being a digital conservator means. I’d be curious to hear your response to that question. Could you tell us a bit about how you define this role? To what extent do you think this role is similar and different to analog art conservation? Similarly, to what extent is this work similar or different to roles like digital archivist or digital curator?

Dragan: I have very little experience with conserving analog art in general so I will spare myself the embarrassment of comparing. The point I agree whole-heartedly with Ben is about empathy for the artifacts. “New Media” is always new because the symbols buzzing around in computers don’t have any meaning by themselves, and digital culture is about inventing meanings for them. A digital conservator will need to weave the past into the present and constantly find new ways of doing so. This touches knowledge of history, curation, and artistic techniques. While I believe the field of digital conservation needs to build an identity still, I see my personal role as ultimately developing methods and practices for communities to take care of their own history.

Categories: Planet DigiPres

ARC to WARC migration: How to deal with de-duplicated records?

Open Planets Foundation Blogs - 24 March 2014 - 4:13pm

In my last blog post about ARC to WARC migration I did a performance comparison of two alternative approaches for migrating very large sets of ARC container files to the WARC format using Apache Hadoop, and I said that resolving contextual dependencies in order to create self-contained WARC files was the next point to investigate further. This is why I am now proposing one possible way to deal with de-duplicated records in an ARC to WARC migration scenario.

Before entering into specifics, let me briefly recall what is meant by „de-duplication“: It is a mechanism used by a web crawler to reference identical content that was already stored when visiting a web site at a previous point in time, and the main purpose is to avoid storing content redundantly and by that way to reduce the required storage capacity.

The Netarchive Suite uses a Heritrix module for de-duplication, which takes place on the level of a harvest definition. The following diagram roughly outlines the most important information items and their dependencies.


The example shows two subsequent jobs executed as part of the same harvest definition. Depending on the configuration parameters, as the desired size of ARC files, for example, each crawl job creates one or various ARC container files and a corresponding crawl metadata file. In the example above, the first crawl job (1001) produced two ARC files, each containing ARC metadata, a DNS record and one HTML page. Additionally, the first ARC file contains a PNG image file that was referenced in the HTML file. The second crawl job (1002) produced equivalent content except that the PNG image file is not contained in the first ARC file of this job, but it is only referred to as a de-duplicated item in the crawl-metadata using the notation {job-id}-{harvest-id}-{serialno}.

The question is: Do we actually need the de-duplication information in the crawl-metadata file? If an index (e.g. CDX index) is created over all ARC container files, we know – or better: the wayback machine knows – where a file can be located, and in this sense the de-duplication information could be considered obsolete. We would only loose the information as part of which crawl job the de-duplication actually took place, and this concerns the informational integrity of a crawl job because external dependencies would not be explicit any more. Therefore, the following is a proposed way to preserve this information in a WARC-standard-compliant way.

Each content record of the original ARC file is converted to a response-record in the WARC file like illustrated in the bottom left box in the diagram above. Any request/response metadata can be added as a header block to the record payload or as a separate metadata-record that relates to the response-record.

The de-duplicated information items available in the crawl-metadata file are converted to revisit-records as illustrated in the bottom right box as a separate WARC file (one per crawl-metadata file). The payload-digest must be equal and should state that the completeness of the referrred record was checked successfully. The WARC-Refers-To property refers to the WARC record that contains the record payload, additionally, the fact that Content-Length is 0 explicitely states that the record payload is not available in the current record and that it is to be located elsewhere.

Taxonomy upgrade extras: SCAPESCAPEProjectSCAPE-ProjectMigrationWeb ArchivingARCWARCARC to WARCPreservation Topics: MigrationWeb ArchivingSCAPE
Categories: Planet DigiPres

New look and feel for Viewshare

The Signal: Digital Preservation - 21 March 2014 - 5:55pm

Earlier this month Trevor Owens announced that a new version of Viewshare is open for user testing and comment. Following this public beta, our plan is to move all users over to the new platform in the next few months.  When this happens your Viewshare account, data and views will all transition seamlessly. You will, however, notice visual and functional improvements to the views. The overall look and feel has been modernized and more functionality has been added to some of the views, particularly those that use pie charts.

Trevor gave an overview of all of the new features in his post announcing these changes. In this post I’ll focus on the visual updates. I would encourage everyone with a Viewshare account to check out how your views look in the new version of Viewshare and let us know what you think or if you have any issues.

Responsive Design

The new version of Viewshare implements responsive design which will allow your views to look good and be functional on any computer or device. You can see this in action with a view of digital collections preserved by NDIIPP partners on both a large and small screen. The view can fill up a large computer monitor screen and be equally functional and usable on a smartphone. This added feature will require no action from users and will work automatically.

NDIIPP Collections on a smartphone using the new version of Viewshare.

NDIIPP Collections on a smartphone using the new version of Viewshare.

NDIIPP Collections view on large monitor

NDIIPP Collections view on large monitor











Changes for Charts

Bar chart views are available in the new version of Viewshare. The pie charts have also been greatly improved. Visually, they are clearer and the text is more legible. Functionally, users are able to click through to items that are represented in different areas of the pie chart. This isn’t possible in the current Viewshare. Check out the two versions of the same data from the East Texas Research Center and you’ll see the improvements.

I do want to point out that in the current version of Viewshare there’s an option to switch between two different pie charts on the same view by using a “view by” drop-down menu. To simplify the building process for these views in the new version of Viewshare that option was eliminated so if you want two views of a pie chart all you have to do is create two views. If your current pie chart view has options to view more than one chart in the same view the view listed first will be the one that displays in the new version.  To restore the missing view simply create an additional pie chart view.

Current pie chart view

Current pie chart view

New version of pie charts

New version of pie charts










Share Filtered or Subsets of Results

The new version of Viewshare allows users to share results of a particular state in a view. An example of this is shown in the Carson Monk-Metcalf view of birth and death records. The view below shows a scatterplot chart of birth years vs. death years and their race and religion (religion data not shown below but accessible in the view). The view is limited to show records for those who were 75 years and above at the time of their death. The user could cite or link to this particular view in the data by clicking the red bookmark icon in the upper right and share or save the link provided.

Carson Mon-Metcalf bookmarked results

Carson Mon-Metcalf bookmarked results

Again, be sure to check-out your views in the new Viewshare, your current login credentials will work. As always let us know what you think in the comments of this post or in the user feedback forums for Viewshare.

Categories: Planet DigiPres

CSV Validator - beta releases

Open Planets Foundation Blogs - 21 March 2014 - 2:51pm

For quite some time at The National Archives (UK) we've been working on a tool for validating CSV files against user defined schema.  We're now at the point of making beta releases of the tool generally available (1.0-RC3 at the time of writing), along with the formal specification of the schema language.  The tool and source code are released under Mozilla Public Licence version 2.0.

For more details, links to the source code repository, release code on Maven Central, instructions and schema specification, see

Feedback is welcome.  When we make the formal version 1.0 release there will be a fuller blog post on The National Archives blog.

Preservation Topics: Tools
Categories: Planet DigiPres

A Tika to ride; characterising web content with Nanite

Open Planets Foundation Blogs - 21 March 2014 - 1:58pm

This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite.

Introducing Nanite

Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects:

  • Nanite-Core: an API for Droid   
  • Nanite-Hadoop: a MapReduce program for characterising web archives that makes use of Nanite-Core, Apache Tika and libmagic-jna-wrapper  (the last one here essentially being the *nix `file` tool wrapped for reuse in Java)

Nanite-Hadoop makes use of UK Web Archive Record Readers for Hadoop, to enable it to directly process ARC and WARC files from HDFS without an intermediate processing step.  The initial part of a Nanite-Hadoop run is a test to check that the input files are valid gz files.  This is very quick (takes seconds) and ensures that there are no invalid files that could crash the format profiler after it has run for several hours.  More checks on the input files could be potentially be added.

We have been working on Nanite to add different characterisation libraries and improve them/their coverage.  As the tools that are used are all Java, or using native library calls, Nanite-Hadoop is fast.  Retrieving a mimetype from Droid and Tika for all 93 million files in 1TB (compressed size) of WARC files took 17.5hrs on our Hadoop cluster.  This is less than 1ms/file.  Libraries to be turned on/off relatively easily by editing the source or in the jar.

That time does not include any characterisation, so I began to add support for characterisation using Tika’s parsers.  The process I followed to add this characterisation is described below.

(Un)Intentionally stress testing Tika’s parsers

In hindsight sending 93 million files harvested from the open web directly to Tika’s parsers and expecting everything to be ok was optimistic at best.  There were bound to have been files in that corpus that were corrupt or otherwise broken that would cause crashes in Tika or its dependencies. 

Carnet let you do that; crashing/hanging the Hadoop JVM

Initially I began by using the Tika Parser interface directly.  This was ok until I noticed that some parsers (or their dependencies) were crashing or hanging.  As that was rather undesirable I began to disable the problematic parsers at runtime (with the aim of submitting bug reports back to Tika).  However, it soon became apparent that the files contained in the web archive were stressing the parsers to the point I would have had to disable ever increasing numbers of them.  This was really undesirable as the logic was handcrafted and relied on the state of the Tika parsers at that particular moment.  It also meant that the existence of one bad file of a particular format meant that no characterisation of that format could be carried out.  The logic to do this is still in the code, albeit not currently used.

Timing out Tika considered harmful; first steps

The next step was to error-proof the calls to Tika.  Firstly I ensured that any Exceptions/Errors/etc were caught.  Then I created a TimeoutParser  that parsed the files in a background Thread and forcibly stopped the Tika parser after a time limit had been exceeded.  This worked ok, however, it made use of Thread.stop() – a deprecated API call to stop a Java Thread.  Use of this API call is thoroughly not recommended as it may corrupt the internal state of the JVM or produce other undesired effects.  Details about this can be read in an issue on the Tika bug tracker.  Since I did not want to risk a corruption of the JVM I did not pursue this further. 

I should note that subsequently it has been suggested that an alternative to using Thread.stop() is to just leave it alone for the JVM to deal with and create new Thread.  This is a valid method of dealing with the problem, given the numbers of files involved (see later), but I have not tested it.

The whole Tika, and nothing but the Tika; isolating the Tika process

Following a suggestion by a commenter in the Tika issue, linked above, I produced a library that abstracted a Tika-server as a separate operating system process, isolated from the main JVM: ProcessIsolatedTika.  This means that if Tika crashes it is the operating system’s responsibility to clean up the mess and it won’t affect the state of the main JVM.  The new library controls restarting the process after a crash, or after processing times out (in case of a hang).  An API similar to a normal Tika parser is provided so it can be easily reused.  Communication by the library with the Tika-server is via REST, over the loopback network interface.  There may be issues if there is more than BUFSIZE bytes read (currently 20MB) – although such errors should be logged by Nanite in the Hadoop Reducer output.

Although the main overhead of this approach is having a separate process and JVM per WARC file, that is mitigated somewhat by the time that process is used for.  Aside from the cost of transferring files to the Tika-server, the overhead is a larger jar file, longer initial start-up time for Mappers and additional time for restarts of the Tika-server on failed files.  Given average runtime per WARC is slightly over 5 minutes, the few additional seconds that are included for using a process isolated Tika is not a great deal extra.

The output from the Tika parsers is kept in a sequence file in HDFS (one per input (W)ARC) – i.e. 1000 WARCs == 1000 Tika parser sequence files.  This output is in addition to the output from the Reducer (mimetypes, server mimetypes and extension).

To help the Tika parsers with the file, Tika detect() is first run on the file and that mimetype is passed to the parsers via a http header.  A Metadata object cannot be passed to the parsers via REST like it would be if we called them directly from the Java code.

Another approach could have been to use Nailgun as described by Ross Spencer in a previous blog post here.  I did not take that approach as I did not want to set up a Nailgun server on each Hadoop node (we have 28 of them) and if a Tika parser crashed or caused the JVM to hang then it may corrupt the state of the Nailgun JVM in a similar way to the TimeoutParser above.  Finally, with my current test data each node handles ~3m files – much more than the 420k calls that caused Nailgun to run out of heap space in Ross’ experiment.

Express Tika; initial benchmarks

I ran some initial benchmarks on 1000 WARC files using our test Hadoop cluster (28 nodes with 1 cpu/map slot per node) the results are as follows:

Identification tools used

Nanite-core (Droid)

Tika detect() (mimetype only)

ProcessIsolatedTika parsers

WARC files


Total WARC size

59.4GB (63,759,574,081 bytes)


Total files in WARCs (# input records)


Runtime (hh:mm:ss)






Total Tika parser output size (compressed)

765MB (801,740,734 bytes)


Tika parser failures/crashes


Misc failures

Malformed records: 122

IOExceptions*: 3224

Other Exceptions: 430

Total: 3776

*This may be due to files being larger than the buffer – to be investigated.

The output has not been fully verified but should give an initial indication of speed.

Conceivably the information from the Tika parsers could be loaded into c3po but I have not looked into that.

Conclusion; if the process isolation FITS, where is it?

We are now able to use Tika parsers for characterisation without being concerned about crashes in Tika.  This research will also allow us to identify files that Tika’s parsers cannot handle so we can submit bug reports/patches back to Tika.  When Tika 1.6 comes out it will include detailed pdf version detection within the pdf parser.

As an aside - if FITS offered a REST interface then the ProcessIsolatedTika code could be easily modifed to replace Tika with FITS – this is worth considering, if there was interest and someone were to create such a REST interface.

Apologies for the puns.

Preservation Topics: Preservation ActionsIdentificationCharacterisationWeb ArchivingToolsSCAPE
Categories: Planet DigiPres

Nominations Now Open for the 2014 NDSA Innovation Awards

The Signal: Digital Preservation - 20 March 2014 - 8:19pm

12 year old girl wins Medal of Honor, Washington, D.C., Sept. 12.” Library of Congress, Prints & Photographs Collection. LC-DIG-hec-33759,

 The National Digital Stewardship Alliance Innovation Working Group is proud to open the nominations for the 2014 NDSA Innovation Awards. As a diverse membership group with a shared commitment to digital preservation, the NDSA understands the importance of innovation and risk-taking in developing and supporting a broad range of successful digital preservation activities. These awards are an example of the NDSA’s commitment to encourage and recognize innovation in the digital stewardship community.

This slate of annual awards highlights and commends creative individuals, projects, organizations and future stewards demonstrating originality and excellence in their contributions to the field of digital preservation. The program is administered by a committee drawn from members of the NDSA Innovation Working Group.

Last year’s winners are exemplars of the diversity and collaboration essential to supporting the digital stewardship community as it works to preserve and make available digital materials. For more information on the details of last year’s recipients, please see the blog post announcing last year’s winners.

The NDSA Innovation Awards focus on recognizing excellence in one or more of the following areas:

  • Individuals making a significant, innovative contribution to the field of digital preservation;
  • Projects whose goals or outcomes represent an inventive, meaningful addition to the understanding or processes required for successful, sustainable digital preservation stewardship;
  • Organizations taking an innovative approach to providing support and guidance to the digital preservation community;
  • Future stewards, especially students, but including educators, trainers or curricular endeavors, taking a creative approach to advancing knowledge of digital preservation theory and practices.

Acknowledging that innovative digital stewardship can take many forms, eligibility for these awards has been left purposely broad. Nominations are open to anyone or anything that falls into the above categories and any entity can be nominated for one of the four awards. Nominees should be US-based people and projects or collaborative international projects that contain a US-based partner. This is your chance to help us highlight and reward novel, risk-taking and inventive approaches to the challenges of digital preservation.

Nominations are now being accepted and you can submit a nomination using this quick, easy online submission form. You can also submit a nomination by emailing a brief description, justification and the URL and/or contact information of your nominee to ndsa (at)

Nominations will be accepted until Friday May 2, 2014 and winners announced in mid-May. The prizes will be plaques presented to the winners at the Digital Preservation 2014 meeting taking place in the Washington, DC area on July 22-24, 2014. Winners will be asked to deliver a very brief talk about their activities as part of the awards ceremony and travel funds are expected to be available for these invited presenters.

Help us recognize and reward innovation in digital stewardship and submit a nomination!

Categories: Planet DigiPres

Long term accessibility of digital resources in theory and practice

Alliance for Permanent Access News - 20 March 2014 - 3:25pm

The APARSEN project is organising a Satellite Event on “Long Term Accessibility of Digital Resources in Theory and Practice” on 21st May 2014 in Vienna, Austria.

It takes place in the context of the 3rd LIBER Workshop on Digital Curation “Keeping data: The process of data curation” (19-20 May 2014)

The programme is organised by the APARSEN project together with the SCAPE Project.

09:00 – 10:30 Sabine Schrimpf
(German National Library) Digital Rights Management in the context of long-term preservation Ross King
(Austrian Institute of Technology) Thes SCAPE project and Scalable Quality Control David Wang
(SBA Research) Understanding the Costs of Digital Curation
11:00 – 12:30
Sven Schlarb
(Austrian National Library) Application scenarios of the SCAPE project at the Austrian National Library Krešimir Đuretec
(Vienna University of Technology) The SCAPE Planning and Watch Suite David Giaretta
(Alliance for Permanent Access) Digital Preservation: How APARSEN can help answer the key question “Who pays and Why?”
Categories: Planet DigiPres

A Regional NDSA?

The Signal: Digital Preservation - 19 March 2014 - 5:59pm

The following is a guest post by Kim Schroeder, a lecturer at the Wayne State University School of Library and Information Science.

Several years ago before the glory of the annual NDSA conference, professionals across America were seeking more digital curation literature and professional contacts.  Basic questions like ‘what is the first step in digital preservation?’ and ‘how do I start to consistently manage digital assets?’ were at the forefront.

As we have worked toward increased information sharing including the invaluable annual NDSA and IDCC conferences, we see a disconnect as we return home.  As we try to implement new tools and processes, we inevitably hit bumps beyond our immediate knowledge.  This is being alleviated more and more by local meetings being hosted in regions to gather professionals for hands-on and hand-waving process sharing.

Lance Stuchell, Digital Preservation Librarian at the University of Michigan and I began the Regional Digital Preservation Practitioners (RDPP) meetings as an opportunity to talk through our challenges and solutions.  The result is that over 100 professionals have signed up for our listserv since our call one year ago. We sent announcements out to Windsor, Toledo, Ann Arbor and throughout Metro Detroit to let people know that there is an untapped community of professionals that want and need to share their progress on digital curation.

 Mary Jane Murawka

Kevin Barton in the Wayne State SLIS Lab: Photo credit: Mary Jane Murawka

In the last year we have held three meetings with more planned this year.  The initial meeting included a discussion and eventually a survey to define our biggest issues as well as how best to craft the group.  Other topics included a digital projects lab tour, a DSpace installation overview, and a demonstration of a mature Digital Asset Management system.  Coming later this year, we plan to focus on metadata issues and a symposium on how to create workflows.  Further information about the meetings is available at the Regional Digital Preservation Practitioners site.

The development of the list has been one of the more helpful pieces with folks posting jobs, practicum ideas, latest articles and technical questions.  The volume of discussion is not there yet but it is off to a healthy start.

Mid-Michigan has also created a similar group that works with us to schedule events and share information.  Ed Busch, the Electronic Records Archivist at Michigan State University (MSU) held a successful conference last summer at MSU and he said:  “What my co-worker Lisa Schmidt and I find so useful with our Mid-Michigan regional meeting is the chance to network with other professionals trying to solve the same situations as we are with digital assets; hearing what they’ve tried with success and failure; and finding new ways to collaborate. All institutions with digital assets, regardless of size, are in the same boat when it comes to dealing with this material. It’s really nice to hear that from your peers.”  They held another conference on March 14th of this year and the agenda is available (pdf).

The NDSA is also encouraging regions to join together beyond the annual meeting. Butch Lazorchak, a Digital Archivist at the National Digital Information Infrastructure and Preservation Program
shared his thoughts on this. “The NDSA regional meetings are a great opportunity for NDSA members to share the work they’ve done to advance digital stewardship practice,” he said. “At the same time, the meetings help to build community by bringing together regional participants who may not usually find an opportunity to get together to discuss digital stewardship practice and share their own triumphs and challenges.”

Beginning a regional group is fairly easy as you send out announcements to professional listservs, but the tougher part is administration.  Deciding who keeps the minutes, manages the list, hosts the next meeting and how to maintain momentum is a necessity.  With the explosion in research, professional literature and expanding conferences we have more avenues to explore but we need the hands-on lessons learned from local colleagues to continue successful experimentation.  We would encourage you to think about starting your own local group!

Categories: Planet DigiPres

Things to Know About Personal Digital Archiving 2014

The Signal: Digital Preservation - 18 March 2014 - 8:44pm

Personal Digital Archiving 2014 will be held at the Indiana State Library in Indianapolis, Indiana, April 10-11, 2014.  This is THE conference that raises awareness among individuals, public institutions and private companies engaged in the creation, preservation and ongoing use of personal digital content.  A key overarching topic will be how libraries, archives and other cultural heritage organizations can support personal digital archiving within our own community as well as reaching out to specific communities. We invite you to come out and join the conversation.

The two-day conference will feature a diverse range of presentations on topics such as: archiving and documentation practices of local communities; tools and techniques to process digital archives; investigations of building, managing and archiving scholarly practices and family history archives; and the challenges of communicating personal digital archiving benefits to a variety of audiences. The full list of presentations, lightning talks and posters can be found here.

Tag cloud of PDA14 presentation titles.

Tag cloud of PDA14 presentation titles.

Here are a few quick things to know about upcoming conference:

  • Keynote speakers will explore preservation challenges from the perspectives of both researchers and creators of personal digital information.  Andrea Copeland from the School of Informatics and Computing, Indiana University-Purdue, will talk about her research looking into public library users’ digital preservation practices. Charles R. Cross, a music historian & author, will talk about the value of personal archives from a biographers perspective.
  • Adequate infrastructure in many organizations to implement preservation of personal digital records is lacking.  There will be a number of presentations on the practical side of doing personal digital preservation using specific tools and services.  Some will be on consumer-level services that help individuals build their own person digital archives. Other presentations will be from librarians, archivists and researchers who are using certain tools to help their institutions manage personal digital records.
  • Knowledge related to accession, donor or legal requirements, researchers’ interests, and practical preservation strategies for personal digital archives is equally lacking. To help understand some of these issues, practitioners, scholars and individuals from different fields will share their current research on personal digital archiving topics.  For the first time, the conference will feature a panel discussion from contemporary architects and landscape architects talking about preserving their work and transferring it to archives.  This is a community of professionals not regularly represented at the PDA conference and provides a great opportunity to hear about their specific challenges.

Registration is open!  We hope you can join us and explore and help raise awareness of the need for personal digital archiving in your own communities.

Categories: Planet DigiPres

Three years of SCAPE

Open Planets Foundation Blogs - 18 March 2014 - 12:24pm

SCAPE is proud to look back at another successful project year. During the third year the team produced many new tools, e.g. ToMaR, a tool which wraps command line tools into Hadoop MapReduce jobs. Other tools like xcorrSound and C3PO have been developed further.

This year’s All-Staff Meeting took place mid-February in Póvoa de Varzim, Portugal. The team organised a number of general sessions, during which the project partners presented demos of and elevator pitches for the tools and services they developed in SCAPE. It was very interesting for all meeting participants to see the results achieved so far. The demos and pitches were also useful for re-focusing on the big picture of SCAPE. During the main meeting sessions the participants mainly focused on take up and productization of SCAPE tools.

Another central topic of the meeting was integration. Until the end of the project the partners will put an emphasis on integrating the results further. To prove scalability of the tools, the team set up a number of operative Hadoop clusters instances (both central and local), which are currently being used for the evaluation of the tools and workflows.

Another focus lies on the sustainability of SCAPE tools. The SCAPE team is working towards documenting the tools for both developers and users. SCAPE outcomes will be curated by the Open Planets Foundation until the end of the project and will keep them available.

In September 2014 SCAPE is organising a final event in collaboration with APARSEN. The workshop is planned to take place at the Digital Libraries 2014 conference in London, where SCAPE will have its final, overall presentation. The workshop is directed towards developers, content holders, and data managers. The SCAPE team will present tools and services developed since 2011. A special focus will lie on newly and further developed open source tools for scalable preservation actions; SCAPE’s scalable Platform architecture; and its policy-based Planning and Watch solutions.

Preservation Topics: SCAPE
Categories: Planet DigiPres

Mavenized JHOVE

File Formats Blog - 16 March 2014 - 2:19pm

I’m not a Maven maven, but more of a Maven klutz. Nonetheless, I’ve managed to push a Mavenized version of JHOVE to Github that compiles for me. I haven’t tried to do anything beyond compiling. If anyone would like to help clean it up, please do.

This kills the continuity of file histories which Andy worked so hard to preserve, since Maven has its own ideas of where files should be. The histories are available under the deleted files in their old locations, if you look at the original commit.

Tagged: JHOVE, software
Categories: Planet DigiPres

ToMaR - How to let your preservation tools scale

Open Planets Foundation Blogs - 14 March 2014 - 4:01pm

Whenever you run into the situation that you have got used to a command line tool and all of a sudden need to apply it to a large amount of files over a Hadoop cluster without having any clue of writing distributed programs ToMaR will be your friend.

Mathilda is working at the department for digital preservation at a famous national library. In her daily work she has to cope with various well-known tasks like data identification, migration and curation. She is experienced in using the command shell on a Unix system and occasionally has to write small scripts to perform a certain workflow effectively.

When she has got to deal with a few hundreds of files she usually invokes her shell script on one file after the other using a simple loop for automation. But today she has been put in charge of a much bigger data set than she is used to. There are one hundred thousand TIFF images which need to be migrated to JPEG2000 images in order to save storage space. Intuitively she knows that processing these files one after the other with each single migration taking about half a minute would take a whole work day to run.

Luckily Mathilda has heard of the recent Hadoop cluster colleagues of her have set up in order to do some data mining on a large collection of text files. "Would there be a way to run my file migration tool on that cluster thing?", she thinks, "If I could run it in parallel on all these machines then that would speed up my migration task tremendously!" Only one thing makes here hesitate: She has hardly got any Java programming skills, not to mention any idea of that MapReduce programming paradigm they are using in their data mining task. How to let her tool scale?

That's where ToMaR, the Tool-to-MapReduce Wrapper comes in!

What can ToMaR do?

If you have a running Hadoop cluster you are only three little steps away from letting your preservation tools run on thousands of files almost as efficiently as with a native one-purpose Java MapReduce application. ToMaR wraps command line tools into a Hadoop MapReduce job which executes the command on all the worker nodes of the Hadoop cluster in parallel. Dependent on the tool you want to use through ToMaR it might be necessary to install it on each cluster node beforehand. Then all you need to do is:

  1. Specify your tool so that ToMaR can understand it using the SCAPE Tool Specification Schema.
  2. Itemize the parameters of the tool invocation for each of your input files in a control file.
  3. Run ToMaR.

Through MapReduce your list of parameter descriptions in the control file will be split up and assigned to each node portion by portion. For instance ToMaR could have been configured to create splits of 10 lines each taken from the control file. Then each node parses the portion line by line and invokes the tool with the parameters specified therein each time.

File Format Migration Example

So how may Mathilda tackle her file format migration problem? First she will have to make sure that her tool is installed on each cluster node. Her colleagues who maintain the Hadoop cluster will take care for this requirement. Up to her is the creation of the Tool Specification Document (ToolSpec) using the SCAPE Tool Specification Schema and the itemization of the tool invocation parameter descriptions. The following figure depicts the required workflow:

Create the ToolSpec

The ToolSpec is an XML file which contains several operations. An operations consists of name, a description, a command pattern and input/output parameters. The operation for Mathilda's file format migration tool might look like this:

<operation name="image-to-j2k"> <description>Migrates an image to jpeg2000</description> <command> image_to_j2k -i ${input} -o ${output} -I -p RPCL -n 7 -c [256,256], [256,256],[128,128],[128,128],[128,128],[128,128],[128,128] -b 64,64 -r 320.000,160.000,80.000,40.000,20.000,11.250,7.000,4.600,3.400,2.750, 2.400,1.000 </command> <inputs> <input name="input" required="true"> <description>Reference to input file</description> </input> </inputs> <outputs> <output name="output" required="true"> <description>Reference to output file. Only *.j2k, *.j2c or *.jp2!</description> </output> </outputs> </operation>

In the <command> element she has put the actual command line with a long tail of static parameters. This example highlights another advantage of the ToolSpec: You gain the ease of wrapping complex command lines in an atomic operation definition which is associated with a simple name, here "image-to-j2k". Inside the command pattern she puts placeholders which are replaced by various values. Here ${input} and ${output} denote such variables so that the value of the input file parameter (-i) and the value of the output file parameter (-o) can vary with each invocation of the tool.

Along with the command definition Mathilda has to describe these variables in the <inputs> and <outputs> section. For the ${input} being the placeholder for a input file she has to add a <input> element with the name of the placeholder as an attribute. The same counts for the ${output} placeholder. Additionally she can add some description text to these input and output parameter definitions.

There are more constructs possible with the SCAPE Tool Specification Schema which can not be covered here. The full contents of this ToolSpec can be found in the file attachments.

Create the Control File

The other essential requirement Mathilda has to achieve is the creation of the control file. This file contains the real values for the tool invocation which are mapped to the ToolSpec by ToMaR. Together with the above example her control file will look something like this:

openjpeg image-to-jp2 --input=“hdfs://myFile1.tif“ --output=“hdfs://myFile2.jp2“ openjpeg image-to-jp2 --input=“hdfs://myFile2.tif“ --output=“hdfs://myFile2.jp2“ openjpeg image-to-jp2 --input=“hdfs://myFile3.tif“ --output=“hdfs://myFile3.jp2“ ...

The first word refers to the name of the ToolSpec ToMaR shall load. In this example the ToolSpec is called "openjpeg.xml" but only the name without the .xml extension is needed for the reference. The second word refers to an operation within that ToolSpec, it's the "image-to-j2k" operation described in the ToolSpec example snippet above.

The rest of the line contains references to input and output parameters. Each reference starts with a double dash followed by a pair of parameters name and value. So --input (and likewise --output) refers to the parameters named "input" in the ToolSpec which in turn refers to the ${input} placeholder in the command pattern. The values are file references on Hadoop's Distributed File System (HDFS).

As Mathilda has 100k TIFF images she will have 100k lines in her control file. As she knows how to use the command shell she quickly writes a script which generates this file for her.

Run ToMaR

Having the ToolSpec openjpeg.xml and the control file controlfile.txt created she copies openjpeg.xml into the directory "hdfs:///user/mathilda/toolspecs" of HDFS and executes the following command on the master node of the Hadoop cluster:

hadoop jar ToMaR.jar -i controlfile.txt -r hdfs:///user/mathilda/toolspecs

Here she feeds in the controlfile.txt and the location of her ToolSpecs and ToMaR does the rest. It splits up the control file and distributes a certain number of lines per split to each node. The ToolSpec is loaded and the parameters are mapped to the command line pattern contained in the named operation. Input files are copied from HDFS to the local file system. As the placeholders are replaced by the values the command line can be executed by the worker node. After that the result output file is copied back to HDFS to the output location given.

Finally Mathilda has got all the migrated JPEG2000 images on HDFS in a fraction of the time it would have taken when run sequentially on her machine.

  • easily take up external tools with a clear mapping between the instructions and the physical invocation of the tool
  • use the SCAPE Toolspec, as well as existing Toolspecs, and its advantage of associating simple keywords with complex command-line patterns
  • no programming skills needed as the minimum requirement only is to setup the control file

When dealing with large volumes of files, e.g. in the context of file format migration or characterisation tasks, a standalone server often cannot provide sufficient throughput to process the data in a feasible period of time. ToMaR provides a simple and flexible solution to run preservation tools on a Hadoop MapReduce cluster in a scalable fashion.

ToMaR offers the possibility to use existing command-line tools in Hadoop's distributed environment very similarly to a desktop computer. By utilizing SCAPE Tool Specification documents, ToMaR allows users to associate complex command-line patterns with simple keywords, which can be referenced for execution on a computer cluster. ToMaR is a generic MapReduce application which does not require any programming skills.

Checkout the following blog posts for further usage scenarios of ToMaR:


Preservation Topics: Preservation ActionsSCAPE AttachmentSize Full openjpeg ToolSpec1.02 KB ToMaR-image_to_j2k-workflow.png158.29 KB ToMaR-overview.png67.97 KB logo.png74.65 KB
Categories: Planet DigiPres

Upcoming NDSR Symposium “Emerging Trends in Digital Stewardship”: Speaker Announcements

The Signal: Digital Preservation - 14 March 2014 - 1:23pm

The following is a guest post by Jaime McCurry, National Digital Stewardship Resident at the Folger Shakespeare Library.

It’s certainly been an exciting and busy few months for the National Digital Stewardship Residents and although we are well into the final portion of our projects, we’re showing no signs of slowing down.

NDSRs view time-based media art at the American Art Museum and National Portrait Gallery. Photo courtesy of Emily Reynolds.

Residents see time-based media art at the American Art Museum and National Portrait Gallery. Photo courtesy of Emily Reynolds.

In addition to our regularly scheduled programming on The Signal, you can find the residents on the web and elsewhere talking digital stewardship and providing project updates:

  • Julia Blase discusses her project status at the National Security Archive in Forward…March!
  • Heidi Dowding discusses the most recent NDSR Enrichment Session, hosted this month at Dumbarton Oaks.
  • Continuing with the Resident-to-Resident Interview Series, Maureen McCormick Harlow (National Library of Medicine) interviews Emily Reynolds on the specifics of her project at the World Bank.
  • Emily Reynolds recaps a recent NDSR site-visit to the United States Holocaust Memorial Museum.
  • Erica Titkemeyer (Smithsonian Institute) discusses Handling Digital Assets in Time Based Media Art.
  • I’m talking web archiving at the Folger Shakespeare Library.
  • You can catch Lauren Work (PBS) and Julia Blase (National Security Archive) at the Spring CNI meeting later this month.
  • And finally, residents Margo Padilla (MITH), Molly Schwartz (ARL), Erica Titkemeyer (Smithsonian Institute), and Lauren Work (PBS) are New Voices in Digital Curation in April.

Emerging Trends in Digital Stewardship Symposium: Speaker Announcements!

As previously announced, the inaugural cohort of National Digital Stewardship Residents will present a symposium titled “Emerging Trends in Digital Stewardship” on April 8, 2014. This event, hosted by the Library of Congress, IMLS, and the National Library of Medicine will be located at the National Library of Medicine’s Lister Hill Auditorium and will consist of panel presentations on topics related to digital stewardship.

At this time, we are delighted to release a final program, including guest speakers and panel participants:

Tuesday, April 8, 2014

8:30-9:30         Registration
9:30-9:45         Opening Remarks

  • George Coulbourne and Kris Nelson, Library of Congress

9:45-10:45       BitCurator Demonstration

  • Cal Lee, UNC-Chapel Hill School of Information and Library Science

11:00-Noon     Panel Discussion:  Social Media, Archiving, and Preserving Collaborative Projects

  • Leslie Johnston, Library of Congress
  • Janel Kinlaw, NPR: National Public Radio
  • Laura Wrubel, George Washington University

Noon-1:15       Lunch Break

1:15-2:15         Panel Discussion:  Open Government and Open Data

  • Daniel Schuman, Citizens for Responsibility and Ethics in Washington
  • Jennifer Serventi, National Endowment for the Humanities
  • Nick Shockey, Scholarly Publishing and Academic Resources Coalition

2:45-3:45       Panel Discussion:  Digital Strategies for Public and Non-Profit Institutions

  • Carl Fleischhauer, Library of Congress
  • Eric Johnson, Folger Shakespeare Library
  • Matt Kirschenbaum, Maryland Institute for Technology in the Humanities
  • Kate Murray, Library of Congress
  • Trevor Owens, Library of Congress

3:45             Closing Remarks

We’re thrilled to have such wonderful participants and look forward to sparking some exciting discussions on all things digital stewardship. As a reminder, the symposium is free and open to the public and pre-registration is strongly encouraged. More information can be here. We hope to see you there!

Categories: Planet DigiPres

Happy Birthday, Web!

The Signal: Digital Preservation - 13 March 2014 - 2:24pm

This is a guest post by Abbie Grotke, Library of Congress Web Archiving Team Lead and Co-Chair of the National Digital Stewardship Alliance Content Working Group

Yesterday we celebrated the 25th anniversary of the creation of the World Wide Web.

How many of you can remember the first time you saw a website, clicked on a hyperlink, or actually edited an HTML page? My “first web” story is still pretty fresh in my mind: It was probably around October 1993, in D.C. My brother and his friends were fairly tech savvy (they’d already set me up with an email account). We went over to his friend Miles’s house in Dupont Circle to visit, and while there he excitedly showed us this thing called Mosaic. I remember the gray screen and the strange concept of hyperlinks; my brother remembers seeing a short quicktime movie of a dolphin doing a flip.

We were all really excited.

Screenshot from the Election 2000 Web Archive of, captured October 23, 2000.

Screenshot from the Election 2000 Web Archive of, captured October 23, 2000.

Flash forward to 2014: Although I vaguely remember life without the web (however did we find out “who that actor is that looks so familiar on the TV right now and what she role she played in that movie that is on the tip of my tongue”?), I certainly can’t imagine a life without it in the future.  I’m in a job, preserving parts of the Internet, which would not exist had it not been for Tim Berners-Lee 25 years ago. For more on the 25th anniversary of the World Wide Web, check out Pew Research Internet Project’s “The Web at 25” .

As evidenced by Pew’s handy timeline of the Web, you can see a lot has changed since the Internet Archive (followed by national libraries), began preserving the web in 1996. If you haven’t seen this other Web Archives timeline, I encourage you to check it out. Since those early days, the number of organizations archiving the web has grown.

The Library of Congress started its own adventure preserving web content in 2000. For an institution that began in 1800, it certainly counts as a small amount of “Library” time. Although we’re not quite sure what our first archived website was (following the lead of our friends at the British Library) the first websites we crawled are from our Election 2000 Web Archive, and include campaign websites from both George E. Bush and Al Gore, among others.

As you can see by the screenshots, and if you click off to those archived pages, certain things didn’t archive very well. Something as simple as images weren’t always captured comprehensively, and the full sites certainly weren’t archived. We’ve spent years, with our partners around the globe, working to make “archival-quality” web archives that include much more than just the text of a site.

We’re all preserving more content than ever, but there are still challenges for those charged with preserving this content to keep up with not only the scale of content being generated, legal issues surrounding preservation of websites, and keeping up with the technologies used on the web (even if we want to preserve it, can we?), as has been discussed before on this blog. We’ve still got a lot of work to do.

Screenshot from the Election 2000 Web Archive of, captured August 3, 2000.

Screenshot from the Election 2000 Web Archive of, captured August 3, 2000.

It’s also unclear what researchers of the future will want, how they want to use our archives and access the data that we’ve preserved. More researchers have interest in access to the raw data for data mining projects than we  ever envisioned when we first started out. The International Internet Preservation Consortium has been reaching out at the last few General Assembly sessions to engage researchers during their “open days,” which have been incredibly interesting as we learn more about research use of our archives.

Twenty five years in is as good a time as any to reflect on things, whether it’s the founding of the Web or the efforts to preserve the future web. Please feel free to share your stories and thoughts in the comments.

Categories: Planet DigiPres

JHOVE on Github

File Formats Blog - 12 March 2014 - 11:15am

The JHOVE repository on Github is now live. The SourceForge site is still there and holds the documentation. The Github site is a work in progress.

Categories: Planet DigiPres

Preserving News Apps

The Signal: Digital Preservation - 11 March 2014 - 2:01pm

Natl. Press Bld. Newstand. Photo by Harris & Ewing, ca. 1940.

On Sunday, March 2, I had the opportunity to attend an OpenNews Hack Day event at the Newseum in Washington DC, sponsored by Knight-mozilla OpenNews, PopUp Archive, and the Newseum.  The event was held in conjunction with the NICAR (National Institute for Computer-Assisted Reporting) conference on working with datasets and developing interactive applications in journalism.

This was not a hackathon, but what they termed a “designathon,” where the goal was to brainstorm about end-to-end approaches for archiving and preserving data journalism projects.  The problem of disappearing applications is very well outlined in blog posts by Jacob Harris and Matt Waite, which are part of “The Source Guide to the Care and Feeding of News Apps.”  From the introduction to the Guide:

“Any news app that relies on live or updated data needs to be built to handle change gracefully. Even relatively simple interactive features often require special care when it comes time to archive them in a useful way. From launch to retirement and from timeliness to traffic management, we offer a collection of articles that will help you keep your projects happy and healthy until it’s time to say goodbye.”

For some, awareness of the need for digital preservation in this community came from a desire to participate in a wonderful Tumblr called “News Nerd First Projects.” Developers wanted to share their earliest works through this collaborative effort — whether to brag or admit to some embarrassment — and many discovered that their work was missing from the web or still online but irreparably broken. Many were lucky if they had screenshots to document their work. Some found static remnants through the Internet Archive but nothing more.

The event brought together journalists, researchers, software developers and archivists. The group of about 50 attendees broke out into sub-groups, discussing topics including best practices for coding, documenting and packaging up apps, saving and documenting the interactive experience and documenting cultural context. Not too surprisingly, a lot of the conversation centered around best practices around coding, metadata, documentation, packaging and dealing with external dependencies.

There was a discussion about web harvesting, which captures static snapshots of rendered data and the design but not the interaction or the underlying data. Packaging up the underlying databases and tables captures the vital data so that it can be used for research, but loses the design and the interaction. Packaging up the app and the tables together with a documented environment means that it might run again, perhaps in an emulated environment, but if the app requires interactions with open or commercial external web service dependencies, such as for geolocation or map rendering, that functionality is likely lost. Finding the balance of preserving the data and preserving the interactivity is a difficult challenge.

All in all, it’s early days for the conversation in this community, but the awareness-building around the need for digital preservation is already achieved and next steps are planned. I am looking forward to seeing this community move forward in its efforts to keep digital news sources alive.

Categories: Planet DigiPres

JHOVE, continued

File Formats Blog - 10 March 2014 - 11:30pm

There’s been enough encouragement in email and Twitter to my proposal to move JHOVE to Github that I’ll be going ahead with it. Andy Jackson has told me he has some almost-finished work to migrate the CVS history along with the project, so I’m waiting on that for the present. Watch this space for more news.

Tagged: JHOVE, software
Categories: Planet DigiPres

A New Viewshare in the Works: Public Beta for Existing Users

The Signal: Digital Preservation - 10 March 2014 - 5:12pm

The Viewshare team has been hard at work on an extensive revision of the Viewshare platform. Almost every part of the workflow and interface is being tweaked and revised; so much so that we didn’t just want to foist the whole thing on all the existing users at once. So, we have set up a sandbox for current Viewshare users to give it a try.

You can start trying out some of the new features at There you can kick the tires and help us identify any of the bugs that are likely to emerge from this extensive a rework of the platform. Please post any questions, comments and issues in the feedback and troubleshooting forums.

Here is a quick set of notes on some of the major enhancements:

Get to building interfaces faster: Through observing folks use the tool and talking with a lot of users it was clear that there was way too much process. People want to pull in their data and see something. To that end, we have moved a few things around to get users to seeing something sooner rather than later. You now upload your data and start building your interface straight away. The biggest impact of this is that there is no longer a distinction between “data” and “views” of data. You just build views.

An example of the interface for configuring a map in the new beta version of Viewshare.

An example of the interface for configuring a map in the new beta version of Viewshare.

Start fiddling with the Dials: Throughout revisions to the interface we saw a lot of users respond well to situations where they could make configuration changes and directly see what those changes would mean in their interface. As a result, wherever possible we have tried to model a system where you get a live preview of exactly what any interface decision or change you make is going to look like. So you can tweak any of the presentation features and in real time see exactly what they will look like.

Embedded Audio and Video Players (HTML5): For a long time, users have been able to mark image URLs as such and then have them show up as images in their interfaces. We have extended this functionality to work the same way for audio and video links.  If you have links to audio and video files in your collection data users can now just click and start listening and viewing in modern browsers.

An example of the new HMTL5 media player for wrapping links to audio and video files in a player.

An example of the new HMTL5 media player for wrapping links to audio and video files in a player.

Responsive Design: Viewshare interfaces have long been restrained to a maximum width. In the new version, a view can fill up the whole size of whatever screen you have available. By switching to a different framework for layouts you can now create views that fill the whole screen. On the other end of the size spectrum, this also makes views look a lot better on mobile devices.

Bar Charts and Better Pie Charts: The pie charts were the weakest part of the whole platform. There was a lot of potential there but it just didn’t really fit. To that end, we’ve now added in a whole different set of pie charts and bar charts. These charts are particularly useful as they are actually interfaces to collection items. So, you can click on a bar or a slice of a pie and see each of the items that are part of that slice and click through to see each of their item records.


An example of a new dynamic bar chart in a viewshare view.

Share a Particular State in a View: Lots of Viewshare users have told us that they when they get down to a particular set of records ( say those from a particular date range, by a particular author or from a particular region) that they would love to be able to share a link to exactly that subset. Now you can do that. There is a little bookmark icon on each view and if you click that you get a URL (admittedly a not particularly pretty URL)  you can go ahead and link directly to that subset of the view.

So if you want to write a blog post comparing one subset of the data with another you can link directly to each of those subsets. Similarly, you can just email a link to part of a view to a colleague if there is a subset that you think they would be interested in.

So, those of you out there with Viewshare accounts, please help us out and give it a try. We have made a copy of all the existing data, so you can go in and just check in on what your views will look like. Do what you like with the data in this beta instance without any fear. This will not affect any of your actual active viewshare data sets or views. Eventually, when the whole system moves to the new version your data on will be migrated to the new version.

Categories: Planet DigiPres

The state of JHOVE

File Formats Blog - 8 March 2014 - 6:23pm

As you may have noticed, I’ve been neglectful of JHOVE since last September, when 1.11 came out. Issues are continuing to arise, and people are still using it, and I’m not getting anything done about them.

The problem is that my current job has rather long hours, and when I come home from it, looking at more Java code isn’t at the top of my list of things to do. I’m very glad people are still using JHOVE, close to a decade after I started work on it as a contractor to the Harvard Library, but I’m not getting anything actually done.

It would help if there were more contributions from others, and its being on the moribund SourceForge isn’t helping. I think I could undertake the energy to move it to Github, where more contributors might be interested. There’s already a Mavenized version by Andy Jackson there, which doesn’t include the Java source code but provides some important scaffolding and pom.xml files. It probably makes sense to start by forking this. This migration should also make the horrible JHOVE build procedure easier.

If this is something you’d like to see, let me know. I’d like some reassurance that this will actually help before I start.

Tagged: JHOVE, software
Categories: Planet DigiPres