Planet DigiPres

New look and feel for Viewshare

The Signal: Digital Preservation - 21 March 2014 - 5:55pm

Earlier this month Trevor Owens announced that a new version of Viewshare is open for user testing and comment. Following this public beta, our plan is to move all users over to the new platform in the next few months.  When this happens your Viewshare account, data and views will all transition seamlessly. You will, however, notice visual and functional improvements to the views. The overall look and feel has been modernized and more functionality has been added to some of the views, particularly those that use pie charts.

Trevor gave an overview of all of the new features in his post announcing these changes. In this post I’ll focus on the visual updates. I would encourage everyone with a Viewshare account to check out how your views look in the new version of Viewshare and let us know what you think or if you have any issues.

Responsive Design

The new version of Viewshare implements responsive design which will allow your views to look good and be functional on any computer or device. You can see this in action with a view of digital collections preserved by NDIIPP partners on both a large and small screen. The view can fill up a large computer monitor screen and be equally functional and usable on a smartphone. This added feature will require no action from users and will work automatically.

NDIIPP Collections on a smartphone using the new version of Viewshare.

NDIIPP Collections on a smartphone using the new version of Viewshare.

NDIIPP Collections view on large monitor

NDIIPP Collections view on large monitor











Changes for Charts

Bar chart views are available in the new version of Viewshare. The pie charts have also been greatly improved. Visually, they are clearer and the text is more legible. Functionally, users are able to click through to items that are represented in different areas of the pie chart. This isn’t possible in the current Viewshare. Check out the two versions of the same data from the East Texas Research Center and you’ll see the improvements.

I do want to point out that in the current version of Viewshare there’s an option to switch between two different pie charts on the same view by using a “view by” drop-down menu. To simplify the building process for these views in the new version of Viewshare that option was eliminated so if you want two views of a pie chart all you have to do is create two views. If your current pie chart view has options to view more than one chart in the same view the view listed first will be the one that displays in the new version.  To restore the missing view simply create an additional pie chart view.

Current pie chart view

Current pie chart view

New version of pie charts

New version of pie charts










Share Filtered or Subsets of Results

The new version of Viewshare allows users to share results of a particular state in a view. An example of this is shown in the Carson Monk-Metcalf view of birth and death records. The view below shows a scatterplot chart of birth years vs. death years and their race and religion (religion data not shown below but accessible in the view). The view is limited to show records for those who were 75 years and above at the time of their death. The user could cite or link to this particular view in the data by clicking the red bookmark icon in the upper right and share or save the link provided.

Carson Mon-Metcalf bookmarked results

Carson Mon-Metcalf bookmarked results

Again, be sure to check-out your views in the new Viewshare, your current login credentials will work. As always let us know what you think in the comments of this post or in the user feedback forums for Viewshare.

Categories: Planet DigiPres

CSV Validator - beta releases

Open Planets Foundation Blogs - 21 March 2014 - 2:51pm

For quite some time at The National Archives (UK) we've been working on a tool for validating CSV files against user defined schema.  We're now at the point of making beta releases of the tool generally available (1.0-RC3 at the time of writing), along with the formal specification of the schema language.  The tool and source code are released under Mozilla Public Licence version 2.0.

For more details, links to the source code repository, release code on Maven Central, instructions and schema specification, see

Feedback is welcome.  When we make the formal version 1.0 release there will be a fuller blog post on The National Archives blog.

Preservation Topics: Tools
Categories: Planet DigiPres

A Tika to ride; characterising web content with Nanite

Open Planets Foundation Blogs - 21 March 2014 - 1:58pm

This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite.

Introducing Nanite

Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects:

  • Nanite-Core: an API for Droid   
  • Nanite-Hadoop: a MapReduce program for characterising web archives that makes use of Nanite-Core, Apache Tika and libmagic-jna-wrapper  (the last one here essentially being the *nix `file` tool wrapped for reuse in Java)

Nanite-Hadoop makes use of UK Web Archive Record Readers for Hadoop, to enable it to directly process ARC and WARC files from HDFS without an intermediate processing step.  The initial part of a Nanite-Hadoop run is a test to check that the input files are valid gz files.  This is very quick (takes seconds) and ensures that there are no invalid files that could crash the format profiler after it has run for several hours.  More checks on the input files could be potentially be added.

We have been working on Nanite to add different characterisation libraries and improve them/their coverage.  As the tools that are used are all Java, or using native library calls, Nanite-Hadoop is fast.  Retrieving a mimetype from Droid and Tika for all 93 million files in 1TB (compressed size) of WARC files took 17.5hrs on our Hadoop cluster.  This is less than 1ms/file.  Libraries to be turned on/off relatively easily by editing the source or in the jar.

That time does not include any characterisation, so I began to add support for characterisation using Tika’s parsers.  The process I followed to add this characterisation is described below.

(Un)Intentionally stress testing Tika’s parsers

In hindsight sending 93 million files harvested from the open web directly to Tika’s parsers and expecting everything to be ok was optimistic at best.  There were bound to have been files in that corpus that were corrupt or otherwise broken that would cause crashes in Tika or its dependencies. 

Carnet let you do that; crashing/hanging the Hadoop JVM

Initially I began by using the Tika Parser interface directly.  This was ok until I noticed that some parsers (or their dependencies) were crashing or hanging.  As that was rather undesirable I began to disable the problematic parsers at runtime (with the aim of submitting bug reports back to Tika).  However, it soon became apparent that the files contained in the web archive were stressing the parsers to the point I would have had to disable ever increasing numbers of them.  This was really undesirable as the logic was handcrafted and relied on the state of the Tika parsers at that particular moment.  It also meant that the existence of one bad file of a particular format meant that no characterisation of that format could be carried out.  The logic to do this is still in the code, albeit not currently used.

Timing out Tika considered harmful; first steps

The next step was to error-proof the calls to Tika.  Firstly I ensured that any Exceptions/Errors/etc were caught.  Then I created a TimeoutParser  that parsed the files in a background Thread and forcibly stopped the Tika parser after a time limit had been exceeded.  This worked ok, however, it made use of Thread.stop() – a deprecated API call to stop a Java Thread.  Use of this API call is thoroughly not recommended as it may corrupt the internal state of the JVM or produce other undesired effects.  Details about this can be read in an issue on the Tika bug tracker.  Since I did not want to risk a corruption of the JVM I did not pursue this further. 

I should note that subsequently it has been suggested that an alternative to using Thread.stop() is to just leave it alone for the JVM to deal with and create new Thread.  This is a valid method of dealing with the problem, given the numbers of files involved (see later), but I have not tested it.

The whole Tika, and nothing but the Tika; isolating the Tika process

Following a suggestion by a commenter in the Tika issue, linked above, I produced a library that abstracted a Tika-server as a separate operating system process, isolated from the main JVM: ProcessIsolatedTika.  This means that if Tika crashes it is the operating system’s responsibility to clean up the mess and it won’t affect the state of the main JVM.  The new library controls restarting the process after a crash, or after processing times out (in case of a hang).  An API similar to a normal Tika parser is provided so it can be easily reused.  Communication by the library with the Tika-server is via REST, over the loopback network interface.  There may be issues if there is more than BUFSIZE bytes read (currently 20MB) – although such errors should be logged by Nanite in the Hadoop Reducer output.

Although the main overhead of this approach is having a separate process and JVM per WARC file, that is mitigated somewhat by the time that process is used for.  Aside from the cost of transferring files to the Tika-server, the overhead is a larger jar file, longer initial start-up time for Mappers and additional time for restarts of the Tika-server on failed files.  Given average runtime per WARC is slightly over 5 minutes, the few additional seconds that are included for using a process isolated Tika is not a great deal extra.

The output from the Tika parsers is kept in a sequence file in HDFS (one per input (W)ARC) – i.e. 1000 WARCs == 1000 Tika parser sequence files.  This output is in addition to the output from the Reducer (mimetypes, server mimetypes and extension).

To help the Tika parsers with the file, Tika detect() is first run on the file and that mimetype is passed to the parsers via a http header.  A Metadata object cannot be passed to the parsers via REST like it would be if we called them directly from the Java code.

Another approach could have been to use Nailgun as described by Ross Spencer in a previous blog post here.  I did not take that approach as I did not want to set up a Nailgun server on each Hadoop node (we have 28 of them) and if a Tika parser crashed or caused the JVM to hang then it may corrupt the state of the Nailgun JVM in a similar way to the TimeoutParser above.  Finally, with my current test data each node handles ~3m files – much more than the 420k calls that caused Nailgun to run out of heap space in Ross’ experiment.

Express Tika; initial benchmarks

I ran some initial benchmarks on 1000 WARC files using our test Hadoop cluster (28 nodes with 1 cpu/map slot per node) the results are as follows:

Identification tools used

Nanite-core (Droid)

Tika detect() (mimetype only)

ProcessIsolatedTika parsers

WARC files


Total WARC size

59.4GB (63,759,574,081 bytes)


Total files in WARCs (# input records)


Runtime (hh:mm:ss)






Total Tika parser output size (compressed)

765MB (801,740,734 bytes)


Tika parser failures/crashes


Misc failures

Malformed records: 122

IOExceptions*: 3224

Other Exceptions: 430

Total: 3776

*This may be due to files being larger than the buffer – to be investigated.

The output has not been fully verified but should give an initial indication of speed.

Conceivably the information from the Tika parsers could be loaded into c3po but I have not looked into that.

Conclusion; if the process isolation FITS, where is it?

We are now able to use Tika parsers for characterisation without being concerned about crashes in Tika.  This research will also allow us to identify files that Tika’s parsers cannot handle so we can submit bug reports/patches back to Tika.  When Tika 1.6 comes out it will include detailed pdf version detection within the pdf parser.

As an aside - if FITS offered a REST interface then the ProcessIsolatedTika code could be easily modifed to replace Tika with FITS – this is worth considering, if there was interest and someone were to create such a REST interface.

Apologies for the puns.

Preservation Topics: Preservation ActionsIdentificationCharacterisationWeb ArchivingToolsSCAPE
Categories: Planet DigiPres

Nominations Now Open for the 2014 NDSA Innovation Awards

The Signal: Digital Preservation - 20 March 2014 - 8:19pm

12 year old girl wins Medal of Honor, Washington, D.C., Sept. 12.” Library of Congress, Prints & Photographs Collection. LC-DIG-hec-33759,

 The National Digital Stewardship Alliance Innovation Working Group is proud to open the nominations for the 2014 NDSA Innovation Awards. As a diverse membership group with a shared commitment to digital preservation, the NDSA understands the importance of innovation and risk-taking in developing and supporting a broad range of successful digital preservation activities. These awards are an example of the NDSA’s commitment to encourage and recognize innovation in the digital stewardship community.

This slate of annual awards highlights and commends creative individuals, projects, organizations and future stewards demonstrating originality and excellence in their contributions to the field of digital preservation. The program is administered by a committee drawn from members of the NDSA Innovation Working Group.

Last year’s winners are exemplars of the diversity and collaboration essential to supporting the digital stewardship community as it works to preserve and make available digital materials. For more information on the details of last year’s recipients, please see the blog post announcing last year’s winners.

The NDSA Innovation Awards focus on recognizing excellence in one or more of the following areas:

  • Individuals making a significant, innovative contribution to the field of digital preservation;
  • Projects whose goals or outcomes represent an inventive, meaningful addition to the understanding or processes required for successful, sustainable digital preservation stewardship;
  • Organizations taking an innovative approach to providing support and guidance to the digital preservation community;
  • Future stewards, especially students, but including educators, trainers or curricular endeavors, taking a creative approach to advancing knowledge of digital preservation theory and practices.

Acknowledging that innovative digital stewardship can take many forms, eligibility for these awards has been left purposely broad. Nominations are open to anyone or anything that falls into the above categories and any entity can be nominated for one of the four awards. Nominees should be US-based people and projects or collaborative international projects that contain a US-based partner. This is your chance to help us highlight and reward novel, risk-taking and inventive approaches to the challenges of digital preservation.

Nominations are now being accepted and you can submit a nomination using this quick, easy online submission form. You can also submit a nomination by emailing a brief description, justification and the URL and/or contact information of your nominee to ndsa (at)

Nominations will be accepted until Friday May 2, 2014 and winners announced in mid-May. The prizes will be plaques presented to the winners at the Digital Preservation 2014 meeting taking place in the Washington, DC area on July 22-24, 2014. Winners will be asked to deliver a very brief talk about their activities as part of the awards ceremony and travel funds are expected to be available for these invited presenters.

Help us recognize and reward innovation in digital stewardship and submit a nomination!

Categories: Planet DigiPres

Long term accessibility of digital resources in theory and practice

Alliance for Permanent Access News - 20 March 2014 - 3:25pm

The APARSEN project is organising a Satellite Event on “Long Term Accessibility of Digital Resources in Theory and Practice” on 21st May 2014 in Vienna, Austria.

It takes place in the context of the 3rd LIBER Workshop on Digital Curation “Keeping data: The process of data curation” (19-20 May 2014)

The programme is organised by the APARSEN project together with the SCAPE Project.

09:00 – 10:30 Sabine Schrimpf
(German National Library) Digital Rights Management in the context of long-term preservation Ross King
(Austrian Institute of Technology) Thes SCAPE project and Scalable Quality Control David Wang
(SBA Research) Understanding the Costs of Digital Curation
11:00 – 12:30
Sven Schlarb
(Austrian National Library) Application scenarios of the SCAPE project at the Austrian National Library Krešimir Đuretec
(Vienna University of Technology) The SCAPE Planning and Watch Suite David Giaretta
(Alliance for Permanent Access) Digital Preservation: How APARSEN can help answer the key question “Who pays and Why?”
Categories: Planet DigiPres

A Regional NDSA?

The Signal: Digital Preservation - 19 March 2014 - 5:59pm

The following is a guest post by Kim Schroeder, a lecturer at the Wayne State University School of Library and Information Science.

Several years ago before the glory of the annual NDSA conference, professionals across America were seeking more digital curation literature and professional contacts.  Basic questions like ‘what is the first step in digital preservation?’ and ‘how do I start to consistently manage digital assets?’ were at the forefront.

As we have worked toward increased information sharing including the invaluable annual NDSA and IDCC conferences, we see a disconnect as we return home.  As we try to implement new tools and processes, we inevitably hit bumps beyond our immediate knowledge.  This is being alleviated more and more by local meetings being hosted in regions to gather professionals for hands-on and hand-waving process sharing.

Lance Stuchell, Digital Preservation Librarian at the University of Michigan and I began the Regional Digital Preservation Practitioners (RDPP) meetings as an opportunity to talk through our challenges and solutions.  The result is that over 100 professionals have signed up for our listserv since our call one year ago. We sent announcements out to Windsor, Toledo, Ann Arbor and throughout Metro Detroit to let people know that there is an untapped community of professionals that want and need to share their progress on digital curation.

 Mary Jane Murawka

Kevin Barton in the Wayne State SLIS Lab: Photo credit: Mary Jane Murawka

In the last year we have held three meetings with more planned this year.  The initial meeting included a discussion and eventually a survey to define our biggest issues as well as how best to craft the group.  Other topics included a digital projects lab tour, a DSpace installation overview, and a demonstration of a mature Digital Asset Management system.  Coming later this year, we plan to focus on metadata issues and a symposium on how to create workflows.  Further information about the meetings is available at the Regional Digital Preservation Practitioners site.

The development of the list has been one of the more helpful pieces with folks posting jobs, practicum ideas, latest articles and technical questions.  The volume of discussion is not there yet but it is off to a healthy start.

Mid-Michigan has also created a similar group that works with us to schedule events and share information.  Ed Busch, the Electronic Records Archivist at Michigan State University (MSU) held a successful conference last summer at MSU and he said:  “What my co-worker Lisa Schmidt and I find so useful with our Mid-Michigan regional meeting is the chance to network with other professionals trying to solve the same situations as we are with digital assets; hearing what they’ve tried with success and failure; and finding new ways to collaborate. All institutions with digital assets, regardless of size, are in the same boat when it comes to dealing with this material. It’s really nice to hear that from your peers.”  They held another conference on March 14th of this year and the agenda is available (pdf).

The NDSA is also encouraging regions to join together beyond the annual meeting. Butch Lazorchak, a Digital Archivist at the National Digital Information Infrastructure and Preservation Program
shared his thoughts on this. “The NDSA regional meetings are a great opportunity for NDSA members to share the work they’ve done to advance digital stewardship practice,” he said. “At the same time, the meetings help to build community by bringing together regional participants who may not usually find an opportunity to get together to discuss digital stewardship practice and share their own triumphs and challenges.”

Beginning a regional group is fairly easy as you send out announcements to professional listservs, but the tougher part is administration.  Deciding who keeps the minutes, manages the list, hosts the next meeting and how to maintain momentum is a necessity.  With the explosion in research, professional literature and expanding conferences we have more avenues to explore but we need the hands-on lessons learned from local colleagues to continue successful experimentation.  We would encourage you to think about starting your own local group!

Categories: Planet DigiPres

Things to Know About Personal Digital Archiving 2014

The Signal: Digital Preservation - 18 March 2014 - 8:44pm

Personal Digital Archiving 2014 will be held at the Indiana State Library in Indianapolis, Indiana, April 10-11, 2014.  This is THE conference that raises awareness among individuals, public institutions and private companies engaged in the creation, preservation and ongoing use of personal digital content.  A key overarching topic will be how libraries, archives and other cultural heritage organizations can support personal digital archiving within our own community as well as reaching out to specific communities. We invite you to come out and join the conversation.

The two-day conference will feature a diverse range of presentations on topics such as: archiving and documentation practices of local communities; tools and techniques to process digital archives; investigations of building, managing and archiving scholarly practices and family history archives; and the challenges of communicating personal digital archiving benefits to a variety of audiences. The full list of presentations, lightning talks and posters can be found here.

Tag cloud of PDA14 presentation titles.

Tag cloud of PDA14 presentation titles.

Here are a few quick things to know about upcoming conference:

  • Keynote speakers will explore preservation challenges from the perspectives of both researchers and creators of personal digital information.  Andrea Copeland from the School of Informatics and Computing, Indiana University-Purdue, will talk about her research looking into public library users’ digital preservation practices. Charles R. Cross, a music historian & author, will talk about the value of personal archives from a biographers perspective.
  • Adequate infrastructure in many organizations to implement preservation of personal digital records is lacking.  There will be a number of presentations on the practical side of doing personal digital preservation using specific tools and services.  Some will be on consumer-level services that help individuals build their own person digital archives. Other presentations will be from librarians, archivists and researchers who are using certain tools to help their institutions manage personal digital records.
  • Knowledge related to accession, donor or legal requirements, researchers’ interests, and practical preservation strategies for personal digital archives is equally lacking. To help understand some of these issues, practitioners, scholars and individuals from different fields will share their current research on personal digital archiving topics.  For the first time, the conference will feature a panel discussion from contemporary architects and landscape architects talking about preserving their work and transferring it to archives.  This is a community of professionals not regularly represented at the PDA conference and provides a great opportunity to hear about their specific challenges.

Registration is open!  We hope you can join us and explore and help raise awareness of the need for personal digital archiving in your own communities.

Categories: Planet DigiPres

Three years of SCAPE

Open Planets Foundation Blogs - 18 March 2014 - 12:24pm

SCAPE is proud to look back at another successful project year. During the third year the team produced many new tools, e.g. ToMaR, a tool which wraps command line tools into Hadoop MapReduce jobs. Other tools like xcorrSound and C3PO have been developed further.

This year’s All-Staff Meeting took place mid-February in Póvoa de Varzim, Portugal. The team organised a number of general sessions, during which the project partners presented demos of and elevator pitches for the tools and services they developed in SCAPE. It was very interesting for all meeting participants to see the results achieved so far. The demos and pitches were also useful for re-focusing on the big picture of SCAPE. During the main meeting sessions the participants mainly focused on take up and productization of SCAPE tools.

Another central topic of the meeting was integration. Until the end of the project the partners will put an emphasis on integrating the results further. To prove scalability of the tools, the team set up a number of operative Hadoop clusters instances (both central and local), which are currently being used for the evaluation of the tools and workflows.

Another focus lies on the sustainability of SCAPE tools. The SCAPE team is working towards documenting the tools for both developers and users. SCAPE outcomes will be curated by the Open Planets Foundation until the end of the project and will keep them available.

In September 2014 SCAPE is organising a final event in collaboration with APARSEN. The workshop is planned to take place at the Digital Libraries 2014 conference in London, where SCAPE will have its final, overall presentation. The workshop is directed towards developers, content holders, and data managers. The SCAPE team will present tools and services developed since 2011. A special focus will lie on newly and further developed open source tools for scalable preservation actions; SCAPE’s scalable Platform architecture; and its policy-based Planning and Watch solutions.

Preservation Topics: SCAPE
Categories: Planet DigiPres

Mavenized JHOVE

File Formats Blog - 16 March 2014 - 2:19pm

I’m not a Maven maven, but more of a Maven klutz. Nonetheless, I’ve managed to push a Mavenized version of JHOVE to Github that compiles for me. I haven’t tried to do anything beyond compiling. If anyone would like to help clean it up, please do.

This kills the continuity of file histories which Andy worked so hard to preserve, since Maven has its own ideas of where files should be. The histories are available under the deleted files in their old locations, if you look at the original commit.

Tagged: JHOVE, software
Categories: Planet DigiPres

ToMaR - How to let your preservation tools scale

Open Planets Foundation Blogs - 14 March 2014 - 4:01pm

Whenever you run into the situation that you have got used to a command line tool and all of a sudden need to apply it to a large amount of files over a Hadoop cluster without having any clue of writing distributed programs ToMaR will be your friend.

Mathilda is working at the department for digital preservation at a famous national library. In her daily work she has to cope with various well-known tasks like data identification, migration and curation. She is experienced in using the command shell on a Unix system and occasionally has to write small scripts to perform a certain workflow effectively.

When she has got to deal with a few hundreds of files she usually invokes her shell script on one file after the other using a simple loop for automation. But today she has been put in charge of a much bigger data set than she is used to. There are one hundred thousand TIFF images which need to be migrated to JPEG2000 images in order to save storage space. Intuitively she knows that processing these files one after the other with each single migration taking about half a minute would take a whole work day to run.

Luckily Mathilda has heard of the recent Hadoop cluster colleagues of her have set up in order to do some data mining on a large collection of text files. "Would there be a way to run my file migration tool on that cluster thing?", she thinks, "If I could run it in parallel on all these machines then that would speed up my migration task tremendously!" Only one thing makes here hesitate: She has hardly got any Java programming skills, not to mention any idea of that MapReduce programming paradigm they are using in their data mining task. How to let her tool scale?

That's where ToMaR, the Tool-to-MapReduce Wrapper comes in!

What can ToMaR do?

If you have a running Hadoop cluster you are only three little steps away from letting your preservation tools run on thousands of files almost as efficiently as with a native one-purpose Java MapReduce application. ToMaR wraps command line tools into a Hadoop MapReduce job which executes the command on all the worker nodes of the Hadoop cluster in parallel. Dependent on the tool you want to use through ToMaR it might be necessary to install it on each cluster node beforehand. Then all you need to do is:

  1. Specify your tool so that ToMaR can understand it using the SCAPE Tool Specification Schema.
  2. Itemize the parameters of the tool invocation for each of your input files in a control file.
  3. Run ToMaR.

Through MapReduce your list of parameter descriptions in the control file will be split up and assigned to each node portion by portion. For instance ToMaR could have been configured to create splits of 10 lines each taken from the control file. Then each node parses the portion line by line and invokes the tool with the parameters specified therein each time.

File Format Migration Example

So how may Mathilda tackle her file format migration problem? First she will have to make sure that her tool is installed on each cluster node. Her colleagues who maintain the Hadoop cluster will take care for this requirement. Up to her is the creation of the Tool Specification Document (ToolSpec) using the SCAPE Tool Specification Schema and the itemization of the tool invocation parameter descriptions. The following figure depicts the required workflow:

Create the ToolSpec

The ToolSpec is an XML file which contains several operations. An operations consists of name, a description, a command pattern and input/output parameters. The operation for Mathilda's file format migration tool might look like this:

<operation name="image-to-j2k"> <description>Migrates an image to jpeg2000</description> <command> image_to_j2k -i ${input} -o ${output} -I -p RPCL -n 7 -c [256,256], [256,256],[128,128],[128,128],[128,128],[128,128],[128,128] -b 64,64 -r 320.000,160.000,80.000,40.000,20.000,11.250,7.000,4.600,3.400,2.750, 2.400,1.000 </command> <inputs> <input name="input" required="true"> <description>Reference to input file</description> </input> </inputs> <outputs> <output name="output" required="true"> <description>Reference to output file. Only *.j2k, *.j2c or *.jp2!</description> </output> </outputs> </operation>

In the <command> element she has put the actual command line with a long tail of static parameters. This example highlights another advantage of the ToolSpec: You gain the ease of wrapping complex command lines in an atomic operation definition which is associated with a simple name, here "image-to-j2k". Inside the command pattern she puts placeholders which are replaced by various values. Here ${input} and ${output} denote such variables so that the value of the input file parameter (-i) and the value of the output file parameter (-o) can vary with each invocation of the tool.

Along with the command definition Mathilda has to describe these variables in the <inputs> and <outputs> section. For the ${input} being the placeholder for a input file she has to add a <input> element with the name of the placeholder as an attribute. The same counts for the ${output} placeholder. Additionally she can add some description text to these input and output parameter definitions.

There are more constructs possible with the SCAPE Tool Specification Schema which can not be covered here. The full contents of this ToolSpec can be found in the file attachments.

Create the Control File

The other essential requirement Mathilda has to achieve is the creation of the control file. This file contains the real values for the tool invocation which are mapped to the ToolSpec by ToMaR. Together with the above example her control file will look something like this:

openjpeg image-to-jp2 --input=“hdfs://myFile1.tif“ --output=“hdfs://myFile2.jp2“ openjpeg image-to-jp2 --input=“hdfs://myFile2.tif“ --output=“hdfs://myFile2.jp2“ openjpeg image-to-jp2 --input=“hdfs://myFile3.tif“ --output=“hdfs://myFile3.jp2“ ...

The first word refers to the name of the ToolSpec ToMaR shall load. In this example the ToolSpec is called "openjpeg.xml" but only the name without the .xml extension is needed for the reference. The second word refers to an operation within that ToolSpec, it's the "image-to-j2k" operation described in the ToolSpec example snippet above.

The rest of the line contains references to input and output parameters. Each reference starts with a double dash followed by a pair of parameters name and value. So --input (and likewise --output) refers to the parameters named "input" in the ToolSpec which in turn refers to the ${input} placeholder in the command pattern. The values are file references on Hadoop's Distributed File System (HDFS).

As Mathilda has 100k TIFF images she will have 100k lines in her control file. As she knows how to use the command shell she quickly writes a script which generates this file for her.

Run ToMaR

Having the ToolSpec openjpeg.xml and the control file controlfile.txt created she copies openjpeg.xml into the directory "hdfs:///user/mathilda/toolspecs" of HDFS and executes the following command on the master node of the Hadoop cluster:

hadoop jar ToMaR.jar -i controlfile.txt -r hdfs:///user/mathilda/toolspecs

Here she feeds in the controlfile.txt and the location of her ToolSpecs and ToMaR does the rest. It splits up the control file and distributes a certain number of lines per split to each node. The ToolSpec is loaded and the parameters are mapped to the command line pattern contained in the named operation. Input files are copied from HDFS to the local file system. As the placeholders are replaced by the values the command line can be executed by the worker node. After that the result output file is copied back to HDFS to the output location given.

Finally Mathilda has got all the migrated JPEG2000 images on HDFS in a fraction of the time it would have taken when run sequentially on her machine.

  • easily take up external tools with a clear mapping between the instructions and the physical invocation of the tool
  • use the SCAPE Toolspec, as well as existing Toolspecs, and its advantage of associating simple keywords with complex command-line patterns
  • no programming skills needed as the minimum requirement only is to setup the control file

When dealing with large volumes of files, e.g. in the context of file format migration or characterisation tasks, a standalone server often cannot provide sufficient throughput to process the data in a feasible period of time. ToMaR provides a simple and flexible solution to run preservation tools on a Hadoop MapReduce cluster in a scalable fashion.

ToMaR offers the possibility to use existing command-line tools in Hadoop's distributed environment very similarly to a desktop computer. By utilizing SCAPE Tool Specification documents, ToMaR allows users to associate complex command-line patterns with simple keywords, which can be referenced for execution on a computer cluster. ToMaR is a generic MapReduce application which does not require any programming skills.

Checkout the following blog posts for further usage scenarios of ToMaR:


Preservation Topics: Preservation ActionsSCAPE AttachmentSize Full openjpeg ToolSpec1.02 KB ToMaR-image_to_j2k-workflow.png158.29 KB ToMaR-overview.png67.97 KB logo.png74.65 KB
Categories: Planet DigiPres

Upcoming NDSR Symposium “Emerging Trends in Digital Stewardship”: Speaker Announcements

The Signal: Digital Preservation - 14 March 2014 - 1:23pm

The following is a guest post by Jaime McCurry, National Digital Stewardship Resident at the Folger Shakespeare Library.

It’s certainly been an exciting and busy few months for the National Digital Stewardship Residents and although we are well into the final portion of our projects, we’re showing no signs of slowing down.

NDSRs view time-based media art at the American Art Museum and National Portrait Gallery. Photo courtesy of Emily Reynolds.

Residents see time-based media art at the American Art Museum and National Portrait Gallery. Photo courtesy of Emily Reynolds.

In addition to our regularly scheduled programming on The Signal, you can find the residents on the web and elsewhere talking digital stewardship and providing project updates:

  • Julia Blase discusses her project status at the National Security Archive in Forward…March!
  • Heidi Dowding discusses the most recent NDSR Enrichment Session, hosted this month at Dumbarton Oaks.
  • Continuing with the Resident-to-Resident Interview Series, Maureen McCormick Harlow (National Library of Medicine) interviews Emily Reynolds on the specifics of her project at the World Bank.
  • Emily Reynolds recaps a recent NDSR site-visit to the United States Holocaust Memorial Museum.
  • Erica Titkemeyer (Smithsonian Institute) discusses Handling Digital Assets in Time Based Media Art.
  • I’m talking web archiving at the Folger Shakespeare Library.
  • You can catch Lauren Work (PBS) and Julia Blase (National Security Archive) at the Spring CNI meeting later this month.
  • And finally, residents Margo Padilla (MITH), Molly Schwartz (ARL), Erica Titkemeyer (Smithsonian Institute), and Lauren Work (PBS) are New Voices in Digital Curation in April.

Emerging Trends in Digital Stewardship Symposium: Speaker Announcements!

As previously announced, the inaugural cohort of National Digital Stewardship Residents will present a symposium titled “Emerging Trends in Digital Stewardship” on April 8, 2014. This event, hosted by the Library of Congress, IMLS, and the National Library of Medicine will be located at the National Library of Medicine’s Lister Hill Auditorium and will consist of panel presentations on topics related to digital stewardship.

At this time, we are delighted to release a final program, including guest speakers and panel participants:

Tuesday, April 8, 2014

8:30-9:30         Registration
9:30-9:45         Opening Remarks

  • George Coulbourne and Kris Nelson, Library of Congress

9:45-10:45       BitCurator Demonstration

  • Cal Lee, UNC-Chapel Hill School of Information and Library Science

11:00-Noon     Panel Discussion:  Social Media, Archiving, and Preserving Collaborative Projects

  • Leslie Johnston, Library of Congress
  • Janel Kinlaw, NPR: National Public Radio
  • Laura Wrubel, George Washington University

Noon-1:15       Lunch Break

1:15-2:15         Panel Discussion:  Open Government and Open Data

  • Daniel Schuman, Citizens for Responsibility and Ethics in Washington
  • Jennifer Serventi, National Endowment for the Humanities
  • Nick Shockey, Scholarly Publishing and Academic Resources Coalition

2:45-3:45       Panel Discussion:  Digital Strategies for Public and Non-Profit Institutions

  • Carl Fleischhauer, Library of Congress
  • Eric Johnson, Folger Shakespeare Library
  • Matt Kirschenbaum, Maryland Institute for Technology in the Humanities
  • Kate Murray, Library of Congress
  • Trevor Owens, Library of Congress

3:45             Closing Remarks

We’re thrilled to have such wonderful participants and look forward to sparking some exciting discussions on all things digital stewardship. As a reminder, the symposium is free and open to the public and pre-registration is strongly encouraged. More information can be here. We hope to see you there!

Categories: Planet DigiPres

Happy Birthday, Web!

The Signal: Digital Preservation - 13 March 2014 - 2:24pm

This is a guest post by Abbie Grotke, Library of Congress Web Archiving Team Lead and Co-Chair of the National Digital Stewardship Alliance Content Working Group

Yesterday we celebrated the 25th anniversary of the creation of the World Wide Web.

How many of you can remember the first time you saw a website, clicked on a hyperlink, or actually edited an HTML page? My “first web” story is still pretty fresh in my mind: It was probably around October 1993, in D.C. My brother and his friends were fairly tech savvy (they’d already set me up with an email account). We went over to his friend Miles’s house in Dupont Circle to visit, and while there he excitedly showed us this thing called Mosaic. I remember the gray screen and the strange concept of hyperlinks; my brother remembers seeing a short quicktime movie of a dolphin doing a flip.

We were all really excited.

Screenshot from the Election 2000 Web Archive of, captured October 23, 2000.

Screenshot from the Election 2000 Web Archive of, captured October 23, 2000.

Flash forward to 2014: Although I vaguely remember life without the web (however did we find out “who that actor is that looks so familiar on the TV right now and what she role she played in that movie that is on the tip of my tongue”?), I certainly can’t imagine a life without it in the future.  I’m in a job, preserving parts of the Internet, which would not exist had it not been for Tim Berners-Lee 25 years ago. For more on the 25th anniversary of the World Wide Web, check out Pew Research Internet Project’s “The Web at 25” .

As evidenced by Pew’s handy timeline of the Web, you can see a lot has changed since the Internet Archive (followed by national libraries), began preserving the web in 1996. If you haven’t seen this other Web Archives timeline, I encourage you to check it out. Since those early days, the number of organizations archiving the web has grown.

The Library of Congress started its own adventure preserving web content in 2000. For an institution that began in 1800, it certainly counts as a small amount of “Library” time. Although we’re not quite sure what our first archived website was (following the lead of our friends at the British Library) the first websites we crawled are from our Election 2000 Web Archive, and include campaign websites from both George E. Bush and Al Gore, among others.

As you can see by the screenshots, and if you click off to those archived pages, certain things didn’t archive very well. Something as simple as images weren’t always captured comprehensively, and the full sites certainly weren’t archived. We’ve spent years, with our partners around the globe, working to make “archival-quality” web archives that include much more than just the text of a site.

We’re all preserving more content than ever, but there are still challenges for those charged with preserving this content to keep up with not only the scale of content being generated, legal issues surrounding preservation of websites, and keeping up with the technologies used on the web (even if we want to preserve it, can we?), as has been discussed before on this blog. We’ve still got a lot of work to do.

Screenshot from the Election 2000 Web Archive of, captured August 3, 2000.

Screenshot from the Election 2000 Web Archive of, captured August 3, 2000.

It’s also unclear what researchers of the future will want, how they want to use our archives and access the data that we’ve preserved. More researchers have interest in access to the raw data for data mining projects than we  ever envisioned when we first started out. The International Internet Preservation Consortium has been reaching out at the last few General Assembly sessions to engage researchers during their “open days,” which have been incredibly interesting as we learn more about research use of our archives.

Twenty five years in is as good a time as any to reflect on things, whether it’s the founding of the Web or the efforts to preserve the future web. Please feel free to share your stories and thoughts in the comments.

Categories: Planet DigiPres

JHOVE on Github

File Formats Blog - 12 March 2014 - 11:15am

The JHOVE repository on Github is now live. The SourceForge site is still there and holds the documentation. The Github site is a work in progress.

Categories: Planet DigiPres

Preserving News Apps

The Signal: Digital Preservation - 11 March 2014 - 2:01pm

Natl. Press Bld. Newstand. Photo by Harris & Ewing, ca. 1940.

On Sunday, March 2, I had the opportunity to attend an OpenNews Hack Day event at the Newseum in Washington DC, sponsored by Knight-mozilla OpenNews, PopUp Archive, and the Newseum.  The event was held in conjunction with the NICAR (National Institute for Computer-Assisted Reporting) conference on working with datasets and developing interactive applications in journalism.

This was not a hackathon, but what they termed a “designathon,” where the goal was to brainstorm about end-to-end approaches for archiving and preserving data journalism projects.  The problem of disappearing applications is very well outlined in blog posts by Jacob Harris and Matt Waite, which are part of “The Source Guide to the Care and Feeding of News Apps.”  From the introduction to the Guide:

“Any news app that relies on live or updated data needs to be built to handle change gracefully. Even relatively simple interactive features often require special care when it comes time to archive them in a useful way. From launch to retirement and from timeliness to traffic management, we offer a collection of articles that will help you keep your projects happy and healthy until it’s time to say goodbye.”

For some, awareness of the need for digital preservation in this community came from a desire to participate in a wonderful Tumblr called “News Nerd First Projects.” Developers wanted to share their earliest works through this collaborative effort — whether to brag or admit to some embarrassment — and many discovered that their work was missing from the web or still online but irreparably broken. Many were lucky if they had screenshots to document their work. Some found static remnants through the Internet Archive but nothing more.

The event brought together journalists, researchers, software developers and archivists. The group of about 50 attendees broke out into sub-groups, discussing topics including best practices for coding, documenting and packaging up apps, saving and documenting the interactive experience and documenting cultural context. Not too surprisingly, a lot of the conversation centered around best practices around coding, metadata, documentation, packaging and dealing with external dependencies.

There was a discussion about web harvesting, which captures static snapshots of rendered data and the design but not the interaction or the underlying data. Packaging up the underlying databases and tables captures the vital data so that it can be used for research, but loses the design and the interaction. Packaging up the app and the tables together with a documented environment means that it might run again, perhaps in an emulated environment, but if the app requires interactions with open or commercial external web service dependencies, such as for geolocation or map rendering, that functionality is likely lost. Finding the balance of preserving the data and preserving the interactivity is a difficult challenge.

All in all, it’s early days for the conversation in this community, but the awareness-building around the need for digital preservation is already achieved and next steps are planned. I am looking forward to seeing this community move forward in its efforts to keep digital news sources alive.

Categories: Planet DigiPres

JHOVE, continued

File Formats Blog - 10 March 2014 - 11:30pm

There’s been enough encouragement in email and Twitter to my proposal to move JHOVE to Github that I’ll be going ahead with it. Andy Jackson has told me he has some almost-finished work to migrate the CVS history along with the project, so I’m waiting on that for the present. Watch this space for more news.

Tagged: JHOVE, software
Categories: Planet DigiPres

A New Viewshare in the Works: Public Beta for Existing Users

The Signal: Digital Preservation - 10 March 2014 - 5:12pm

The Viewshare team has been hard at work on an extensive revision of the Viewshare platform. Almost every part of the workflow and interface is being tweaked and revised; so much so that we didn’t just want to foist the whole thing on all the existing users at once. So, we have set up a sandbox for current Viewshare users to give it a try.

You can start trying out some of the new features at There you can kick the tires and help us identify any of the bugs that are likely to emerge from this extensive a rework of the platform. Please post any questions, comments and issues in the feedback and troubleshooting forums.

Here is a quick set of notes on some of the major enhancements:

Get to building interfaces faster: Through observing folks use the tool and talking with a lot of users it was clear that there was way too much process. People want to pull in their data and see something. To that end, we have moved a few things around to get users to seeing something sooner rather than later. You now upload your data and start building your interface straight away. The biggest impact of this is that there is no longer a distinction between “data” and “views” of data. You just build views.

An example of the interface for configuring a map in the new beta version of Viewshare.

An example of the interface for configuring a map in the new beta version of Viewshare.

Start fiddling with the Dials: Throughout revisions to the interface we saw a lot of users respond well to situations where they could make configuration changes and directly see what those changes would mean in their interface. As a result, wherever possible we have tried to model a system where you get a live preview of exactly what any interface decision or change you make is going to look like. So you can tweak any of the presentation features and in real time see exactly what they will look like.

Embedded Audio and Video Players (HTML5): For a long time, users have been able to mark image URLs as such and then have them show up as images in their interfaces. We have extended this functionality to work the same way for audio and video links.  If you have links to audio and video files in your collection data users can now just click and start listening and viewing in modern browsers.

An example of the new HMTL5 media player for wrapping links to audio and video files in a player.

An example of the new HMTL5 media player for wrapping links to audio and video files in a player.

Responsive Design: Viewshare interfaces have long been restrained to a maximum width. In the new version, a view can fill up the whole size of whatever screen you have available. By switching to a different framework for layouts you can now create views that fill the whole screen. On the other end of the size spectrum, this also makes views look a lot better on mobile devices.

Bar Charts and Better Pie Charts: The pie charts were the weakest part of the whole platform. There was a lot of potential there but it just didn’t really fit. To that end, we’ve now added in a whole different set of pie charts and bar charts. These charts are particularly useful as they are actually interfaces to collection items. So, you can click on a bar or a slice of a pie and see each of the items that are part of that slice and click through to see each of their item records.


An example of a new dynamic bar chart in a viewshare view.

Share a Particular State in a View: Lots of Viewshare users have told us that they when they get down to a particular set of records ( say those from a particular date range, by a particular author or from a particular region) that they would love to be able to share a link to exactly that subset. Now you can do that. There is a little bookmark icon on each view and if you click that you get a URL (admittedly a not particularly pretty URL)  you can go ahead and link directly to that subset of the view.

So if you want to write a blog post comparing one subset of the data with another you can link directly to each of those subsets. Similarly, you can just email a link to part of a view to a colleague if there is a subset that you think they would be interested in.

So, those of you out there with Viewshare accounts, please help us out and give it a try. We have made a copy of all the existing data, so you can go in and just check in on what your views will look like. Do what you like with the data in this beta instance without any fear. This will not affect any of your actual active viewshare data sets or views. Eventually, when the whole system moves to the new version your data on will be migrated to the new version.

Categories: Planet DigiPres

The state of JHOVE

File Formats Blog - 8 March 2014 - 6:23pm

As you may have noticed, I’ve been neglectful of JHOVE since last September, when 1.11 came out. Issues are continuing to arise, and people are still using it, and I’m not getting anything done about them.

The problem is that my current job has rather long hours, and when I come home from it, looking at more Java code isn’t at the top of my list of things to do. I’m very glad people are still using JHOVE, close to a decade after I started work on it as a contractor to the Harvard Library, but I’m not getting anything actually done.

It would help if there were more contributions from others, and its being on the moribund SourceForge isn’t helping. I think I could undertake the energy to move it to Github, where more contributors might be interested. There’s already a Mavenized version by Andy Jackson there, which doesn’t include the Java source code but provides some important scaffolding and pom.xml files. It probably makes sense to start by forking this. This migration should also make the horrible JHOVE build procedure easier.

If this is something you’d like to see, let me know. I’d like some reassurance that this will actually help before I start.

Tagged: JHOVE, software
Categories: Planet DigiPres

Farewell to NDIIPP

The Signal: Digital Preservation - 7 March 2014 - 4:44pm

It’s finally come–my last day at the Library of Congress. I’ve got plenty of mixed emotions. On the one hand I’ll miss working with my Library colleagues and with the NDIIPP partners–we spent 12 years working together on projects that made a difference. On the other hand, I could not have asked for a better send-off: I’m touched by all the messages of congratulations and support, delivered both in person and, fittingly, in digital form.

Farewell, by Harco Rutgers, on Flickr

Farewell, by Harco Rutgers, on Flickr

I’ve been lucky during my career to work on innovative efforts with federal government agencies to identify, preserve and make available information with enduring value. Now, it’s true that national institutions can struggle with new demands on them for leadership, and that pain accompanies change. But I feel deeply grateful for the unique opportunities I’ve had to help chart a new course in the direction of building collections and providing enhanced access to them.

I’d like to acknowledge a few special people with whom I’ve worked at the Library. My own team deserves thanks for building what I think is the best digital preservation communications program anywhere. Mike Ashenfelder is a fine writer with a special gift for presenting the human side of things; his series of articles on digital preservation pioneers is wonderful (he is the only person on Earth who can weave a great story about archives, radiation and cat physical therapy). Erin Engle has shown initiative and determination to establish herself as an authority on personal digital archiving and the National Digital Stewardship Alliance as a whole. Butch Lazorchak has long been the linchpin for an amazing variety of NDIIPP initiatives, and is demonstrating his usual level  of commitment and enthusiasm in taking over from me as editor and chief cajoler for The Signal. Sue Manus keeps us all focused on the big picture with her awareness of how NDIIPP is reported across all communications platforms; she also provides indispensable support for our various student internship programs (more on that in a minute).

I’d also like to thank Carl Fleischhauer for his sage advice and help over the years; he is without question one of the Library’s most valuable resources. Leslie Johnston has helped in many ways, including as a dedicated and prolific blogger, a source of technical expertise and as a wellspring of optimism and good humor. In terms of those who have retired before me, I’d like to pay special tribute to Caroline Arms, who remains an unparalleled and indefatigable source of wisdom about so many things (she is the only one I know who can effortlessly use both “whitespace” and “namespace” in the same sentence).

I mentioned our student interns, and working with them has been a special joy for me over the years. They include Sally Whiting Kerrigan, Candace LaPlante, Madeline SheltonEmily Reynolds, Chelsie RowellGloria GonzalezVictoria Priester, Kristin Snawder, Cristina Bilmanis and Tess Webre. All of them are outstanding people, brimming with enthusiasm and talent, and all have bright futures ahead of them.

Goodbye and good luck!




Categories: Planet DigiPres

Some reflections on salable ARC to WARC migration

Open Planets Foundation Blogs - 7 March 2014 - 1:56pm

The SCAPE project is developing solutions to enable the processing of very large data sets with a focus long-term preservation. One of the application areas is web archiving where long-term preservation is of direct relevance for different task areas, like harvesting, storage, and access.

Web archives usually consist of large data collections of multi-terabyte size, the largest archive being the Internet Archive which according to its own statements stores about 364 billion pages that occupy around 10 petabytes of storage. And the International Internet Preservation Consortium (IIPC) with its over 40 members worldwide shows how diverse the institutions are, each with a different focus regarding the selection of content or type of material they are interested in.

It is up to this international community and to the individual member institutions to ensure that archived web content content can be accessed and displayed correctly in the future. And this is a real challenge, the reason for this lies in the nature of the content which is like the internet itself: diverse in the use of formats for publishing text and multi-media content, using a rich variety of standards and programming languages, enabling interactive user experience, data-base driven web sites, strongly interlinked functionality, involving social media content from numerous sources, etc. This is to say that apart from the sheer size, it is the heterogeneity and complexity of the data that poses the significant challenge for collection curators, developers, and long-term preservation specialists.

One of the topics which the International Internet Preservation Consortium (IIPC) is dealing with is the question how web archive content should actually be stored for the long term. Originally, content used to be stored in a the ARC format as proposed by the Internet Archive. The format was designed to hold multiple web resources aggregated in a single – optionally compressed – container file. But this format was not supposed to format for storing content for the long term, it was lacking features that support adding contextual information in a standardised way. For this reason, the new WARC format as an ISO Standard was created to provide additional features, especially the ability to hold harvested content as well as any meta-data related to it in a self-contained manner.

An important pragmatic aspect of web archiving is the fact that while content is continuously changing on one side, it remains static on the other. In order to preserve the changes, web pages are harvested with a certain frequency of crawl jobs. Storing the same content at each visit would store content redundantly and not make efficient use of storage.

For this reason, the Netarchive Suite, originally developed by the The Royal Library and The State and University Library, and used in the meantime by other libraries as well, provides a mechanism called “deduplication” which detects that content was already retrieved and therefore references the existing payload content. The information where the referenced content is actually stored is available in the crawl log files which means that if the crawl log file is missing, there is actually no knowledge of any referenced content. In order to display a single web page with various images, for example, the wayback machine needs to know where to find content that may be scattered over various ARC container files. An index file, e.g. an index in the CDX file format, contains the required information, and to build this index, at the current state it is necessary to involve ARC files and crawl log files in the index building process.

From a long-term-preservation perspective, this is a problematic dependency. The ARC container files are not self-describing, they depend on operative data (log files generated by the crawl software) in a non-standardised manner. Web archive operators and developers know where to get the information, and the dependency might be well documented. But it involves the risk of loosing information that is essential for displaying and accessing the content.

This is one of the reasons why the option to migrate from ARC to the new WARC format is being considered by many institutions. But, as often happens, what looks like a simple format transformation at first glance rapidly turns into a project with complex requirements that are not easy to fulfil.

In the SCAPE project, there are several aspects that in our opinion deserve closer attention:

  1. The migration from ARC to WARC is typically dealing with large data sets, therefore a solution must provide an efficient, reliable and scalable transformation process. There must be the ability to scale-out which means that it should be possible to increase processing power by using a computing cluster of appropriate size to enable organisations to complete the migration in a given time frame.

  2. Reading and writing the large data sets comes with a cost. Sometimes, data must be even shifted to a (remote) cluster first. It should therefore be possible to easily hook in other processes that are used to extract additional meta-data from the content.

  3. The migration from one format to another conveys the risk of information loss. Measures of quality assurance like calculating the payload hash and compare content between corresponding ARC and WARC instances or doing rendering tests in the Wayback machine of subsets of migrated content are possible approaches in this regard.

  4. Resolving dependencies of the ARC container files to any external information entities is a necessary requirement. A solution should therefore not only look into a one-to-one mapping between ARC and WARC, but it should involve contextual information in the migration process.

The first concrete step regarding this activity was to find out the right approach to */

the first of the above mentioned aspects.

In the SCAPE project, the Hadoop framework is an essential element of the so called SCAPE platform. Hadoop is the core which holds the responsibility of efficiently distributing processing tasks to the available workers in a computing cluster.

Taking advantage on software development outcomes from the SCAPE project, there were different options to implement a solution. The first option was using a module of the SCAPE platform called ToMaR, a Map/Reduce java application that allows to easily distribute command line application processing on a computing cluster (in the following: ARC2WARC-TOMAR). And the second option was using a Map/Reduce application with customised reader for the ARC format and customised writer for the WARC format so that the Hadoop framework is able to handle these web archive file formats directly (in the following: ARC2WARC-HDP).

An experiment was set up to test the performance of two different approaches and the main question was whether the native Map/Reduce job implementation had a significant performance advantage compared to using ToMaR with an underlying command line tool execution.

The reason why this advantage should be “significant” is that the ARC2WARC-HDP option has an important limitation: In order to achieve the transformation based on a native Map/Reduce implementation it is required to use a Hadoop representation of a web archive record. This is the intermediate representation that is between reading the records from the ARC files and writing the records to WARC files. As it uses a byte array field to store web archive record payload content, there is a theoretical limit of around 2 GB due to the Integer length of the byte array which would be a value near Integer.MAXVALUE. In reality, the limitation of payload content size might be much lower depending on hardware setup and configuration of the cluster.

This limitation would come along with the need for an alternative solution for records with large payload content. And, such a separation between "small" and "large" records would possibly increase the complexity of the application, especially when it is required to involve contextual information across different container files in the migration process.

The implementations used to do the migration are proof-of-concept tools which means that they are not intended to be used to run a production migration at this stage. This means that there are the following limitations:

  1. Related to ARC2WARC-HDP, as already mentioned, there is a file size limit regarding the in-memory representation of a web archive record, the largest ARC file in the data sets used in these experiments is around 300MB, therefore record-payload content can be easily stored as byte array fields.

  2. Exceptions are catched and logged, but there is no gathering of processing errors or any other analytic results. As the focus lies here on performance evaluation, any details regarding record processing are not taken into consideration.

  3. The current implementations neither do quality assurance nor do they involve contextual information which have been mentioned as important aspects of the ARC to WARC migration above.

The basis of the implementations is the Java Web Archive Toolkit (JWAT) for reading web archive ARC container files and to iterate over the records.

As an example for a process that is used while we are reading the data, the implementations include Apache Tika to identify the payload content as an optional feature. All Hadoop job executions are therefore tested with and without payload content identification enabled.

As already mentioned, the ARC2WARC-HDP application was implemented as a Map/Reduce application which is started from the command line as follows:

hadoop jar arc2warc-migration-hdp-1.0-jar-with-dependencies.jar \
-i hdfs:///user/input/directory -o hdfs:///user/output/directory

And the ARC2WARC-TOMAR workflow is using a command line Java-Implementation and executed using ToMaR. One bash script was used to prepare the input needed by ToMaR and another bash script to execute the ToMaR Hadoop job, a combined representation of the workflow is available as a Taverna workflow.

A so called “tool specification” is needed to start an action in a ToMaR Hadoop which specified inputs and outputs and the java command to be executed:

<?xml version="1.0" encoding="utf-8" ?> <tool xmlns:xsi="" xsi:schemaLocation=" tool-1.0_draft.xsd" xmlns="" xmlns:xlink="" schemaVersion="1.0" name="bash"> <operations> <operation name="migrate"> <description>ARC to WARC migration using arc2warc-migration-cli</description> <command> java -jar /usr/local/java/arc2warc-migration-cli-1.0-jar-with-dependencies.jar -i ${input} -o ${output} </command> <inputs> <input name="input" required="true"> <description>Reference to input file</description> </input> </inputs> <outputs> <output name="output" required="true"> <description>Reference to output file</description> </output> </outputs> </operation> </operations> </tool>

All commands allow use of a “-p” flag to enable Apache Tika identification of payload content.

The cluster used in the experiment has one controller machine (Master) and 5 worker machines (Slaves). The master node has two quadcore CPUs (8 physical/16 HyperThreading cores) with a clock rate of 2.40GHz and 24 Gigabyte RAM. The slave nodes have one quadcore CPUs (4 physical/8 HyperThreading cores) with a clock rate of 2.53GHz and 16 Gigabyte RAM. Regarding the Hadoop configuration, five processor cores of each machine have been assigned to Map Tasks, two cores to Reduce tasks, and one core is reserved for the operating system. This is a total of 25 processing cores for Map tasks and 10 cores for Reduce tasks.

The experiment was executed using two data sets of different size, one with 1000 ARC files with 91,58 Gigabyte and one with 4924 ARC files with a total size of 445,47 Gigabyte.

A summary of the results is shown in the table below.


/*-->*/ /*-->*/


  Obj./hourThroughput Avg.time/item  (num)(GB/min)(s) Baseline8341,27274,32Map/Reduce1000 ARC files45927,00890,78 4924 ARC files46457,00420,77ToMaR1000 ARC files42506,48750,85 4924 ARC files43206,51430,83 Baseline5450,83216,60Map/Reduce w. Tika1000 ARC files27614,21391,30 4924 ARC files28134,24191,28ToMaR w. Tika1000 ARC files33185,06451,09 4924 ARC files28134,24191,28


The Baseline value was determined by executing a standalone Java-application that shifts content and meta-data from one container to the other using JWAT. It was executed on one worker node of the cluster and serves as a point of reference for the distributed processing.

Some observations regarding these data are that, compared to the local java application processing the cluster processing shows a significant increase of performance for all Hadoop jobs – which should not be a surprise, this is the purpose of distributed processing. Then, the throughput does not change significantly between the two data sets of different size which allows the assumption that there is a linear execution time as the number of objects increases. Regarding the two different approaches ARC2WARC-HDP and ARC2WARC-TOMAR there is only a slight difference which given the above mentioned caveats of the Map/Reduce implementation does highlight ToMaR as an interesting option to be the tool of choice. Finally, the figures show that using Apache Tika the processing time is increased by more than 50%.

To give an outlook to further work and following arguments outlined here, resolving contextual dependencies in order to create self-contained WARC files is the next point to look into.

As a final remark, the proof-of-concept implementations presented here are far from a workflow that can be used in production. There is an ongoing discussion in the web archiving community whether it makes any sense to tackle such a project in memory institutions at all. Ensuring backwards-compatibility of the wayback machine and safely preserving contextual information is a viable alternative to this.

Many thanks to colleagues sitting near to me and in the SCAPE project who gave me useful hints and support.

Taxonomy upgrade extras: SCAPEPreservation Topics: Preservation ActionsIdentificationMigrationWeb ArchivingSCAPE
Categories: Planet DigiPres

Interview with a SCAPEr - Pavel Smrz

Open Planets Foundation Blogs - 7 March 2014 - 1:07pm
Who are you?

My name is Pavel Smrz. I work as an associate professor at the Faculty of Information Technology, Brno University of Technology (BUT) in the Czech Republic. Our team joined the SCAPE project in September 2013.

Tell us a bit about your role in SCAPE and what SCAPE work you are involved in right now?

I lead a work package dealing with the Data Centre Testbed. Together with other new project partners, we aim at extending the current SCAPE development towards preserving large-scale computing experiments that take place in modern data centres. Our team particularly focuses on preservation scenarios and workflows related to large-scale video processing and interlinking.

Why is your organisation involved in SCAPE?

BUT has a long tradition and a proved research track in the fields of large-scale parallel and distributed computing, knowledge technologies and big data analysis. We have participated in many European projects, other international and national research and development activities and industrial projects relevant to this domain. That is why we have been invited to join the proposal to extend the SCAPE project as a part of the special EC Horizontal Action – Supplements to Strengthen Cooperation in ICT R&D in an Enlarged European Union. The proposal was accepted and the SCAPE project was successfully extended in 2013.

What are the biggest challenges in SCAPE as you see it?

SCAPE is a complex project so that there are many technological challenges. Being new to the project, I was agreeably surprised by the high level of technical expertise of professionals from libraries and other institutions dealing with preservation. To mention just an example from our domain, concepts of advanced distributed computing are well understood and commonly employed by the experts. I believe the technical excellence will help us to meet all the challenges in the remaining project time.

In addition to the technical area, I would see a key challenge of the project in integration of partners and individuals with very different backgrounds and perspectives. SCAPE is really an inter-disciplinary project so that people from various fields need to make a special effort to find common ground. I am glad that this works in the project and I really enjoy being part of the community.

What do you think will be the most valuable outcome of SCAPE?

SCAPE will deliver a new platform and a set of tools for various preservation contexts. I would stress diversity of tools as a particular outcome. My experience shows that “one-size-fits-all” solutions are often too scary to be used. Although funding agencies believe opposite, research and development project seldom deliver solutions that could be used as a whole. It is often the case that what seemed to be a minor contribution becomes the next big thing for business. I believe that at least some components developed within the project have this great potential.

Having interest in large-scale parallel and distributed computing, I cannot forget scalability as a key attribute of the SCAPE development. Today’s public and private cloud and cluster infrastructures enable realizing large-scale preservation scenarios. What would be a year preservation project few years ago, can be solved in a day on these platforms. However, many tools are not ready to take benefit from existing computing infrastructures – scalability does not come ‘automagically’.

In my opinion, the most valuable outcome of the SCAPE project consists in providing a diverse set of preservation tools and showing that they scale-up in real situations.

Contact information

Pavel SMRZ
Brno University of Technology
Faculty of Information Technology
Bozetechova 2, 61266 Brno
Czech Republic

Preservation Topics: SCAPE
Categories: Planet DigiPres