Planet DigiPres

Personal Digital Archiving: The Basics of Scanning

The Signal: Digital Preservation - 27 March 2014 - 6:30pm
A digital image consists of tens of thousands of tiny dots or squares called pixels.

A digital image consists of tens of thousands of tiny dots or squares called pixels.

Although the National Digital Information Infrastructure and Preservation Program and the National Digital Stewardship Alliance focus on digital preservation and access, many of the personal digital archiving questions that the general public ask us are about scanning. Though scanning is a separate issue from digital preservation, scanning does generate digital files that need to be preserved. In the interest of helping people create the best possible digitization of their photos and documents for preservation, we have produced a “how to” video that we will be releasing soon. In the meantime here is a brief, basic introduction to scanning that we hope will demystify the process.

When you scan a paper photograph, the scanning device creates a digital version of the photo made up of tens of thousands of tiny dots or squares called pixels. This paper-to-digital conversion process is digitizing, though the act of digitizing is not limited to images or text; you can also digitize video and audio. In this post we will just look at scanning and digitizing photos.

Step 1: Prepare Scanner and Photos
The first step in the process is to clean the scanner and photos. Smudges, dust and hair will scan into your digital photo and ruin its purity. Wipe the scanner glass with a plain, lint-free cloth dampened with water. Do not spray the water directly onto the scanner; spray it on the cloth. Wipe the inside of the scanner lid too.

Next, lightly wipe off the photograph with a dry anti-static cloth. You can find these cloths in a camera store. They not only clean the photo, they help reduce static electricity on the photo and prevent it from attracting more dust particles and hair. Place the cleaned photo face down on the scanner. Do not touch the glass when you place the photo. The natural oils on your fingers may smudge the glass and you’ll have to clean the glass again. Try to slide the photo against the side of the scanner glass and up into the corner for the best alignment.

detect-separate-itemsSome software will detect separate photos or items and scan them as individual files. Leave about one-half inch between the photos to help the software recognize them as separate items. Close the lid gently so the photos remain aligned with the scanner glass edges.

Step 2: Set Scan Properties – DPI and Bit Depth
Once you have prepared your scanner and photos, set the properties for the photo scan. In your computer, open the scanner software. There are two important settings to look for:

  • details of the digital image, such as the number of dots per inch and whether the image is color or grayscale.
  • the file format to save the image as – such as TIFF or JPEG – and the type of image compression (if any) you want on that file type.

Dots per inch – or DPI – is a measurement of pixel density. Image specialists use the more precise term “pixels per inch” or PPI. However, since documentation for commercial scanners almost exclusively uses the term DPI, we will stick with the term “DPI.”

scan_300-400dpi_220jpgThe more pixels packed into a one-square inch space, the greater the potential detail an image can hold. An image with 200 dots per inch potentially displays more detail than the same image with 72 dots per inch. There are optimum DPI settings for different photo sizes and types but more DPI is not always better; there is a DPI limit or threshold. Beyond that limit, there is nothing more of value that increased DPI can add. You can only scan so much detail from a photo.

  • For most personal work, 300 to 400 dpi is satisfactory for snapshot prints and for common enlargements at 4″x6″, 5″x7″ or 8″x10″ in size.
  • Since very small prints or photographic slides contain a lot of detail in a small area, capture more dots per inch, around 1400 to 1500 DPI.
  • Photo negatives also hold a lot of detail, so for negatives, select a minimum of 1500 to 2000 dpi. Remember that increasing dots per inch increases data and increases the file size.

8-bit_onlySome software may enable you to adjust the bit depth of data per pixel. The more bits per pixel, the more information the pixel contains and the richer the digital palette you have to work with. The most commonly used scan setting is 8-bits per pixel for grayscale (some scanners may also offer you 16), and 24-bits-per-pixel for color (although some scanners may also offer 48). With more bits per pixel you can have a bit more to work with if you intend to edit your digital photos later. But for routine scanning, where you do not plan to edit much, or where the quality of the outcome is not such a big deal, then select 8-bit grayscale or 24-bit color. Remember that increasing bits per pixel increases the amount of data in the file and so it increases the size of the file.

If the paper photo you want to scan is black and white, and you see a menu choice of grayscale or color, select “grayscale.” If the paper photo is color, select “color.”

tiff-lzw-compression_smallStep 2: File Format and Compression
Scanner software saves your scanned photo as a digital file and the most common file-type options are TIFF and JPEG. TIFF, the preferred format for digital photo preservation, retains the maximum amount of digital data that your scanner captures. If you have a choice, save your original master scan as a TIFF.

If file storage space is an issue, you can compress a file and reduce its file size. Scanning software may offer an option of LZW compression for a TIFF file, which will cut the size of the TIFF file without the loss of digital data. This is called “lossless” compression. By contrast, saving an image as a JPEG employs a “lossy” compression, so named because a JPEG file, by its nature, is compressed and it loses some of the digital data during compression that the scanner captured. You can select JPEG quality levels and degrees of compression, from “least compression” — the least amount of lost data and the highest JPEG quality — to “most compression” — the most amount of lost data and the lowest JPEG quality.

jpeg_max-quality_200We recommend that if you intend to modify or work with a digital photo, you save two versions of it: a master version and a working copy. Keep a TIFF file as the master file and store it safely with your other personal digital archives; use a JPEG version as the working copy. The JPEG file will be smaller and more convenient to email or post on social media sites. Edit, modify and work with the JPEG. You can always make a fresh JPEG copy of the master TIFF file.

Once you have selected the file type and set your bit depth and DPI, you are ready to scan your photo. Preview the scan, if you have that option, and look it over to make sure you haven’t picked up any dust, hair or artifacts. And check that the photo is aligned properly. Then select “scan.”

Renaming a file does not affect the contents of the file.

Renaming a file does not affect the contents of the file.

After scanning the file, some scanning software will prompt you to assign a file name. Some software will automatically assign a file name to your file. If it assigns a file name (usually some alphanumeric name like “DC2148793.jpg”), you can either keep that file name or you can change it. To change the file name, right-click –- if you are on a PC –- and select “rename.” On a Mac, control-click and select “rename.” Renaming the file will not affect the contents of the file. We recommend that you rename the file to help you find the file later. Many people include the date in the file name — at least the year or the combined year and month. If your file names lead off with year-month, followed by a descriptive word or two, then — in your computer folder — the files will sort in chronological order.

Remove each photo from the scanner by slipping a piece of paper under it and lifting it. Avoid touching the glass with your fingers.

As soon as possible, back up your digital photos in a few separate places. Every five years or so, migrate your personal digital archives to a new storage medium in order to avoid having your collection stuck on some obsolete media.

Categories: Planet DigiPres

Where are the Born-Digital Archives Test Data Sets?

The Signal: Digital Preservation - 26 March 2014 - 3:23pm

By Butch Lazorchak and Trevor Owens

We’ve talked in the past on the Signal on the need more applied research in digital preservation and stewardship. This is a key issue addressed by the 2014 National Agenda for Digital Stewardship, which dives in a little deeper to suggest that there’s a great need to strengthen the evidence base for digital preservation.

But what does this mean exactly?

Scientific fields have a long tradition of applied research and have amassed common bodies of evidence that can be used to systematically advance the evaluation of differing approaches, tools and services.

This approach is common in some areas of library and archives study, such as the Text Retrieval Conferences and their common data pools, but is less common in the digital preservation community.

As the Agenda mentions, there’s a need for some open test sets of digital archival material for folks to work on bench-marking and evaluating tools against, but the first step should be to establish the criteria for data collections.  What would make a good digital preservation test data set?

1. Needs to be real-world messy stuff: The whole point of establishing digital preservation test data sets is to have actual data to be able to run jobs against. An ideal set would be sanitized, processed or normalized to the least extent possible. Ideally, these data sets would come with some degree of clearly stated provenance and a set of checksums to allow researchers to validate that they are working on real stuff.

2. Needs to be public: The data needs to be publicly-accessible in order to encourage the widest use, and should be available via a URL without having to ask permission. This will allow anyone (even inspired amateurs) to take cracks at the data.

3. Needs to be legal to work with: There are many exciting honey pots of data out there that satisfy the first two requirements but live in legal grey areas. Many of the people working with these data sets will operate in government agencies and academia where clear legality is key.

There are some data sets currently available that meet most of the above criteria, though most are not designed specifically as digital preservation testbeds. Still, these provide a beginning to building a more comprehensive list of available research data, on the way to tailor-made digital preservation testbeds.

Some Initial Data Set Suggestions:

The social life of email at Entron - a new study from user chieftech on Flicker.

The social life of email at Entron – a new study from user chieftech on Flicker.

Enron Email Dataset:  This dataset consists of a large set of email messages that was made public during the legal investigation concerning the Enron corporation. It was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes) under the auspices of the Defense Advanced Research Projects Agency. The collection contains a total of about ½ million messages and was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.

NASA NEX: The NASA Earth Exchange is a platform for scientific collaboration and research for the earth science community. NEX users can explore and analyze large Earth science data sets, run and share modeling algorithms and collaborate on new or existing projects and exchange workflows. NEX have a number of datasets available, but three large sets have been made readily available to public users. One, the NEX downscaled climate simulations, provides high-resolution climate change projections for the 48 contiguous U.S. states. The second, the MODIS (or Moderate Resolution Imaging Spectroradiometer) data offers a global view of Earth’s surface, while the third, the Landsat data record from the U.S. Geological Survey, provides the longest existing continuous space-based record of Earth’s land.

GeoCities Special Collection 2009: GeoCities was an important outlet for personal expression on the Web for almost 15 years, but was discontinued on October 26, 2009. This partial collection of GeoCities personal websites was rescued by the Archive Team and is about a terabyte of data. For more on the Geocities collection see our interview with Dragan Espenscheid from March 24.

There are other collections, such as the September 11 Digital Archive hosted by the Center for History and New Media at George Mason University, that have been used as testbeds in the past, most notably in the NDIIPP-supported Archive Ingest and Handling Test, but the data is not readily available for bulk download.

There are also entities that host public data sets that anyone can access for free, but further investigation is needed to see whether they meet all the criteria above.

We need testbeds like this to explore digital preservation solutions. Let us know about other available testbeds in the comments.

Categories: Planet DigiPres

What Could Curation Possibly Mean?

The Signal: Digital Preservation - 25 March 2014 - 1:47pm

My colleague Trevor Owens wrote a great blog post entitled “What Do you Mean by Archive?” This led to a follow-up discussion where I publicly announced on Facebook that I wanted to write about the term “curation.”  Its seemingly widespread use in popular culture in the past 4-5 years has fascinated me.

Curated, photo by Leslie Johnston

Curated, photo by Leslie Johnston

Every time I get a new device I need to teach the spell checker “curate” and “curation.”  In fact, I just had to add “curation” to this blog software dictionary.  So even my email and word processor dictionaries do not know what these words mean.  So why is it everywhere?

I have seen “curated” collections at retail shops. On web sites. In magazines. In the names of companies.  I have seen curated menus. I don’t mean collections of menus – I mean a menu describing the meal as “curated.”  Social media is curated into stories.  Music festivals are curated.  Brands are now curated. There is a great “Curating the Curators” tumblr that documents encounters with the many varied uses of the terms.  And I cannot fail to mention the fabulous meta “Curate Meme” tumblr, “Where Curators Curate Memes about Curation. Where will the absurdity of our use of the term Curation go next?.”

Apparently we are in the age of “Curated Consumption.” Or of “Channelization.”

The most famous discourse on this topic is Pete Martin’s “You Are Not a Curator,” originally written for (now seemingly defunct).  For a great general discussion on this, read the Chicago Tribune piece “Everybody’s a Curator.”  The Details article “Why Calling Yourself a Curator is the New Power Move” is an interesting take on personal brand management. Perhaps most interesting is “The Seven Needs of Real-Time Curators,” a guide for people interested in becoming a real-time blogger/curator.  My colleague Butch has already weighed in on digital curation versus stewardship.  But what is it we really mean when we say curation?

Curation is Acquisition

My academic introduction to the concept of curation came in my initial museums studies courses when I was an undergraduate. I was taught that curation was an act of selection, one of understanding the scope of a collection and a collection development policy, seeing the gaps in the collection and selectively acquiring items to make the collection more comprehensive and/or focused in its coverage. In this a curator is considering and appraising, researching, contextualizing and selecting, constantly searching and refining.  The curator is always auditing the collection and reappraising the items, refining the collection development policy as to its scope.

Curation is Exhibition

I think my first understanding of the word curation came as a museum-going child, when I realized that an actual person was responsible for the exhibits. I came to understand that someone identified a context or message that one wanted to present, and brought together a set of objects that represented a place or time or context, or provided examples to illustrate a point. The exhibition wall texts and labels and catalog and web site are carefully crafted to contextualize the items so they do not appear to be a random selection.  These are acts of conceptualization, interpretation and transformation of objects into an illustration, making the objects and the message accessible to a wide audience. In some cases, curators are transforming objects, some of which may seem quite mundane, into objects of desire through exhibition, by showcasing their cultural value.

Curation is Preservation

In a preservation context, curation is the act of sustaining the collection. Curation is about the storage and care of collections, sometimes passive and sometimes active. The more passive activity is one where items have been stored (ingested) in the safest long-term manner that ensures they require as little interaction as possible for their sustainability, while having an inventory in place so that you always know what you have and where and what it is.

In the more active act of auditing and reappraising a collection, the curator will always be looking for  issues in the collections, such as items exhibiting signs of deterioration or items that no longer meet the scope of the collection. In this curation is the progression of actions from auditing to reappraisal to taking some sort of preservation action, or perhaps de-accessioning items that no longer meet the collection criteria or require care that cannot be provided.

For those of us in the federal sphere, there is this commentary on the definition of curator from a blog post by Steven Lubar, a former curator at the National Museum of American History:

“The official OPM “position classification standard” [pdf] for curators is not of much use. It was written in 1962, and states that “Moreover, unlike library science, the techniques of acquisitioning, cataloging, storing, and displaying objects, and the methods of museum management have not been standardized into formal disciplines and incorporated into formal college courses of training and education.” Shocking, really; someone should update this.”

And what about digital curation? My colleague Doug Reside has written the most succinct description of what it is we do. And the Digital Curation Centre has a great brief guide to the activities and life cycle of digital curation.

How do I feel about what some call the appropriation of these terms?  On one side I dislike the use, as it seems that everyone thinks that they are a curator, which might dilute the professional meaning of the terms. On the other side, the retail usage of curation is not that different from part of what we do – the selection and showcasing and explication of the value of items – with vastly different criteria of course.

And people may indeed curate aspects of their own lives, auditing and reviewing and sustaining and de-accessioning music or books or clothing. So as much as I may get prickly over some uses, they seem to fit in the spirit of the word. Perhaps their use in popular culture will lead to a more widespread understanding of the use of the terms as we use them, and of what we do.

Categories: Planet DigiPres

“Digital Culture is Mass Culture”: An interview with Digital Conservator Dragan Espenschied

The Signal: Digital Preservation - 24 March 2014 - 5:30pm
Dragan cc-by, Flo Köhler

Dragan Espenschied, Digital Conservator @ Rhizome cc-by, Flo Köhler

At the intersection of digital preservation, art conservation and folklore you can find many of Dragan Espenschied’s projects. After receiving feedback and input from Dragan for a recent post on interfaces to digital collections and geocities I heard that he is now stepping into the role of digital conservator at Rhizome. To that end, I’m excited to talk with him as part of our ongoing NDSA innovation group’s Insights interview series about some of his projects and perspectives and his new role as a digital conservator.

Trevor: When I asked you to review the post on some of your work with the Geocities data set, you responded “I agree, archivists should act dumb and take as much as possible, leaving open as many (unforeseeable) opportunities as possible for access design.” Now that you are moving into a role as a digital conservator, I would be curious to hear to what extent you think that perspective might animate and inform your work.

Dragan: I believe that developing criteria of relevance and even selecting what artifacts are allowed into archives poses a problem of scale. The wise choice might be not trying to solve this problem, but to work on techniques for capturing artifacts as a whole – without trying to define significant properties, what the “core” of an artifact might be, or making too many assumptions about the future use of the artifact. The fewer choices are made during archiving, the more choices are open later, when the artifact will be accessed.

While at Rhizome I want to focus on designing how access to legacy data and systems located in an archive can be designed in a meaningful way. For Digital Culture, “access” means finding a way for a whole class of legacy artifacts to fulfill a function in contemporary Digital Culture. How to do that is one of the most pressing issues when it comes to developing an actually meaningful history of Digital Culture. We are still fixated on a very traditional storytelling, along the lines of great men creating groundbreaking innovations that changed the world. I hope I can help by turning the focus to users.

Trevor: BitTorrent has been the primary means by which the geocities archive has been published and shared. Given the recent announcement of AcademicTorrents as a potential way for sharing research data, I would be curious to hear what role you think BitTorrent can or should play in providing access to this kind of raw data.

Dragan: The torrent was an emergency solution in the Geocities case, but the Archive Team’s head Jason Scott turned the disadvantage of not having any institutional support into a powerful narrative device. Today the torrent’s files can be downloaded from The Internet Archive and this is in fact much more comfortable, though less exciting.

In general distribution via BitTorrent is problematic because once nobody is interested in a certain set of data anymore, even temporarily, a torrent dries up and simply vanishes. But torrents can help rescuing stuff trapped in an institution or distributing stuff that no institution would ever dare to touch. One of them is this big pile of Digital Folklore of Geocities, it poses so many problems for institutions: there are literally hundreds of thousands of unknown authors who could theoretically claim some copyright violations if their work would show up under the banner of an institution; there is hardly anybody in charge who would recognize the immense cultural value of the digital vernacular; it is so much material that no-one could ever look inside and check each and every byte for “offensive content” and so forth …

Trevor: In my earlier post on digital interfaces I had called the website One Terabyte of Kilobyte Age, which you artist Olia Lialina run, an interpretation. You said you think of it as “a carefully designed mass re-enactment, based on this scale of authenticity/accessibility.” Could you unpack that for us a bit? What makes it a re-enactment and what do you see as the core claim in your approach to authenticity and accessibility?

Dragan: As much as Digital Culture is Mass Culture, it is also more about practices than objects. In order for artifacts to survive culturally, they need to become useful again in contemporary digital culture. Since, at the moment, “content” that is isolated, de-contextualized and shuffled around in databases of social networking sites is the main form of communication, to be useful an artifact has to work as a “post,” it has to become impartible and be brought into a format that is accepted everywhere. And that is a screenshot.

I have a great setup with emulators and proxy servers and whatnot to access the processed Geocities archive, but this won’t bring it anywhere close to executing its important cultural role, being a testimony of a pre-industrialized web. Even public archives like the rather excellent ReoCities or the Wayback Machine cannot serve as a mediator for 1990’s web culture. The screenshots are easily accessible, sharable and usable: they work as cultural signatures users can assign to themselves by re-blogging them, they can be used to spark discussions and harvest likes and favorites, and so forth.

Some decisions of how these screenshots are automatically created are coming from this perspective of accessibility; for example, although the typical screen resolution of web users increased around the turn of the century, One Terabyte Of Kilobyte Age will continue to serve 800×600 pixel images for the foreseeable future. Larger images would burst many blogs’ layouts and cause unrecognizable elements on downsizing.

Other decisions, like the choice of MIDI replay plugin installed in the browser, is about making the screen shots as narrative as possible. The MIDI replay plugin shipped with Netscape would play MIDI music in the background without any visual representation, if the music would be embedded to the page, it would show simple play controls. The “crescendo” plugin I used always shows the file name of the MIDI file being played, most of the time in a popup window.


Reenactment of CapitolHill/1455/ via One Terabyte of Kilobyte Age

On the Geocities site CapitolHill/1455/ there is a music playing called “2001.mid”. You might think this might be the title theme of the movie 2001, “Also Sprach Zarathustra” by Richard Strauss – and that’s really the case (see this recording artist Olia Lialina made). This screenshot of the – some might say – annoying, even not very “authentic” popup window makes the tune play in your head.

 authenticity (how realistic can the harvested data be presented again) and ease of access (what technical requirements and what knowledge are needed to gain access on a certain level of authenticity)." Image from Dargan.

“Access to the remains of Geocities can be measured on two axis: authenticity (how realistic can the harvested data be presented again) and ease of access (what technical requirements and what knowledge are needed to gain access on a certain level of authenticity).” Image from Dragan.

So, while the screenshots have some “authenticity issues,” this is greatly outweighed by their accessibility and therefore impact. And experiencing the iconic, graphically impressive Netscape browser is something otherwise only achievable by complicated emulator setups. (Olia observed that tumblr users also reblog screenshots of empty Netscape windows, probably because its very dominant interface looks explicitly historic today.)

Trevor: In the announcement of your new position you are quoted as saying “I strongly believe that designing the access to complex legacy digital artifacts and systems is the largest contemporary challenge in digital culture. Digital culture is mass culture, and collection and preservation practices have to change to reflect this fact.” Could you unpack that a bit more for us? What are the implications of mass digital culture for collecting and preserving it?

Dragan: The grief I have with the creation of history in digital culture is that it is in many cases located outside of digital culture itself. Digital culture is regarded as too flimsy (or the classic “ephemeral”) to take care of itself, so conservation is done by removing artifacts from the cultural tempest they originated in and putting them into a safe place. The problem is that this approach doesn’t scale – sorry for using this technical term. I won’t argue that a privileged, careful handling of certain artifacts deemed of high importance or representative value is the wrong way; actually, this approach is the most narrative. But practiced too rigidly it doesn’t do digital culture any justice. Firstly because there simply are no resources to do this with a large amount of artifacts, and secondly because many artifacts can only blossom in their environment, in concert or contrast with a vernacular web, commercial services and so forth.

The other extreme is to write history with databases, pie charts and timelines, like in Google’s Zeitgeist. Going there I can find out that in January 2013 the top search requests in my city were “silvester” and “kalender 2013” – big data, little narration. With the presentation of such decontextualized data points, the main narrative power lies in the design of the visual template they end up in. This year it is a world map, next year it might be a 3D timeline – but in fact users typed in their queries into the Google search box. That is why the popular Google Search autocomplete screen shots, as a part of digital folklore, are more powerful, and typing into the Google search box yourself and watching the suggestions appear is the best way to explore what is being searched for.

Example of Autocomplete screenshot provided by Dragan.

Example of Autocomplete screenshot provided by Dragan.

Mass Digital Culture is posing this challenge: can there be a way of writing its history that does it justice? How to cope with the mass without cynicism and with respect for the users, without resorting to methods of market analysis?

Trevor: I spoke with Ben Fino-Radin, your predecessor in this role, about Rhizome and his take on what being a digital conservator means. I’d be curious to hear your response to that question. Could you tell us a bit about how you define this role? To what extent do you think this role is similar and different to analog art conservation? Similarly, to what extent is this work similar or different to roles like digital archivist or digital curator?

Dragan: I have very little experience with conserving analog art in general so I will spare myself the embarrassment of comparing. The point I agree whole-heartedly with Ben is about empathy for the artifacts. “New Media” is always new because the symbols buzzing around in computers don’t have any meaning by themselves, and digital culture is about inventing meanings for them. A digital conservator will need to weave the past into the present and constantly find new ways of doing so. This touches knowledge of history, curation, and artistic techniques. While I believe the field of digital conservation needs to build an identity still, I see my personal role as ultimately developing methods and practices for communities to take care of their own history.

Categories: Planet DigiPres

ARC to WARC migration: How to deal with de-duplicated records?

Open Planets Foundation Blogs - 24 March 2014 - 4:13pm

In my last blog post about ARC to WARC migration I did a performance comparison of two alternative approaches for migrating very large sets of ARC container files to the WARC format using Apache Hadoop, and I said that resolving contextual dependencies in order to create self-contained WARC files was the next point to investigate further. This is why I am now proposing one possible way to deal with de-duplicated records in an ARC to WARC migration scenario.

Before entering into specifics, let me briefly recall what is meant by „de-duplication“: It is a mechanism used by a web crawler to reference identical content that was already stored when visiting a web site at a previous point in time, and the main purpose is to avoid storing content redundantly and by that way to reduce the required storage capacity.

The Netarchive Suite uses a Heritrix module for de-duplication, which takes place on the level of a harvest definition. The following diagram roughly outlines the most important information items and their dependencies.


The example shows two subsequent jobs executed as part of the same harvest definition. Depending on the configuration parameters, as the desired size of ARC files, for example, each crawl job creates one or various ARC container files and a corresponding crawl metadata file. In the example above, the first crawl job (1001) produced two ARC files, each containing ARC metadata, a DNS record and one HTML page. Additionally, the first ARC file contains a PNG image file that was referenced in the HTML file. The second crawl job (1002) produced equivalent content except that the PNG image file is not contained in the first ARC file of this job, but it is only referred to as a de-duplicated item in the crawl-metadata using the notation {job-id}-{harvest-id}-{serialno}.

The question is: Do we actually need the de-duplication information in the crawl-metadata file? If an index (e.g. CDX index) is created over all ARC container files, we know – or better: the wayback machine knows – where a file can be located, and in this sense the de-duplication information could be considered obsolete. We would only loose the information as part of which crawl job the de-duplication actually took place, and this concerns the informational integrity of a crawl job because external dependencies would not be explicit any more. Therefore, the following is a proposed way to preserve this information in a WARC-standard-compliant way.

Each content record of the original ARC file is converted to a response-record in the WARC file like illustrated in the bottom left box in the diagram above. Any request/response metadata can be added as a header block to the record payload or as a separate metadata-record that relates to the response-record.

The de-duplicated information items available in the crawl-metadata file are converted to revisit-records as illustrated in the bottom right box as a separate WARC file (one per crawl-metadata file). The payload-digest must be equal and should state that the completeness of the referrred record was checked successfully. The WARC-Refers-To property refers to the WARC record that contains the record payload, additionally, the fact that Content-Length is 0 explicitely states that the record payload is not available in the current record and that it is to be located elsewhere.

Taxonomy upgrade extras: SCAPESCAPEProjectSCAPE-ProjectMigrationWeb ArchivingARCWARCARC to WARCPreservation Topics: MigrationWeb ArchivingSCAPE
Categories: Planet DigiPres

New look and feel for Viewshare

The Signal: Digital Preservation - 21 March 2014 - 5:55pm

Earlier this month Trevor Owens announced that a new version of Viewshare is open for user testing and comment. Following this public beta, our plan is to move all users over to the new platform in the next few months.  When this happens your Viewshare account, data and views will all transition seamlessly. You will, however, notice visual and functional improvements to the views. The overall look and feel has been modernized and more functionality has been added to some of the views, particularly those that use pie charts.

Trevor gave an overview of all of the new features in his post announcing these changes. In this post I’ll focus on the visual updates. I would encourage everyone with a Viewshare account to check out how your views look in the new version of Viewshare and let us know what you think or if you have any issues.

Responsive Design

The new version of Viewshare implements responsive design which will allow your views to look good and be functional on any computer or device. You can see this in action with a view of digital collections preserved by NDIIPP partners on both a large and small screen. The view can fill up a large computer monitor screen and be equally functional and usable on a smartphone. This added feature will require no action from users and will work automatically.

NDIIPP Collections on a smartphone using the new version of Viewshare.

NDIIPP Collections on a smartphone using the new version of Viewshare.

NDIIPP Collections view on large monitor

NDIIPP Collections view on large monitor











Changes for Charts

Bar chart views are available in the new version of Viewshare. The pie charts have also been greatly improved. Visually, they are clearer and the text is more legible. Functionally, users are able to click through to items that are represented in different areas of the pie chart. This isn’t possible in the current Viewshare. Check out the two versions of the same data from the East Texas Research Center and you’ll see the improvements.

I do want to point out that in the current version of Viewshare there’s an option to switch between two different pie charts on the same view by using a “view by” drop-down menu. To simplify the building process for these views in the new version of Viewshare that option was eliminated so if you want two views of a pie chart all you have to do is create two views. If your current pie chart view has options to view more than one chart in the same view the view listed first will be the one that displays in the new version.  To restore the missing view simply create an additional pie chart view.

Current pie chart view

Current pie chart view

New version of pie charts

New version of pie charts










Share Filtered or Subsets of Results

The new version of Viewshare allows users to share results of a particular state in a view. An example of this is shown in the Carson Monk-Metcalf view of birth and death records. The view below shows a scatterplot chart of birth years vs. death years and their race and religion (religion data not shown below but accessible in the view). The view is limited to show records for those who were 75 years and above at the time of their death. The user could cite or link to this particular view in the data by clicking the red bookmark icon in the upper right and share or save the link provided.

Carson Mon-Metcalf bookmarked results

Carson Mon-Metcalf bookmarked results

Again, be sure to check-out your views in the new Viewshare, your current login credentials will work. As always let us know what you think in the comments of this post or in the user feedback forums for Viewshare.

Categories: Planet DigiPres

CSV Validator - beta releases

Open Planets Foundation Blogs - 21 March 2014 - 2:51pm

For quite some time at The National Archives (UK) we've been working on a tool for validating CSV files against user defined schema.  We're now at the point of making beta releases of the tool generally available (1.0-RC3 at the time of writing), along with the formal specification of the schema language.  The tool and source code are released under Mozilla Public Licence version 2.0.

For more details, links to the source code repository, release code on Maven Central, instructions and schema specification, see

Feedback is welcome.  When we make the formal version 1.0 release there will be a fuller blog post on The National Archives blog.

Preservation Topics: Tools
Categories: Planet DigiPres

A Tika to ride; characterising web content with Nanite

Open Planets Foundation Blogs - 21 March 2014 - 1:58pm

This post covers two main topics that are related; characterising web content with Nanite, and my methods for successfully integrating the Tika parsers with Nanite.

Introducing Nanite

Nanite is a Java project lead by Andy Jackson from the UK Web Archive, formed of two main subprojects:

  • Nanite-Core: an API for Droid   
  • Nanite-Hadoop: a MapReduce program for characterising web archives that makes use of Nanite-Core, Apache Tika and libmagic-jna-wrapper  (the last one here essentially being the *nix `file` tool wrapped for reuse in Java)

Nanite-Hadoop makes use of UK Web Archive Record Readers for Hadoop, to enable it to directly process ARC and WARC files from HDFS without an intermediate processing step.  The initial part of a Nanite-Hadoop run is a test to check that the input files are valid gz files.  This is very quick (takes seconds) and ensures that there are no invalid files that could crash the format profiler after it has run for several hours.  More checks on the input files could be potentially be added.

We have been working on Nanite to add different characterisation libraries and improve them/their coverage.  As the tools that are used are all Java, or using native library calls, Nanite-Hadoop is fast.  Retrieving a mimetype from Droid and Tika for all 93 million files in 1TB (compressed size) of WARC files took 17.5hrs on our Hadoop cluster.  This is less than 1ms/file.  Libraries to be turned on/off relatively easily by editing the source or in the jar.

That time does not include any characterisation, so I began to add support for characterisation using Tika’s parsers.  The process I followed to add this characterisation is described below.

(Un)Intentionally stress testing Tika’s parsers

In hindsight sending 93 million files harvested from the open web directly to Tika’s parsers and expecting everything to be ok was optimistic at best.  There were bound to have been files in that corpus that were corrupt or otherwise broken that would cause crashes in Tika or its dependencies. 

Carnet let you do that; crashing/hanging the Hadoop JVM

Initially I began by using the Tika Parser interface directly.  This was ok until I noticed that some parsers (or their dependencies) were crashing or hanging.  As that was rather undesirable I began to disable the problematic parsers at runtime (with the aim of submitting bug reports back to Tika).  However, it soon became apparent that the files contained in the web archive were stressing the parsers to the point I would have had to disable ever increasing numbers of them.  This was really undesirable as the logic was handcrafted and relied on the state of the Tika parsers at that particular moment.  It also meant that the existence of one bad file of a particular format meant that no characterisation of that format could be carried out.  The logic to do this is still in the code, albeit not currently used.

Timing out Tika considered harmful; first steps

The next step was to error-proof the calls to Tika.  Firstly I ensured that any Exceptions/Errors/etc were caught.  Then I created a TimeoutParser  that parsed the files in a background Thread and forcibly stopped the Tika parser after a time limit had been exceeded.  This worked ok, however, it made use of Thread.stop() – a deprecated API call to stop a Java Thread.  Use of this API call is thoroughly not recommended as it may corrupt the internal state of the JVM or produce other undesired effects.  Details about this can be read in an issue on the Tika bug tracker.  Since I did not want to risk a corruption of the JVM I did not pursue this further. 

I should note that subsequently it has been suggested that an alternative to using Thread.stop() is to just leave it alone for the JVM to deal with and create new Thread.  This is a valid method of dealing with the problem, given the numbers of files involved (see later), but I have not tested it.

The whole Tika, and nothing but the Tika; isolating the Tika process

Following a suggestion by a commenter in the Tika issue, linked above, I produced a library that abstracted a Tika-server as a separate operating system process, isolated from the main JVM: ProcessIsolatedTika.  This means that if Tika crashes it is the operating system’s responsibility to clean up the mess and it won’t affect the state of the main JVM.  The new library controls restarting the process after a crash, or after processing times out (in case of a hang).  An API similar to a normal Tika parser is provided so it can be easily reused.  Communication by the library with the Tika-server is via REST, over the loopback network interface.  There may be issues if there is more than BUFSIZE bytes read (currently 20MB) – although such errors should be logged by Nanite in the Hadoop Reducer output.

Although the main overhead of this approach is having a separate process and JVM per WARC file, that is mitigated somewhat by the time that process is used for.  Aside from the cost of transferring files to the Tika-server, the overhead is a larger jar file, longer initial start-up time for Mappers and additional time for restarts of the Tika-server on failed files.  Given average runtime per WARC is slightly over 5 minutes, the few additional seconds that are included for using a process isolated Tika is not a great deal extra.

The output from the Tika parsers is kept in a sequence file in HDFS (one per input (W)ARC) – i.e. 1000 WARCs == 1000 Tika parser sequence files.  This output is in addition to the output from the Reducer (mimetypes, server mimetypes and extension).

To help the Tika parsers with the file, Tika detect() is first run on the file and that mimetype is passed to the parsers via a http header.  A Metadata object cannot be passed to the parsers via REST like it would be if we called them directly from the Java code.

Another approach could have been to use Nailgun as described by Ross Spencer in a previous blog post here.  I did not take that approach as I did not want to set up a Nailgun server on each Hadoop node (we have 28 of them) and if a Tika parser crashed or caused the JVM to hang then it may corrupt the state of the Nailgun JVM in a similar way to the TimeoutParser above.  Finally, with my current test data each node handles ~3m files – much more than the 420k calls that caused Nailgun to run out of heap space in Ross’ experiment.

Express Tika; initial benchmarks

I ran some initial benchmarks on 1000 WARC files using our test Hadoop cluster (28 nodes with 1 cpu/map slot per node) the results are as follows:

Identification tools used

Nanite-core (Droid)

Tika detect() (mimetype only)

ProcessIsolatedTika parsers

WARC files


Total WARC size

59.4GB (63,759,574,081 bytes)


Total files in WARCs (# input records)


Runtime (hh:mm:ss)






Total Tika parser output size (compressed)

765MB (801,740,734 bytes)


Tika parser failures/crashes


Misc failures

Malformed records: 122

IOExceptions*: 3224

Other Exceptions: 430

Total: 3776

*This may be due to files being larger than the buffer – to be investigated.

The output has not been fully verified but should give an initial indication of speed.

Conceivably the information from the Tika parsers could be loaded into c3po but I have not looked into that.

Conclusion; if the process isolation FITS, where is it?

We are now able to use Tika parsers for characterisation without being concerned about crashes in Tika.  This research will also allow us to identify files that Tika’s parsers cannot handle so we can submit bug reports/patches back to Tika.  When Tika 1.6 comes out it will include detailed pdf version detection within the pdf parser.

As an aside - if FITS offered a REST interface then the ProcessIsolatedTika code could be easily modifed to replace Tika with FITS – this is worth considering, if there was interest and someone were to create such a REST interface.

Apologies for the puns.

Preservation Topics: Preservation ActionsIdentificationCharacterisationWeb ArchivingToolsSCAPE
Categories: Planet DigiPres

Nominations Now Open for the 2014 NDSA Innovation Awards

The Signal: Digital Preservation - 20 March 2014 - 8:19pm

12 year old girl wins Medal of Honor, Washington, D.C., Sept. 12.” Library of Congress, Prints & Photographs Collection. LC-DIG-hec-33759,

 The National Digital Stewardship Alliance Innovation Working Group is proud to open the nominations for the 2014 NDSA Innovation Awards. As a diverse membership group with a shared commitment to digital preservation, the NDSA understands the importance of innovation and risk-taking in developing and supporting a broad range of successful digital preservation activities. These awards are an example of the NDSA’s commitment to encourage and recognize innovation in the digital stewardship community.

This slate of annual awards highlights and commends creative individuals, projects, organizations and future stewards demonstrating originality and excellence in their contributions to the field of digital preservation. The program is administered by a committee drawn from members of the NDSA Innovation Working Group.

Last year’s winners are exemplars of the diversity and collaboration essential to supporting the digital stewardship community as it works to preserve and make available digital materials. For more information on the details of last year’s recipients, please see the blog post announcing last year’s winners.

The NDSA Innovation Awards focus on recognizing excellence in one or more of the following areas:

  • Individuals making a significant, innovative contribution to the field of digital preservation;
  • Projects whose goals or outcomes represent an inventive, meaningful addition to the understanding or processes required for successful, sustainable digital preservation stewardship;
  • Organizations taking an innovative approach to providing support and guidance to the digital preservation community;
  • Future stewards, especially students, but including educators, trainers or curricular endeavors, taking a creative approach to advancing knowledge of digital preservation theory and practices.

Acknowledging that innovative digital stewardship can take many forms, eligibility for these awards has been left purposely broad. Nominations are open to anyone or anything that falls into the above categories and any entity can be nominated for one of the four awards. Nominees should be US-based people and projects or collaborative international projects that contain a US-based partner. This is your chance to help us highlight and reward novel, risk-taking and inventive approaches to the challenges of digital preservation.

Nominations are now being accepted and you can submit a nomination using this quick, easy online submission form. You can also submit a nomination by emailing a brief description, justification and the URL and/or contact information of your nominee to ndsa (at)

Nominations will be accepted until Friday May 2, 2014 and winners announced in mid-May. The prizes will be plaques presented to the winners at the Digital Preservation 2014 meeting taking place in the Washington, DC area on July 22-24, 2014. Winners will be asked to deliver a very brief talk about their activities as part of the awards ceremony and travel funds are expected to be available for these invited presenters.

Help us recognize and reward innovation in digital stewardship and submit a nomination!

Categories: Planet DigiPres

Long term accessibility of digital resources in theory and practice

Alliance for Permanent Access News - 20 March 2014 - 3:25pm

The APARSEN project is organising a Satellite Event on “Long Term Accessibility of Digital Resources in Theory and Practice” on 21st May 2014 in Vienna, Austria.

It takes place in the context of the 3rd LIBER Workshop on Digital Curation “Keeping data: The process of data curation” (19-20 May 2014)

The programme is organised by the APARSEN project together with the SCAPE Project.

09:00 – 10:30 Sabine Schrimpf
(German National Library) Digital Rights Management in the context of long-term preservation Ross King
(Austrian Institute of Technology) Thes SCAPE project and Scalable Quality Control David Wang
(SBA Research) Understanding the Costs of Digital Curation
11:00 – 12:30
Sven Schlarb
(Austrian National Library) Application scenarios of the SCAPE project at the Austrian National Library Krešimir Đuretec
(Vienna University of Technology) The SCAPE Planning and Watch Suite David Giaretta
(Alliance for Permanent Access) Digital Preservation: How APARSEN can help answer the key question “Who pays and Why?”
Categories: Planet DigiPres

A Regional NDSA?

The Signal: Digital Preservation - 19 March 2014 - 5:59pm

The following is a guest post by Kim Schroeder, a lecturer at the Wayne State University School of Library and Information Science.

Several years ago before the glory of the annual NDSA conference, professionals across America were seeking more digital curation literature and professional contacts.  Basic questions like ‘what is the first step in digital preservation?’ and ‘how do I start to consistently manage digital assets?’ were at the forefront.

As we have worked toward increased information sharing including the invaluable annual NDSA and IDCC conferences, we see a disconnect as we return home.  As we try to implement new tools and processes, we inevitably hit bumps beyond our immediate knowledge.  This is being alleviated more and more by local meetings being hosted in regions to gather professionals for hands-on and hand-waving process sharing.

Lance Stuchell, Digital Preservation Librarian at the University of Michigan and I began the Regional Digital Preservation Practitioners (RDPP) meetings as an opportunity to talk through our challenges and solutions.  The result is that over 100 professionals have signed up for our listserv since our call one year ago. We sent announcements out to Windsor, Toledo, Ann Arbor and throughout Metro Detroit to let people know that there is an untapped community of professionals that want and need to share their progress on digital curation.

 Mary Jane Murawka

Kevin Barton in the Wayne State SLIS Lab: Photo credit: Mary Jane Murawka

In the last year we have held three meetings with more planned this year.  The initial meeting included a discussion and eventually a survey to define our biggest issues as well as how best to craft the group.  Other topics included a digital projects lab tour, a DSpace installation overview, and a demonstration of a mature Digital Asset Management system.  Coming later this year, we plan to focus on metadata issues and a symposium on how to create workflows.  Further information about the meetings is available at the Regional Digital Preservation Practitioners site.

The development of the list has been one of the more helpful pieces with folks posting jobs, practicum ideas, latest articles and technical questions.  The volume of discussion is not there yet but it is off to a healthy start.

Mid-Michigan has also created a similar group that works with us to schedule events and share information.  Ed Busch, the Electronic Records Archivist at Michigan State University (MSU) held a successful conference last summer at MSU and he said:  “What my co-worker Lisa Schmidt and I find so useful with our Mid-Michigan regional meeting is the chance to network with other professionals trying to solve the same situations as we are with digital assets; hearing what they’ve tried with success and failure; and finding new ways to collaborate. All institutions with digital assets, regardless of size, are in the same boat when it comes to dealing with this material. It’s really nice to hear that from your peers.”  They held another conference on March 14th of this year and the agenda is available (pdf).

The NDSA is also encouraging regions to join together beyond the annual meeting. Butch Lazorchak, a Digital Archivist at the National Digital Information Infrastructure and Preservation Program
shared his thoughts on this. “The NDSA regional meetings are a great opportunity for NDSA members to share the work they’ve done to advance digital stewardship practice,” he said. “At the same time, the meetings help to build community by bringing together regional participants who may not usually find an opportunity to get together to discuss digital stewardship practice and share their own triumphs and challenges.”

Beginning a regional group is fairly easy as you send out announcements to professional listservs, but the tougher part is administration.  Deciding who keeps the minutes, manages the list, hosts the next meeting and how to maintain momentum is a necessity.  With the explosion in research, professional literature and expanding conferences we have more avenues to explore but we need the hands-on lessons learned from local colleagues to continue successful experimentation.  We would encourage you to think about starting your own local group!

Categories: Planet DigiPres

Things to Know About Personal Digital Archiving 2014

The Signal: Digital Preservation - 18 March 2014 - 8:44pm

Personal Digital Archiving 2014 will be held at the Indiana State Library in Indianapolis, Indiana, April 10-11, 2014.  This is THE conference that raises awareness among individuals, public institutions and private companies engaged in the creation, preservation and ongoing use of personal digital content.  A key overarching topic will be how libraries, archives and other cultural heritage organizations can support personal digital archiving within our own community as well as reaching out to specific communities. We invite you to come out and join the conversation.

The two-day conference will feature a diverse range of presentations on topics such as: archiving and documentation practices of local communities; tools and techniques to process digital archives; investigations of building, managing and archiving scholarly practices and family history archives; and the challenges of communicating personal digital archiving benefits to a variety of audiences. The full list of presentations, lightning talks and posters can be found here.

Tag cloud of PDA14 presentation titles.

Tag cloud of PDA14 presentation titles.

Here are a few quick things to know about upcoming conference:

  • Keynote speakers will explore preservation challenges from the perspectives of both researchers and creators of personal digital information.  Andrea Copeland from the School of Informatics and Computing, Indiana University-Purdue, will talk about her research looking into public library users’ digital preservation practices. Charles R. Cross, a music historian & author, will talk about the value of personal archives from a biographers perspective.
  • Adequate infrastructure in many organizations to implement preservation of personal digital records is lacking.  There will be a number of presentations on the practical side of doing personal digital preservation using specific tools and services.  Some will be on consumer-level services that help individuals build their own person digital archives. Other presentations will be from librarians, archivists and researchers who are using certain tools to help their institutions manage personal digital records.
  • Knowledge related to accession, donor or legal requirements, researchers’ interests, and practical preservation strategies for personal digital archives is equally lacking. To help understand some of these issues, practitioners, scholars and individuals from different fields will share their current research on personal digital archiving topics.  For the first time, the conference will feature a panel discussion from contemporary architects and landscape architects talking about preserving their work and transferring it to archives.  This is a community of professionals not regularly represented at the PDA conference and provides a great opportunity to hear about their specific challenges.

Registration is open!  We hope you can join us and explore and help raise awareness of the need for personal digital archiving in your own communities.

Categories: Planet DigiPres

Three years of SCAPE

Open Planets Foundation Blogs - 18 March 2014 - 12:24pm

SCAPE is proud to look back at another successful project year. During the third year the team produced many new tools, e.g. ToMaR, a tool which wraps command line tools into Hadoop MapReduce jobs. Other tools like xcorrSound and C3PO have been developed further.

This year’s All-Staff Meeting took place mid-February in Póvoa de Varzim, Portugal. The team organised a number of general sessions, during which the project partners presented demos of and elevator pitches for the tools and services they developed in SCAPE. It was very interesting for all meeting participants to see the results achieved so far. The demos and pitches were also useful for re-focusing on the big picture of SCAPE. During the main meeting sessions the participants mainly focused on take up and productization of SCAPE tools.

Another central topic of the meeting was integration. Until the end of the project the partners will put an emphasis on integrating the results further. To prove scalability of the tools, the team set up a number of operative Hadoop clusters instances (both central and local), which are currently being used for the evaluation of the tools and workflows.

Another focus lies on the sustainability of SCAPE tools. The SCAPE team is working towards documenting the tools for both developers and users. SCAPE outcomes will be curated by the Open Planets Foundation until the end of the project and will keep them available.

In September 2014 SCAPE is organising a final event in collaboration with APARSEN. The workshop is planned to take place at the Digital Libraries 2014 conference in London, where SCAPE will have its final, overall presentation. The workshop is directed towards developers, content holders, and data managers. The SCAPE team will present tools and services developed since 2011. A special focus will lie on newly and further developed open source tools for scalable preservation actions; SCAPE’s scalable Platform architecture; and its policy-based Planning and Watch solutions.

Preservation Topics: SCAPE
Categories: Planet DigiPres

Mavenized JHOVE

File Formats Blog - 16 March 2014 - 2:19pm

I’m not a Maven maven, but more of a Maven klutz. Nonetheless, I’ve managed to push a Mavenized version of JHOVE to Github that compiles for me. I haven’t tried to do anything beyond compiling. If anyone would like to help clean it up, please do.

This kills the continuity of file histories which Andy worked so hard to preserve, since Maven has its own ideas of where files should be. The histories are available under the deleted files in their old locations, if you look at the original commit.

Tagged: JHOVE, software
Categories: Planet DigiPres

ToMaR - How to let your preservation tools scale

Open Planets Foundation Blogs - 14 March 2014 - 4:01pm

Whenever you run into the situation that you have got used to a command line tool and all of a sudden need to apply it to a large amount of files over a Hadoop cluster without having any clue of writing distributed programs ToMaR will be your friend.

Mathilda is working at the department for digital preservation at a famous national library. In her daily work she has to cope with various well-known tasks like data identification, migration and curation. She is experienced in using the command shell on a Unix system and occasionally has to write small scripts to perform a certain workflow effectively.

When she has got to deal with a few hundreds of files she usually invokes her shell script on one file after the other using a simple loop for automation. But today she has been put in charge of a much bigger data set than she is used to. There are one hundred thousand TIFF images which need to be migrated to JPEG2000 images in order to save storage space. Intuitively she knows that processing these files one after the other with each single migration taking about half a minute would take a whole work day to run.

Luckily Mathilda has heard of the recent Hadoop cluster colleagues of her have set up in order to do some data mining on a large collection of text files. "Would there be a way to run my file migration tool on that cluster thing?", she thinks, "If I could run it in parallel on all these machines then that would speed up my migration task tremendously!" Only one thing makes here hesitate: She has hardly got any Java programming skills, not to mention any idea of that MapReduce programming paradigm they are using in their data mining task. How to let her tool scale?

That's where ToMaR, the Tool-to-MapReduce Wrapper comes in!

What can ToMaR do?

If you have a running Hadoop cluster you are only three little steps away from letting your preservation tools run on thousands of files almost as efficiently as with a native one-purpose Java MapReduce application. ToMaR wraps command line tools into a Hadoop MapReduce job which executes the command on all the worker nodes of the Hadoop cluster in parallel. Dependent on the tool you want to use through ToMaR it might be necessary to install it on each cluster node beforehand. Then all you need to do is:

  1. Specify your tool so that ToMaR can understand it using the SCAPE Tool Specification Schema.
  2. Itemize the parameters of the tool invocation for each of your input files in a control file.
  3. Run ToMaR.

Through MapReduce your list of parameter descriptions in the control file will be split up and assigned to each node portion by portion. For instance ToMaR could have been configured to create splits of 10 lines each taken from the control file. Then each node parses the portion line by line and invokes the tool with the parameters specified therein each time.

File Format Migration Example

So how may Mathilda tackle her file format migration problem? First she will have to make sure that her tool is installed on each cluster node. Her colleagues who maintain the Hadoop cluster will take care for this requirement. Up to her is the creation of the Tool Specification Document (ToolSpec) using the SCAPE Tool Specification Schema and the itemization of the tool invocation parameter descriptions. The following figure depicts the required workflow:

Create the ToolSpec

The ToolSpec is an XML file which contains several operations. An operations consists of name, a description, a command pattern and input/output parameters. The operation for Mathilda's file format migration tool might look like this:

<operation name="image-to-j2k"> <description>Migrates an image to jpeg2000</description> <command> image_to_j2k -i ${input} -o ${output} -I -p RPCL -n 7 -c [256,256], [256,256],[128,128],[128,128],[128,128],[128,128],[128,128] -b 64,64 -r 320.000,160.000,80.000,40.000,20.000,11.250,7.000,4.600,3.400,2.750, 2.400,1.000 </command> <inputs> <input name="input" required="true"> <description>Reference to input file</description> </input> </inputs> <outputs> <output name="output" required="true"> <description>Reference to output file. Only *.j2k, *.j2c or *.jp2!</description> </output> </outputs> </operation>

In the <command> element she has put the actual command line with a long tail of static parameters. This example highlights another advantage of the ToolSpec: You gain the ease of wrapping complex command lines in an atomic operation definition which is associated with a simple name, here "image-to-j2k". Inside the command pattern she puts placeholders which are replaced by various values. Here ${input} and ${output} denote such variables so that the value of the input file parameter (-i) and the value of the output file parameter (-o) can vary with each invocation of the tool.

Along with the command definition Mathilda has to describe these variables in the <inputs> and <outputs> section. For the ${input} being the placeholder for a input file she has to add a <input> element with the name of the placeholder as an attribute. The same counts for the ${output} placeholder. Additionally she can add some description text to these input and output parameter definitions.

There are more constructs possible with the SCAPE Tool Specification Schema which can not be covered here. The full contents of this ToolSpec can be found in the file attachments.

Create the Control File

The other essential requirement Mathilda has to achieve is the creation of the control file. This file contains the real values for the tool invocation which are mapped to the ToolSpec by ToMaR. Together with the above example her control file will look something like this:

openjpeg image-to-jp2 --input=“hdfs://myFile1.tif“ --output=“hdfs://myFile2.jp2“ openjpeg image-to-jp2 --input=“hdfs://myFile2.tif“ --output=“hdfs://myFile2.jp2“ openjpeg image-to-jp2 --input=“hdfs://myFile3.tif“ --output=“hdfs://myFile3.jp2“ ...

The first word refers to the name of the ToolSpec ToMaR shall load. In this example the ToolSpec is called "openjpeg.xml" but only the name without the .xml extension is needed for the reference. The second word refers to an operation within that ToolSpec, it's the "image-to-j2k" operation described in the ToolSpec example snippet above.

The rest of the line contains references to input and output parameters. Each reference starts with a double dash followed by a pair of parameters name and value. So --input (and likewise --output) refers to the parameters named "input" in the ToolSpec which in turn refers to the ${input} placeholder in the command pattern. The values are file references on Hadoop's Distributed File System (HDFS).

As Mathilda has 100k TIFF images she will have 100k lines in her control file. As she knows how to use the command shell she quickly writes a script which generates this file for her.

Run ToMaR

Having the ToolSpec openjpeg.xml and the control file controlfile.txt created she copies openjpeg.xml into the directory "hdfs:///user/mathilda/toolspecs" of HDFS and executes the following command on the master node of the Hadoop cluster:

hadoop jar ToMaR.jar -i controlfile.txt -r hdfs:///user/mathilda/toolspecs

Here she feeds in the controlfile.txt and the location of her ToolSpecs and ToMaR does the rest. It splits up the control file and distributes a certain number of lines per split to each node. The ToolSpec is loaded and the parameters are mapped to the command line pattern contained in the named operation. Input files are copied from HDFS to the local file system. As the placeholders are replaced by the values the command line can be executed by the worker node. After that the result output file is copied back to HDFS to the output location given.

Finally Mathilda has got all the migrated JPEG2000 images on HDFS in a fraction of the time it would have taken when run sequentially on her machine.

  • easily take up external tools with a clear mapping between the instructions and the physical invocation of the tool
  • use the SCAPE Toolspec, as well as existing Toolspecs, and its advantage of associating simple keywords with complex command-line patterns
  • no programming skills needed as the minimum requirement only is to setup the control file

When dealing with large volumes of files, e.g. in the context of file format migration or characterisation tasks, a standalone server often cannot provide sufficient throughput to process the data in a feasible period of time. ToMaR provides a simple and flexible solution to run preservation tools on a Hadoop MapReduce cluster in a scalable fashion.

ToMaR offers the possibility to use existing command-line tools in Hadoop's distributed environment very similarly to a desktop computer. By utilizing SCAPE Tool Specification documents, ToMaR allows users to associate complex command-line patterns with simple keywords, which can be referenced for execution on a computer cluster. ToMaR is a generic MapReduce application which does not require any programming skills.

Checkout the following blog posts for further usage scenarios of ToMaR:


Preservation Topics: Preservation ActionsSCAPE AttachmentSize Full openjpeg ToolSpec1.02 KB ToMaR-image_to_j2k-workflow.png158.29 KB ToMaR-overview.png67.97 KB logo.png74.65 KB
Categories: Planet DigiPres

Upcoming NDSR Symposium “Emerging Trends in Digital Stewardship”: Speaker Announcements

The Signal: Digital Preservation - 14 March 2014 - 1:23pm

The following is a guest post by Jaime McCurry, National Digital Stewardship Resident at the Folger Shakespeare Library.

It’s certainly been an exciting and busy few months for the National Digital Stewardship Residents and although we are well into the final portion of our projects, we’re showing no signs of slowing down.

NDSRs view time-based media art at the American Art Museum and National Portrait Gallery. Photo courtesy of Emily Reynolds.

Residents see time-based media art at the American Art Museum and National Portrait Gallery. Photo courtesy of Emily Reynolds.

In addition to our regularly scheduled programming on The Signal, you can find the residents on the web and elsewhere talking digital stewardship and providing project updates:

  • Julia Blase discusses her project status at the National Security Archive in Forward…March!
  • Heidi Dowding discusses the most recent NDSR Enrichment Session, hosted this month at Dumbarton Oaks.
  • Continuing with the Resident-to-Resident Interview Series, Maureen McCormick Harlow (National Library of Medicine) interviews Emily Reynolds on the specifics of her project at the World Bank.
  • Emily Reynolds recaps a recent NDSR site-visit to the United States Holocaust Memorial Museum.
  • Erica Titkemeyer (Smithsonian Institute) discusses Handling Digital Assets in Time Based Media Art.
  • I’m talking web archiving at the Folger Shakespeare Library.
  • You can catch Lauren Work (PBS) and Julia Blase (National Security Archive) at the Spring CNI meeting later this month.
  • And finally, residents Margo Padilla (MITH), Molly Schwartz (ARL), Erica Titkemeyer (Smithsonian Institute), and Lauren Work (PBS) are New Voices in Digital Curation in April.

Emerging Trends in Digital Stewardship Symposium: Speaker Announcements!

As previously announced, the inaugural cohort of National Digital Stewardship Residents will present a symposium titled “Emerging Trends in Digital Stewardship” on April 8, 2014. This event, hosted by the Library of Congress, IMLS, and the National Library of Medicine will be located at the National Library of Medicine’s Lister Hill Auditorium and will consist of panel presentations on topics related to digital stewardship.

At this time, we are delighted to release a final program, including guest speakers and panel participants:

Tuesday, April 8, 2014

8:30-9:30         Registration
9:30-9:45         Opening Remarks

  • George Coulbourne and Kris Nelson, Library of Congress

9:45-10:45       BitCurator Demonstration

  • Cal Lee, UNC-Chapel Hill School of Information and Library Science

11:00-Noon     Panel Discussion:  Social Media, Archiving, and Preserving Collaborative Projects

  • Leslie Johnston, Library of Congress
  • Janel Kinlaw, NPR: National Public Radio
  • Laura Wrubel, George Washington University

Noon-1:15       Lunch Break

1:15-2:15         Panel Discussion:  Open Government and Open Data

  • Daniel Schuman, Citizens for Responsibility and Ethics in Washington
  • Jennifer Serventi, National Endowment for the Humanities
  • Nick Shockey, Scholarly Publishing and Academic Resources Coalition

2:45-3:45       Panel Discussion:  Digital Strategies for Public and Non-Profit Institutions

  • Carl Fleischhauer, Library of Congress
  • Eric Johnson, Folger Shakespeare Library
  • Matt Kirschenbaum, Maryland Institute for Technology in the Humanities
  • Kate Murray, Library of Congress
  • Trevor Owens, Library of Congress

3:45             Closing Remarks

We’re thrilled to have such wonderful participants and look forward to sparking some exciting discussions on all things digital stewardship. As a reminder, the symposium is free and open to the public and pre-registration is strongly encouraged. More information can be here. We hope to see you there!

Categories: Planet DigiPres

Happy Birthday, Web!

The Signal: Digital Preservation - 13 March 2014 - 2:24pm

This is a guest post by Abbie Grotke, Library of Congress Web Archiving Team Lead and Co-Chair of the National Digital Stewardship Alliance Content Working Group

Yesterday we celebrated the 25th anniversary of the creation of the World Wide Web.

How many of you can remember the first time you saw a website, clicked on a hyperlink, or actually edited an HTML page? My “first web” story is still pretty fresh in my mind: It was probably around October 1993, in D.C. My brother and his friends were fairly tech savvy (they’d already set me up with an email account). We went over to his friend Miles’s house in Dupont Circle to visit, and while there he excitedly showed us this thing called Mosaic. I remember the gray screen and the strange concept of hyperlinks; my brother remembers seeing a short quicktime movie of a dolphin doing a flip.

We were all really excited.

Screenshot from the Election 2000 Web Archive of, captured October 23, 2000.

Screenshot from the Election 2000 Web Archive of, captured October 23, 2000.

Flash forward to 2014: Although I vaguely remember life without the web (however did we find out “who that actor is that looks so familiar on the TV right now and what she role she played in that movie that is on the tip of my tongue”?), I certainly can’t imagine a life without it in the future.  I’m in a job, preserving parts of the Internet, which would not exist had it not been for Tim Berners-Lee 25 years ago. For more on the 25th anniversary of the World Wide Web, check out Pew Research Internet Project’s “The Web at 25” .

As evidenced by Pew’s handy timeline of the Web, you can see a lot has changed since the Internet Archive (followed by national libraries), began preserving the web in 1996. If you haven’t seen this other Web Archives timeline, I encourage you to check it out. Since those early days, the number of organizations archiving the web has grown.

The Library of Congress started its own adventure preserving web content in 2000. For an institution that began in 1800, it certainly counts as a small amount of “Library” time. Although we’re not quite sure what our first archived website was (following the lead of our friends at the British Library) the first websites we crawled are from our Election 2000 Web Archive, and include campaign websites from both George E. Bush and Al Gore, among others.

As you can see by the screenshots, and if you click off to those archived pages, certain things didn’t archive very well. Something as simple as images weren’t always captured comprehensively, and the full sites certainly weren’t archived. We’ve spent years, with our partners around the globe, working to make “archival-quality” web archives that include much more than just the text of a site.

We’re all preserving more content than ever, but there are still challenges for those charged with preserving this content to keep up with not only the scale of content being generated, legal issues surrounding preservation of websites, and keeping up with the technologies used on the web (even if we want to preserve it, can we?), as has been discussed before on this blog. We’ve still got a lot of work to do.

Screenshot from the Election 2000 Web Archive of, captured August 3, 2000.

Screenshot from the Election 2000 Web Archive of, captured August 3, 2000.

It’s also unclear what researchers of the future will want, how they want to use our archives and access the data that we’ve preserved. More researchers have interest in access to the raw data for data mining projects than we  ever envisioned when we first started out. The International Internet Preservation Consortium has been reaching out at the last few General Assembly sessions to engage researchers during their “open days,” which have been incredibly interesting as we learn more about research use of our archives.

Twenty five years in is as good a time as any to reflect on things, whether it’s the founding of the Web or the efforts to preserve the future web. Please feel free to share your stories and thoughts in the comments.

Categories: Planet DigiPres

JHOVE on Github

File Formats Blog - 12 March 2014 - 11:15am

The JHOVE repository on Github is now live. The SourceForge site is still there and holds the documentation. The Github site is a work in progress.

Categories: Planet DigiPres

Preserving News Apps

The Signal: Digital Preservation - 11 March 2014 - 2:01pm

Natl. Press Bld. Newstand. Photo by Harris & Ewing, ca. 1940.

On Sunday, March 2, I had the opportunity to attend an OpenNews Hack Day event at the Newseum in Washington DC, sponsored by Knight-mozilla OpenNews, PopUp Archive, and the Newseum.  The event was held in conjunction with the NICAR (National Institute for Computer-Assisted Reporting) conference on working with datasets and developing interactive applications in journalism.

This was not a hackathon, but what they termed a “designathon,” where the goal was to brainstorm about end-to-end approaches for archiving and preserving data journalism projects.  The problem of disappearing applications is very well outlined in blog posts by Jacob Harris and Matt Waite, which are part of “The Source Guide to the Care and Feeding of News Apps.”  From the introduction to the Guide:

“Any news app that relies on live or updated data needs to be built to handle change gracefully. Even relatively simple interactive features often require special care when it comes time to archive them in a useful way. From launch to retirement and from timeliness to traffic management, we offer a collection of articles that will help you keep your projects happy and healthy until it’s time to say goodbye.”

For some, awareness of the need for digital preservation in this community came from a desire to participate in a wonderful Tumblr called “News Nerd First Projects.” Developers wanted to share their earliest works through this collaborative effort — whether to brag or admit to some embarrassment — and many discovered that their work was missing from the web or still online but irreparably broken. Many were lucky if they had screenshots to document their work. Some found static remnants through the Internet Archive but nothing more.

The event brought together journalists, researchers, software developers and archivists. The group of about 50 attendees broke out into sub-groups, discussing topics including best practices for coding, documenting and packaging up apps, saving and documenting the interactive experience and documenting cultural context. Not too surprisingly, a lot of the conversation centered around best practices around coding, metadata, documentation, packaging and dealing with external dependencies.

There was a discussion about web harvesting, which captures static snapshots of rendered data and the design but not the interaction or the underlying data. Packaging up the underlying databases and tables captures the vital data so that it can be used for research, but loses the design and the interaction. Packaging up the app and the tables together with a documented environment means that it might run again, perhaps in an emulated environment, but if the app requires interactions with open or commercial external web service dependencies, such as for geolocation or map rendering, that functionality is likely lost. Finding the balance of preserving the data and preserving the interactivity is a difficult challenge.

All in all, it’s early days for the conversation in this community, but the awareness-building around the need for digital preservation is already achieved and next steps are planned. I am looking forward to seeing this community move forward in its efforts to keep digital news sources alive.

Categories: Planet DigiPres

JHOVE, continued

File Formats Blog - 10 March 2014 - 11:30pm

There’s been enough encouragement in email and Twitter to my proposal to move JHOVE to Github that I’ll be going ahead with it. Andy Jackson has told me he has some almost-finished work to migrate the CVS history along with the project, so I’m waiting on that for the present. Watch this space for more news.

Tagged: JHOVE, software
Categories: Planet DigiPres