Planet DigiPres

Cookbooks vs. learning books

File Formats Blog - 8 August 2014 - 12:30pm

A contract lead got me to try learning more about a technology which the client needs. Working from an e-book which I already had, I soon was thinking that it’s a really confused, ad hoc library. But then I remembered having this feeling before, when the problem was really the book. I looked for websites on the software and found one that explained it much better. The e-book had a lot of errors, using JavaScript terminology incorrectly and its own terminology inconsistently.

A feeling came over me, the same horrified realization the translator of To Serve Man had: “It’s a cookbook!” It wasn’t designed to let you learn how the software works, but to get you turning out code as quickly as possible. There are too many of these books, designed for developers who think that understanding the concepts is a waste of time. Or maybe the fault belongs less to the developers than to managers who want results immediately.

A book that introduces a programming language or API needs to start with the lay of the land. What are its basic concepts? How is it different from other approaches? It has to get the terminology straight. If it has functions, objects, classes, properties, and attributes, make it clear what each one is. There should be examples from the start, so you aren’t teaching arid theory, but you need to follow up with an explanation.

If you’re writing an introduction to Java, your “Hello world” example probably has a class, a main() function, and some code to write to System.out. You should at least introduce the concepts of classes, functions, and importing. That’s not the place to give all the details; the best way to teach a new idea is to give a simple version at first, then come back in more depth later. But if all you say is “Compile and run this code, and look, you’ve got output!” then you aren’t doing your job. You need to present the basic ideas simply and clearly, promise more information later, and keep the promise.

Don’t jump into complicated boilerplate before you’ve covered the elements it’s made of. The point of the examples should be to teach the reader how to use the technology, not to provide recipes for specific problems. The problem the developer has to solve is rarely going to be the one in the book. They can tinker with the examples until they fit their own problem, not really understanding them, but that usually results in complicated, inefficient, unmaintainable code.

Expert developers “steal” code too, but we know how it works, so we can take it apart and put it back together in a way that really suits the problem. The books we can learn from are the ones that put the “how it works” first. Cookbooks are useful too, but we need them after we’ve learned the tech, not when we’re trying to figure it out.

Tagged: books, writing
Categories: Planet DigiPres

Duke’s Legacy: Video Game Source Disc Preservation at the Library of Congress

The Signal: Digital Preservation - 6 August 2014 - 2:18pm

The following is a guest post from David Gibson, a moving image technician in the Library of Congress. He was previously interviewed about the Library of Congress video games collection.

The discovery of that which has been lost or previously unattainable is one of the driving forces behind the archival profession and one of the passions the profession shares with the gaming community. Video game enthusiasts have long been fascinated by unreleased games and “lost levels,” gameplay levels which are partially developed but left out of the final release of the game. Discovery is, of course, a key component to gameplay. Players revel in the thrill of unlocking the secret door or uncovering Easter eggs hidden in the game by developers. In many ways, the fascination with obtaining access to unreleased games or levels brings this thrill of discovery into the real world. In a recent article written for The Atlantic, Heidi Kemps discusses the joy in obtaining online access to playable lost levels from the 1992 Sega Genesis game, Sonic The Hedgehog 2, reveling in the fact that access to these levels gave her a glimpse into how this beloved game was made.

Original source disc as it was received by the Library of Congress.

Original source disc as it was received by the Library of Congress.

Since 2006, the Moving Image section of the Library of Congress has served as the custodial unit for video games. In this capacity, we receive roughly 400 video games per year through the Copyright registration process, about 99% of which are physically published console games. In addition to the games themselves we sometimes receive ancillary materials, such as printed descriptions of the game, DVDs or VHS cassettes featuring excerpts of gameplay, or the occasional printed source code excerpt. These materials are useful, primarily for their contextual value, in helping to tell the story of video game development in this country and are retained along with the games in the collection.

Several months ago, while performing an inventory of recently acquired video games, I happened upon a DVD-R labeled Duke Nukem: Critical Mass (PSP). My first assumption was that the disc, like so many others we have received, was a DVD-R of gameplay. However, a line of text on the Copyright database record for the item intrigued me. It reads: Authorship: Entire video game; computer code; artwork; and music. I placed the disc into my computer’s DVD drive to discover that the DVD-R did not contain video, but instead a file directory, including every asset used to make up the game in a wide variety of proprietary formats. Upon further research, I discovered that the Playstation Portable version of Duke Nukem: Critical Mass was never actually released commercially and was in fact a very different beast than the Nintendo DS version of the game which did see release. I realized then that in my computer was the source disc used to author the UMD for an unreleased PlayStation Portable game. I could feel the lump in my throat. I felt as though I had solved the wizard’s riddle and unlocked the secret door.

Excerpt of code from boot.bin including game text.

Excerpt of code from boot.bin including game text.

The first challenge involved finding a way to access the proprietary Sony file formats contained within the disc, including, but not limited to, graphics files in .gim format and audio files in .AT3 format. I enlisted the aid of Packard Campus Software Developer Matt Derby and we were able to pull the files off of the disc and get a clearer sense of the file structure contained within. Through some research on various PSP homebrew sites we discovered Noesis, a program that would allow us to access the .gim and .gmo files which contain the 3D models and textures used to create the game’s characters and 3D environments. With this program we were able to view a complete 3D view of Duke Nukem himself, soaring through the air on his jetpack and a pre-composite 3D model of one of the game’s nemeses, the Pig Cops. Additionally, we employed Mediacoder and VLC in order to convert the Sony .AT3 (ATRAC3) audio files to MP3 in order to have access to the game’s many music cues.


3D model for Duke Nukem equipped with jetpack. View an animated gif of the model here.

Perhaps the most exciting discovery came when we used a hex editor to access the ASCII text held in the boot.bin folder in the disc’s system directory. Here we located the full text and credit information for the game along with a large chunk of un-obfuscated software code. However, much of what is contained in this folder was presented as compiled binaries. It is my hope that access to both the compiled binaries and ASCII code will allow us to explore future preservation options for video games. Such information becomes even more vital in the case of games such as this Duke Nukem title which were never released for public consumption. In many ways, this source disc can serve as an exemplary case as we work to define preferred format requirements for software received by the Library of Congress. Ultimately, I feel that access to the game assets and source code will prove to be invaluable both to researchers who are interested in game design and mechanics and to any preservation efforts the Library may undertake.

Providing access to the disc’s content to researchers will, unfortunately, remain a challenge. As mentioned above, it was difficult enough for Library of Congress staff to view the proprietary formats found on the disc before seeking help from the homebrew community. The legal and logistical hurdles related to providing access to licensed software will continue to present themselves as we move forward but I hope that increased focus on the tremendous research value of such digital assets will allow for these items to be more accessible in the future. For now the assets and code will be stored in our digital archive at the Packard Campus in Culpeper and the physical disc will be stored in temperature-controlled vaults.

The source disc for the PSP version of Duke Nukem: Critical Mass stands out in the video game collection of the Library of Congress as a true digital rarity. In Doug Reside’s recent article “File Not Found: Rarity in the Age of Digital Plenty” (pdf), he explores the notion of source code as manuscript and the concept of digital palimpsests that are created through the various layers that make up a Photoshop document or which are present in the various saved “layers” of a Microsoft Word document. The ability to view the pre-compiled assets for this unreleased game provides a similar opportunity to view the game as a work-in-progress, or at the very least to see the inner workings and multiple layers of a work of software beyond what is presented to us in the final, published version. In my mind, receiving the source disc for an unreleased game directly from the developer is analogous to receiving the original camera negative for an unreleased film, along with all of the separate production elements used to make the film. The disc is a valuable evidentiary artifact and I hope we will see more of its kind as we continue to define and develop our software preservation efforts.

The staff of the Moving Image section would love the opportunity to work with more source materials for games and I hope that game developers who are interested in preserving their legacy will be willing to submit these kinds of materials to us in the future. Though source discs are not currently a requirement for copyright, they are absolutely invaluable in contributing to our efforts towards stewardship and long term access to the documentation of these creative works.

Special thanks to Matt Derby for his assistance with this project and input for this post.

Categories: Planet DigiPres

National Geospatial Advisory Committee: The Shape of Geo to Come

The Signal: Digital Preservation - 5 August 2014 - 1:24pm

World Map 1689 — No. 1 from user caveman_92223 on Flickr.

Back in late June I attended the National Geospatial Advisory Committee (NGAC) meeting here in DC. NGAC is a Federal Advisory Committee sponsored by the Department of the Interior under the Federal Advisory Committee Act. The committee is composed of (mostly) non-federal representatives from all sectors of the geospatial community and features very high profile participants. For example, ESRI founder Jack Dangermond, the 222nd richest American, has been a member since the committee was first chartered in 2008 (his term has since expired). Current committee members include the creator of Google Earth (Michael Jones) and the founder of OpenStreetMap (Steve Coast).

So what is the committee interested in, and how does it coincide with what the digital stewardship community is interested in? There are number of noteworthy points of intersection:

  • In late March of this year the FGDC released the “National Geospatial Data Asset Management Plan – a Portfolio Management Implementation Plan for the OMB Circular A–16” (pdf). The plan “lays out a framework and processes for managing Federal NGDAs [National Geospatial Data Assets] as a single Federal Geospatial Portfolio in accordance with OMB policy and Administration direction. In addition, the Plan describes the actions to be taken to enable and fulfill the supporting management, reporting, and priority-setting requirements in order to maximize the investments in, and reliability and use of, Federal geospatial assets.”
  • Driven by the release of the NGDA Management Plan, a baseline assessment of the “maturity” of various federal geospatial data assets is currently under way. This includes identifying dataset managers, identifying the sources of data (fed only/fed-state partnerships/consortium/etc.) and determining the maturity level of the datasets across a variety of criteria. With that in mind, several “maturity models” and reports were identified that might prove useful for future work in this area. For example, the state of Utah AGRC has developed a one-page GIS Data Maturity Assessment; the American Geophysical Union has a maturity model for assessing the completeness of climate data records (behind a paywall, unfortunately); the National States Geographic Information Council has a Geospatial Maturity Assessment; and the FGDC has “NGDA Dataset Maturity Annual Assessment Survey and Tool” that is being developed as part of their baseline assessment These maturity models have a lot in common with the NDSA Levels of Preservation work.
  • Lots of discussion on a pair of reports on big data and geolocation privacy. The first, Big Data – Seizing Opportunities, Preserving Values Report from the Executive Office of the President, acknowledges the benefits of data but also notes that “big data technologies also raise challenging questions about how best to protect privacy and other values in a world where data collection will be increasingly ubiquitous, multidimensional, and permanent.” The second, the PCast report on Big Data and Privacy (PCAST is the “President’s Council of Advisors on Science and Technology” and the report is officially called “Big Data: A Technology Perspective”) “begins by exploring the changing nature of privacy as computing technology has advanced and big data has come to the forefront.  It proceeds by identifying the sources of these data, the utility of these data — including new data analytics enabled by data mining and data fusion — and the privacy challenges big data poses in a world where technologies for re-identification often outpace privacy-preserving de-identification capabilities, and where it is increasingly hard to identify privacy-sensitive information at the time of its collection.” The importance of both of these reports to future library and archive collection and access policies regarding data can not be understated.
  • The Spatial Data Transfer Standard is being voted on for withdrawal as an FGDC-endorsed standard. FGDC maintenance authority agencies were asked to review the relevance of the SDTS, and they responded that the SDTS is no longer used by their agencies. There’s a Federal Register link to the proposal. The Geography Markup Language (GML), which the FGDC has endorsed, now satisfies the encoding requirements that SDTS once provided. NARA revised their transfer guidance for geospatial information in April 2014 to make SDTS files “acceptable for imminent transfer formats” but it’s clear that they’ve already moved away from them.  As a side note, GeoRSS is coming up for a vote soon to become an FGDC-endorsed standard.
  • The Office of Management and Budget is reevaluating the geospatial professional classification. The geospatial community has an issue similar to that being faced by the library and archives community, in that the jobs are increasingly information technology jobs but are not necessarily classified as such. This coincides with efforts to reevaluate the federal government library position description.
  • The Federal Geographic Data Committee is working with federal partners to make previously-classified datasets available to the public.  These datasets have been prepared as part of the “HSIP Gold” program. HSIP Gold is a compilation of over 450 geospatial datasets of U.S. domestic infrastructure features that have been assembled from a variety of Federal agencies and commercial sources. The work of assembling HSIP Gold has been tasked to the Homeland Infrastructure Foundation-Level Data (HIFLD) Working Group (say it as “high field”). Not all of the data in HSIP Gold is classified, so they are working to make some of the unclassified portions available to the public.

The next meeting of the NGAC is scheduled for September 23 and 24 in Shepherdstown, WV. The meetings are public.

Categories: Planet DigiPres

Making Scanned Content Accessible Using Full-text Search and OCR

The Signal: Digital Preservation - 4 August 2014 - 12:48pm

This following is a guest post by Chris Adams from the Repository Development Center at the Library of Congress, the technical lead for the World Digital Library.

We live in an age of cheap bits: scanning objects en masse has never been easier, storage has never been cheaper and large-scale digitization has become routine for many organizations. This poses an interesting challenge: our capacity to generate scanned images has greatly outstripped our ability to generate the metadata needed to make those items discoverable. Most people use search engines to find the information they need but our terabytes of carefully produced and diligently preserved TIFF files are effectively invisible for text-based search.

The traditional approach to this problem has been to invest in cataloging and transcription but those services are expensive, particularly as flat budgets are devoted to the race to digitize faster than physical media degrades. This is obviously the right call from a preservation perspective but it still leaves us looking for less expensive alternatives.

OCR is the obvious solution for extracting machine-searchable text from an image but the quality rates usually aren’t high enough to offer the text as an alternative to the original item. Fortunately, we can hide OCR errors by using the text to search but displaying the original image to the human reader. This means our search hit rate will be lower than it would with perfect text but since the content in question is otherwise completely unsearchable anything better than no results will be a significant improvement.

Since November 2013, the World Digital Library has offered combined search results similar to what you can see in the screenshot below:


This system is entirely automated, uses only open-source software and existing server capacity, and provides an easy process to improve results for items as resources allow.

How it Works: From Scan to Web Page Generating OCR Text

As we receive new items, any item which matches our criteria (currently books, journals and newspapers created after 1800) will automatically be placed in a task queue for processing. Each of our existing servers has a worker process which uses idle capacity to perform OCR and other background tasks. We use the Tesseract OCR engine with the generic training data for each of our supported languages to generate an HTML document using hOCR markup.

The hOCR document has HTML markup identifying each detected word and paragraph and its pixel coordinates within the image. We archive this file for future usage but our system also generates two alternative formats for the rest of our system to use:

  • A plain text version for the search engine, which does not understand HTML markup
  • A JSON file with word coordinates which will be used by a browser to display or highlight parts of an image on our search results page and item viewer
Indexing the Text for Search

Search has become a commodity service with a number of stable, feature-packed open-source offerings such as such Apache Solr, ElasticSearch or Xapian. Conceptually, these work with documents — i.e. complete records — which are used to build an inverted index — essentially a list of words and the documents which contain them. When you search for “whaling” the search engine performs stemming to reduce your term to a base form (e.g. “whale”) so it will match closely-related words, finds the term in the index, and retrieves the list of matching documents. The results are typically sorted by calculating a score for each document based on how frequently the terms are used in that document relative to the entire corpus (see the Lucene scoring guide for the exact details about how term frequency-inverse document frequency (TD-IDF) works).

This approach makes traditional metadata-driven search easy: each item has a single document containing all of the available metadata and each search result links to an item-level display. Unfortunately, we need to handle both very large items and page-level results so we can send users directly to the page containing the text they searched for rather than page 1 of a large book. Storing each page as a separate document provides the necessary granularity and avoids document size limits but it breaks the ability to calculate relevancy for the entire item: the score for each page would be calculated separately and it would be impossible to search for multiple words which fall on different pages.

The solution for this final problem is a technique which Solr calls Field Collapsing (the ElasticSearch team has recently completed a similar feature referred to as “aggregation”). This allows us to make a query and specify a field which will be used to group documents before determining relevancy. If we tell Solr to group our results by the item ID the search ranking will be calculated across all of the available pages and the results will contain both the item’s metadata record and any matching OCR pages.

(The django-haystack Solr grouped search backend with Field Collapsing support used on has been released into the public domain.)

Highlighting Results

At this point, we can perform a search and display a nice list of results with a single entry for each item and direct links to specific pages. Unfortunately, the raw OCR text is a simple unstructured stream of text and any OCR glitches will be displayed, as can be seen in this example where the first occurrence of “VILLAGE FOULA” was recognized incorrectly:


The next step is replacing that messy OCR text with a section of the original image. Our search results list includes all of the information we need except for the locations for each word on the page. We can use our list of word coordinates but this is complicated because the search engine’s language analysis and synonym handling mean that we cannot assume that the word on the page is the same word that was typed into the search box (e.g. a search for “runners” might return a page which mentions “running”).

Here’s what the entire process looks like:

1. The server returns an HTML results page containing all of the text returned by Solr with embedded microdata indicating the item, volume and page numbers for results and the highlighted OCR text:


2. JavaScript uses the embedded microdata to determine which search results include page-level hits and an AJAX request is made to retrieve the word coordinate lists for every matching page. The word coordinate list is used to build a list of pixel coordinates for every place where one of our search words occurs on the page:

adams080414image7Now we can find each word highlighted by Solr and locate it in the word coordinates list. Since Solr returned the original word and our word coordinates were generated from the same OCR text which was indexed in Solr, the highlighting code doesn’t need to handle word tenses, capitalization, etc.

3. Since we often find words in multiple places on the same page and we want to display a large, easily readable section of the page rather than just the word, our image slice will always be the full width of the page starting at the top-most result and extending down to include subsequent matches until there is either a sizable gap or the total height is greater than the first third of the page.

Once the image has been loaded, the original text is replaced with the image:


4. Finally, we add a partially transparent overlay over each highlighted word:


  • The WDL management software records the OCR source and review status for each item. This makes it safe to automatically reprocess items when new versions of our software are released without the chance of inadvertently overwriting OCR text which was provided by a partner or which has been hand-corrected.
  • You might be wondering why the highlighting work is performed on the client side rather than having the server return highlighted images. In addition to reducing server load this design improves performance because a given image segment can be reused for multiple results on the same page(rounding the coordinates improves the cache hit ratio significantly) and both the image and word coordinates can be cached independently by CDN edge servers rather than requiring a full round-trip back to the server each time.
  • This benefit is most obvious when you open an item and start reading it: the same word coordinates used on the search results page can be reused by the viewer and since the page images don’t have to be customized with search highlighting, they’re likely to be cached on the CDN. If you change your search text while viewing the book highlighting for the current page will be immediately updated without having to wait for the server to respond.


Challenges & Future Directions

This approach works relatively well but there are a number of areas for improvement:

  • The process described above allows the OCR process to be improved considerably. This provides plenty of room to improve results with technical improvements such as more sophisticated image processing, OCR engine training, and workflow systems incorporating human review and correction.
  • For collections such as WDL’s which include older items OCR accuracy is reduced by the condition of the materials and typographic conventions like the long s (ſ) or ligatures which are no longer in common usage. The Early Modern OCR Project is working on this problem and will hopefully provide a solution for many needs.
  • Finally, there’s considerable appeal to crowd-sourcing corrections as demonstrated by the National Library of Australia’s wonderful Trove project and various experimental projects such as the UMD MITH ActiveOCR project.
  • This research area is of benefit to any organization with large digitized collections, particularly projects with an eye towards generic reuse. Ed Summers and I have casually discussed the idea for a simple web application which would display images with the corresponding hOCR with full version control, allowing the review and correction process to be a generic workflow step for many different projects.
Categories: Planet DigiPres

Computational Linguistics & Social Media Data: An Interview with Bryan Routledge

The Signal: Digital Preservation - 1 August 2014 - 1:15pm
Bryan Routledge, Associate Professor of Finance Tepper School of Business Carnegie Mellon University.

Bryan Routledge, Associate Professor of Finance, Tepper School of Business, Carnegie Mellon University.

The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and worked on a range of projects leading up to CurateCamp Digital Culture last week. This is part of an ongoing series of interviews Julia is conducting to better understand the kinds of born-digital primary sources folklorists, and others interested in studying digital culture, are making use of for their scholarship.

What can a Yelp review or a single tweet reveal about society? How about hundreds of thousands of them? In this installment of the Insights Interviews series, I’m thrilled to talk with researcher Bryan Routledge about two of his projects that utilize a computational linguistic lens to analyze vast quantities of social media data. You can read the article on word choice used in online restaurant reviews here. The article about using Twitter as a predictive tool as compared with traditional public opinion polls here (PDF).

Julia: The research group Noah’s ARK at the Language Technologies Institute, School of Computer Science at Carnegie Mellon University aims in part to “analyze the textual content of social media, including Twitter and blogs, as data that reveal political, linguistic, and economic phenomena in society.”  Can you unpack this a bit for us? What kind of information can social media provide that other kinds of data can’t?

Bryan: Noah Smith, my colleague in the school of computer science at CMU, runs that lab.  He is kind enough to let me hang out over there.  The research we are working on looks at the connection between text and social science (e.g., economics, finance).  The idea is that looking at text through the lens of a forecasting problem — the statistical model between text and some social-science measured variable — gives insight into both the language and social parts.  Online and easily accessed text brings new data to old questions in economics.  More interesting, at least to me, is that grounding the text/language with quantitative external measures (volatility, citations, etc.) gives insight into the text.  What words in corporate 10K annual reports correlate with stock volatility and how that changes over time is cool.


Different metaphors for expensive and inexpensive restaurants in Yelp reviews. From: Dan Jurafsky, Victor Chahuneau, Bryan R. Routledge, and Noah A. Smith. 2014. Narrative framing of consumer sentiment in online restaurant reviews. First Monday 19:4.

Julia: Your work with social media—Yelp and Twitter—are notable for their large sample sizes and emphasis on quantitative methods, using over 900,000 Yelp reviews and 1 billion tweets. How might archivists of social media better serve social science research that depends on these sorts of data sets and methods?

Bryan: That is a good question.  What makes it very hard for archivists is that collecting the right data without knowing the research questions is hard.  The usual answer of “keep everything!” is impractical.  Google’s n-gram project is a good illustration.  They summarized a huge volume of books with word counts (two word pairs, …) by time.  This is great for some research.  But not for the more recent statistical models that use sentences and paragraph information.

Julia:  Your background and most of your work is in the field of finance, which you have characterized as being fundamentally about predicting the behavior of people . How do you see financial research being influenced by social media and other born digital content? Could you tell us a bit about what it means to have a financial background doing this kind of research? What can the fields of finance and archives learn from each other?

 in most locations, the word “baby” is neutral -- it suggests neither high nor low price.  Except in the Wall Street area of lower Manhattan where it is associated with higher priced steak.

In Yelp reviews of Manhattan restaurants with “steak” in the menu (an example). Predict the (log) menu item price using the words used to describe the item by location. For example: in most locations, the word “baby” is neutral — it suggests neither high nor low price. Except in the Wall Street area of lower Manhattan where it is associated with higher priced steak.

Bryan:  Finance (and economics) is about the collective behavior of large number of people in markets.  To make research possible you need simple models of individuals.  Getting the right mix of simplicity and realism is age-old and ongoing research in the area.  More data helps.  Macroeconomic data like GDP and stock returns is informative about the aggregate.  Data on, say, individual portfolio choices in 401K plans lets you refine models.  Social media data is this sort of disaggregated data.  We can get a signal, very noisy, about what is behind an individual decision.  Whether that is ultimately helpful for guiding financial or economic policy is an open, but exciting, question.

More generally, working across disciplines is interesting and fun.  It is not always “additive.”  The research we have done on menus has nothing to do with finance (other than my observation that in NY restaurants near Wall Street, the word “baby” is associated with expensive menu items).  But if we can combine, for example, decision theory finance with generative text models, we get some cool insights into purposefully drafted documents.

Julia: The data your team collected from Yelp was gathered from the site. Your data from Twitter was collected using Twitter’s Streaming API and “Gardenhose,” which deliver a random sampling of tweets in real-time. I’d be curious to hear what role you think content holders like Yelp or Twitter can or could play in providing access to this kind of raw data.

Bryan: As a researcher with only the interests of science at heart, it would be best if they just gave me access to all their data!  Given that much of the data is valuable to the companies (and privacy, of course), I understand that is not possible.  But it is interesting that academic research, and data-sharing more generally, is in a company’s self-interest.  Twitter has encouraged a whole ecosystem that has helped them grow.  Many companies have an API for that purpose that happens to work nicely for academic research.  In general, open access is most preferred in academic settings so that all researchers have access to the same data.  Interesting papers using proprietary access to Facebook are less helpful than Twitter.

Julia: Could you tell us a bit about how you processed and organized the data for analysis and how you are working to manage it for the future? Given that reproducibility is such an important concept for science, what ways are you approaching ensuring that your data will be available in the future?

Bryan: This is not my strong suit.  But at a high-level, the steps are (roughly) “get,” “clean,” “store,” “extract,” “experiment.”  The “get” varies with the data source (an API).  The “clean” step is just a matter of being careful with special characters and making sure data are lining up into fields right.  If the API is sensible, the “clean” is easy.  We usually store things in a JSON format that is flexible.  This is usually a good format to share data.  The “extract” and “experiment” steps depend on what you are interested in.  Word counts? Phrase counts? Other?  The key is not to jump from “get” to “extract” — storing the data in as raw form as possible makes thing flexible.

Julia:  What role, or potential role, do you see for the future of libraries, archives and museums in working with the kinds of data you collect? That is, while your data is valuable for other researchers now, things like 700,000 Yelp reviews of restaurants will be invaluable to all kinds of folks studying culture, economics and society 10, 20, 50 and 100 years from now. So, what kind of role do you think cultural heritage institutions could play in the long-term stewardship of this cultural data? Further, what kinds of relationships do you think might be able to be arranged between researchers and libraries, archives, and museums? For instance, would it make sense for a library to collect, preserve, and provide access to something like the Yelp review data you worked with? Or do you think they should be collecting in other ways?

 Linking Text Sentiment to Public Opinion Time Series. Brendan O'Connor,Ramnath Balasubramanyan, Bryan R. Routledge, and Noah A. Smith. In Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM 2010), pages 122–129, Washington, DC, May 2010

Sentiment on Twitter as compared to Gallup Poll. Appeared in From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series. Brendan O’Connor, Ramnath Balasubramanyan, Bryan R. Routledge and Noah A. Smith. In Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM 2010), pages 122–129, Washington, DC, May 2010

Bryan: This is also a great question and also one for which I do not have a great answer.  I do not know a lot about the research in “digital humanities,” but that would be a good place to look.  People doing digital text-based research on a long-horizon panel of data should provide some insight into what sorts of questions people ask.  Similarly, economic data might provide some hints.  Finance, for example, has a strong empirical component that comes from having easy-to-access stock data (the CRSP).  The hard part for libraries is figuring out which parts to keep.  Sampling Twitter, for example, gets a nice time-series of data but loses the ability to track a group of users or Twitter conversations.

Julia: Talking about the paper you co-authored that analyzed Yelp reviews, Dan Jurafsky said “when you write a review on the web you’re providing a window into your own psyche – and the vast amount of text on the web means that researchers have millions of pieces of data about people’s mindsets.” What do you think are some of the possibilities and limitations for analyzing social media content?

Bryan: There are many limitations, of course.  Twitter and Yelp are not just providing a window into things, they are changing the way the world works.  “Big data” is not just about larger sample sizes of draws from a fixed distribution.  Things are non-stationary.  (In an early paper using Twitter data, we could see the “Oprah” effect as the number of users jumped in the day following her show about Twitter).  Similarly, the data we see in social media is not a representative sample of society cross section.  But both of these are the sort of things good modeling – statistical, economic – should, and do, aim to capture.  The possibilities of all this new data are exciting.  Language is a rich source of data with challenging models needed to turn it into useful information.  More generally, social media is an integral part of many economic and social transactions.  Capturing that in a tractable model makes for an interesting research agenda.

Categories: Planet DigiPres

Digital Preservation 2014: It’s a Thing

The Signal: Digital Preservation - 30 July 2014 - 12:56pm

“Digital preservation makes headlines now, seemingly routinely. And the work performed by the community gathered here is the bedrock underlying such high profile endeavors.” – Matt Kirschenbaum

 Erin Engle.

The registration table at Digital Preservation 2014. Photo credit: Erin Engle.

The annual Digital Preservation meeting, held each summer in Washington, DC, brings together experts in academia, government and the private and non-profit sectors to celebrate key work and share the latest developments, guidelines, best practices and standards in digital preservation.

Digital Preservation 2014, held July 22-24,  marked the 13th major meeting hosted by NDIIPP in support of the broad community of digital preservation practitioners (NDIIPP held two meetings a year from 2005-2007), and it was certainly the largest, if not the best. Starting with the first combined NDIIPP/National Digital Stewardship Alliance meeting in 2011, the annual meeting has rapidly evolved to welcome an ever-expanding group of practitioners, ranging from students to policy-makers to computer scientists to academic researchers. Over 300 people attended this year’s meeting.

“People don’t need drills; they need holes,” stated NDSA Coordinating Committee chairman Micah Altman, the Director of Research at the Massachusetts Institute of Technology Libraries,  in an analogy to digital preservation in his opening talk. As he went on to explain, no one needs digital preservation for its own sake, but it’s essential to support the rule of law, a cumulative evidence base, national heritage, a strategic information reserve, and to communicate to future generations. It’s these challenges that face the current generation of digital stewardship practitioners, many of which are addressed in the 2015 National Agenda for Digital Stewardship, which Altman previewed during his talk (and which will appear later this fall).

 Erin Engle.

A breakout session at Digital Preservation 2014. Photo credit: Erin Engle.

One of those challenges is the preservation of the software record, which was eloquently illuminated by Matt Kirschenbaum, the Associate Director of the Maryland Institute for Technology in the Humanities, during his stellar talk, “Software, It’s a Thing.” Kirschenbaum ranged widely across computer history, art, archeology and pop culture with a number of essential insights. One of the more piquant was his sorting of software into different categories of “things” (software as asset, package, shrinkwrap, notation/score, object, craft, epigraphy, clickwrap, hardware, social media, background, paper trail, service, big data), each of which with its own characteristics. As Kirschenbaum eloquently noted, software is many different “things,” and we’ll need to adjust our future approaches to preservation accordingly.

Associate Professor at the New School Shannon Mattern took yet another refreshing approach, discussing the aesthetics of creative destruction and the challenges of preserving ephemeral digital art. As she noted, “by pushing certain protocols to their extreme, or highlighting snafus and ‘limit cases’ these artists’ work often brings into stark relief the conventions of preservation practice, and poses potential creative new directions for that work.”

 Erin Engle.

Stephen Abrams, Martin Klein, Jimmy Lin and Michael Nelson during the “Web Archiving” panel. Photo credit: Erin Engle.

These three presentations on the morning of the first day provided a thoughtful intellectual substrate upon which a huge variety of digital preservation tools, services, practices and approaches were elaborated over the following days. As befits a meeting that convenes disparate organizations and interests, collaboration and community were big topics of discussion.

A Tuesday afternoon panel on “Community Approaches to Digital Stewardship” brought together a quartet of practitioners who are working collaboratively to advance digital preservation practice across a range of organizations and structures, including small institutions (the POWRR project); data stewards (the Research Data Alliance); academia (the Academic Preservation Trust); and institutional consortiums (the Five College Consortium).

Later, on the second day, a well-received panel on the “Future of Web Archiving” displayed a number of clever collaborative approaches to capturing the digital materials from the web, including updates on the Memento project and Warcbase, an open-source platform for managing web archives.

 Erin Engle.

CurateCamp: Digital Culture. Photo credit: Erin Engle.

In between there were plenary sessions on stewarding space and research data, and over three dozen lightning talks, posters and breakout sessions covering everything from digital repositories for museum collections to a Brazilian digital preservation network to the debut of a new digital preservation questions and answers tool. Additionally, a CurateCamp unconference on the topic of “Digital Culture” was held on a third day at Catholic University, thanks to the support of the CUA Department of Library and Information Science.

The main meeting closed with a thought-provoking presentation from artist and digital conservator Dragan Espenschied. Espenschied utilized emulation and other novel tools to demonstrate some of the challenges related to presenting works authentically, in particular works from the early web and those dependent on a range of web services. Espenschied, also the Digital Conservator at Rhizome, has an ongoing project, One Terabyte of Kilobyte Age, that explores the material captured in the Geocities special collection. Associated with that project is a Tumblr he created that automatically generates a new screenshot from the Geocities archive collection every 20 minutes.

Web history, data stewardship, digital repositories; for digital preservation practitioners it was nerd heaven. Digital preservation 2014, it’s a thing. Now on to 2015!

Categories: Planet DigiPres

Art is Long, Life is Short: the XFR Collective Helps Artists Preserve Magnetic and Digital Works

The Signal: Digital Preservation - 29 July 2014 - 2:44pm

XFR STN (“Transfer Station”) is a grass-roots digitization and digital-preservation project that arose as a response from the New York arts community to rescue creative works off of aging or obsolete audiovisual formats and media. The digital files are stored by the Library of Congress’s NDIIPP partner the Internet Archive and accessible for free online. At the recent Digital Preservation 2014 conference, the NDSA gave XFR STN the NDSA Innovation Award. Last month, members of the XFR collective — Rebecca Fraimow, Kristin MacDonough, Andrea Callard and Julia Kim — answered a few questions for the Signal.

"VHS 1" from XFR Collective.

“VHS 1,” courtesy of Walter Forsberg.

Mike: Can you describe the challenges the XFR Collective faced in its formation?

XFR: Last summer, the New Museum hosted a groundbreaking exhibit called XFR STN.  Initiated by the artist collective Colab and the resulting MWF Video Club, the exhibit was a major success. By the end of the exhibition over 700 videos had been digitized with many available online through the Internet Archive.

It was clear  for all of us involved that there was a real demand for these services, that there are many under-served artists who were having difficulty preserving and accessing their own media. Many of the people involved with the exhibit became passionate about continuing the service of preserving obsolete magnetic and digital media for artists.  We wanted to offer a long-term, non-commercial, grassroots solution.

Using the experience of working on XFR STN as a jumping-off point, we began developing XFR Collective as a separate nonprofit initiative to serve the need that we saw.  Over the course of our development, we’ve definitely faced — and are still facing — a number of challenges in order to make ourselves effective and sustainable.

"VHS 3" by XFR Collective.

“VHS 2,” courtesy of Walter Forsberg.

Perhaps the biggest challenge has simply been deciding what form XFR Collective was going to take.  We started out with a bunch of borrowed equipment and a lot of enthusiasm, so the one thing we knew we could do was digitize, but we had to sit down and really think about things like organizational structure, sustainable pricing for our services, and the convoluted process of becoming a non-profit.

Eventually, we settled on a membership-based structure in order to be able to keep our costs as low as possible.  A lot of how we’re operating is still very experimental — this summer wraps up our six-month test period, during which we limited ourselves to working with only a small number of partners to allow us to figure out what our capacity was and how we could design our projects in the future.

We’ve got a number of challenges still ahead of us — finding a permanent home is a big one — and we still feel like we’re only just getting started, in terms of what we can do for the community of artists who use our services.  It’s going to be interesting for all of us to see how we develop.  We’ve started thinking of ourselves as kind of a grassroots preservation test kitchen. We’ll try almost any kind of project once to see if it works!

Mike: Where are the digital files stored? Who maintains them?

XFR: Our digital files will be stored with the membership organizations and uploaded to the Internet Archive for access and for long-term open-source preservation.  This is an important distinction that may confuse some people: XFR Collective is not an archive.

While we advocate and educate about best practices, we will not hold any of the digital files ourselves; we just don’t have the resources to maintain long-term archival storage.  We encourage material to go onto the Internet Archive because long-term accessibility is part of our mission and because the Internet Archive has the server space to store uncompressed and lossless files as well as access files.  That way if something happens to the storage that our partners are using for their own files, they can always re-download them.  But we can’t take responsibility for those files ourselves. We’re a service point, not a storage repository.

"VHS 2" by XFR Collective

“VHS 3,” courtesy of Walter Forsberg.

Mike: Regarding public access as a means of long-term preservation and sustainability, how do you address copyrighted works?

XFR: This is a great question that confounds a lot of our collaborators initially.  Access-as-preservation creates a lot of intellectual property concerns.  Still, we’re a very small organization, so we can afford to take more risks than a more high-profile institution.  We don’t delve too deeply into the area of copyright; our concern is with the survival of the material.  If someone has a complaint, the Internet Archive will give us a warning in time to re-download the content and then remove it. But so far we haven’t had any complaints.

Mike: What open access tools and resources do you use?

XFR: The Internet Archive itself is something of an open access resource and we’re seeing it used more and more frequently as a kind of accessory to preservation, which is fantastic.  Obviously it’s not the only solution, and you wouldn’t want to rely on that alone any more than you would any kind of cloud storage, but it’s great to have a non-commercial option for streaming and storage that has its own archival mission and that’s open to literally anyone and anything.

Mike:  If anyone is considering a potential collaboration to digitally preserve audiovisual artwork, what can they learn from the experiences of the XFR Collective?

XFR: Don’t be afraid to experiment!  A lot of what we’ve accomplished is just by saying to ourselves that we have to start doing something, and then jumping in and doing it.  We’ve had to be very flexible. A lot of the time we’ll decide something as a set proposition and then find ourselves changing it as soon as we’ve actually talked with our partners and understood their needs.  We’re evolving all the time but that’s part of what makes the work we do so exciting.

We’ve also had a lot of help and we couldn’t have done any of what we’ve accomplished without support and advice from a wide network of individuals, ranging from the amazing team at XFR STN to video archivists across New York City.  None of these collaborations happen in a vacuum, so make friendships, make partnerships, and don’t be nervous about asking for advice.  There are a lot of people out there who care about video preservation and would love to see more initiatives out there working to make it happen.

Categories: Planet DigiPres

The MH17 Crash and Selective Web Archiving

The Signal: Digital Preservation - 28 July 2014 - 4:34pm

The following is a guest post by Nicholas Taylor, Web Archiving Service Manager for Stanford University Libraries.

//">Internet Archive Wayback Machine</a>.

Screenshot of 17 July 2014 15:57 UTC archive snapshot of deleted VKontakte Strelkov blog post regarding downed aircraft, on Internet
Archive Wayback Machine

The Internet Archive Wayback Machine has been mentioned in several news articles within the last week  (see here, here and here) for having archived a since-deleted blog post from a Ukrainian separatist leader touting his shooting down a military transport plane which may have actually been Malaysia Airlines Flight 17. At this early stage in the crash investigation, the significance of the ephemeral post is still unclear, but it could prove to be a pivotal piece of evidence.

An important dimension of the smaller web archiving story is that the blog post didn’t make it into the Wayback Machine by the serendipity of Internet Archive’s web-wide crawlers; an unknown but apparently well-informed individual identified it as important and explicitly designated it for archiving.

Internet Archive crawls the Web every few months, tends to seed those crawls from online directories or compiled lists of top websites that favor popular content, archives more broadly across websites than it does deeply on any given website, and embargoes archived content from public access for at least six months. These parameters make the Internet Archive Wayback Machine an incredible resource for the broadest possible swath of web history in one place, but they don’t dispose it toward ensuring the archiving and immediate re-presentation of a blog post with a three-hour lifespan on a blog that was largely unknown until recently.

Recognizing the value of selective web archiving for such cases, many memory organizations engage in more targeted collecting. Internet Archive itself facilitates this approach through its subscription Archive-It service, which makes web archiving approachable for curators and many organizations. A side benefit is that content archived through Archive-It propagates with minimal delay to the Internet Archive Wayback Machine’s more comprehensive index. Internet Archive also provides a function to save a specified resource into the Wayback Machine, where it immediately becomes available.

Considering the six-month access embargo, it’s safe to say that the provenance of everything that has so far been archived and re-presented in the Wayback Machine relating to the five-month-old Ukraine conflict is either the Archive-It collaborative Ukraine Conflict collection or the Wayback Machine Save Page Now function. In other words, all of the content preserved and made accessible to date, including the key blog post, reflects deliberate curatorial decisions on the part of individuals and institutions.

A curator at the Hoover Institution Library and Archives with a specific concern for the VKontakte Strelkov blog actually added it to the Archive-It collection with a twice-daily capture frequency at the beginning of July. Though the key blog post was ultimately recorded through the Save Page Now feature, what’s clear is that subject area experts play a vital role in focusing web archiving efforts and, in this case, facilitated the preservation of a vital document that would not otherwise have been archived.

At the same time, selective web archiving is limited in scope and can never fully anticipate what resources the future will have wanted us to save, underscoring the value of large-scale archiving across the Web. It’s a tragic incident but an instructive example of how selective web archiving complements broader web archiving efforts.

Categories: Planet DigiPres

Song identification on GitHub

File Formats Blog - 24 July 2014 - 11:42am

The code for my song identification “nichesourcing” web application is now available on GitHub. It’s currently aimed at one project, as I’d mentioned in my earlier post, but has potential for wide use. It allows the following:

  • Users can register as editors or contributors. Only registered users have access.
  • Editors can post recording clips with short descriptions.
  • Contributors can view the list of clips and enter reports on them.
  • Reports specify type of sound, participants, song titles, and instruments. Contributors can enter as much or as little information as they’re able to.
  • Editors can modify clip metadata, delete clips, and delete reports.
  • Contributors and editors can view reports.
  • More features are planned, including an administrator role.

This is my first PHP coding project of any substance, so I’m willing to accept comments about my overall coding approach. It’s inevitable that, to some degree, I’m writing PHP as if it’s Java. If there are any standard practices or patterns I’m overlooking, let me know.

Tagged: music, software, songid
Categories: Planet DigiPres

Understanding the Participatory Culture of the Web: An Interview with Henry Jenkins

The Signal: Digital Preservation - 24 July 2014 - 10:51am
Henry Jenkins, Provost Professor of Communication, Journalism, and Cinematic Arts, a joint professorship at the USC Annenberg School for Communication and the USC School of Cinematic Arts.

Henry Jenkins, Provost Professor of Communication, Journalism, and Cinematic Arts, with USC Annenberg School for Communication and the USC School of Cinematic Arts.

The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and is working on a range of projects related to CurateCamp Digital Culture. This is part of an ongoing series of interviews Julia is conducting to better understand the kinds of born-digital primary sources folklorists, and others interested in studying digital culture, are making use of for their scholarship.

Anyone who has ever liked a TV show’s page on Facebook or proudly sported a Quidditch t-shirt knows that being a fan goes beyond the screen or page.  With the growth of countless blogs, tweets, Tumblr gifsets, Youtube videos, Instagram hashtags, fanart sites and fanfiction sites, accessing fan culture online has never been easier. Whether understood as a vernacular web or as the blossoming of a participatory culture individuals across the world are using the web to respond to and communicate their own stories.

As part of the NDSA Insights interview series, I’m delighted to interview Henry Jenkins, professor at the USC Annenberg School for Communication and self-proclaimed Aca-Fan. He is the author of one of the foundational works exploring fan cultures, “Textual Poachers: Television Fans and Participatory Culture,”  as well as a range of other books, including “Convergence Culture: Where Old and New Media Collide,” and most recently the co-author (with Sam Ford and Joshua Green) “Spreadable Media: Creating Value and Meaning in a Networked Culture.” He blogs at Confessions of an Aca-Fan.

Julia: You state on your website that your time at MIT, “studying culture within one of the world’s leading technical institutions” gave you “some distinctive insights into the ways that culture and technology are reshaping before our very eyes.”  How so? What are some of the changes you’ve observed, from a technical perspective and/or a cultural one?

Henry: MIT was one of the earliest hubs in the Internet. When I arrived there in 1989, Project Athena was in its prime; the MIT Media Lab was in its first half decade and I was part of a now legendary Narrative Intelligence Reading Group (PDF) which brought together some of the smartest of their graduate students and a range of people interested in new media from across Cambridge; many of the key thinkers of early network culture were regular speakers at MIT; and my students were hatching ideas that would become the basis for a range of Silicon Valley start ups. And it quickly became clear to me that I had a ringside seat for some of the biggest transfomations in the media landscape in the past century, all the more so because through my classes, the students were helping me to make connections between my work on fandom as a participatory culture and a wide array of emerging digital practices (from texting to game mods).

Kresge Auditorium, MIT, Historic American Buildings Survey/Historic American Engineering Record/Historic American Landscapes Survey, Library of Congress Prints and Photographs Division,

Kresge Auditorium, MIT, Historic American Buildings Survey/Historic American Engineering Record/Historic American Landscapes Survey, Library of Congress Prints and Photographs Division,

Studying games made sense at MIT because “Spacewar,” one of the first known uses of computers for gaming, had been created by the MIT Model Railroad club in the early 1960s. I found myself helping to program a series that the MIT Women’s Studies Program was running on gender and cyberspace, from which the materials for my book, “From Barbie to Mortal Kombat” emerged. Later, I would spend more than a decade as the housemaster of an MIT dorm, Senior House, which is known to be one of the most culturally creative at the Institute.

Through this, I was among the first outside of Harvard to get a Facebook account; I watched students experimenting with podcasting, video-sharing and file-sharing. Having MIT after my name opened doors at all of the major digital companies and so I was able to go behind the scenes as some of these new technologies were developing, and also see how they were being used by my students in their everyday lives.

So, through the years, my job was to place these developments in their historical and cultural contexts — often literally as Media Lab students would come to me for advice on their dissertation projects, but also more broadly as I wrote about these developments through Technology Review, the publication for MIT’s alumni network. It was there where many of the ideas that would form “Convergence Culture” were first shared with my readers. And the students that came through the Comparative Media Studies graduate program have been at ground zero for some of the key developments in the creative industries in recent years — from the Veronica Mars Kickstarter campaign to the community building practices of Etsy, from key developments in the games and advertising industry to cutting edge experiments in transmedia storytelling. The irony is that I had been really reluctant about accepting the MIT job because I suffer from fairly serious math phobia. :-)

Today, I enjoy another extraordinary vantage point as a faculty member at USC, who is embedded in both the Annenberg School of Communication and Journalism and the Cinema School, and thus positioned to watch how Hollywood and American journalism are responding to the changes that networked communication have forced upon them. I am able to work with future filmmakers who are trying to grasp a shift from a focus on individual stories to an emphasis on world-building, journalists who are trying to imagine new relationships with their publics, and activists who are seeking to make change by any media necessary.

Julia: Much of your work has focused on reframing the media audience as active and creative participants in creating media, rather than passive consumers.  You’ve critiqued use of the terms “viral” and “memes” to describe  internet phenomena as “stripping aside the concept of human agency,” and that the biological language “confuses the actual power relations between producers, properties, brands and consumers.” Can you unpack some of your critiques for us? What is at stake?

Henry: At the core of “Spreadable Media” is a shift in how media travels across the culture. On the one hand, there is distribution as we have traditionally understood it in the era of mass media where content flows in patterns regulated by decisions made by major corporations who control what we see, when we see it and under what conditions. On the other hand, there is circulation, a hybrid system, still shaped top-down by corporate players, but also bottom-up by networks of everyday people, who are seeking to move media that is meaningful to them across their social networks, and will take media where they want it when they want it through means both legal and illegal. The shift towards a circulation-based model for media access is disrupting and transforming many of our media-related practices, and it is not explained well by a model which relies so heavily on metaphors of infection and assumptions of irrationality.

The idea of viral media is a way that the broadcasters hold onto the illusion of their power to set the media agenda at a time when that power is undergoing a crisis. They are the ones who make rational calculations, able to design a killer virus which infects the masses, so they construct making something go viral as either arcane knowledge that can be sold at a price from those in the know or as something that nobody understands, “It just went viral!” But, in fact, we are seeing people, collectively and individually, make conscious decisions about what media to pass to which networks for what purposes with what messages attached through which media channels and we are seeing activist groups, religious groups, indie media producers, educators and fans make savvy decisions about how to get their messages out through networked communications.

Julia: Cases like the Harry Potter Alliance suggest the range of ways that fan cultures on the web function as a significant cultural and political force. Given the significance of fandom, what kinds of records of their online communities do you think will be necessary in the future for us to understand their impact? Said differently, what kinds of records do you think cultural heritage organizations should be collecting to support the study of these communities now and into the future?

Henry: This is a really interesting question. My colleague, Abigail De Kosnik at UC-Berkeley, is finishing up a book right now which traces the history of the fan community’s efforts to archive their own creative output over this period, which has been especially precarious, since we’ve seen some of the major corporations which fans have used to spread their cultural output to each other go out of business and take their archives away without warning or change their user policies in ways that forced massive numbers of people to take down their content.

Image of Paper Print Films in Library of Congress collection.

Image of Paper Print Films in Library of Congress collection. Jenkins notes this collection of prints likely makes it easier to write the history of the first decade of American cinema than to write the history of the first decade of the web.

The reality is that it is probably already easier to write the history of the first decade of American cinema, because of the paper print collection at the Library of Congress, than it is to write the history of the first decade of the web. For that reason, there has been surprisingly little historical research into fandom — even though some of the communication practices that fans use today go back to the publication practices of the Amateur Press Association in the mid-19th century. And even recently, major collections of fan-produced materials have been shunted from library to archive with few in your realm recognizing the value of what these collections contain.

Put simply, many of the roots of today’s more participatory culture can be traced back to fan practices over the last century. Fans have been amongst the leading innovators in terms of the cultural uses of new media. But collecting this material is going to be difficult: fandom is a dispersed but networked community which does not work through traditional organizations; there are no gatekeepers (and few recordkeepers) in fandom, and the scale of fan production — hundreds of thousands if not millions of new works every year — dwarfs that of commercial publishing. And that’s to focus only on fan fiction and would does not even touch the new kinds of fan activism that we are documenting for my forthcoming book, By Any Media Necessary. So, there is an urgent need to archive some of these materials, but the mechanisms for gathering and appraising them are far from clear.

Julia: Your New Media Literacy project aims in part to “provide adults and youth with the opportunity to develop the skills, knowledge, ethical framework and self-confidence needed to be full participants in the cultural changes which are taking place in response to the influx of new media technologies, and to explore the transformations and possibilities afforded by these technologies to reshape education.” In one of your pilot programs, for instance, students studied “Moby-Dick” by updating the novel’s Wikipedia page. Can you tell us a little more about this project? What are some of your goals? Further, what opportunities do you think libraries have to enable this kind of learning?

Henry: We documented this project through our book, “Reading in a Participatory Culture,” and through a free online project, Flows of Reading. It was inspired by the work of Ricardo Pitts-Wiley, the head of the Mixed Magic Theater in Rhode Island, who was spending time going into prisons to get young people to read “Moby-Dick” by getting them to rewrite it, imagining who these characters would be and what issues they would be confronting if they were part of the cocaine trade in the 21st century as opposed to the whaling trade in the 19th century. This resonated with the work I have been doing on fan rewriting and fan remixing practices, as well as what we know about, for example, the ways hip hop artists sample and build on each other’s work.

So, we developed a curriculum which brought together Melville’s own writing and reading practices (as the master mash-up artist of his time) with Pitts-Wiley’s process in developing a stage play that was inspired by his work with the incarcerated youth and with a focus on the place of remix in contemporary culture. We wanted to give young people tools to think ethically and meaningfully about how culture is actually produced and to give teachers a language to connect the study of literature with contemporary cultural practices. Above all, we wanted to help students learn to engage with literary texts creatively as well as critically.

We think libraries can be valuable partners in such a venture, all the more so as regimes of standardized testing make it hard for teachers to bring complex 19th century novels like “Moby-Dick” into their classes or focus student attention on the process and cultural context of reading and writing as literacy practices. Doing so requires librarians to think of themselves not only as curators of physical collections but as mentors and coaches who help students confront the larger resources and practices opened up to them through networked communication. I’ve found librarians and library organizations to be vital partners in this work through the years.

Julia: Your latest book is on the topic of “spreadable media,” arguing that “if it doesn’t spread, it’s dead.”  In a nutshell, how would you define the term “spreadable media”?

Henry:  I talked about this a little above, but let me elaborate. We are proposing spreadable media as an alternative to viral media in order to explain how media content travels across a culture in an age of Facebook, Twitter, YouTube, Reddit, Tumblr, etc. The term emphasizes the act of spreading and the choices which get made as people appraise media content and decide what is worth sharing with the people they know. It places these acts of circulation in a cultural context rather than a purely technological one. At the same time, the word is intended to contrast with older models of “stickiness,” which work on the assumption that value is created by locking down the flow of content and forcing everyone who wants your media to come to your carefully regulated site. This assumes a kind of scarcity where we know what we want and we are willing to deal with content monopolies in order to get it.

But, the reality is that we have more media available to us today that we can process: we count on trusted curators — primarily others in our social networks but also potentially those in your profession — to call media to our attention and the media needs to be able to move where the conversations are taking place or remain permanently hidden from view. That’s the spirit of “If it doesn’t spread, it’s dead!” If we don’t know about the media, if we don’t know where to find it, if it’s locked down where we can’t easily get to it, it becomes irrelevant to the conversations in which we are participating. Spreading increases the value of content.

Julia: What does spreadable media mean to the conversations libraries, archives and museums could  have with their patrons? How can archives be more inclusive of participatory culture?

Henry:  Throughout the book, we use the term “appraisal” to refer to the choices everyday people make, collectively and personally, about what media to pass along to the people they know. Others are calling this process “curating.” But either way, the language takes us immediately to the practices which used to be the domain of “libraries, archives, and museums.” You were the people who decided what culture mattered, what media to save from the endless flow, what media to present to your patrons. But that responsibility is increasingly being shared with grassroots communities, who might “like” something or “vote something up or down” through their social media platforms, or simply decide to intensify the flow of the content through tweeting about it.

We are seeing certain videos reach incredible levels of circulation without ever passing through traditional gatekeepers. Consider “Kony 2012,” which reached more than 100 million viewers in its first week of circulation, totally swamping the highest grossing film at the box office that week (“Hunger Games”) and the highest viewed series on American television (“Modern Family”), without ever being broadcast in a traditional sense. Minimally, that means that archivists may be confronting new brokers of content, museums will be confronting new criteria for artistic merit, and libraries may be needing to work hand in hand with their patrons as they identify the long-term information needs of their communities. It doesn’t mean letting go of their professional judgement, but it does mean examining their prejudices about what forms of culture might matter and it does mean creating mechanisms, such as those around crowd-sourcing and perhaps even crowd-funding, which help to insure greater responsiveness to public interests.

Julia: You wrote in 2006 that there is a lack of fan involvement with works of high culture because “we are taught to think about high culture as untouchable,” which in turn has to do with “the contexts within which we are introduced to these texts and the stained glass attitudes which often surround them.” Further, you argue that this lack of a fan culture makes it difficult to engage with a work, either intellectually or emotionally. Can you expand on this a bit? Do you still believe this to be the case, or has this changed with time? Does the existence of transformative works like “The Lizzie Bennet Diaries” on Youtube or vibrant Austen fan communities on Tumblr reveal a shift in attitudes? Finally, how can libraries, museums, and other institutions help foster a higher level of emotional and intellectual engagement?

Henry:  Years ago, I wrote “Science Fiction Audiences” with the British scholar John Tulloch in which we explored the broad range of ways that fans read and engaged with “Star Trek” and “Doctor Who.” Tulloch then went on to interview audiences at the plays of Anton Checkov and discovered a much narrower range of interpretations and meanings — they repeated back what they had been taught to think about the Russian playwright rather than making more creative uses of their experience at the theater. This was probably the opposite of the way many culture brokers think about the high arts — as the place where we are encouraged to think and explore — and popular arts — as works that are dummied down for mass consumption. This is what I meant when I suggested that the ways we treat these works cut them off from popular engagement.

At the same time, I am inspired by recent experiments which merge the high and the low. I’ve already talked about Mixed Magic’s work with “Moby-Dick,” but “The Lizzie Bennett Diaries” is another spectacular example. It’s inspired to translate Jane Austen’s world through the mechanisms of social media: gossip and scandal plays such a central role in her works; she’s so attentive to what people say about each other and how information travels through various social communities. And the playful appropriation and remixing of “Pride and Prejudice” there has opened up Austen’s work to a whole new generation of readers who might otherwise have known it entirely through Sparknotes and plodding classroom instruction. There are certainly other examples of classical creators — from Gilbert and Sullivan to Charles Dickens and Arthur Conan Doyle — who inspire this kind of fannish devotion from their followers, but by and large, this is not the spirit with which these works get presented to the public by leading cultural institutions.

I would love to see libraries and museums encourage audiences to rewrite and remix these works, to imagine new ways of presenting them, which make them a living part of our culture again. Lawrence Levine’s “Highbrow/Lowbrow” contrasts the way people dealt with Shakespeare in the 19th century — as part of the popular culture of the era — with the ways we have assumed across the 20th century that an appreciation of the Bard is something which must be taught because it requires specific kinds of cultural knowledge and specific reading practices. Perhaps we need to reverse the tides of history in this way and bring back a popular engagement with such works.

Julia: You’re a self-described academic and fan, so I’d be interested in what you think are some particularly vibrant fan communities online that scholars should be paying more attention to.

 A Vlogbrothers FAQ”

Screenshot of the VlogBrothers, Hank and John Green, as they display a symbol of their channel in a video titled “How To Be a Nerdfighter: A Vlogbrothers FAQ”

Henry: The first thing I would say is that librarians, as individuals, have long been an active presence in the kinds of fan communities I study; many of them write and read fan fiction, for example, or go to fan conventions because they know these as spaces where people care passionately about texts, engage in active debates around their interpretation, and often have deep commitments to their preservation. So, many of your readers will not need me to point out the spaces where fandom are thriving right now; they will know that fans have been a central part of the growth of the Young Adult Novel as a literary category which attracts a large number of adult readers so they will be attentive to “Harry Potter,” “Hunger Games,” or the Nerdfighters (who are followers of the YA novels of John Green); they will know that fans are being drawn right now to programs like “Sleepy Hollow” which have helped to promote more diverse casting on American television; and they will know that now as always science fiction remains a central tool which incites the imagination and creative participation of its readers. The term, Aca-Fan, has been a rallying point for a generation of young academics who became engaged with their research topics in part through their involvement within fandom. Whatever you call them, there needs to be a similar movement to help librarians, archivists and curators come out of the closet, identify as fans, and deploy what they have learned within fandom more openly through their work.

Categories: Planet DigiPres

Future Steward on Stewardship’s Future: An Interview with Emily Reynolds

The Signal: Digital Preservation - 23 July 2014 - 10:44am
Emily Reynolds, Winner of 2014 Future Steward NDSA Innovation Award.

Emily Reynolds, Winner of 2014 Future Steward NDSA Innovation Award.

Each year, the NDSA Innovation Working Group reviews nominations from members and non-members alike for the Innovation Awards. Most of those awards are focused on recognizing individuals, projects and organizations that are at the top of their game.

The Future Steward award is a little different. It’s focused on emerging leaders, and while the recipients of the future steward award have all made significant accomplishments and achievements, they have done so as students, learners and professionals in the early stages of their careers. Mat Kelly’s work on WARCreate, Martin Gengebach’s work on forensic workflows and now Emily Reynolds work in a range of organizations on digital preservation exemplify how some of the most vital work in digital preservation is being taken on and accomplished by some of the newest members of our workforce.

I’m thrilled to be able to talk with Emily, who picked up this year’s Future Steward award yesterday during the Digital Preservation 2014 meeting, about the range of her work and her thoughts on the future of the field. Emily was recognized for the quality of her work in a range of internships and student positions with the Interuniversity Consortium for Political and Social Research, the University of Michigan Libraries, the Library of Congress, Brooklyn Historical Society, StoryCorps, and, in particular, her recent work on the World Bank’s eArchives project.

Screenshot of the Arab American National Museum's web archive collections.

Screenshot of the Arab American National Museum’s web archive collections.

Trevor: You have a bit of experience working with web archives at different institutions; scoping web archive projects with the Arab American National Museum, putting together use cases for the Library of Congress and in your coursework at the University of Michigan. Across these experiences, what are your reflections and thoughts on the state of web archiving for cultural heritage organizations?

Emily: It seems to me that many cultural heritage organizations are still uncertain as to where their web archive collections fit within the broader collections of their organization. Maureen McCormick Harlow, a fellow National Digital Stewardship Resident, often spoke about this dynamic; the collections that she created have been included in the National Library of Medicine’s general catalog. But for many organizations, web collections are still a novelty or a fringe part of the collections, and aren’t as discoverable. Because we’re not sure how the collections will be used, it’s difficult to provide access in a way that will make them useful.

I also think that there’s a bit of a skills gap, in terms of the challenges that web archiving can present, as compared to the in-house technical skills at many small organizations. Tools like Archive-It definitely lower the barrier to entry, but still require a certain amount of expertise for troubleshooting and understanding how the tool works. Even as the tools get stronger, the web becomes more and more complex and difficult to capture, so I can’t imagine that it will ever be a totally painless process.

Trevor: You have worked on some very different born-digital collections, processing born-digital materials for StoryCorps in New York and on a TRAC self-audit at ICPSR, one of the most significant holders of social science data sets. While very different kinds of materials, I imagine there are some similarities there too. Could you tell us a bit about what you did and what you learned working for each of these institutions? Further, I would be curious to hear what kinds of parallels or similarities you can draw from the work.


Image of a StoryCorps exhibit at the New Museum which Emily participate in.

Emily: At StoryCorps, I did a lot of hands-on work with incoming interviews and data, so I saw first-hand the amount of effort that goes into making such complex collections discoverable. Their full interviews are not currently available online, but need to be accessible to internal staff. At ICPSR, I was more on the policy side of things, getting an overview of their preservation activities and documenting compliance with the TRAC standard.

StoryCorps and ICPSR are an interesting pair of organizations to compare because there are some striking similarities in the challenges they face in terms of access. The complexity and variety of research data held by ICPSR requires specialized tools and standards for curation, discovery and reuse. Similarly, oral history interviews can be difficult to discover and use without extensive metadata (including, ideally, full transcripts). They’re specialized types of content, and both organizations have to be innovative in figuring out how to preserve and provide access to their collections.

ICPSR has a strong infrastructure and systems for normalizing and documenting the data they ingest, but this work still requires a great deal of human input and quality control. Similarly, metadata for StoryCorps interviews is input manually by staff. I think both organizations have done great work towards finding solutions that work for their individual context, although the tools for providing access to research data seem to have developed faster than those for oral history. I’m hopeful that with tools like Pop Up Archive that will change.

Trevor: Most recently, you’ve played a leadership role in the development of the World Bank’s eArchives project. Could you tell us about this project a little and suggest some of the biggest things you learned from working on it?

Julia Blase and Emily Reynolds present on “Developing Sustainable Digital Archive Systems.” at ALA 2013 Midwinter Meeting. Photo by Jaime McCurry.

Emily: The eArchives program is an effort to digitize the holdings of the World Bank Group Archives that are of greatest interest to researchers. We don’t view our digitization as a preservation action (only insofar as it reduces physical wear and tear on the records), and are primarily interested in providing broader access to the records for our international user base. We’ve scanned around 1500 folders of records at this point, prioritizing records that have been requested by researchers and cleared for public disclosure through the World Bank’s Access to Information Policy.

The project has also included a component of improving the accessibility of digitized records and archival finding aids. We are in the process of launching a public online finding aid portal, using the open-source Access to Memory (AtoM) platform, which will contain the archives’ ISAD(G) finding aids as well as links to the digitized materials. Previously, the finding aids were contained in static HTML pages that needed to be updated manually; soon, the AtoM database will sync regularly with our internal description database. This is going to be a huge upgrade for the archivists, in terms of reducing duplication of work and making their efforts more visible to the public.

It’s been really interesting to collaborate with the archives staff throughout the process of launching our AtoM instance. I’ve been thinking a lot about how compliance with archival standards can actually make records less accessible to the public, since the practices and language involved in finding aids can be esoteric and confusing to an outsider. It has been an interesting balance to ensure that the archivists are happy with the way the descriptions are presented, while also making the site as user-friendly as possible. Anne-Marie Viola, of Dumbarton Oaks, has written a couple of blog posts about the process of conducting usability testing on their AtoM instance, which have been a great resource for me.

Trevor: As I understand it, you are starting out a new position as a program specialist with the Institute for Museum and Library Services. I realize you haven’t started yet, but could you tell us a bit about what you are going to be doing? Along with that, I would be curious to hear you talk a bit about how you see your experience thus far fitting into working for the federal funding for libraries and museums?

Emily: As a Program Specialist, I’ll be working in IMLS’s Library Discretionary Programs division, which includes grant programs like the Laura Bush 21st Century Librarian Program and the National Leadership Grants for Libraries. Among other things, I will be supporting the grant review process, communicating with grant applicants, and coordinating grant documentation. I’ll also have the opportunity to participate in some of the outreach that IMLS does with potential and existing grant applicants.

Even though I haven’t been in the profession for a very long time, I’ve had the opportunity to work in a lot of different areas, and as a result feel that I have a good understanding of the broad issues impacting all kinds of libraries today. I’m excited that I’ll be able to be involved in a variety of initiatives and areas, and to increase my involvement in the professional community. I’ve also been spoiled by the National Digital Stewardship Residency’s focus on professional development, and am excited to be moving on to a workplace where I can continue to attend conferences and stay up-to-date with the field.

Trevor: Staffing is a big concern for the future of access to digital information. The NDSA staffing survey gets into a lot of these issues. Based on your experience, what words of advice would you offer to others interested in getting into this field? How important do you think particular technical capabilities are? What made some of your internships better or more useful than others? What kinds of courses do you think were particularly useful? At this point you’ve graduated among a whole cohort of students in your program. What kinds of things do you think made the difference for those who had an easier time getting started in their careers?

Emily: I believe that it is not the exact technical skills that are so important, but the ability to feel comfortable learning new ones, and the ability to adapt what one knows to a particular situation. I wouldn’t expect every LIS graduate to be adept at programming, but they should have a basic level of technical literacy. I took classes in GIS, PHP and MySQL, Drupal and Python, and while I would not consider myself an expert in any of these topics, they gave me a solid understanding of the basics, and the ability to understand how these tools can be applied.

I think it’s also important for recent graduates to be flexible about what types of jobs they apply for, rather than only applying for positions with “Librarian” or “Archivist” in the title. The work we do is applicable in so many roles and types of organizations, and I know that recent grads who were more flexible about their search were generally able to find work more quickly. I enjoyed your recent blog post on the subject of digital archivists as strategists and leaders, rather than just people who work with floppy discs instead of manuscripts. Of course this is easy for me to say, as I move to my first job outside of archives – but I think I’ll still be able to support and participate in the field in a meaningful way.

Categories: Planet DigiPres

EaaS: Image and Object Archive — Requirements, Implementation and Example Use-Cases

Open Planets Foundation Blogs - 23 July 2014 - 10:33am
bwFLA's Emulation-as-a-Service makes emulation widely available for non-experts and could prove emulation as a valuable tool in digital preservation workflows. Providing these emulation services to access preserved and archived digital objects poses further challenges to data management. Digital artifacts are usually stored and maintained in dedicated repositories and object owners want to – or are required to – stay in control over their intellectual property. This article discusses the problem of managing virtual images, i.e. virtual harddisks bootable by an emulator, and derivatives thereof but the solution proposed can be applied to any digital artifact.RequirementsOnce a digital object is stored in an archive and an appropriate computing environment has been created for access, this environment should be immutable and should not be modified except explicitly by an administrational interface. This guarantees that a memory institution's digital assets are unaltered by the EaaS service and remain available in the future. Immutability, however, is not easy to handle for most emulated environments. Just booting the operating system may change an environment in unpredictable ways. When the emulated software writes parts of this data and reads it again, however, it probably expects the read data to represent the modifications. Furthermore, users that want to interact with the environment should be able to change or customize it. Therefore, data connectors have to provide write access for the emulation service while they cannot write the data back to the archive. The distributed nature of the EaaS approach requires an  efficient network transport of data to allow for immediate data access and usability. However, digital objects stored in archives can be quite large in size. When representing a hard disk image, the installed operating system together with installed software can easily grow up to several GBs in size. Even with today's network bandwidths, copying these digital objects in full to the EaaS service may take minutes and affects the user experience. While the archived amount of data is usually large, the data that is actually accessed frequently can be very small. In a typical emulator scenario, read access to virtual hard disk images is block-aligned and only very few blocks are actually read by the emulated system. Transferring only these blocks instead of the whole disk image file is typically more efficient, especially for larger files. Therefore, the network transport protocol has to support random seeks and sparse reads without the need for actually copying the whole data file. While direct filesystem access provides these features if a digital object is locally available to the EaaS service, such access it is not available in the general case of separate emulation and archive servers that are connected via the internet.ImplementationThe Network Block Device (NBD) protocol provides a simple client/server architecture that allows direct access to single digital objects as well as random access to the data stream within these objects. Furthermore, it can be completely implemented in userspace and does not require a complex software infrastructure to be deployed to the archives.  In order to access digital objects, the emulation environment needs to reference these objects in the emulation environment. Individual objects are identified in the NBD server by using unique export names. While the NBD URL schema directly identifies the digital object and the archive where the digital object can be found, the data references are bound to the actual network location. In a long-term preservation scenario, where emulation environments, once curated, should last longer than a single computer system that acts as the NBD server, this approach has obvious drawbacks. Furthermore, the cloud structure of EaaS allows for interchanging any component that participates in the preservation effort, thus allowing for load balancing and fail-safety. This advantage of distributed systems is offset by static, hostname-bound references.Handle It!To detach the references from the object's network location, the Handle System is used as persistent object identifier throughout our reference implementation. The Handle System provides a complete technological framework to deal with these identifiers (or "Handles'' (HDL) in the Handle System) and constitutes a federated infrastructure that allows the resolution of individual Handles using decentralized Handle Services. Each institution that wants to participate in the Handle System is assigned a prefix and can host a Handle Service. Handles are then resolved by a central resolver by forwarding requests to these services according to the Handle's prefix. As the Handle System, as a sole technological provider, does not pose any strict requirements to the data associated with Handles, this system was used as a PI technology.Persistent User Sessions and DerivativesAs digital objects (in this case the virtual disk image) are not to be modified directly in the archive by the EaaS service, a mechanism to store modifications locally  while reading unchanged data from the archive has to be implemented. Such a transparent write mechanism can be achieved using a copy-on-write access strategy. While NBD allows for arbitrary parts of the data to be read upon request, not requiring any data to be provided locally, data that is written through the data connector is tracked and stored in a local data structure. If a read operation requests a part of data that is already in this data structure, the previously changed version of the data should be returned to the emulation component. Similarly, parts of data that are not in this data structure were never modified and must be read from the original archive server. Over time, a running user session has its own local version of the data, but only those parts of data that were written are actually copied. We used the qcow2 container format from the QEMU project to keep track of local changes to the digital object. Besides supporting copy-on-write, it features an open documentation as well as a widely used and tested reference implementation with a comprehensive API, the QEMU Block Driver. The qcow2 format allows to store all changed data blocks and the respective metadata for tracking these changes in a single file. To define where the original blocks (before copy-on-write) can be found, a backing file definition is used. The Block Driver API provides a continuous view on this qcow2 container,  transparently choosing either the backing file or the copy-on-write data structures as source. This mechanism allows modifications of data to be stored separately and independent from the original digital object during an EaaS user session, allowing to keep every digital object in its original state as it was preserved  Once the session has finished, these changes can be retrieved from the emulation component and used to create a new, derived data object. As any Block Driver format is allowed in the backing file of a qcow2 container, the backing file can also be a qcow2 container again. This allows „chaining" a series of modifications as copy-on-write files that only contain the actually modified data. This greatly facilitates efficient storage of derived environments as a single qcow2 container can directly be used in a binding without having to combine the original data and the modifications to a  consolidated stream of data. However, this makes such bindings rely not only on the availability of the qcow2 container with the modifications, but also on the original data the qcow2 container refers to. Therefore, consolidation is still possible and directly supported by the tools that QEMU provides to handle qcow2 files. Once the data modifications and the changed emulation environment are retrieved after a session, both can be stored again in an archive to make this derivate environment available. Only those chunks of data that actually  were changed by the user have to be retrieved. These, however, reference and  remain dependent on the original, unmodified digital object. The derivate can then be accessed like any other archived environment. Since all derivate environments contain (stable) references to their backing files, modifications can be stored in  a different image archive, as long as the backing file is available. Therefore, each object owner is in charge for providing storage for individualized system environments but is also  able to protect its modification without loosing the benefits of shared base images. Examples and Use-CasesTo provide a better understanding of the image archive implementation, the following three use-cases demonstrate how the current implementation works. Firstly, a so called derivative is created, a tailored system environment suitable to render a specific object. In a second scenario, a container object (CD-ROM) is injected into the environment which is then modified for object access, i.e. installation of a  viewer application and adding the object to the autostart folder. Finally, an existing harddisk image (e.g. an image copy of a real machine) is ingested into the system. This last case requires, besides the technical configuration of the hardware environment, private files to be removed before public access.Derivatives – Tailored Runtime EnvironmentsTypically, an EaaS provider provides a set of so-called base images. These images contain a basic OS installation which has been configured to be run on a certain emulated platform. Depending on the user's requirements, additional software and/or configuration may be required, e.g. the installation of certain software frameworks or text processing or image manipulation software. This can be done by uploading or making available a software installation package. On our current demo instance this is done either by uploading individual files or a CD ISO image. Once the software is installed the modified environment can be saved and made accessible for object rendering or similar purposes. Object Specific CustomizationIn case of complex CD-ROM objects with rich multimedia content from the 90s and early 2000s, e.g. encyclopedias and teaching software, typically a custom viewer application has to be installed to be able to render its content. For these objects, an already prepared environment (installed software, autostart of the application) would be useful and would surely improve the user experience during access as „implicit“ knowledge on using an outdated environment is not required anymore to make use of the object. Since the number of archived media is large, duplicating for instance a Microsoft Windows environment for every one of them would add a few GBs of data to each object. Usually, neither the object’s information content nor the current or expected user demand justify these extra costs. Using derivatives of base images, however, only a few MBs are required for each customized environment since only changed parts of the virtual image are to be stored for each object. In the case of the aforementioned collection of multimedia CD-ROMs, the derivate size varies between 348KBs and 54MBs.  Authentic Archiving and Restricted Access to Existing ComputersSometimes it makes sense to preserve a complete user system like the personal computer of Vilèm Flusser in the Vilèm Flusser Archive. Such complete system environments usually can be achieved by creating a hard disk image of the existing computer and use this image as the virtual hard disk for EaaS. Such hard disk images can, however, contain personal data of the computer's owner. While EaaS aims at providing interactive access to complete software environments, it is impossible to restrict this "interactiveness", e.g. to forbid access to a certain directory directly from the user interface. Instead, our approach to this problem is to create a derivative work with all the personal data being stripped from the system. This allows users with sufficient access permissions (e.g. family or close friends) to access the original system including personal data, while the general public access only sees a computer with all the personal data removed.Conclusion

With our distributed architecture and an efficient network transport protocol, we are able to provide Emulation as a Service quite efficiently while at the same time allowing owners of digital objects to remain in complete control over their intellectual property. Using copy-on-write technology it is possible to create a multitude of different configurations and flavors of the same system with only minimal storage requirements. Derivatives and their respective "parent" system can be handled completely independent from each other and withdrawing access permissions for a parent will automatically invalidate all existing derivatives. This allows for a very efficient and flexible handling of curation processes that involve the installation of (licensed) software, personal information and user customizations.

Open Planets members can test aforementioned features using the bwFLA demo instance. Get the password here:

Taxonomy upgrade extras: EaaSPreservation Topics: Emulation
Categories: Planet DigiPres

Archiving video

File Formats Blog - 19 July 2014 - 10:59am

Suppose you see a cop beating someone up for jaywalking, or you’re stopped at one of the Border Patrol’s internal checkpoints. You’ve got your camera, phone, or tablet, so you make a video record of the incident. What do you do next? The Activists’ Guide to Archiving Video has some solid advice. Its purpose is to help you “make sure that the video documentation you have created or collected can be used for advocacy, as evidence, for education or historical memory – not just now but into the future.” The advice is solid, and most of it applies to any video recording that has long-term importance. In essence, it’s the same advice you’d get from Files that Last or from the Library of Congress. It includes considerations that especially apply to sensitive video, such as encryption and information that might put people at risk, but it’s a valuable addition to anyone’s digital preservation library.

There’s a PDF version of the guide for people who don’t like hopping around web pages. Versions in Spanish and Arabic are also provided.

Tagged: metadata, preservation, video
Categories: Planet DigiPres


Open Planets Foundation Blogs - 17 July 2014 - 2:36pm

We have just set up a vagrant environment for C3PO. It starts a headless vm where the C3PO related functionalities (Mongodb, Play, a downloadable commandline jar) are managable from the host's browser. Further, the vm itself has all relevant processes configured at start-up independently from vagrant, so it can be, once created, downloaded and used as a stand-alone C3PO vm. We think this could be a scenario applicable to other SCAPE projects as well. The following is a summary of the ideas we've had and the experiences we've made.

The Result

The Vagrantfile and a directory containing all vagrant-relevant files live directly in the root directory of the C3PO repository. So after installing Vagrant and cloning the repository a simple 'vagrant up' should do all the work, as downloading the base box, installing the necessary software and booting the new vm.

After a few minutes one should have a running vm that is accessible from the hosts browser at localhost:8000. This opens a central welcome page that contains information about the vm-specific aspects and links to the playframework's url (localhost:9000) and the Mongodb admin interface (localhost:28017). It also provides a download link for the command-line jar, which has to be used in order to import data. This can be used from the outside of the vm as the Mongodb port is mapped as well. So I can import and analyse data with C3PO without having to fiddle through the setup challenges myself, and, believe me, that way can be long and stony.

The created image is self-contained in that sense that, if I put it on a server, anyone who has Virtualbox installed can download it and use it, without having to rely on vagrant working on their machine.

General Setup

The provisioning script has a number of tasks:

  • it downloads all required dependencies for building the C3PO environment
  • it installs a fresh C3PO (from /vagrant, which is the shared folder connection between the git repository and the vm) and assembles the command-line app
  • it installs and runs a Mongodb server
  • it installs and runs the Playframework
  • it creates a port-forwarded static welcome page with links to all the functionalities above
  • it adds all above to the native ubuntu startup (using /etc/rc.local, if necessary), so that an image of the vm can theoretically be run independently from the vagrant environment

These are all trivial steps, but it can make a difference not having to manually implement all of them.

Getting rid of proxy issues

In case you're behind one of those very common NTLM company proxies, you'll really like that the only thing you have to provide is a config script with some some details around your proxy. If the setup script detects this file, it will download the necessary software and configure maven to use it. Doing it in this way has been actually the first time I got maven running smoothly on a linux VM behind our proxy.

Ideas for possible next steps

There is loads left to do, here are a few ideas:

  • provide interesting initial test-data that ships with the box, so that people can play around with C3PO without having to install/import anything at all.
  • why not having a vm for more SCAPE projects? we could quickly create a repository for something like a SCAPE base vm configuration that is useable as a base for other vms. The central welcome page could be pre-configured (SCAPE branded) as well as all the proxy- and development-environment-related stuff mentioned above.
  • I'm not sure about the sustainablity of shell provisioning scripts with increasing complexity of the bootstrap process. Grouping the shell commands in functions is certainly an improvement, it might be worth though to check out other, more dynamic provisioners. One I find particularly interesting is Ansible.
  • currently there's no way of testing that the vm works with the current development trunk; a test environment that runs the vm and tests for all the relevant connection bits would be handy


Preservation Topics: SCAPE
Categories: Planet DigiPres

CSV Validator version 1.0 release

Open Planets Foundation Blogs - 15 July 2014 - 12:10pm

Following on from my previous brief post announcing the beta release of the CSV Validator,, today we've made the formal version 1.0 release of the CSV Validator and the associated CSV Schema Language.  I've described this in more detail on The NAtional Archives' blog,

Preservation Topics: Tools
Categories: Planet DigiPres

Crowdsourcing song identification

File Formats Blog - 14 July 2014 - 10:04am

Some friends of mine are pulling together a project for crowdsourcing identification of a large collection of music clips. At least a couple of us are professional software developers, but I’m the one with the most free time right now, and it fits with my library background, so I’ve become lead developer. In talking about it, we’ve realized it can be useful to librarians, archivists, and researchers, so we’re looking into making it a crowdfunded open source project.

A little background: “Filk music” is songs created and sung by science fiction and fantasy fans, mostly at conventions and in homes. I’ve offered a definition of filk on my website. There are some shoestring filk publishers; technically they’re in business, but it’s a labor of love rather than a source of income. Some of them have a large backlog of recordings from past conventions. Just identifying the songs and who’s singing them is a big task.

This project is, initially, for one of these filk publishers, who has the biggest backlog of anyone. The approach we’re looking at is making short clips available to registered crowdsource contributors, and letting them identify as much as they can of the song, the author, the performer(s), the original tune (many of these songs are parodies), etc. Reports would be delivered to editors for evaluation. There could be multiple reports on the same clip; editors would use their judgment on how to combine them. I’ve started on a prototype, using PHP and MySQL.

There’s a huge amount of enthusiasm among the people already involved, which makes me confident that at least the niche project will happen. The question is whether there may be broader interest. I can see this as a very useful tool for professionals dealing with archives of unidentified recordings: folk music, old jazz, transcribed wax cylinder collections, whatever. There’s very little in the current design that’s specific to one corner of the musical world.

The first question: Has anyone already done it? Please let me know if something like this already exists.

If not, how interesting does it sound? Would you like it to happen? What features would you like to see in it?

Update: On the Code4lib mailing list, Jodi Schneider pointed out that nichesourcing is a more precise word for what this project is about.

Tagged: archiving, crowdsourcing, filk, music
Categories: Planet DigiPres

New QA tool for finger detection on scans

Open Planets Foundation Blogs - 10 July 2014 - 11:49am

I would like to draw your attention to the new QA tool for finger detection on scans: This tool was developed by AIT in scope of the SCAPE project.


Checking to identify fingers on scan manually is a very time-consuming and error-prone process. You need a tool to help you: Fingerdet.

Fingerdet is an open source tool which:

  • provides decision-making support for finger on scans detection in or across collections
  • identifies fingers even independent from file format, scan quality, finger sizes, direction, shape, colour and lighting conditions
  • applies state-of-the art image processing
  • is useful in assembling collections from multiple sources, and identifying corrupted files

Fingerdet brings the following benefits:

  • Automated quality assurance
  • Reduced manual effort and error
  • Saved time
  • Lower costs, e.g. storage, effort
  • Open source, standalone tool. Also as Taverna component for easy invocation
  • Invariant to format, rotation, scale, translation, illumination, resolution, cropping, warping and distortions
  • May be applied to wide range of image collections, not just print images
Preservation Topics: Web ArchivingPreservation RisksSCAPESoftware
Categories: Planet DigiPres

Quality assured ARC to WARC migration

Open Planets Foundation Blogs - 10 July 2014 - 10:44am

This blog post continues a series of posts about the weeb archiving topic „ARC to WARC migration“, namely it is a follow-up on the posts „ARC to WARC migration: How to deal with de-duplicated records?“, and „Some reflections on scalable ARC to WARC migration“.

Especially the last one of these posts ,which described how SCAPE tools can be used for multi-Terabyte web archive data migration, is the basis for this post from a subject point of view. One consequence of evaluating alternative approaches for processing web archive records using the Apache Hadoop framework was to abandon the native Hadoop job implementation (the arc2warc-migration-hdp module was deprecated and removed from the master branch) because of having some disadvantages without bringing significant benefits in terms of performance and scale-out cabability compared to the command line application arc2warc-migration-cli used together with the SCAPE tool ToMaR for parallel processing. While this previous post did not elaborate on quality assurance, it will be the main focus of this post.

The workflow diagram in figure 1 illustrates the main components and processes that were used to create a quality assured ARC to WARC migration workflow.


Figure 1: Workflow diagram of the ARC to WARC migration workflow

The basis of the main components used in this workflow is the Java Web Archive Toolkit (JWAT) for reading web archive ARC container files. Based on this toolkit the „hawarp“ tool set was developed in the SCAPE project which bundles several components for preparing and processing web archive data, especially data that is stored in ARC or WARC container files. Special attention was given to making sure that data can be processed using the Hadoop framework, an essential part of the SCAPE platform for distributed data processing using computer clusters.

The input of the workflow is an ARC container file, a format originally proposed by the Internet Archive to persistently store web archive records. The ARC to WARC migration tool is a JAVA command line application which takes an ARC file as input and produces a WARC file. The tool basically performs a procedural mapping of metadata between the ARC and WARC format (see constructors of the eu.scape_project.hawarp.webarchive.ArchiveRecord class). One point that is worth highlighting is that the records of ARC container files from the Austrian National Library's web archive were not structured homogeneously. When creating ARC records (using the Netarchive Suite/Heritrix web crawler in this case), the usual procedure was to strip off the HTTP response metadata from the server's response and store these data as part of the header of the ARC record. Possibly due to malformed URLs this was not applied to all records, so that the HTTP response metadata were still part of the payload content as it was actually defined later for the WARC standard. The ARC to WARC migration tool handles these special cases accordingly. Generally, and as table 1 shows, HTTP response metadata is transferred from ARC header to WARC payload header and therefore becomes part of the payload content.

ARC Header
→ HTTP Response MetadataWARC HeaderARC Payload→ HTTP Response Metadata
WARC Payload

Table 1: HTTP response metadata is transferred from ARC header to WARC payload header

The CDX-Index Creation module, which is also part of the „hawarp“ tool set, is used to create a file in the CDX file format to store selected attributes of web archive records aggregated in ARC or WARC container files - one line per record - in a plain text file. The main purpose of the CDX-index is to provide a lookup table for the wayback software. The index file contains the necessary information (URL, date, offset: record position in container file, container identifier, etc) to retrieve data required for rendering an archived web page and depending ressources from the container files.

Apart from serving the purpose of rendering web ressources using the wayback software, the CDX index file can also be used to do a basic verification if the container format migration process was successful or not, namely by comparing the CDX fields of the ARC CDX file and the WARC CDX file. The basic assumption here is that apart from the offset and container identifier fields all the other fields must have the same values for corresponding records. Especially the payload digest allows verifying if the digest computed for the binary data (payload content) are the same for all records in the two container formats respectively.

An additional step of the workflow in order to verify the quality of the migration is to compare the rendering results of selected ressources when being retrieved from the original ARC and the migrated WARC container files. To this end, the CDX files are deployed to the wayback application in a first step. In a second step the PhantomJS framwork is used to take snapshots from rendering the same ressource retrieved once from the ARC container and once from the WARC container file.

Finally, the snapshot images are compared using Exiftool (basic image properties) and ImageMagick (measure AE: absolute error) in order to determine if the rendering result is equal for both instances. Randomized manual verification of individual cases may then conclude the quality control process.

There is an executable Taverna workflow available on myExperiment. The Taverna workflow is configured by adapting the values of the constant values (light-blue boxes) which define the paths to configuration files, deployment files, and scripts in the processing environment. However, as in this workflow Taverna is just used as an orchestration tool to build a sequence of bash script invocations, it is also possible to just use the individual scripts of this workflow and replace the Taverna variables (embraced by two per cent symbols) accordingly.

The following prerequisites must be fulfilled to be able to execute the Taverna workflow and/or the bash scripts it contains:

The following screencast demonstrates the workflow using a simple "Hello World!" crawl as example:

Taxonomy upgrade extras: SCAPESCAPEProjectSCAPE-ProjectWeb ArchivingPreservation Topics: Preservation ActionsMigrationWeb ArchivingSCAPE
Categories: Planet DigiPres

EaaS in Action — And a short meltdown due to a friendly DDoS

Open Planets Foundation Blogs - 9 July 2014 - 4:23pm

On June 24th 9.30 AM EST Dragan Espenschied, Digital Conservator at Rhizome NY, released an editorial on featuring a restored home computer previously owned by Cory Arcangel. The article uses an embedded emulator powered by the bwFLA Emulation as a Service framework  and the University of Freiburg’s computing center. The embedded emulator allows readers to explore and interact with the re-enacted machine. 

Currently the bwFLA test and demo infrastructure runs an old, written off IBM Blade Cluster, using 12 blades for demo purposes, each equipped with 8 core Intel(R) Xeon(R) CPUs (E5440  @ 2.83GHz). All instances are booted diskless (network boot) with the latest bwFLA codebase deployed. Additionally, there is a EaaS gateway running on 4 CPUs delegating request and providing a Web container framework (JBoss) for IFrame delivery. To ensure, a decent performance of individual emulation sessions we assign one emulation session to every available physical CPU. Hence, our current setup can handle 96 parallel session. 

Due to some social media propaganda and good timing (US breakfast/coffee time) our resources were exhausted within minutes.

The figure above shows the total number of sessions for June 26th. Between 16:00 and 20:00 CEST, however, we were unable to deal with the demand.

After two days however, load has normalized again, even though on a higher level.

Lessons learned

The bwFLA EaaS framework is able to scale with demand, however, not with our available (financial) resources. Our cluster is suitable for „normal“ load scenarios. For peaks like we have experienced with Dragan’s post, a temporary deployment in the Cloud (e.g. using PaaS / IaaS Cloud services) is cost effective strategy, since these heavy load situations last only for a few days. The „local“ infrastructure should be scaled to average demand, keeping costs running EaaS at bay. For instance, Amazon EC2 charges for an 8 CPU machine about € 0.50 per hour. In case of Dragan’s post, the average session time of a user playing with the emulated machine was 15 minutes, hence, the average cost per user is about 0.02€ if a machine is fully utilized. 


Taxonomy upgrade extras: EaaSPreservation Topics: Emulation
Categories: Planet DigiPres

BSDIFF: Technological Solutions for Reversible Pre-conditioning of Complex Binary Objects

Open Planets Foundation Blogs - 9 July 2014 - 12:31am

During my time at The National Archives UK, colleague, Adam Retter, developed a methodology for the reversible pre-conditioning of complex binary objects. The technique was required to avoid the doubling of storage for malformed JPEG2000 objects numbering in the hundreds of thousands. The difference between a malformed JPEG2000 file and a corrected, well-formed JPEG2000 file, in this instance was a handful of bytes, yet the objects themselves were many megabytes in size. The cost of storage means that doubling it in such a scenario is not desirable in today’s fiscal environment – especially if it can be avoided.

As we approach ingest of our first born-digital transfers at Archives New Zealand, we also have to think about such issues. We’re also concerned about the documentation of any comparable changes to binary objects, as well as any more complicated changes to objects in any future transfers.

The reason for making changes to a file pre-ingest, in our process terminology - pre-conditioning, is to ensure well-formed, valid objects are ingested into the long term digital preservation system. Using processes to ensure changes are:

  • Reversible
  • Documented
  • Approved

We can counter any issues identified as digital preservation risks in the system’s custom rules up-front ensuring we don’t have to perform any preservation actions in the short to medium term. Such issues may be raised through format identification, validation, or characterisation tools. Such issues can be trivial or complex and the objects that contain exceptions may also be trivial or complex themselves.

At present, if pre-conditioning is approved, it will result in a change being made to the digital object and written documentation of the change, associated with the file, in its metadata and in the organisation’s content management system outside of the digital preservation system.

As example documentation for a change we can look at a provenance note I might write to describe a change in a plain text file. The reason for the change is the digital preservation system looking for the object to be encoded as UTF-8. A conversion can give us stronger confidence about what this file is in future. Such a change, converting the object from ASCII to UTF-8, can be completed as either a pre-conditioning action pre-ingest, or preservation migration post-ingest.

Provenance Note

“Programmers Notepad 2.2.2300-rc used to convert plain-text file to UTF-8. UTF-8 byte-order-mark (0xEFBBBF) added to beginning of file – file size +3 bytes. Em-dash (0x97 ANSI) at position d1256 replaced by UTF-8 representation 0xE28094 at position d1256+3 bytes (d1259-d1261) – file size +2 bytes.”

Such a small change is deceptively complex to document. Without the presence of a character sitting outside of the ASCII range we might have simply been able to write, “UTF-8 byte-order-mark added to beginning of file.” – but with its presence we have to provide a description complete enough to ensure that the change can be observed, and reversed by anyone accessing the file in future.

Pre-conditioning vs. Preservation Migration

As pre-conditioning is a form of preservation action that happens outside of the digital preservation system we haven’t adequate tools to complete the action and document it for us – especially for complex objects. We’re relying on good hand-written documentation being provided on ingest. The temptation, therefore, is to let the digital preservation system handle this using its inbuilt capability to record and document all additions to a digital object’s record, including the generation of additional representations; but the biggest reason to not rely on this is the cost of storage and how this increases with the likelihood of so many objects requiring this sort of treatment over time.

Proposed Solution

It is important to note that the proposed solution can be implemented either pre- or post-ingest therefore removing the emphasis from where in the digital preservation process this occurs, however, incorporating it post-ingest requires changes to the digital preservation system. Doing it pre-ingest enables it to be done manually with immediate challenges addressed. Consistent usage and proven advantages over time might see it included in a digital preservation mechanism at a higher level.

The proposed solution is to use a patch file, specifically a binary diff (patch file) which stores instructions about how to convert one bitstream to another. We can create a patch file by using a tool that compares an original bitstream to a corrected (pre-conditioned) version of it and stores the result of the comparison. Patch files can add and remove information as required and so we can apply the instructions created to a corrected version of any file to re-produce the un-corrected original.

The tool we adopted at The National Archives, UK was called BSDIFF. This tool is distributed with the popular operating system, FreeBSD, but is also available under Linux, and Windows.

The tool was created by Colin Percival and there are two utilities required; one to create a binary diff - BSDIFF itself, and the other to apply it BSPATCH. The manual instructions are straightforward, but the important part of the solution in a digital preservation context is to flip the terms <oldfile> and <newfile>, so for example, in the manual:

  • $ bsdiff <oldfile> <newfile> <patchfile>

Can become:

  • $ bsdiff <newfile> <oldfile> <patchfile>

Further, in the below descriptions, I will replace <newfile> and <oldfile> for <pre-conditioned-file> and <malformed-file> respectively, e.g.

  • $ bsdiff <pre-conditioned-file> <malformed-file> <patchfile>


BSDIFF generates a patch <patchfile> between two binary files. It compares <pre-conditioned-file> to <malformed-file> and writes a <patchfile> suitable for use by BSPATCH.


BSPATCH applies a patch built with BSDIFF, it generates <malformed-file> using <pre-conditioned-file>, and <patchfile> from BSDIFF.


For my examples I have been using the Windows port of BSDIFF referenced from Colin Percival’s site.

To begin with, a non-archival example simply re-producing a binary object:

If I have the plain text file, hen.txt:

  • The quick brown fox jumped over the lazy hen.

I might want to correct the text to its more well-known pangram form – dog.txt:

  • The quick brown fox jumped over the lazy dog.

I create dog.txt and using the following command I create hen-reverse.diff:

  • $ bsdiff dog.txt hen.txt hen-reverse.diff

We have two objects we need to look after, dog.txt and hen-reverse.diff.

If we ever need to look at the original again we can use the BSPATCH utility:

  • $ bspatch dog.txt hen-original.txt hen-reverse.diff

We end up with a file that matched the original byte for byte and can be confirmed by comparing the two checksums.

$ md5sum hen.txt 84588fd6795a7e593d0c7454320cf516 *hen.txt $ md5sum hen-original.txt 84588fd6795a7e593d0c7454320cf516 *hen-original.txt

Used as an illustration, we can re-create the original binary object, but we’re not saving any storage space at this point as the patch file is bigger than the <malformed-file> and <pre-conditioned-file> together:

  • hen.txt – 46 bytes
  • dog.txt – 46 bytes
  • hen-reverse.diff – 159 bytes

The savings we can begin to make, however, using binary diff objects to store pre-conditioning instructions can be seen when we begin to ramp up the complexity and size of the objects we’re working with. Still working with text, we can convert the following plain-text object to UTF-8 complementing the pre-conditioning action we might perform on archival material as described in the introduction to this blog entry:

  • Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas quam lacus, tincidunt sit amet lobortis eget, auctor non nibh. Sed fermentum tempor luctus. Phasellus cursus, risus nec eleifend sagittis, odio tellus pretium dui, ut tincidunt ligula lorem et odio. Ut tincidunt, nunc ut volutpat aliquam, quam diam varius elit, non luctus nulla velit eu mauris. Curabitur consequat mauris sit amet lacus dignissim bibendum eget dignissim mauris. Nunc eget ullamcorper felis, non scelerisque metus. Fusce dapibus eros malesuada, porta arcu ut, pretium tellus. Pellentesque diam mauris, mollis quis semper sit amet, congue at dolor. Curabitur condimentum, ligula egestas mollis euismod, dolor velit tempus nisl, ut vulputate velit ligula sed neque. Donec posuere dolor id tempus sodales. Donec lobortis elit et mi varius rutrum. Vestibulum egestas vehicula massa id facilisis.

Converting the passage to UTF-8 doesn’t require the conversion of any characters within the text itself, rather just the addition of the UTF-8 byte-order-mark at the beginning of the file. Using Programmers Notepad we can open lorem-ascii.txt and re-save it with a different encoding as lorem-utf-8.txt. As with dog.txt and hen.txt we can then create the patch, and then apply to see the original again using the following commands:

  • $ bsdiff lorem-utf-8.txt lorem-ascii.txt lorem-reverse.diff
  • $ bspatch lorem-utf-8.txt lorem-ascii-original.txt lorem-reverse.diff

Again, confirmation that bspatch outputs a file matching the original can be seen by looking at their respective MD5 values:

$ md5sum lorem-old.txt ec6cf995d7462e20f314aaaa15eef8f9 *lorem-ascii.txt $ md5sum lorem-ascii.txt ec6cf995d7462e20f314aaaa15eef8f9 *lorem-ascii-original.txt

The file sizes here are much more illuminating:

  • lorem-ascii.txt – 874 bytes
  • lorem-utf-8.txt – 877 bytes
  • lorem-reverse.diff – 141 bytes

Just one more thing… Complexity!

We can also demonstrate the complexity of the modifications we can make to digital objects that BSDIFF affords us. Attached to this blog is a zip file containing supporting files, lorem-ole2-doc.doc and lorem-xml-docx.docx.

The files are used to demonstrate a migration exercise from an older Microsoft OLE2 format to the up-to-date OOXML format.

I’ve also included the patch file lorem-word-reverse.diff.

Using the commands as documented above:

  • $ bsdiff lorem-xml-docx.docx lorem-ole2-doc.doc lorem-word-reverse.diff
  • $ bspatch lorem-xml-docx.docx lorem-word-original.doc lorem-word-reverse.diff

We can observe that application of the diff file to the ‘pre-conditioned’ object, results in a file identical to the original OLE2 object:

$ md5sum lorem-ole2-doc.doc 3bb94e23892f645696fafc04cdbeefb5 *lorem-ole2-doc.doc $ md5sum lorem-word-original.doc 3bb94e23892f645696fafc04cdbeefb5 *lorem-word-original.doc

The file-sizes involved in this example are as follows:

  • Lorem-ole2-doc.doc – 65,536 bytes
  • Lorem-xml-docx.docx – 42.690 bytes
  • Lorem-word-reverse.diff – 16,384 bytes

The neat part of this as a solution, if it wasn’t enough that the most complex of modifications are reversible, is that the provenance note remains the same for all transformations between all digital objects. The tools and techniques are documented instead, and the rest is consistently consistent, and perhaps even more accessible to all users who can understand this documentation over more complex narrative, byte-by-byte breakdowns that might otherwise be necessary.


Given the right problem and the insight of an individual from outside of the digital preservation sphere (at the time) in Adam, we have been shown an innovative solution that helps us to demonstrate provenance in a technologically and scientifically sound manner, more accurately, and more efficiently than we might otherwise be able to do so using current approaches. The solution:

  • Enables more complex pre-conditioning actions on more complex objects
  • Prevents us from doubling storage space
  • Encapsulates pre-conditioning instructions more efficiently and more accurately - there are fewer chances to make errors

While it is unclear whether Archives New Zealand will be able to incorporate this technique into its workflows at present, the solution will be presented alongside our other options so that it can be discussed and taken into consideration by the organisation as appropriate.

Work does need to be done to incorporate it properly, e.g. respecting original file naming conventions; and some consideration should be given as to where and when in the transfer / digital preservation process the method should be applied, however, it should prove to be an attractive, and useful option for many archives performing pre-conditioning or preservation actions on all future, trivial and complex digital objects.




Documentation for the BSDIFF file format is attached to this blog and was created by Colin Percival and is BSD licensed. 


Preservation Topics: Preservation ActionsMigrationPreservation StrategiesNormalisation AttachmentSize BSDIFF format documentation provided by Colin Percival [PDF]29.93 KB Support files for OLE2 to OOXML example [ZIP]69.52 KB
Categories: Planet DigiPres