Planet DigiPres

2014 NDSA Philly Regional Meeting: January 23-24

The Signal: Digital Preservation - 18 February 2014 - 4:19pm

The following is a guest post by Nicole Scalessa, IT manager at The Library Company of Philadelphia, an NDSA member.

Digital stewardship is a prime topic for small institutions trying to keep pace with the increasing demands for digital content. The Library Company of Philadelphia, a special collections library founded by Benjamin Franklin in 1731, hosted the National Digital Stewardship Alliance Philly Regional meeting to inform and connect mid-Atlantic institutions so they may consider new collaborations to meet digital preservation demands.

The intent of the NDSA Philly Regional meeting was to present a slate of speakers that represented some of the most influential thinking and trends in digital preservation today. The event was opened to new audiences to spread the word of NDSA’s accomplishments and ongoing activities. Members of the Philadelphia Area Consortium of Special Collections Libraries, PhillyDH, and the Delaware Valley Archivists Group were in attendance. The event, on the cusp of ALA Mid-Winter, also drew audiences from around the country from as far as North Carolina, Florida, Colorado and Washington State.

Things kicked off Thursday evening, January 23rd, with a welcome by Library Company Director John C. Van Horne and an introduction by Erin Engle, digital archivist with NDIIPP.  Erin provided a clear presentation of the NDSA mission to advocate for common needs among members through reports, guidance, meetings, events and webinars. As an example, she mentioned the 2014 National Agenda for Digital Stewardship, an insightful look into the trends and current state of digital preservation and a tool to help decision makers and funders.

 Nicole Joniec

Emily Gore presenting on DPLA during the NDSA Philadelphia Regional Meeting. Credit: Nicole Joniec

This was followed by an enthusiastic and compelling keynote by Emily Gore, DPLA director for content, entitled “Building the Digital Public Library of America: Successes, Challenges and Future Directions.” A theme of sustainability resonated through her talk as she described the development of the DPLA and how it became clear that the hub model was the most successful strategy for the long term success of the project. The establishment of hubs is driven by the idea that asking a few existing digital repositories to aggregate content is the most efficient way to bring more institutions into DPLA. Hubs help DPLA with the management of data aggregation, metadata consistency, continual repository services, promoting new digitization, encouraging community engagement and self-evaluation for the improvement of existing and the development of new DPLA hubs.

The evening progressed into a series of lightning talks that focused on standards for preservation, digitization and description. This was a natural transition in the conversation and established a complete picture of the issues that must be addressed in any collaborative digitization strategy. Consistency was the prevailing message for success when conformity is often unattainable.

Meg Phillips, NARA’s external affairs liaison, initiated the lightning part of the evening with a presentation on the NDSA Levels of Digital Preservation, “a tiered set of recommendations for how organizations should begin to build or enhance their digital preservation activities.” She emphasized the importance of this document as a tool for self-assessment, program planning, institutional advocacy, strategic planning, and as a way to open communication with content creators. The success of the document lies in its simple descriptive format that is content agnostic. It includes four levels of preservation – protect your data, know your data, monitor your data, and repair your data; across five functions – storage and geographic location, file fixity and data integrity, information security, metadata and file formats.

 Nicole Joniec

Ian Bogus talking about the ALCTS Minimum Digitization Capture Recommendations. Credit: Nicole Joniec

Ian Bogus, MacDonald Curator of Preservation at the University of Pennsylvania Libraries lightening talk was entitled “Why Create a Standard on Digitization? An Experience Creating the Association for Library Collections and Technical Services Minimum Digitization Capture Recommendation.” The goal of this project was to establish an acceptable minimum standard that would resonate with staff with different degrees of digitization experience. With this standard libraries can create digital surrogates that are sustainable into the future. The guiding principles of the project were to create a standard high enough to meet adequacy, that kept in line with other recommendations and projects, did not reduplicate existing work, was basic enough for novices to use and was accurate enough for experts.

The evening was concluded with a fun discussion on metadata, with some serious undertones. George Blood of George Blood Audio|Video|Film discussed how we as librarians are “Describing Ourselves to Death” and the “the Failures of Metadata.” He began by affirming he is a metadata pessimist because no one asks “what problem are we trying to solve? Or “what are we trying to provide metadata for?” Most metadata is collected “just because we can” and because of this we do not test our metadata. The variety of metadata standards across and within institutions is staggering. Sometimes metadata standardization costs more than digitization itself. He encouraged the audience to consider what is a standard, does a standard need to be perfect, what are the implications of local modifications, and is there a one size fits all solution? This was quite a formidable list of questions to end the evening but a wonderful starting point for the Friday unconference the next morning.

 Nicole Joniec

A session during the unconference portion at the NDSA Philadelphia Regional Meeting. Credit: Nicole Joniec

 Nicole Joniec

Unconference agenda comes together. Credit: Nicole Joniec

Approximately 50 attendees convened to propose and vote upon the unconference sessions.  The largest sessions included “making the case for digital preservation,” “let’s discuss a consortium data center,” and “how do we approach becoming a regional hub of DPLA.” The smaller breakout sessions included discussions on minimal standards for archival description, engaging leadership and encouraging organizational responsibility for digital projects, approaching rights and access issues, metrics for evaluation of digital archival resources, new technologies in digitization, and teaching digital preservation in library science and graduate archival programs. Notes from these sessions will be forthcoming on the event web page here.

The two-day event was attended by nearly one hundred and fifty people from around the country and ended in promising collaboration discussions and new friendships. This experience demonstrates that NDSA Regional meetings offer opportunities for local institutions to connect with one another while becoming informed on trends in digital stewardship on a national scale.

Categories: Planet DigiPres

The Heart of the Matter: An NDSR Project and Program Update

The Signal: Digital Preservation - 14 February 2014 - 6:26pm

The following is a guest post by Maureen McCormick Harlow, a National Digital Stewardship Resident at the National Library of Medicine in Bethesda, Maryland.  She is working on a project to build a thematic web collection.

Maureen McCormick

Maureen McCormick Harlow

Greetings from the National Library of Medicine!  It’s hard to believe it, but I’m heading into the fourth quarter of my residency here.  I thought it was time to give an update on what I’ve been doing for my project, even though it’s not terribly Valentine’s Day-related!

The Project

My NDSR project is to build a thematic web collection at NLM that will be incorporated into the History of Medicine Division collection.  HMD has extensive digital and modern manuscript collections, and this little collection that I’m working on will be accessioned into it as a curated, intentional collection.

The Theory

Thematic collections can provide institutions with an opportunity to close known collection gaps.  If institutions can identify areas of weakness within their collections, they can intentionally collect on the topics as they exist today on the Internet.  This is an especially attractive option for topics that are in flux, or whose understanding is changing frequently.

Another benefit of thematic web collections is that they allow institutions to collect material that may be ephemeral.  Blogs come and go frequently, and once they are taken down, the information contained in them is gone as well.  Collecting websites can be akin to collecting gray literature.

The Model

My project is limited to creating one thematic collection to add to the HMD holdings, but I wanted to also establish a framework that could be used in the future for other thematic collections.  The framework that we eventually settled on is a thematic collection that represents two sides of the same coin so to speak.  In this case, Autism Spectrum Disorders are brain disorders generally diagnosed at the beginning of life, while the brain is developing, while Alzheimer’s Disease is a brain disorder diagnosed at the end of life, in old age.

Although the two diseases are not related, they are diagnosed during the organ’s development and decay.  Future thematic web collections could explore diagnoses in a particular body system or region made during the system/region’s development and at the end of life, or two extremes of the same issue.  Some examples include:

  • Teen pregnancy and infertility
  • Diabetes type 1 and type 2
  • Scoliosis and osteoarthritis
  • Eating disorders and obesity

Each of these issues is one of strategic importance to NIH and, in some cases, the nation (see: the Let’s Move project by Michelle Obama and the Teen Pregnancy Prevention Resource Center in HHS’s Office of Adolescent Health).  More importantly, many of these topics represent areas of great change and understandings that are in flux, making websites a viable way for future researchers to examine change over time.

The Details

Picking a Theme
Before you can create a thematic web collection, you’ve got to have a theme.  This process took awhile.  My first step was to look over the various collecting documents. In my section at NLM, there were three to consider: the NIH Research Priorities, the NLM Collection Development Policy and Manual, and an internal document that deals with known collection gaps (for example, the caregiver perspective).  Each of these helped to inform narrow my possibilities.  For instance, the Research Priorities at NIH report indicated several areas of interest to the larger NIH audience, alerting me to trends in research and some of the most prevalent problems in medicine.  It stood to reason that, since these were priorities for NIH, there would be scholarly work about the diseases produced, and that the understanding of the diseases was in a period of flux, making web collecting more important than ever.  Since this is a bit of a pioneer collection, I wanted it to fit squarely within each of these areas.


Screenshot of the National Library of Medicine Collection Development Manual relevant to the History of Medicine Division

After reviewing all of these documents and spending a significant amount of time looking at internet resources, I came up with three proposals:

  • Eating disorders
  • Sexual assault
  • Autism and Alzheimer’s

My last step in the process was to plug each potential topic into the NLM catalog and the HMD finding aid search to see what kind of resources we already had on each topic.  Since one of my personal goals was to help fill some of the collecting gaps, I wanted to see that the web collection would be contributing something original to the HMD collection.  In each case, I found that, while NLM collected extensively on each topic, the HMD holdings were limited.

We ended up going with the third option, and I’m calling the collection Disorders of the Developing and Aging Brain: Autism and Alzheimer’s on the Web.

The results when I searched the HMD finding aids for “autism.”

The results when I searched the HMD finding aids for “autism.”

Picking the Seeds
The scope of my collection was limited to approximately 40-60 seeds (individual websites/URLs that will be added to the collection).  I decided to split the seeds roughly in half (a total of 64 seeds) and divided the ~30 per topic into six or seven different areas:

  • Current understanding
  • Caretakers (first-person resources, primarily blogs of caretakers)
  • Patients/sufferers (also first-person, also primarily blogs)
  • Research
  • Causes
  • Treatment
  • Prevention (for Alzheimer’s only)

For the first-person categories, I tried to make sure to cover a wide variety of ages, diagnoses, and roles/perspectives to represent a range of experiences.

Collecting the Material
After picking the seeds, we went about collecting permissions for the blogs.  Although we have a strong argument for use under the ARL Best Practices for Fair Use guidelines, we’re proceeding with an abundance of caution and collecting as many permissions as possible for the blogs in the collection.

Almost two weeks ago, I started crawling the seeds using Archive-It.  NLM has used Archive-It for several years for its web collections, and has two other public web collections.

Describing the Collection
This is where I am now.  My preliminary plan is to use the following methods to describe the new collection:

  • Create a catalog record so that the collection is discoverable through the NLM catalog;
  • Fully arrange and describe the collection using a finding aid and adhering to DACS principles and local implementations.

There are very few examples that I’ve found of web collections described in this manner, so it’s going to be a lot of work creating standards and best practices that will be robust and durable enough to make the collection usable to researchers, while also being flexible enough for archivists at NLM to use into the future.

That’s where my project stands now!  I’m looking forward to finishing it, and I welcome the challenge of describing the collection and getting it incorporated into the HMD collection!

Other residents in the blogosphere: Heidi Dowding discusses digital asset management at cultural institutions in Baltimore, Emily Reynolds recaps her presentation at ALA Mid-Winter with Julia Blase and shares her slides, and Lauren Work shares her ALA Mid-Winter slides.

Categories: Planet DigiPres

Capturing and Preserving the Olympic Spirit Via Web Archiving

The Signal: Digital Preservation - 12 February 2014 - 3:15pm
Image from the IIPC website

Image from the IIPC website

Every two years there is a fresh opportunity for excitement in following the Olympic games – not only for the thrill of the sports themselves, and rooting for hometown heroes, but for the fascination and variety of all the international culture in one place.   And now, there is an effort going on behind the scenes to capture the highlights, the competition, and the general cultural history surrounding the Olympic Games.  That is, a project to archive the 2014 Olympics web sites. This effort may not be well known, but the resultant archive will be invaluable for researchers in the future.

This web archiving project is being produced by the International Internet Preservation Consortium.  The IIPC has been around since 2003, and it’s a collaborative organization dedicated to improving the tools, standards and best practices of web archiving while promoting international collaboration and the broad access and use of web archives for research and cultural heritage.  Membership in the IIPC currently includes almost 50 organizations; libraries (including the Library of Congress), archives, museums and other cultural heritage organizations, representing over 25 countries. This Olympics project is being coordinated through the IIPC Access Working Group, and the project leaders are Nicola Bingham and Helen Helen Hockx-Yu, both of the British Library.

The IIPC has produced similar projects before: there is an archive of the 2010 Winter Olympics in Vancouver and the 2012 Summer Olympics and Parolympics in London.  And this current effort aims to preserve a range of web sites relating to the 2014 Olympics in Sochi, Russia.

A little bit about the process – IIPC member institutions all contribute their own list of suggested web sites (referred to as “seeds”) for inclusion in the collection.  With so many member organizations around the world, the aim is to include Olympics-related sites from many countries, in a variety of languages and from a variety of viewpoints.

A torch relay map from the 2012 site collection

A torch relay map page image from the 2012 collection

The previous IIPC project to capture the 2012 Olympics in London included many British sites that provide an overall view of the host country preparations.  These archived sites are not available yet, but include the official London 2012 Olympic and Paralympic Games sites as well as the British Olympic Association which includes details of the Olympics bid, and a local council’s 2012 Olympic and Paralympic website.  It also includes the Hidden London site showing the building stages of the Olympic stadium, as well as blogs and commentaries related to arts and culture, featuring such things as a torch from the 1948 London Olympics acquired by the Victoria and Albert Museum.

For this current 2014 Olympics project, the various IIPC member institutions are all recommending their own list of websites to be included.   For example, the Library of Congress has recommended 131 web sites.  As described by Michael Neubert, Supervisory Digital Projects Specialist here at the Library: “The selection of most sites for such collections is mechanical, in that we know we want sites for the various US teams – each team sport has its own site, for example, then along with that site there will be various social media sites/channels.  In order to optimize the crawls, we nominate the social media separately. In addition to the team sites, we also chose a limited number of news media sites where the coverage of the Olympics seemed segregated from the rest of the site.”

New Zealand Olympic committee site for 2014

New Zealand Olympic committee page image

Nicola Bingham of the British Library, and one of the project coordinators, emphasizes additional contributions to this project.  “The IIPC 2014 Winter Olympics project is being supported by the Internet Archive who are crawling the seeds (sites) and the University of North Texas who are supporting the nomination tool. A common subject scheme is being used to categorize websites according to producer type and Olympic sport. Crawling began in mid December 2013, and to-date 745 seeds have been nominated by 17 IIPC member institutions.”

“The Internet Archive has taken on the role of crawling, without which the project would have been much more difficult. Many other IIPC members would not have been able to perform the crawling, not necessarily for technical reasons but due to legal and/or political considerations.”  For more about the web archiving process, see the IIPC “About Archiving” page.

As stated on the group’s Access Working Group page, “It is hoped that the project will enable institutions to continue to experiment with tools and processes that facilitate collaborative definition, collection and accessibility of web data.”

Over the next year or so, the IIPC will be working on creating wider access to all these Olympic archives.  For the latest updates on this and other IIPC projects, follow @netpreserve.

See other Olympics-related blog posts here at the Library, from Poetry and Literature, Teaching, and the Law Library.

Categories: Planet DigiPres

SCAPE QA Tool: Technologies behind Pagelyzer - II Web Page Segmentation

Open Planets Foundation Blogs - 12 February 2014 - 1:21pm

Web pages are getting more complex than ever. Thus, identifying different elements from web pages, such as main content, menus, user comments, advertising among others, becomes difficult. Web page segmentation refers to the process of dividing a Web page into visually and semantically coherent segments called Blocks or Segments. Detecting these different blocks is a crucial step for many applications, for example mobile devices content visualization, information retrieval and change detection between versions in the web archive context.

Web Page Segmentation at a Glance

For a web page (W) the output of its segmentation is the semantic tree of a web page (W'). Each node represents a data region in the web page, which is called a block. The root block represents the whole page. Each inner block is the aggregation of all its children blocks. All leaf blocks are atomic units and form a flat segmentation of the web page. Each block is identified by a block-id value (See Figure 1 for an example).

Fig. 1

An efficient web page segmentation aproach is important for several issues:

  • Process different part of a web page accordingly to its type of content.

  • Assign importance to a region in a web page over the rest

  • Understand the structure of a web page

Pagelyzer is a tool containing a supervised framework that decides if two web page versions are similar or not. Pagelyzer takes two urls and two browsers types (e.g. firefox, chrome) and one comparison type as input (image-based, hybrid or content-based). If browsers types are not set, it uses firefox by default. SVM-based comparison is discussed in the post (SCAPE QA Tool: Technologies behind Pagelyzer - I Support Vector Machine). Based on the segmentation hyperlinks are extracted from each block and the jaccard distance between them are calculated.    

In this post, I will try to explain what web page segmentation does specially for pagelyzer. It provides information of about the web page content.

Web page Segmentation Algorithm

We present here the detail for the Block-o-Matic web page segmentation algorithm used by pagelyzer to perform the segmentation. It is an hybrid between the visual-based approach and document processing approach.

The segmentation process is divided in three phases: analysis, understanding and reconstruction. It comprise three taks: filter, mapping and combine. It produces three structures: DOM structure, content structure and logic structure. The main aspect of the whole process is producing this structures where the logic structure represent the final segmentation of the web page. 

The DOM tree is obtained from the rendering of a web browser. The result of the analysis phase is the content structure (Wcont ), built from the DOM tree with the d2c algorithm. Mapping the content structure into a logical structure (Wlog ) is called document understanding. This mapping is performed by the c2l algorithm with a granularity parameter pG. Web page reconstruction gather the three structures (Rec function),


W' = Rec(DOM, d2c(DOM ), c2l(d2c(DOM, pG))).


For the integration of the segmentation outcome to pagelyzer it is used a XML representation: ViDIFF. It represent hierarchicaly the blocks, their geometric properties, the links and text in each block.


Block-o-matic algorithm is available:

ReferencesStructural and Visual Comparisons for Web Page ArchivingM. T. Law, N. Thome, S. Gançarski, M. Cord12th edition of the ACM Symposium on Document Engineering (DocEng) 2012 Structural and Visual Similarity Learning for Web Page ArchivingM. T. Law, C. Sureda Gutierrez, N. Thome, S. Gançarski, M. Cord10th workshop on Content-Based Multimedia Indexing (CBMI) 2012 Block-o-Matic: A Web Page Segmentation FrameworkA. Sanoja and S. Gançarski. Paper accepted for oral presentation in the International Conference on Multimedia Computing and Systems (ICMCS'14). Morroco, April 2014. Block-o-Matic: a Web Page Segmentation Tool and its EvaluationSanoja A., Gançarski S.BDA. Nantes, France. 2013. Yet another Web Page Segmentation ToolSanoja A., Gançarski S.Proceedings iPRES 2012. Toronto. Canada, 2012 Understanding Web Pages Changes.Pehlivan Z., Saad M.B. , Gançarski S.International Conference on Database and Expert Systems Applications DEXA (1) 2010: 1-15Preservation Topics: Software AttachmentSize fig1.png127.53 KB exampleseg.png1.77 MB
Categories: Planet DigiPres

Considering Emulation for Digital Preservation

The Signal: Digital Preservation - 11 February 2014 - 6:01pm

There was a week in January 2014 where I participated in three meetings/events where emulation came up as a digital preservation solution. Emulation has really hit its stride, 20 years after I first heard about it.

An emulator is an environment that imitates the behavior of a computer or other electronic system.  In recent years, this has come to be known as a Virtual Machine, which is a recreated computer environment — from the operating system to the video drivers and software — that can be run in an interactive manner using current technology, including a web browser in some instances.

ASCII keyboard emulator for Apple I Replica, flickr user llemarie, some rights reserved.

ASCII keyboard emulator for Apple I Replica, flickr user llemarie, some rights reserved.

I was very much the fan of collecting hardware for digital preservation, until I participated in the Library of Congress Preserving.exe meeting in May of 2013. I wrote about my own conversion to Team Emulation in an earlier post on this blog., and my colleague Bill Lefurgy responded to my post with a post of his own.  (That said, we still need vintage hardware to read older media to bring operating systems and software into emulation environments.)

I am again going to refer to the Olive Executable Archive from Carnegie Mellon University, the Multiple Arcade Machine Emulator, and the emscripten project.  I would also like to point out recent advances by the bwFLA project out of the University of Freiburg, which has reached the demo stage of its Emulation As A Service. I saw an impressive live demonstration of this project at CurateGear 2014.  For some background,  Dirk von Suchodoletz was interviewed here on The Signal in 2012.  And I cannot leave out the remarkable work by the JSMESS project to emulate computing and game environments in a browser environment porting the MESS Emulator to JavaScript.

There are a few key articles on this topic:

  • Granger, Stewart. “Emulation as a Digital Preservation Strategy.” D-Lib Magazine 6.19 (2000).
  • Guttenbrunner, Mark, and Andreas Rauber. “A measurement framework for evaluating emulators for digital preservation.” ACM Transactions on Information Systems (TOIS) 30.2 (2012): 14.
  • Rechert, Klaus, Dirk von Suchodoletz, and Randolph Welte. “Emulation based services in digital preservation.” Proceedings of the 10th annual joint conference on Digital libraries. ACM, 2010.
  • Rothenberg, Jeffrey. “The Emulation Solution.” Avoiding Technological Quicksand: Finding a Viable Technical Foundation for Digital Preservation. Washington, DC: Council on Library and Information Resources, 1998. Council on Library and Information Resources.
  • Van der Hoeven, Jeffrey, Bram Lohman, and Remco Verdegem. “Emulation for digital preservation in practice: The results.” International journal of digital curation 2.2 (2008): 123-132.

Don’t let some of the early dates fool you – this issue was debated in just as lively a way 15 years ago as it is now.

Categories: Planet DigiPres

Let’s Start at the Very Beginning: Guiding Principles for Creating Born Digital Video

The Signal: Digital Preservation - 10 February 2014 - 4:24pm

The beginning is a very fine place to start indeed for the Federal Agencies Digitization Guidelines Initiative Born Digital Video subgroup of the Audio-Visual Working Group. As mentioned in a previous blog post, the FADGI Born Digital Video subgroup is taking a close look at the range of decisions to be made throughout the lifecycle of born digital video objects, from file creation through archival ingest and access delivery. Through case histories from federal agencies such as National Archives and Records Administration, Smithsonian Institution Archives, and National Oceanic and Atmospheric Administration, Library of Congress, Voice of America and American Folklife Center, we are exploring the “truth and consequences” when creating and archiving born digital video. In this blog post, we’ll look at some of our guiding principles for creating born digital video. 


Camera operator setting up the video camera by jsawkins. Photo courtesy of Flickr.


But as Julie Andrew’s says, let’s start at the very beginning. What do we mean by born digital video? Quite simply, it’s video that is recorded to digital file at the point of creation. Born digital video is distinct from digitized or reformatted video, a label used to describe the result of translating the analog signal data emanating from a video object into a digitally encoded format.  FADGI’s Reformatted Video subgroup is developing a matrix which compares target wrappers and encodings against a set list of criteria that come into play when reformatting analog videotapes


The first set of FADGI BDV case histories highlight what we call advice for shooters (a.k.a. videographers), and by extension, the project managers within cultural heritage institutions who are responsible for the creation new born digital video files – especially determining the technical file specifications. It’s important to recognize that the FADGI target audience for these case histories isn’t Hollywood or commercial entertainment producers. It’s the cultural heritage community or smaller archives who create non-broadcast classes of content recording such as oral histories. A great example is the Civil Rights History Project at AFC.  These types of projects have the opportunity to spec out the born digital video deliverable from the very beginning and end up with a file that is ingest ready for preservation and access systems.


The goal of the case histories project is to use guiding principles to illustrate the advantages of starting with high quality data capture from the very start. Two examples of FADGI’s guiding principles for creating born digital video include:


  • Create uncompressed video instead of compressed video. Compressed video reduces the amount of data in a file or stream. Although a reduced amount of data can be beneficial for easing storage, transfer, and play-out requirements, it generally introduces additional technical complexity which can have a negative impact on usability of the file over time. Uncompressed video retains all the visual information captured at the selected resolution, which is preferable for preservation purposes.


  • If compression is required, use lossless compression over lossy compression.  Lossless compression uses algorithms that restore the compressed data after decompression. It is essentially reversible compression. Lossy compression permanently alters or deletes the compressed data. If data reduction gains are significant enough to warrant using the added complexity of compressed files, lossless compression is preferred to preserve video quality.


Video camera by la_salebete. Photo courtesy of Flickr.

Video camera by la_salebete. Photo courtesy of Flickr.

These are just two examples that focus on the video encoding. The guiding principles also cover considerations for file wrapper or container capabilities, format sustainability and more general project concerns.


But here’s the thing: our case histories don’t always follow our own guiding principles. And that’s just fine by us. None of us live in a utopian world where digital storage is abundant and systems are completely interoperable. We all have to make choices and compromises to work within our restraints. Uncompressed video files can be huge and a burden to manage and maintain. Lossy compression can be appropriate for certain projects. The guiding principles should all be read with the caveat “if you have the option….” Sometimes, you simply don’t have the option for a myriad of reasons. But when you do have the option, the guiding principles highlight the advantages of high quality data capture.  The important take-away from the case histories project is the choices made during the file creation process will have impacts on the long term archiving and distribution processes and it’s essential to understand what those impacts are and have a plan for to resolve any conflicts.


Our hope is that these guiding principles and case histories help us start to flesh out more specific format guidance for born digital video but that’s in the future. The case history project, which will be published on the Federal Agencies Digitization Guidelines Initiative website this spring, is the first step towards understanding where we are as a community and what we can learn from each other.

Categories: Planet DigiPres

Check Yourself: How and When to Check Fixity

The Signal: Digital Preservation - 7 February 2014 - 4:45pm

How do I know if a digital file/object has been corrupted, changed or altered? Further how can I prove that I know what I have? How can I be confident that the content I am providing is in good condition, complete, or reasonably complete?  How do I verify that a file/object has not changed over time or during transfer processes?

Please consider reading and commenting on this draft document.

Please consider reading and commenting on this draft document.

In digital preservation, a key part of answering these questions comes through establishing and checking the “fixity” or stability of digital content. At this point, many in the preservation community know they should be checking the fixity of their content, but how, when and how often?

A team of individuals from the NDSA Infrastructure & Standards working groups have developed Checking Your Digital Content: How, What and When to Check Fixity? in an effort to help stewards answer these questions in a way that makes sense for their organization based on their needs and resources. We are excited to publicly share this draft document for more broad open discussion and review here on The Signal. We welcome comments and questions, please post them at the bottom of this post for the working group to review.

Not Best Practices, but Guidance for Making Best Use of Resources at Hand

In keeping with work on the NDSA Levels of Digital Preservation, this document is not a benchmark or requirement. It is instead intended as a tool to help organizations develop a plan that fits resource constraints. Different systems and different collections are going to require different fixity checking approaches, and our hope is that this document can help.

Connection to National Agenda for Digital Stewardship

This guidance was developed in direct response to start to address the need articulated in the infrastructure section of the inaugural National Agenda for Digital Stewardship. I’ll include it below at length for context.

Fixity checking is of particular concern in ensuring content integrity. Abstract requirements for fixity checking can be useful as principals, but when applied universally can actually be detrimental to some digital preservation system architectures. The digital preservation community needs to establish best practices for fixity strategies for different system configurations. For example, if an organization were keeping multiple copies of material on magnetic tape and wanted to check fixity of content on a monthly basis, they might end up continuously reading their tape and thereby very rapidly push their tape systems to the limit of reads for the lifetime of the medium.

There is a clear need for use ‐ case driven examples of best practices for fixity in particular system designs and configurations established to meet particular preservation requirements. This would likely include description of fixity strategies for all spinning disk systems, largely tape ‐ based systems, as well as hierarchical storage management systems. A chart documenting the benefits of fixity checks for certain kinds of digital preservation activities would bring clarity and offer guidance to the entire community. A document modeled after the NDSA Levels of Digital Preservation would be a particularly useful way to provide guidance and information about fixity checks based on storage systems in use, as well as other preservation choices.

Again, please share your comments on this here, and consider forwarding this on to others who you think might have comments to share with us.

Categories: Planet DigiPres

SCAPE QA Tool: Technologies behind Pagelyzer - I Support Vector Machine

Open Planets Foundation Blogs - 7 February 2014 - 1:15pm


The Web is constantly evolving over time. Web content like texts, images, etc. are updated frequently. One of the major problems encountered by archiving systems is to understand what happened between two different versions of the web page. We want to underline that the aim is not to compare two web pages like this (however, the tool can also do that):




but web page versions:




An efficient change detection approach is important for several issues:


  • Crawler optimization by deciding if the page should be crawled or not on the fly.

  • Discovering new crawl strategies e.g. based on patterns

  • Quality assurance for crawlers, for example, by comparing the live version of the page with the just crawled one.

  • Detecting format obsolescence following to evolving technologies, is the rendering of web pages are identique visually by using different versions of the browser or different browsers

  • Archive maintenance, different operations like format migration can change the archived versions renderings.

Pagelyzer is a tool containing a supervised framework that decides if two web page versions are similar or not. Pagelyzer takes two urls and two browsers types (e.g. firefox, chrome) and one comparison type as input (image-based, hybrid or content-based). If browsers types are not set, it uses firefox by default.


It is based on two different technologies:


1 – Web page segmentation (let's keep the details for another blog post)

2 – Supervised Learning with Support Vector Machine(SVM).


In this blog, I will try to explain simply (without any equations) what SVM does specially for pagelyzer. You have two urls, let's say url1 and url2 and you would like to know if they are similar (1) or dissimilar (0).


You calculate the distance (or similarity) as a vector based on the comparison type. If it is image-based, your vector will contain the features related to images (e.g. SIFT, HSV). If it is content-based, your vector will contain features for text similarities(e.g. jacard distance for links, images and words). To better explain how it works, let's assume that we have two dimensions (two features). One feature is SIFT and the other one is HSV. They are both color descriptives.


To make your system learn, you should provide at the beginning annotated data to your system. In our case, we need a list of url pairs <url1,url2> annotated manually as similar or not similar. For pagelyzer, this dataset is provided by Internet Memory Foundatation (IMF). With a part of your dataset (ideally 1/3) you train your system, with the other part you test your results.



Let's start training:



First, you put all your vectors in input space.


As, this data is annotated, you know which one is similar (in green), which one is dissimilar(in red).


You find the optimal decision boundary (hyperplane) in input space. Anything above the decision boundary should have label 1 (similar). Similarly, anything below the decision boundary should have label 0 (dissimilar).



Let's classify:



Your system is intelligent now! When you have new pair of urls without any annotation, based on the decision boundry, you can say if they are similar or not.

The pair of urls in blue will be considered as dissimilar, the one in orange will be considered as similar by pagelyzer.


When you choose different types of comparison, you choose different types of features and dimensions. The actual version of Pagelyzer uses the results of SVM learned with 202 couples of web page provided by IMF, 147 are in positive class and 55 are in negative class. As it is a supervised system, increasing the training set size will always lead to better results.

An image to show what happens when you have more than two dimensions:




Preservation Topics: Web ArchivingToolsSCAPESoftware
Categories: Planet DigiPres

Saving Digital Mementos from Virtual Worlds

The Signal: Digital Preservation - 6 February 2014 - 7:40pm
The United States Capitol on by rodgermourtagh2

The United States Capitol Replica on by rodgermourtagh2

My two young teenage daughters spend hours playing Minecraft, building elaborate virtual landscapes and structures. They are far from alone; the game has millions of fans around the world. Teachers are seizing on Minecraft’s popularity with kids as a tool to teach both abstract and concrete subjects. What’s unique about this situation is not so much the product as that a virtual world is functioning as both a fun, engaging activity and a viable teaching tool. We’re witnessing the birth of a new genre of tools and a new set of challenges for preserving the digital creations people build with those tools.

Like most parents, I save many of the things that my daughters create. From where I’m sitting in my home as I write this blog post, I can see their works dotting the room. On one wall is a framed pencil sketch one daughter drew of our family; on a shelf is a perfect clay replica she made of Moomintroll. Hanging above a window are drawings my other daughter did — a Sharpie drawing of tree houses and a pen doodle of kaleidoscopic patterns that disappear into a tunnel-like vanishing point. Huge snowflakes (no two alike) that they cut from paper dangle here and there around the room.

download.1pngI never gave much thought to their virtual gaming activities, aside from monitoring how much time they spend on their electronic devices. But I like that Minecraft lets my kids invent universes and play inside them together and I can tell that it feeds an important part of their intellectual growth as they make things, investigate things and solve problems. So I decided that I’d like to save what I can of the worlds they create, just as I save the rest of their crafts and artwork, which raised questions about what I can save, how I can save it and why I would even want to save it.

Over the last decade, the Library of Congress and its NDIIPP and NDSA partners have led the research into preserving virtual worlds, from military simulations to consumer games. Many of the questions – technological and philosophical – have long been asked and answered or at the least the challenges have been identified and defined. That’s fine for institutions that recognize the cultural value of virtual worlds and have the resources to archive them but what does it mean for a parent who just wants to save his or her kid’s virtual world creations?

laz_MinecraftInteriorA colleague at the Library of Congress, Trevor Owens, is part of the ongoing research on preserving digital worlds and preserving software. In fact, Owens is one of the organizers of the preserving software conference. He said that the solution to the question of saving something from virtual worlds depends on whether you want to save:

  • the virtual world that you or someone else built
  • testimony about what the virtual world meant to you or them at a particular time
  • or documentation of the virtual world.

Preserving the virtual world itself is the most difficult and challenging option. The complexities of preserving virtual worlds are too much to go into in this blog post. And when it comes to talking about networked virtual worlds inhabited by live human participants, the subject often gets downright esoteric, like defining where “here” actually is and what “here” means in a shared virtual world and how telepresence applies to the virtual world experience. But to illustrate the basic technological dilemma of preserving a virtual world, here’s a simple example .

balmycastleLet’s say I build an island, castle and estate in a virtual world and name it Balmy Island. If I want to save Balmy Island and be able to walk around it anytime I want to, I need all the digital files of which Balmy Island is constructed. I might need the exact version of the application or software that I used to build Balmy Island, as well as the exact operating system — and version of the OS — of the hardware device on which I built Balmy Island. And I might need the hardware device itself on which I created Balmy Island. So if I build Balmy Island on my computer, I have to preserve the computer, the software and the files just as they are. Never upgrade or modify anything. Just stick the whole computer in the closet, buy a new computer and pull out the old one whenever I wanted to revisit Balmy Island.

Another less-certain and less-authentic option is that I could save the Balmy Island files and hope that someday someone will build an emulator that will restore some approximate version of my original Balmy Island. It will not be exactly the same, but it might be close enough.

Saving the hardware and software for just this one purpose is unrealistic for the average person but for cultural institutions it makes perfect sense. Stanford University is the home of the Stephen M. Cabrinety Collection in the History of Microcomputing and it is also building a Forensics Lab with a library of software and electronic devices for extracting software from original media, so that it can be run later in native or emulated environments. Similar labs at other institutions include the Maryland Institute for Technology in the Humanities, the International Center for the History of Electronic Games at the Strong National Museum of Play and the UT Videogame Archive at the Dolph Briscoe Center for American History, University of Texas at Austin. The Briscoe Center was featured in the Signal post about video game music composer George Sanger. (Dene Grigar, who was the subject of another Signal blog post, created a similar lab devoted to her vintage electronic literature collection at Washington State University, Vancouver)

Henry Lowood. Photo from Stanford University.

Henry Lowood. Photo from Stanford University.

Henry Lowood, curator for History of Science & Technology Collections and Film & Media Collections in the Stanford University Libraries, was a lead in the Preserving Virtual Worlds project. Lowood has a historical interest in games, virtual worlds and their role in society, and he makes a case for the option of recording testimony about what a virtual world means to its users and builders.

Lowood helped create the Machinima and Virtual Worlds collections, which are hosted by our NDIIP/NDSA partner, the Internet Archive. These collections host video recordings of activities and events in virtual worlds and immersive games. As the users perform actions and navigate through the worlds, they sometimes give a running commentary about what is happening and their thoughts and observations about its meaning to them.

A parent or teacher could use this same approach by shooting a video of a child giving you a tour of their virtual world. It’s an opportunity to capture the context around their creation of the worlds and for them to tell you how they felt about it and what choices they made. If they interact with others in a shared virtual world, the child can describe his or her interactions and maybe even relate anecdotes about certain events and experiences.

fridge_200The third option, saving a documentation of the virtual world, is by far the easiest. Take screenshots and motion capture videos and save those with your other digital mementos.

Screenshots are easy to take on computers and most hand-held devices. PCs have a “print screen” button on the keyboard; for Macs, hold down the Apple key ⌘ plus shift plus 3. For iPods, press and hold the main button below the screen and the power button on the top edge of the device at the same time. And so on. Search online for how to take screen shots or screen captures for your device.

The screenshot will save as a graphic file, usually a JPEG or PNG file. Transfer that JPEG to your computer, crop it and modify it with a photo processing program if you want. Maybe print the screen shots and put them on the refrigerator for you to admire. When you’re finished with the digital photo file, back it up with your other personal digital archives.

Recording a walk through of a virtual world can be a slightly more complex task than taking a screenshot but not terribly so. Search online for “screencast software,” “motion capture” or “screen recording” to find commercial and freeware screencast software. Even version 10 of the QuickTime player includes a screen recording function. They all pretty much operate the same way: click a “Record” button, do your action on the computer and click “Stop” when you are finished. Everything that was displayed on the screen will be captured into a video file.

With the different screen capture software programs, be aware of the video file type that the software generates. QuickTime saves the video as an MOV file, Jing saves the video as an SWF file and so on. Different file types require different digital video players, so if you have any difficulty playing the file back on your computer search online to find the software that will play your video file type. If you upload a copy of your video to YouTube, backup a master copy somewhere else. Don’t rely on the YouTube version as your master “archived” copy.

hawaii_300Although this story is about the challenges of saving mementos from digital virtual worlds, the essence of the challenge — trying to preserve an experience — is not new. If I go to Hawaii, snorkel, build sand castles and have the time of my life, I cannot capture or hold onto that experience. I can only document the experience with photos, video and maybe write in a journal about it. In a way, it even goes back to the dawn of humanity, where people recorded their experiences by means of cave paintings.

So you cannot capture the experience of a virtual world but you can document it. And virtual worlds are a lot more accessible in 2014 than they were in 1990. It’s a long way from Jaron Lanier‘s work, from VPL labs and data gloves and headsets and exclusive access in special labs. Kids now carry their personalized virtual worlds in their handheld devices. Minecraft is just the current cool tool. Who can tell what is yet to come?

Howard Rheingold

Howard Rheingold. Photo by Joi Ito.

It seems appropriate to let Howard Rheingold have the last word on the subject. Rheingold is a writer, teacher, social scientist and thought-leader about the cultural impacts of technology. He is also an authority on virtual reality and virtual communities, having written the definitive books about both topics over twenty years ago. His current book is titled NetSmart.

In addition to his professional expertise, Rheingold is a caring father who dotes on his daughter. While he was researching and writing the books Virtual Reality(1991) and Virtual Communities: Homesteading on the Electronic Frontier(1994), his office walls were filled with her childhood artwork (she is now in her 20s). He brings a unique and authoritative perspective to this story.

Rheingold said, “I’ve been closely observing and writing about innovations in digital media and learning in recent years – and experiencing/experimenting directly through the classes I teach at Stanford and Rheingold U. Among my activities in this sphere is a video blog for DMLcentral, a site sponsored by the MacArthur Foundation’s Digital Media and Learning Initiative. It was there that I delved into the educational uses – and students and teachers’ passion for – Minecraft.

“In my interviews with teachers Liam O’Donnell and Sara Kaviar, it became clear that Minecraft was about much more than using computers to build things. It was a way to engage with a diverse range of abstract subject matter in concrete ways, from comparative religion to mathematics, and more importantly, a way for students to exercise agency in a schooling environment in which so much learning is dependent on what the teacher or textbook says.

“Minecraft artifacts are also important contributions to student e-portfolios, which will become more important than resumes in the not too distant future. Given the growing enthusiasm over Minecraft by students, teachers, and parents, and the pedagogical value of seeing these creations as artifacts and instruments of learning, it only makes sense to make it easy and inexpensive to preserve virtual world creations.”

Categories: Planet DigiPres

February Issue of Library of Congress Digital Preservation Newsletter Now Available

The Signal: Digital Preservation - 6 February 2014 - 5:38pm

Blog imageThe February issue of the Library of Congress Digital Preservation Newsletter (pdf) is now available!

Included in this issue:

  • Spotlight on Digital Collections, including an interview with Lisa Green on Machine Scale Analysis of collections, and a look at the Cultural Heritage of the Great Smoky Mountains
  • Digital Preservation Aid in Response to Tornado
  • NDSA Digital Content Area:  Web and Social Media
  • Wikipedia and Digital Preservation
  • AV Artifact Atlas, FADGI interview with Hanna Frost
  • Several updates on the Residency Program
  • Listing of upcoming events including the IDCC (Feb 24-27), Digital Maryland conference (March 7), Computers in Libraries (April 7-10), Personal Digital Archiving 2014 (April 10-11)
  • And other articles about data, preservation of e-serials, and more.

To subscribe to the newsletter, sign up here

Categories: Planet DigiPres

Call for Proposals: Digital Preservation 2014

The Signal: Digital Preservation - 5 February 2014 - 9:11pm

We’ve started planning our annual meeting, Digital Preservation 2014, which will be held July 22-24 in the Washington, DC area, and we want to hear from you!  Any organization or individual with an interest in digital stewardship can propose ideas for potential inclusion in the meeting.

 Shealah Craighead

Lisa Green of Common Crawl addresses participants at Digital Preservation 2013. Credit: Shealah Craighead

The Library of Congress has hosted annual meetings with digital preservation partners, collaborators and others committed to stewardship of digital content for the past ten years.  The meetings have served as a forum for sharing achievements in the areas of technical infrastructure, innovation, content collection, standards and best practices and outreach efforts.

This year we’ve expanded participation from NDSA member organizations on the program committee. We’re delighted to have NDIIPP staff and NDSA members working together to contribute to the success of the meeting.

Digital Preservation 2014 Program Committee

  • Vickie Allen, PBS Media Library
  • Meghan Banach Bergin, University of Massachusetts Amherst
  • Erin Engle, NDIIPP
  • Abbie Grotke, NDIIPP
  • Barrie Howard, NDIIPP
  • Butch Lazorchak, NDIIPP
  • Vivek Navale, U.S. National Archives and Records Administration
  • Michael Nelson, Old Dominion University
  • Trevor Owens, NDIIPP
  • Abbey Potter, NDIIPP
  • Nicole Scalessa, The Library Company of Philadelphia

Call for Proposals

We are looking for your ideas, accomplishments and project updates that highlight, contribute to, and advance the community dialog.  Areas of interest include, but are not limited to:

  • Scientific data and other content at risk of obsolescence, and what methods, techniques, and tools are being deployed to mitigate risk;
  • Innovative methods of digital preservation, especially regarding sustainable practices, community approaches, and software solutions;
  • Collaboration successes and lessons learned highlighting a wide-range of digital preservation activities, such as best practices, open source solutions, project management techniques and emerging tools;
  • Practical examples of research and scholarly use of stewarded data or content;
  • Educational trends for emerging and practicing professionals.

You are invited to express your interest in any of the following ways:

  • Panels or presentations
  • 5-minute lightning talks
  • Demonstrations
  • Posters

A highlight of this past year was the release of the 2014 National Digital Stewardship Agenda at Digital Preservation 2013.  The Agenda integrates the perspective of dozens of experts to provide funders and decision-makers with insight into emerging technological trends, gaps in digital stewardship capacity and key areas for development.  It suggests a number of important sets of issues for the digital stewardship community to consider prioritizing for developments. We’d be particularly interested for you to share projects your organization has undertaken in the last year that address any of the issues listed in the Agenda.

To be considered, please send 300 words or less describing what you would like to present to ndiipp [at] by March 14. Accepted proposals will be notified on or around April 3.

The last day of the meeting, July 24, will be a CURATEcamp, which will take place off-site from the main meeting venue. The topic focus of this camp is still under discussion, so stay tuned for more information about the camp in the coming weeks.

Please let us know if you have any questions.  Your contributions are important in making this a community program and we’re looking forward to your participation.

Categories: Planet DigiPres

EDRMS across New Zealand’s Government – Challenges with even the most managed of records management systems!

Open Planets Foundation Blogs - 4 February 2014 - 5:21am
A while back I wrote a blog post, MIA: Metadata. I highlighted how difficult it was to capture certain metadata without a managed system - without an Electronic Document and Records Management System (EDRMS). I also questioned if we were doing enough with EDRMS by way of collecting data. Following that blog we sought out the help of a student from the local region’s university to begin looking at EDRMS systems, to understand what metadata they collected, and how to collect additional ‘technical’ metadata using the tools often found in the digital preservation toolkit.    Sarah McKenzie is a student at Victoria University. She has been working at Archives New Zealand on a 400 Hour, Summer Scholarship Programme that takes place during the university’s summer-break. Our department submitted three research proposals to the School of Engineering and Computer Science and out of them Sarah selected the EDRMS focussed project. She began work in December and her scholarship is set to be completed mid-February.  To add further detail, the title and focus of the project is as follows: Mechanism to connect the tools in the digital preservation toolset to content management and database systems for metadata extraction and generation Electronic and document records management systems (EDRMS) are the only legitimate mechanism for storing electronic documents with sufficient organisational context to develop archival descriptions but are not necessarily suited at the point of the creation of a record to store important technical information. Sitting atop database management technology we are keen to understand mechanisms of generating this technical metadata before ingest into a digital archive.  We are keen to understand the challenge of developing this metadata from an EDRMS and DBMS perspective where it is appreciated that mechanisms of access may vary from system to another. In the DBMS context, technical metadata and contextual organisational metadata may be entirely non-existent.  With interfaces to popular characterization tools biased towards that of the file system it is imperative that we create mechanisms to use tools central to the preservation workflow in alternative ways. This project will ask students to develop an interface to EDRMS and DBMS systems that can support characterization using multiple digital preservation tools. Metadata we’re seeking to gather includes format identification, characterisation reports along with other such data as SHA-1 checksums. Tools typical to the digital preservation workflow include DROID, JHOVE, FITS and TIKA.The blog continues with Sarah writing for the OPF on behalf of Archives New Zealand. She provides some insight into her work thus far, and insight into her own methods of research and discovery within a challenging government environment. EDRMS Systems An EDRMS is a system for controlling, and tracking the creation of documents from the point they are made through publication and possibly even destruction. They function as a form of version control for text documents, providing a way to accomplish a varying range of tasks in the management of documents. Some examples of tasks an EDRMS can perform are: •  Tracking creation date•  Changes and publication status•  Keeping a record of who has accessed the documents. EDRMS stores are the individual databases of documents that are maintained for management. They are usually in a proprietary format, and interfacing directly with them means having access to the appropriate Application Layer Interface (API) and Software Development Kit (SDK). In some cases these are merged together requiring only one package. The actual structure of the store varies from system to system. Some use the directory structure that is part of the computer's file system and then have an interface from there. Others utilise a database for storing the documents. Most EDRMS are running client/server architecture. Currently Archives New Zealand has dealt with three different EDRMS stores:  •  IBM Notes (formerly called Lotus Notes)•  Objective•  Summation ‘Notes’ has a publically available API and the latest version is built in Java, allowing for ease of use with metadata extraction tools, used in the digital preservation community - The majority I have found to be written in Java. There are many EDRMS systems, and it's simply not possible to code a tool enabling our preservation toolkit to interact with all of them without a comprehensive review of all New Zealand government agencies and their IT suites.  A survey has been partially completed by Archives New Zealand. The large number of systems suggested a more focused approach in my research project, i.e. a particular instance of EDRMS, over multiple systems.  Gathering Information on Systems in Use Within New Zealand, The Office of the Government Chief Information Officer (OGCIO) had already conducted a survey of electronic document management systems currently used by government agencies. This survey did not cover all government agencies, but with 113 agencies replying it was considered a large enough sample to understand the most widely used systems across government. Out of the 113, some  agencies did not provide any information, leaving only 69 cases where a form of EDRMS was explicitly named. These results were then turned into an alphabetical table listing: •  EDRMS names•  The company that created them•  Any notes on the entry•  A list of agencies using them In addition to the information provided by the OGCIO survey, some investigative work was done in looking through the records of the Archives' own document management system to find any reference to other EDRMS in use across government. Other active EDRMS systems were uncovered. For the purposes of this research it was assumed that if an agency has ever used a given EDRMS, it is still relevant to the work of Archives New Zealand, and considered ‘in-use’ until it is verified that there are no more document stores from that particular system which remain not archived, migrated to a new format, or destroyed. Obstacles were encountered in the process of converting the information into a reference table useful for this project. Some agencies provided the names of companies that built their EDRMS. This is understandable to some extent, since there has been a vanity in the software industry where companies name their flagship product after the company (or vice versa). However, in some cases it was difficult to discern what was meant because the company that made the original software had been bought out and their product was still being sold by the new owner under the same name – or the name had been turned into a brand for an arm of the new parent company which deals with all their EDRMS software (e.g. Autonomy Corporation has now become HP Autonomy, Hewlett-Packard's EDRMS branch).  In addition, sometimes there were multiple software packages for document management with the same name. While it was possible to deduce what some of these names meant, it was not possible to find all of them. In these cases the name provided by the agency was listed with a note explaining it was not possible to conclude what they meant, and some suggestions for further inquiry. Vendor acquisitions were listed to provide a path through to newer software packages that possibly have compatibility with the old software, and also provide a way to quickly track down current owners of an older piece of software. The varying needs of different agencies means there is no one-size-fits-all EDRMS system (e.g. a system designed for legal purposes may offer specialised features one for general document handling wouldn't have). But since there has been no overarching standard for EDRMS for various purposes – it was assumed that agencies would make their own choices based on their business needs – there turned out to be a large number of systems in use, some of them obscure or old. The oldest system that could be reasonably verified as having been used was a 1990's version of a program originally created in the late 1980s, called Paradox. This was in progress of currently being upgraded and the data migrated to a system called Radar when the document mentioning it was written, but there was no clear note of this being completed. At the time of writing it had been established that there were approximately 44 EDRMS ‘in-use’. With 44 systems in use it was considered unfeasible to investigate the possibility of automating metadata extraction from all of them at this time. It was decided to set some boundaries for starting points. One boundary was, which EDRMS is the most used? The most common according to the information gathered looked to be Microsoft SharePoint, which we could gather may have 24 agencies using it, and Objective Corporation's Objective was associated with at least 12 agencies. A second way to view this was to ask, ‘which systems have been recommended for use going forward?’ Archives New Zealand’s parent department The Department of Internal Affairs (DIA) has created a three-supplier panel for providing enterprise content management solutions to government agencies. Those suppliers are: •  Intergen•  Open Text•  Team Informatics With two weeks remaining in the scholarship, and work already completed to connect a number of digital preservation tools together in a middle abstraction layer to provide a broad range of metadata for our digital archivists, it was decided that testing of the tool, that is connecting it to an EDRMS and extracting technical metadata, would be best done on a working, in-use EDRMS, from the proposed DIA supplier panel, that would continue to add value to Archives New Zealand’s work moving into the future. Getting Things Out of an EDRMS The following tools were considered to be a good set to start examining extraction of metadata from files: •  DROID•  ExifTool•  JHOVE•  National Library of New Zealand, Metadata Extractor Tool (NLMET)•  Tika Linking the tools together has been done via a java application that uses each tool's command line API to run them in turn. The files are identified first by Droid, and then each tool is run over the file to produce a collection of all available metadata in Comma Separated Values format. This showed that some tools extract the information in different ways (date formatting is not consistent) and some tools can read data, others cannot; for example, due to a character encoding issue, a particular PDFs Title, Author, and Creator fields were not readable in JHOVE where they were read correctly in Tika, and NLMET - JHOVE still extracts information those tools do not. When a tool sends its output to standard out it's a simple matter of working with the text output as it's fed back to the calling function from the process. In some cases a tool produces an output file which had to be read back in. In the case of the NLMET, a handler for the XML format had to be built. Since the XML schema had separate fields for date and time of creation and modification, the opportunity was taken to collate those into two single date-time fields so they would better fit into a schema. The goal with the collated outputs is to have domain experts check over them to verify which tools produce the information they want, and once that is done a schema for which piece of data to get from which tool can be introduced to the program so it can create and populate the Archives metadata schema for the files it analyses. The ideal goal for this tool is to connect it to an EDRMS system via an API layer, enabling the extraction of metadata from the files within a store without having to export the files. For that purpose the next stage in this research is to set up a test example of one of DIA’s proposed EDRMS solutions and try to access it with the tool unifier. It is hoped that this will provide an approach that can be applied to other document management systems moving forward. Preservation Topics: IdentificationCharacterisationTools
Categories: Planet DigiPres

Developing an Audio QA workflow using Hadoop: Part II

Open Planets Foundation Blogs - 3 February 2014 - 11:35am

First things first. The Github repository with the Audio QA workflows is here: And version 1 is working. Version is really all wrong here. I should call it Workflow 1, which is this one:


To sum up what this workflow does, is migration, conversion and content comparison. The top left box (nested workflow) migrates a list of mp3s to wav files using a Hadoop map-reduce job using the command line tool Ffmpeg, and outputs a list of migrated wav files. The top right box converts the same list of mp3s to wav files using another Hadoop map-reduce job using the command line tool mpg321, and outputs a list of converted wav files. The Taverna work flow then puts the two lists of wav files together and the bottom box receives a list of pairs of wav files to compare. The bottom box compares the content of the paired files using a Hadoop map-reduce job using the xcorrSound waveform-compare commandline tool, and outputs the results of the comparisons.

What we would like to do next is:

  • "Reduce" the output of the Hadoop map-reduce job using the waveform-compare commandline tool
  • Do an experiment on 1TB input mp3 files on the SB Hadoop cluster, and write an evaluation and a new blog post ;-)
  • Extend the workflow with property comparison. The waveform-compare tool only compares sound waves; it does not look at the header information. This should be part of a quality assurance of a migration. The reason this is not top priority is that FFprobe property extraction and comparison is very fast, and will probably not affect performance much...
Preservation Topics: Preservation ActionsMigrationSCAPE
Categories: Planet DigiPres

The Latest from the NDSR: Presenting at ALA Midwinter

The Signal: Digital Preservation - 31 January 2014 - 3:56pm

The following is a guest post by Julia Blase, National Digital Stewardship Resident at the National Security Archive.

'Julia Blase and Emily Reynolds present on "Developing Sustainable Digital Archive Systems.  Photo by Jaime McCurry.

Julia Blase and Emily Reynolds present on “Developing Sustainable Digital Archive Systems.” Photo by Jaime McCurry.

In case you hadn’t heard, the ALA Midwinter Meeting took place in Philadelphia last weekend, attended by around 12,000 librarians and exhibitors. If you didn’t attend, or didn’t have friends there to take notes for you, the Twitter hashtag #alamw14 has it covered – enough content for days of exploration! If you’d like to narrow your gaze, and in the theme of this post, you could refine your search for tweets containing both #alamw14 and #NDSR, because the National Digital Stewardship Residents were there in force, attending and presenting.

Sessions Attended

Emily Reynolds, the Resident at the World Bank, was so kind as to compile a list of the sessions we aimed to attend before the conference. On Saturday, though none of us made it to every event, at least a few of us were at the Preservation Administrators Interest Group, Scholarly Communications Interest Group, Digital Conversion Interest Group, Digital Special Collections Discussion Group and Challenges of Gender Issues in Technology sessions.

The first session I attended, along with Lauren Work and Jaime McCurry, was the Digital Conversion Interest Group session, where we heard fantastic updates on audiovisual digital conversion practices and projects from the American Folklife Center, the American Philosophical Society  library, Columbia University Libraries and George Blood Audio and Video. I particularly enjoyed hearing about the  successful APS attempt to digitize audio samples of Native American languages, many of which are endangered and the positive reaction from the Native community. For audio, it seemed, sometimes digitization is the best form of preservation!

The second session I attended, with Emily Reynolds and Lauren Work, was the Gender Issues in Technology discussion group (see news for it at #libtechgender). We were surprised, but pleased, at the number of attendees and quality of the discussion around ways to improve diversity in the profession. Among the suggestions we heard were to include diverse staff members on search committees, to monitor the language within your own organization when you review candidates to ensure that code words like “gravitas” (meaning “male,” according to the panelists) aren’t being used to exclude groups of candidates, to put codes of conduct into place to help remind everyone of a policy of inclusiveness, and to encourage employees to respond positively to mentorship requests, especially from members of minority groups (women, non-white, not traditionally gendered). The discussion seemed to us to be a piece of a much larger, evolving, and extended conversation that we were glad to see happening in our professional community!

NDSR Presentations

'Erica Titkemeyer presents on preserving time-based media art at the Smithsonian.  Photo by Julia Blase.

Erica Titkemeyer presents on preserving time-based media art at the Smithsonian. Photo by Julia Blase.

On Sunday, though a few of us squeezed in a session or two, our primary focus was our individual project update presentations, given at the Digital Preservation Interest Group morning session, and also our extended project or topic presentations at the Library of Congress booth in the early afternoon. The individual presentations, I’m please to say, went very well! It would be impossible to recap each presentation here, however, many of us have posted project updates recently, so please be sure to look us up for details. Furthermore, searching Twitter for #alamw14 and #NDSR brings you to this list, in which you can find representative samples of the highlights from our individual presentations.

Presentations – Question and Answer Session

We concluded the session by taking some questions, all of which were excellent – particularly the one from Howard Besser, who wanted to know how we believed our projects (or any resident or fellowship temporary project) could be carried on at the conclusion of our project term. The general response was that we are doing our best to ensure they are continued by integrating the projects, and ourselves, into the general workflows of our organizations – keeping all stakeholders informed from an early stage of our progress, finding support from other divisions, and documenting all of our decisions so that any action may be picked up again as easily as possible.

We also had an excellent question about how important networking had been for the success of our projects, and all agreed that, while networking with the D.C. community has been essential (through our personal efforts and also through groups like the DCHDC meetup), almost more significant has been our ability to network with each other – to share feedback, resources, documents, websites, and connections to other networks, which has helped us accomplish our goals more efficiently and effectively. One of the goals of the NDSR program was, of course, to help institutions get valuable work done in the area of digital stewardship, which we are all doing. However, another goal was for the program to help build a professional community in digital stewardship. What is a community if not a group of diverse professionals who trust and rely on each other, who share successes and setbacks, resources and networks, and who support each other as we learn and grow? Though the language is my own, the sentiment is one I heard shared between us over and over during the ALA weekend.

NDSR Recent Activity

In recent news, Emily Reynolds and Lauren Work both discuss their take on our ALA experience, Emily’s here and Lauren’s here. Molly Swartz published some pictures and thoughts on ALA Midwinter over here. Jaime McCurry recently interviewed Maureen McCormick-Harlow about her work at the National Library of Medicine. And to conclude, I’ve recently posted two updates on my project, one on this page and another courtesy of the Digital Libraries Federation.

Thanks for listening, and be sure to tune in two weeks from now when Maureen McCormick-Harlow will be writing another NDSR guest post. If you, like us, were at ALA Midwinter last weekend, I hope you found it as enjoyable as we did!


Categories: Planet DigiPres

Why can't we have digital preservation tools that just work?

Open Planets Foundation Blogs - 31 January 2014 - 12:58pm

One of my first blogs here covered an evaluation of a number of format identification tools. One of the more surprising results of that work was that out of the five tools that were tested, no less than four of them (FITS, DROID, Fido and JHOVE2) failed to even run when executed with their associated launcher script. In many cases the Windows launcher scripts (batch files) only worked when executed from the installation folder. Apart from making things unnecessarily difficult for the user, this also completely flies in the face of all existing conventions on command-line interface design. Around the time of this work (summer 2011) I had been in contact with the developers of all the evaluated tools, and until last week I thought those issues were a thing of the past. Well, was I wrong!

FITS 0.8

Fast-forward 2.5 years: this week I saw the announcement of the latest FITS release. This got me curious, also because of the recent work on this tool as part of the FITS Blitz. So I downloaded FITS 0.8, installed it in a directory called c:\fits\on my Windows PC, and then typed (while being in directory f:\myData\):


Instead of the expected helper message I ended up with this:

The system cannot find the path specified. Error: Could not find or load main class edu.harvard.hul.ois.fits.Fits

Hang on, I've seen this before ... don't tell me this is the same bug that I already reported 2.5 years ago ? Well, turns out it is after all!

This got me curious about the status of the other tools that had similar problems in 2011, so I started downloading the latest versions of DROID, JHOVE2 and Fido. As I was on a roll anyway, I gave JHOVE a try as well (even though it was not part of the 2011 evaluation). The objective of the test was simply to run each tool and get some screen output (e.g. a help message), nothing more. I did these tests on a PC running Windows 7 with Java version 1.7.0_25. Here are the results.

DROID 6.1.3

First I installed DROID in a directory C:\droid\. Then I executed it using:


This started up a Java Virtual Machine Launcher that showed this message box:

The Running DROID text document that comes with DROID says:

To run DROID on Windows, use the "droid.bat" file. You can either double-click on this file, or run it from the command-line console, by typing "droid" when you are in the droid installation folder.

So, no progress on this for DROID either, then. I was able to get DROID running by circumventing the launcher script like this:

java -jar c:\droid\droid-command-line-6.1.3.jar

This resulted in the following output:

No command line options specified

This isn't particularly helpful. There is a helper message, for which you have to give the -h flag on the command line. But you don't get to see this until you give the -h flag on the command line. Catch 22 anyone?


After installing JHOVE2 in c:\jhove2\, I typed:


This gave me 1393 (yes, you read that right: 1393!) Java deprecation warnings, each along the lines of:

16:51:02,702 [main] WARN TypeConverterDelegate : PropertyEditor [com.sun.beans.editors.EnumEditor] found through deprecated global PropertyEditorManager fallback - consider using a more isolated form of registration, e.g. on the BeanWrapper/BeanFactory!

This was eventually followed by the (expected) JHOVE2 help message, and a quick test on some actual files confirmed that JHOVE2 does actually work. Nevertheless, by the time the tsunami of warning messages is over, many first-time users will have started running for the bunkers!

Fido 1.3.1

Fido doesn't make use of any launcher scripts any more, and the default way to run it is to use the Python script directly. After installing in c:\fido\ I typed:


Which resulted in ..... (drum roll) ... a nicely formatted Fido help message, which is exactly what I was hoping for. Beautiful!

JHOVE 1.11

I installed JHOVE in c:\jhove\ and then typed:


Which resulted in this:

Exception in thread "main" java.lang.NoClassDefFoundError: edu/harvard/hul/ois/j hove/viewer/ConfigWindow at edu.harvard.hul.ois.jhove.DefaultConfigurationBuilder.writeDefaultCon figFile(Unknown Source) at edu.harvard.hul.ois.jhove.JhoveBase.init(Unknown Source) at Jhove.main(Unknown Source) Caused by: java.lang.ClassNotFoundException: edu.harvard.hul.ois.jhove.viewer.Co nfigWindow at$ Source) at$ Source) at Method) at Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) ... 3 more


Final remarks

I limited my tests to a Windows environment only, and results may well be better under Linux for some of these tools. Nevertheless, I find it nothing less than astounding that so many of these (often widely cited) preservation tools fail to even execute on today's most widespread operating system. Granted, in some cases there are workarounds, such as tweaking the launcher scripts, or circumventing them altogether. However, this is not an option for less tech-savvy users, who will simply conclude "Hey, this tool doesn't work", give up, and move on to other things. Moreover, this means that much of the (often huge) amounts of development effort that went into these tools will simply fail to reach its potential audience, and I think this is a tremendous waste. I'm also wondering why there's been so little progress on this over the past 2.5 years. Is it really that difficult to develop preservation tools with command-line interfaces that follow basic design conventions that have been ubiquitous elsewhere for more than 30 years? Tools that just work?

Preservation Topics: IdentificationCharacterisationToolsSCAPE
Categories: Planet DigiPres

Data: A Love Story in the Making

The Signal: Digital Preservation - 30 January 2014 - 5:45pm

Here’s a simple experiment that involves asking an average person two questions. Question one is: “how do you feel about physical books?” Question two is: “how do you feel about digital data?”

"I Love Data" She Wept, by bixentro, on Flickr

“I Love Data” She Wept, by bixentro, on Flickr

The first question almost surely will quickly elicit warm, positive exclamations about a life-long relationship with books, including the joy of using and owning them as objects. You may also hear about the convenience of reading on an electronic device, but I’ll wager that most people will mention that only after expounding on paper books.

The second question shifts to cooler, more uncertain ground. The addressee may well appear baffled and request clarification. You could help the person a bit by specifying digital materials of personal interest to them, such as content that resides on their tablet or laptop. “Oh, that stuff,” they might say with measured relief. “I’m glad it’s there.”

These divergent emotional reactions should be worrying to those of us who are committed to keeping digital cultural heritage materials accessible over time. Trying to make a case for something that lacks emotional resonance is difficult, as marketing people say. Most certainly, the issue of limited resources is a common refrain when it comes to assessing the state of digital preservation in cultural heritage institutions; see the Canadian Heritage Information Network’s Digital Preservation Survey: 2011 Preliminary Results, for example.

The flip side is that traditional analog materials are a formidable competitor for management resources because those materials are seen in a glowing emotional context. I don’t mean to say that analog materials are awash in preservation money; far from it. But physical collections still have to be managed even as the volume of digital holdings rapidly rise, and efforts to move away from reliance on the physical are vulnerable to impassioned attack by people such as Nicholson Baker.

What is curious is that even as we collectively move toward an ever deeper relationship with digital, there remains a strong nostalgic bond with traditional book objects. A perfect example of this is a recent article, Real books should be preserved like papyrus scrolls. The author fully accepts the convenience and the future dominance of ebooks, and is profoundly elegiac in his view of the printed word. But, far from turning away from physical books, he declares that “books have a new place as sacred objects, and libraries as museums.” One might see this idea as one person’s nostalgic fetish, but it’s more than that. We can only wonder how long and to what extent this kind of powerful, emotionally-propelled thinking will drive how cultural heritage institutions operate, and more importantly, how they are funded.

As I’ve written before, we’re at a point where intriguing ideas are emerging about establishing a potentially deeper and more meaningful role for digital collections. This is vitally important, as a fundamental challenge that lies before those who champion digital cultural heritage preservation is how to develop a narrative that can compete in terms of personal meaning and impact.

Categories: Planet DigiPres

SCAPE survey on preservation monitoring. Participate now!

Open Planets Foundation Blogs - 30 January 2014 - 10:05am

Anyone willing to preserve digital content must be aware of events that might constitute a relevant risk. In SCAPE we are developing tools that will allow you to detect risks before they cause any irreversible damage.

Help us understand the preservation events, threats and opportunities, you find more relevant and the ways you would like us to detect them.

Participate in our survey and help us develop tools that would help you to automatically detect problems in your own content, and events that might put it at risk.

The survey has 30 short questions that should take about 10 minutes to complete.

Join the survey now! Preservation Topics: Preservation StrategiesPreservation RisksBit rotFormat RegistryRepresentation InformationSCAPE
Categories: Planet DigiPres

Machine Scale Analysis of Digital Collections: An Interview with Lisa Green of Common Crawl

The Signal: Digital Preservation - 29 January 2014 - 8:11pm
Lesa Green, Director of Common Crawl

Lesa Green, Director of Common Crawl

How do we make digital collections available at scale for today’s scholars and researchers? Lisa Green, director of Common Crawl, tackled this and related questions in her keynote address at Digital Preservation 2013. (You can view her slides and watch a video of her talk online.) As a follow up to ongoing discussions of what users can do with dumps of large sets of data, I’m thrilled to continue exploring the issues she raised in this insights interview.

Trevor: Could you tell us a bit about Common Crawl? What is your mission, what kinds of content do you have and how do you make it available to your users?

Lisa: Common Crawl is a non-profit organization that builds and maintains an open repository of web crawl data that is available for everyone to access and analyze. We believe that the web is is an incredibly valuable dataset capable of driving innovation in research, business, and education and that the more people that have access to this dataset, the greater the benefit to society.  The data is stored on public cloud platforms so that anyone with a access to the internet can access and analyze it.

Common crawl invites users to get started by starting a machine image, building examples, and joining in on discussions.

Common crawl invites users to get started by starting a machine image, building examples, and joining in on discussions.

Trevor: In your talk, you described the importance of machine scale analysis. Could you define that term for us and give some examples of why you think that kind of analysis is important for digital collections?

Lisa: Let me start by describing human scale analysis.  Human scale analysis means that a person ingests information with their eyes and then processes and analyzes it with their brain. Even if several people – or even hundreds of people – work on the analysis, it is not as fast as a computer program can ingest, process, and analyze information. Machine scale analysis is when a computer program does the analysis. A computer program can analysis data millions to billions of times faster than a human. It can run 24 hours a day with no need for rest and it can simultaneously run on multiple machines.

Machine scale analysis is important for digital collections because of the massive volume of data in most digital collections. Imagine that a researcher wanted to study the etymology of a word and planned to use a digital collection to answers questions such as:

  • What is the first occurrence of this word?
  • How did the frequency of occurrence change over time?
  • What types of publication it is first appear in?
  • When did it first appear in other types of publications and how did the types of publications it appeared in change over time?
  • What other words most commonly appear in the same sentence, paragraph or page with the word and how did that change over time?

Answering such questions using human scale analysis would take lifetimes of man hours to search the collection for the given word. Machine scale analysis could retrieve the information in seconds or minutes. And if the researcher wanted to make changes in the questions or criteria, only a small amount of effort would be required to alter the software program, then the program could be rerun and return the new the information in seconds or minutes. If we want to optimize the extraction of knowledge from the enormous amounts of data digital collections, human analysis is simply too slow.

Trevor: What do you think libraries, archives and museums can learn from Common Crawl’s approach?

Lisa: I think it is of crucial importance to preserve data in a format that it can be analyzed by computers. For instance, if material is stored as a PDF, it difficult – and sometimes impossible – for software programs to analysis the material and therefore libraries, archives and museums will be limited in the amount of information that can be extracted from the material in a reasonable amount of time.

Trevor: What kind of infrastructure do you think libraries, archives and museums need to have to be able to provide capability for machine scale analysis? Do you think they need to be developing that capacity on their own systems or relying on third party systems and platforms?

Lisa: The two components are storage and compute capacity. When one thinks of digital preservation, storage is always considered but compute capacity is not always considered. Storage is necessary for preservation and the type of storage system influences access to the collection. Compute capacity is necessary for analysis. Building and maintaining the infrastructure or storage and compute can be expensive, so it doesn’t make much financial sense for each organization to develop it own their own.

One option would be a collaborative, shared system build and used by many organizations. This would allow the costs to be shared, avoid duplicative work and storing duplicate material, and – perhaps most importantly – maximize the number of people who have access to the collections.

Personally I believe a better option would be to utilize existing third party systems and platforms. This option avoids the cost of developing custom systems and often makes it easier to maintain or alter the system as there is a greater pool of technologists familiar with the popular third party platforms.

I am a strong believer in public cloud platforms is because there is no upfront cost for the hardware, no need to maintain or replace hardware, and one only pays for the storage and compute that is used. I think it would be wonderful to see more libraries, museums, and archives storing copies of their collections on public cloud platforms in order to increase access. The most interesting use of your data may be thought of by someone outside your organization and the more people who can access the data, the more minds can work to find valuable insight within your data.

Categories: Planet DigiPres

Interface, Exhibition & Artwork: Geocities, Deleted City and the Future of Interfaces to Digital Collections

The Signal: Digital Preservation - 28 January 2014 - 6:36pm

In 2009, a band of rogue digital preservationists called Archive Team did their best to collect and preserve Geocities. The resulting data has became the basis for at least two works of art: Deleted City and One Terabyte of Kilobyte Age. I think the story of this data set and these works offer insights into the future roles of cultural heritage organizations and their collections.

Let Them Build  Interfaces

Screenshot of "One Terabyte of Kilobyte Age Photo Op"

Screenshot of “One Terabyte of Kilobyte Age Photo Op.”

In short, Archive Team collected the data and made the dataset available for bulk download. If you like, you can also just access the 51,000 MIDI music files from the data set from the Internet Archive. Beyond that, because the data was available in mass, the corpus of personal websites became the basis for other works. Taking the Geocities data as a basis, Richard Vijgen’s Deleted City interprets and presents an interface to the data and Olia Lialina & Dragan Espenschied’s One Terabyte of Kilobyte Age  is in effect designed reenactment grounded in an articulated approach to accessibility and authenticity.

An Artwork as the Interface to Your Collection

Some of the most powerful ways to interact the Geocities collection is through works of created by those who have access to the collection as a dataset. Working with digital objects means we don’t need to define the way that they will be accessed or made available. By making the raw data available on the web, and providing a point of reference for the data set everyone is enabled to create interfaces to it.

How to make available digital collections and objects?

Deleted City, used with permission from Richard Vijgen

Deleted City, used with permission from Richard Vijgen

Access remains the burning question for cultural heritage organizations interested in the acquisition and preservation of digital artifacts and collections. What kinds of interfaces do they need in place to serve what kinds of users? If you don’t know how to make it available in advance what can you do with it? I’ve been in discussions with staff from a range of cultural heritage organizations who don’t really want to wade too deep into acquiring born digital materials without having a plan for how to make them available.

The story of Geocities, Archive Team and these artists suggests that if you can make the data avaliable you can invite others to invent the interfaces. If users can help figure out and develop modes of access, as illustrated in this case, then cultural heritage organizations could potentially invite much larger communities of users to help figure out issues around migration and emulation as modes of access as well. By making the content broadly available, organizations have the ability to broaden the network of people who might contribute to efforts to make digital artifacts accessible into the future.

Collections and Interfaces Inside and Outside

Deleted City Neighborhood Interface. Used with permission of Richard Vijgen.

An exciting model can emerge here. Through data dumps of full sets of raw data, cultural heritage organizations can consider embracing the fact that they don’t need to provide the best interface, or for that matter much of any interface at all, for digital content they agree to steward. Instead, a cultural heritage organization can agree to acquire materials or collections which are considered interesting and important, but which they don’t necessarily have the resources or inclination to build sophisticated interfaces to if they are willing to simply provide canonical homes for the data, offer information about the provenance of the data, and invest in dedicated ongoing bit-level preservation. This approach would resonate quite strongly with a more product less process approach to born digital archival materials.

An Example: 4Chan Collection/Dataset @ Stanford

A screenshot of the 4chan archive available in Stanford's Digital Repositry

A screenshot of the 4chan collection available in Stanford’s Digital Repository.

For a sense of what it might look like for a cultural heritage organization to do something like this we need look no further than a recent Stanford University Library acquisition. The recent acquisition of an archive of a collection of 4Chan data into Stanford’s digital repository shows how a research library could go about exactly this sort of activity. The page for the data set/collection briefly describes the structure of the data and some information and context about the collector who offered it to Stanford. Stanford acts as the repository and makes the data available for others to explore, manipulate and create a multiplicity of interfaces to. How will others explore or interface with this content? Only time will tell. In any event, it likely did not take that many resources to acquire it and it will likely not require that much in resources to maintain it at a basic level into the future.

How to encourage rather than discourage this?

If we wanted to encourage this kind of behavior how would we do it? First, off I think we need more data dumps for this kind of data. With the added note that bitsize downloadable chunks of data are going to be the easiest thing for any potential user to right click and save to their desktop. Beyond that, cultural heritage organizations could embrace this example and put up prizes and bounties for artists and designers to develop and create interfaces to different collections.

What I think is particularly exciting here is that by letting go of the requirement to provide the definitive interface cultural heritage organizations could focus more on selection and working to ensure long term preservation and integrity of data. Who knows, some of the interfaces others create might be such great works of art that another cultural heritage organization might feature it in their own database of works.


Categories: Planet DigiPres

FITS website

File Formats Blog - 28 January 2014 - 6:17pm

Last spring, I attended a Hackathon at the University of Leeds, which resulted in my getting a SPRUCE Grant for a month’s work enhancing FITS, a tool which at the time was technically open source but which the Harvard Library treated a bit possessively. After I finished, it seemed for a while that nothing was happening with my work, but it was just a matter of being patient enough. Collaboration between Harvard and the Open Planets Foundation has resulted in a more genuinely open FITS, which now has its own website. There’s also a GitHub repository with five contributors, none of which are me since my work was on an earlier repository that was incorporated into this one.

It really makes me happy to see my work reach this kind of fruition, even if I’m so busy on other things now that I don’t have time to participate.

Tagged: FITS, Harvard, Open Planets Foundation, preservation, software
Categories: Planet DigiPres