Planet DigiPres

Content Matters Interview: The Montana State Library, Part One

The Signal: Digital Preservation - 5 December 2013 - 7:50pm

Diane Papineau. Photo credit: Patty Ceglio

In this installment of the Content Matters interview series of the National Digital Stewardship Alliance Content Working Group we’re featuring an interview with Diane Papineau, a geographic information systems analyst at the Montana State Library.

Diane was kind enough to answer questions, in consultation with other MSL staff and the state librarian, Jennie Stapp, about the MSL’s collecting mission, especially in regards to their geospatial data collections.

This is part one of a two part interview. The second part will appear tomorrow, Friday December 6, 2013.

Butch: Montana is a little unusual in that the geospatial services division of the state falls under the Montana State Library. How did this come about and what are the advantages of having it set up this way.

Diane: In addition to a traditional role of supporting public libraries and collecting state publications, the Montana State Library (MSL) hosts the Natural Resource Information System (NRIS), which is staffed by GIS Analysts.

NRIS was established by the Montana Legislature in 1983 to catalog the natural resource and water information holdings of Montana state agencies. In 1987, NRIS gained momentum (and funding) from the federal Environmental Protection Agency and Montana Department of Health and Environmental Sciences to support their mining clean-up work on the Superfund sites along the Clark Fork River between Butte and Missoula. This project generated a wealth of GIS data such as work area boundaries, contaminated area locations, and soil sampling sites, which NRIS used to make a multitude of maps for reports and project management. Storing the data and resulting maps at MSL made sense because it is a library and therefore a non-regulatory, neutral agency. Making the maps and data available via a library democratized a large collection of timely and important geographic information and minimized duplication of effort.

GIS was first employed at NRIS in 1987; from that point forward, NRIS functioned as the state’s GIS data clearinghouse, generating and collecting GIS data. NRIS operated for a decade essentially as a GIS service bureau for state government; during this period, NRIS grew into a comprehensive GIS facility, unique among state libraries. In fact, in the mid-1990s, NRIS participated in the first national effort to provide automated search and retrieval of map data. Today, beyond data clearinghouse activities, MSL is involved with state GIS Coordination as well as GIS leadership and education. We also are involved with data creation or maintenance for 10 of the 15 framework datasets (cadastral, transportation, hydrography, etc.) for Montana, and also host a GIS data archive, thanks to our participation as a full partner in the Geospatial Multistate Archive and Preservation Partnership (GeoMAPP)—a project of the National Digital Stewardship Alliance (NDSA).

Butch: Give us an example of some of the Montana State Library digital collections. Any particularly interesting digital mapping collections?

Diane: Our most important digital geographic collection is the full collection of GIS clearinghouse data gathered over the past 25 years. The majority of this data is “born digital” content made available for download and other types of access via our Data List. Within that collection, one of our most sought-after datasets is the Montana Cadastral framework—a statewide dataset of private land ownership illustrated by tax parcel boundaries. The dataset is updated monthly and is offered for download and as a web map service for desktop GIS users and online mapping. We have stored periodic snapshots of this dataset as it has changed through time and we also serve the most recent version of the data via the online Montana Cadastral map application. The map application makes this very popular data accessible to those without desktop GIS software or training in GIS. Another collection to note is our Clark Fork River superfund site data, which may prove invaluable at some point in the future.

In terms of an actual digital map series, our Water Supply/Drought maps come to mind. For at least 10 years now, NRIS has partnered with the Montana Department of Natural Resources and Conservation (DNRC) to create statewide maps illustrating the soil moisture conditions in Montana by county. DNRC supplies the data; NRIS creates the map and maintains the website that serves the collection of maps through time.

Butch: Tell us a bit about how the collection is being (or might be) used. To what extent is it for the general public? To what extent is it for scholars and researchers?

Diane: Our GIS data collection serves the GIS community in Montana and beyond. Users could be GIS practitioners working on land management issues or city/county planning for example. Other collections, such as our land use and land cover datasets and our collection of aerial photos, may be of particular interest to researchers.  The general public also utilizes this data; because of phone inquiries we receive, we know that hunters, for example, frequently access the cadastral data in order to obtain landowner permission to hunt on private lands. Though we don’t track individual users due to requirements of library confidentiality, we know that the uses for this collection are virtually limitless.

The general public can access much of the geographic data we serve by using our online mapping applications. For example, patrons can use the Montana Cadastral application that I mentioned plus tools like our Digital Atlas to see GIS datasets for their area of interest. They can use our Topofinder to view topographic maps online or to find a place when, for example, all that’s known is the location’s latitude and longitude. In 2008, in partnership with the Montana Historical Society, we published the Montana Place Names Companion—an online map application that helps patrons to learn the name origin and history of places across the state.

Butch: What sparked the Montana State Library to join the National Digital Stewardship Alliance?

Diane: While we’ve played host to this large collection of GIS data and we have long been recognized as the informal GIS data archive for the state, we had yet to maintain an inventory of our holdings. Thankfully, we never threw data out.

We realized that in order to gain physical and intellectual control over this collection of current and superseded data, we needed to modernize our approach. The timing couldn’t have been better because it coincided with the concluding phase of GeoMAPP.  In 2010 MSL participated as an Information Partner, beginning our exposure to formal GIS data archiving issues. Then in 2011, MSL joined GeoMAPP as the project’s last Full Partner. This partnership permitted us to envision applying archivists’ best practices while we reworked and modernized our data management processes.

In some ways we were the GeoMAPP “guinea pig” and we are grateful for that role—so much research had already been done by the other partners and so much information was already available. In return, what MSL could offer to this group was the perspective of three important GeoMAPP target audiences: libraries, archives, and GIS shops.

Butch: Tell us about some of the archiving practices that the Montana State Library has defined as a result of its partnership with GeoMAPP and the National Digital Stewardship Alliance. Why is preservation important for GIS data?

Diane: I’ll start with the “why.” GIS data creation is expensive. By preserving geographic data via archiving, we store that investment of time and money. GIS data is often used to create public policy. Montana has incredibly strong “right to know” laws so preserving data that was once available to decision makers supports later inquiry about current laws and policies. Furthermore, making superseded data discoverable and accessible promotes historically-informed public policy decisions, wise land use planning, and effective natural disaster planning to name just a few use cases. From a state government perspective, the published GIS datasets created by state agencies are considered state publications. Our agency is statutorily mandated to preserve state publications and make them permanently accessible to the public.

To guide us in this modernization, MSL developed data management standards, policies, and procedures that require data preservation using archivists’ best practices. I’ll discuss a few highlights from these standards that illustrate our particular organizational needs as a GIS data collector and producer.

In order to appeal to the greater GIS community in Montana, we decided to use more GIS-friendly terms in place of the three “package” terms from the OAIS model. We think of a Submission Information Package (SIP) as “working data,” a Dissemination Information Package (DIP) as a Published Data Package, and an Archive Information Packages (AIP), as an Archive Data Package.

MSL chose to take a “library collection development policy” approach to managing a GIS data collection rather than a “records management” approach, which makes use of records retention schedules. What this means is we’re on the lookout for data we want to collect—appraisal happens at the point of collection. If we take the data, we both archive it (creating an AIP) and make DIPs at the same time. The archive is just another data file repository, though a special one with its own rules. If the data acquired is not quite ready for distribution, we modify it from a SIP (our “working data”) to make it publishable. We do not archive the SIP.


Montana State Library Data Collection Management Flow

We’re employing the library discipline’s construct of series’ and collections and their associated parent/child metadata records, which is new to the GIS group here at MSL.  In turn, that decision influenced the file structure of our archive. Though ISO topic categories were GeoMAPP suggestions for both data storage as well as for data discovery, MSL chose instead to organize archive data storage by the time period of content unless the data is part of a series (i.e. cadastral) or if it was generated as part of a discrete project and is considered a collection (i.e the Superfund data). Additional consistency and structure should also come from the use of a new file naming convention (<extent><theme><timeframe>).

MSL is archiving data in its original formats rather than converting all data to an archival format (i.e. shapefile) because each data model offers useful spatial characteristics that we did not want to strip from the archived copy. For archive data packaging, we use the Library of Congress tool “Bagger” and we specifically chose to zip all the associated files together before “bagging” to save space in the archive. Zipping the data also permits us to produce one checksum for the entire package, which simplifies dataset management and dataset integrity checking in the workflow. We decided not to use Bagger’s zip function for this because the resulting AIP produced an excessively deep file structure, burying the data in multiple folder levels. To document the AIP in our data management system, we’ve established new archive metadata fields such as date archived, checksum, data format, and data format version.

Part two of this interview will appear tomorrow, Friday December, 2013.

Categories: Planet DigiPres

Happenings in the Web Archiving World

The Signal: Digital Preservation - 4 December 2013 - 6:11pm

Recently, the world of web archiving has been a busy one. Here are some quick updates:

  • The National Library of Estonia released the Estonian Web Archive to the public. This is of particular note because the Legal Deposit Law in Estonia allows the archive to be publicly accessible online. If you read Estonian you can browse the 1003 records that make up the 1.6 TB of data in the archive. A broad crawl of the entire Estonian domain is planned in 2014.
  • Ed Summers from the Library of Congress gave the keynote address at the National Digital Forum in New Zealand titled The Web as  Preservation Medium. Ed is a software developer and offers a great perspective into some technical aspects of preserving the Web. He covers the durability of HTML, the fragility of links, how preservation is interlaced with access, the importance of community action and the value of “small data”.
  • The International Internet Preservation Consortium 2014 General Assembly will be held at the Bibliothèque nationale de France in Paris May 19-13, 2014. There is still a little time to submit a proposal to speak at the public event on May 19th titled Building Modern Research Corpora: the Evolution of Web Archiving and Analytics.

IIPC_Logo_FullColorCall for Proposals announcement from the IIPC:

Libraries, archives and other heritage or scientific organizations have been systematically collecting web archives for over 15 years. Early stages of web archiving projects were mainly focused on tackling the challenges of harvesting web content, trying to capture an interlinked set of documents, and to rebuild its different layers through time. Institutions, especially those on a national level, were also defining their legal and institutional mandates. Meanwhile, approaches to web studies developed and influenced researchers’ and academics’ use of web archives. New requirements have emerged. While the objective of building generic collections remains valid, web archiving institutions and researchers also need to collaborate in order to build specific corpora – from the live web or from web archives.

At the same time, “surfing the web the way it was” is no longer the only way of accessing archived web content. Methods developed to analyse large data sets – such as data or link mining – are applicable to web archives. Web archive collections can thus be a component of major humanities and social sciences projects and infrastructures. With relevant protocols and tools for analysis, they will provide invaluable knowledge of modern societies.

This conference aims to propose a forum where researchers, librarians, archivists and other digital humanists will exchange ideas, requirements, methods and tools that can be used to collaboratively build and exploit web archive corpora and data sets. Contributions are sought that will present:

  • models of collaboration between archiving institutions and researchers,
  • methods and tools to perform data analytics on web archives,
  • examples of studies performed on web archives,
  • alternative ways of archiving web content.

Abstracts (no longer than one page) should be sent to Peter Stirling (peter dot stirling at bnf dot fr) by Friday December 9, 2013. Full details are available at the IIPC website.

Categories: Planet DigiPres

Week 48: A SCAPE Developer Short Story

Open Planets Foundation Blogs - 4 December 2013 - 10:35am

It's been two weeks since the internal SCAPE developer workshop in Brno, Czech Republic. It was a great workshop. We had a lot of presentations and demos, and were brought up to date on what's going on in the other corners of the SCAPE project. We also had some (loud) discussions, but I think we came to some good agreements on where we as developers are going next. And we started a number of development and productisation activities. I came home with a long list of things to do next week (this ended up not at all being what I did last week, but I still have the list, so next week, fingers crossed). Tasks for week 48:

  • xcorrSound
    • make versioning stable and meaningful (this I looked at together with my colleague in week 48)
    • release new version (this one we actually did)
    • finish writing nice microsite
    • tell my colleague to finish writing small website, where you can test the xcorrSound tools without installing them yourself
    • write unit tests
    • introduce automatic rpm packaging?
    • finish xcorrSound Hadoop job
    • do the xcorrSound Hadoop Testbed Experiment
      • Update the corresponding user story on the wiki
      • Write the new evaluation on the wiki
    • finish the full Audio Migration + QA Hadoop job
    • do the full Audio Migration + QA Hadoop Testbed Experiment
      • Update the corresponding user story on the wiki
      • Write the new evaluation on the wiki
    • write a number of new blog posts about xcorrsound and SCAPE testbed experiments
    • new demo of xcorrsound for the SCAPE all-staff meeting in February
  • SCAPE testbed demonstrations
    • define the demos that we at SB are going to do as part of testbed (this one we also did in week 48; the actual demos we'll make next year)
  • FITS experiment (hopefully not me, but a colleague)
  • JPylyzer experiment (hopefully not me, but a colleague)
  • Mark FFprobe experiment as not active
  • ... there are some more points for the next months, but I'll spare you...

So what did I do in week 48? Well, I sort of worked on the JPylyzer experiment, which is on the list above. In the Digital Preservation Technology Development department at SB we are currently working on a large scale digitized newspapers ingest workflow including QA. As part of this work we run JPylyzer from Hadoop on all the ingested files, and then validate a number of properties using Schematron. These properties come from the requirements to the digitization company, but in SCAPE context these properties should come from policies, so there is still some work to do for the experiment. But running JPylyzer from Hadoop, and validating properties from the JPylyzer output using Schematron now seems to work in the SB large scale digitized newspapers ingest project :-)

And for now I'll put week 50 on the above list, and when I have finished a sufficient number of bullet points I'll blog again! This post is missing links, so I hope you can read it without.

Preservation Topics: SCAPE
Categories: Planet DigiPres

Digital Preservation Pioneer: Gary Marchionini

The Signal: Digital Preservation - 3 December 2013 - 8:27pm
Gary Marchionini

Gary Marchionini. Photo by University of North Carolina at Chapel Hill.

In 1971, Gary Marchionini had an epiphany about educational technology when he found himself competing with teletype machines for his students’ attention.

Marchionini was teaching mathematics at a suburban Detroit junior high school the year that the school acquired four new teletype machines. The machines were networked to a computer, so a user could type something into a teletype and the teletype would transmit it to the computer for processing.

The school teletypes accessed “drill and practice” programs. The paper-based teletype would print a math problem, a student would type in the answer, wait patiently for the response over the slow, primitive network and eventually the teletype would print out, “Good” (if it was correct).

“The thing was noisy,” said Marchionini. “But the kids still wanted to leave my math classroom to go do this in the closet. There was something about this clickety clackety paper-based terminal that attracted them.

“Eventually I realized that there were two things going on. One was personalization; each kid was getting his own special attention. The other thing was interactivity; it was back and forth, back and forth with the kids. It was engaging.

“That’s what sparked my interest in computer interaction as a line of research.”

That interest became a lifelong mission for Marchionini. He went on to get his Masters and Doctorate in Math Education and Educational Computing from Wayne State University, he quit teaching public school in 1978, joined the faculty at Wayne State and trained teachers in computer literacy.

In 1983, Marchionini joined the faculty at the University of Maryland College of Library and Information Services; he also joined the Human-Computer Interaction Laboratory.

“It was easy to make the transition from education to library and information services because I always thought of information retrieval as a learning function,” said Marchionini. “The goal of my work was always to enhance learning. And information seeking, from a library perspective… well, people are learning. It could be casual or it could be critical but they are trying to learn something new.”

Marchionini’s research encompassed information science, library science, information retrieval, information architecture and human/computer interaction…interface research. He was especially keen on the power of graphics to help people visualize and conceptualize information, and to help people interact with computers to find that information. In fact, as early as 1979, before the explosion of graphic interfaces on personal computers, Marchionini was coding rudimentary graphic representations on his own.

“One of my projects [in 1979] involved addition ‘grouping’ and subtraction ‘regrouping’ – borrowing and carrying and all that stuff,” said Marchionini. “I wrote a computer program that graphically showed that process as a bundling and unbundling of little white dots on a Radio Shack screen.”

Marchionini is quick to point out that graphics were only a part of his interface research, and there is a time and a place for graphics and for declarative text in human/computer interaction. He said that the challenge for researchers was to determine the appropriate function of each.

One interface project that he worked on at UMd also marked his first involvement with the Library of Congress: working with UMd’s Nancy Anderson, professor of psychology (now retired), and Ben Shneiderman, professor of computer science, to add touch screens to the Scorpio and MUMS online catalog interfaces. UMd’s collaborative relationship with the Library continued on into the American Memory project.

“They contracted with us at Maryland to do a series of training events on the user-interface side of American Memory,” said Marchionini. “We did a lot of prototypes. This is some of the early dynamic-query work that Ben Shneiderman and his crew and those of us in the Human Computer Interaction lab were inventing. We worked on several of the sub-collections.”

Marchionini’s expertise is in creating the underlying data architecture and determining how the user will interact with the data; he leaves the interface design — the pretty page — to those with graphic arts talent.

A lot of analysis, thought, research and testing goes into developing appropriate visual cues and prompts to stimulate interactivity with the user. How can people navigate dense quantities of information to quickly find what they’re searching for? What kind of visual shorthand communicates effectively and what doesn’t?

When an interface is well-designed, it doesn’t call attention to itself and the user experience is smooth and seamless. Above all, a well-designed interface always answers the two questions “Where am I?” and “What are my options?”.

Regarding his work on cues and prompts, Marchionini cites another early UMd/Library of Congress online project, the Coolidge-Consumerism collection.

“We wanted to give people ‘look aheads’ and clues about what might happen and what they were getting themselves into if they click on something,” said Marchionini. “The idea was to see if we can show samples of what’s down deep in the collection right up front, either on the search page or on what was in those days the early search-and-results page. It was a lot of fun to work with Catherine Plaisant and UMd students on that. We made some good contributions to interface design.” Marchionini and Paisant delivered a paper at the Computer-Human Interaction group’s CHI 97 conference titled, “Bringing Treasures to the Surface: Iterative Design for the Library of Congress National Digital Library Program,” which details UMd’s interface design process.

Marchionini has long had an interest in video as a unique means of conveying information. Indeed, he may have recognized video’s potential long before many of his peers did.

In 1994, he and colleagues from the UMd School of Education worked on a project called the Baltimore Learning Community that created a digital library of social studies and science materials for teachers in Baltimore middle schools.

Apple donated about 50 computers. The Discovery Channel offered 100 hours of video, which Marchionini and his colleagues planned to digitize, segment, index and map to the instructional objectives of the state of Maryland. It was an ambitious project and Marchionini said that he learned a lot about interactive video, emerging video formats, video copyrights and the programming challenges for online interactivity.

“We built some pretty neat interfaces,” said Marchionini. “At the time, Java was just coming out and we were developing dynamic query interfaces in the earliest version of Java. We were moving toward web-based applets. And we were building resources for the teachers to save their lesson plans, including comments on how they used the digital assets and wrote comments on them and shared them with other teachers. Basically we were building a Facebook of those days — getting these materials shared with one another and people making comments and adding to other people’s lesson plans so they could re-use them.”

Marchionini adds that the Baltimore Learning Community project is a good example of the need for digital preservation. Today, nothing remains from the project except for some printouts of screen displays of the user interfaces and website, and a few videotapes that show the dynamics.

“Today’s funding agencies’ data-management plan requirements are a step in the right direction of ensuring preservation,” said Marchionini.

In 1998, Marchionini joined the faculty at the School of Information and Library Science at the University of North Carolina, Chapel Hill, where he continued his video research along with his other projects. In 2000, he and Barbara Wildemuth and their students launched Open Video, a repository of rights-free videos that people could download for education and research purposes. Open Video acquired about 500 videos from NASA, which Open Video segmented and indexed. Archivist and filmmaker Rick Prelinger donated many films from his library to Open Video before he allied with the Internet Archive. Open Video even donated hundreds of videos to Google Video before Google acquired YouTube.

In 2000, around the time that NDIIPP was formed, Marchionini started discussing video preservation with his colleague Helen Tibbo and others. He concluded that one of the intriguing aspects of preserving video from online would be to also capture the context in which the video existed.

Marchionini said, “What kind of context would you need, say in 2250, if you see a video of some kids putting Mentos in Coke bottles and squirting stuff up in the air? You would understand the chemistry of it and all that but you would never understand why half a million people watched that stupid video at one time in history.”

“That’s where you need the context of knowing that this was the time when YouTube was happening and people were discovering ways to make their own videos without having to have a million dollar production lab or a few thousand dollars worth of equipment. The importance of it is that the video is associated with what was going on in the world at the time.”

With NDIIPP grant money, by way of the National Science Foundation, Marchionini and his colleagues created a tool  called ContextMiner, a sort of tightly focused, specialized web harvester that is driven by queries rather than link following. A user gives ContextMiner a query or URL to direct to YouTube, Flickr, Twitter or other services. In the case of YouTube, ContextMiner then regularly downloads not only the video files returned from the search but whatever data on the page is associated with that video. A typical YouTube page will have comments, ratings and links to related videos. For awhile, ContextMiner even harvested incoming links, which placed the video in a sort of contextual constellation of related topics.

The inherent educational value of video is that it can show a process. You can either read about how to juggle or how to tie your shoe laces, or you can watch a demonstration. Modelling communicates processes more effectively than written descriptions of processes.

Marchionini also sees video as a means of recording a process for research purposes. As an example, he described a situation where he wanted to capture and review the actions of users as they conducted queries and negotiated the search process.

He said, “I wanted to see a movie of a thousand people’s searches going through these states, from query specification to results examination and back to queries. Video is a way to preserve some things that have dynamics and interactions involved, things that you just can’t preserve in words. This is critical for showing processes, such as interaction dynamics, in a rapidly changing web environment. Because old code and old websites may no longer work, video is an important tool to capture those dynamics. That’s the only way I have of going back and saying, ‘Ten years ago, here were these interfaces we were designing and here’s why they worked the way they did.’ And I show a video.”

Today Marchionini is dean of the UNC School of Information and Library Science and he heads its Interaction Design Laboratory. The results of Marchionini’s research over the years have influenced our daily human/computer interaction in ways that we’ll never know.  Interfaces will continue to evolve and get refined but it is important to remember the work of people like Marchionini who did the early research and testing, labored on the prototypes and laid the foundation of effective human-computer interface design, making it possible for modern users to interact effortlessly with their devices.

Professors may not get the glory and attention that their work deserves but that’s not the point of being a teacher. Teachers teach. They pass their knowledge along to their students and often inspire them to create the Next Big Thing.

“University professors create ideas and prototypes and then the people who get paid to build real systems do that last difficult 10% of making something work at scale,” said Marchionini. “We train students. And it’s the students that we inspire, hopefully, who go on to industry or government work or libraries. And they put these ideas into place.

“My job is ideas and directions. Some stick and others do not. I hope they all get preserved so we can learn from both the good ones and the not-so-good ones.”

<<Digital Preservation Pioneer index

Categories: Planet DigiPres

BitCurator’s Open Source Approach: An Interview With Cal Lee

The Signal: Digital Preservation - 2 December 2013 - 2:50pm
Cal Lee, Associate Professor at the School of Information and Library Science at the University of North Carolina at Chapel Hill

Cal Lee, Associate Professor at the School of Information and Library Science at the University of North Carolina at Chapel Hill

Open source software is playing an important role in digital stewardship. In an effort to better understand the role open source software is playing, the NDSA infrastructure working group is reaching out to folks working on a range of open source projects. Our goal is to develop a better understanding of their work and how they are thinking about the role of open source software in digital preservation in general.

For background on discussions so far, review our interviews with Bram van der Werf on Open Source Software and Digital Preservation, Peter Van Garderen and  Courtney Mumma on Archivematica and the Open Source Mindset for Digital Preservation Systems and Mark Leggott on Islandora’s Open Source Ecosystem and Digital Preservation. In this interview, we talk with Cal Lee, Associate Professor at the School of Information and Library Science at the University of North Carolina at Chapel Hill about BitCurator.

Trevor: The title of your talk about BitCurator to the NDSA infrastructure working group explained it as “An Open-Source Project for Libraries and Archives that Takes Bitstreams Seriously.” Could you unpack that a bit for us? What does it mean to take bitstreams seriously and why is it important for archives to do so?

Cal: Computers store and process information through physical mechanisms, such as turning transistors on/off and changing/detecting the magnetic properties of the surface of a disk.  However, software is designed to deal with bitstreams, which are abstractions of those physical properties into sequences of 1s and 0s.  As I’ve expressed elsewhere, the bitstream is a powerful abstraction layer, because it allows any two computer components to reliably exchange data, even if the underlying structure of their physical components is quite different. In other words, even though the bits that make up the bitstream must be manifested through physical properties of computer hardware, the bitstreams are not inextricably tied to any specific physical manifestation.  So the bitstream will be treated the same, regardless of whether it came off a hard drive, solid state drive, CD or floppy disk.

The bitstreams can be (and often are) reproduced with complete accuracy.  By using well-established mechanisms – such as generation and comparison of cryptographic hashes (e.g. MD5 or SHA1) – one can verify that two different instances of a bitstream are exactly the same. This is more fundamental than simply saying that one has made a good copy. If the two hash values are identical, then the two instances are, by definition, the same bitstream.

In our everyday use of computers, we luckily don’t need to worry about bitstreams.  We focus on higher-level representations such as documents, pages and programs.  We click on things, copy things and open things, without having to worry about their constituent parts.  But those responsible for the long-term preservation of digital information need to attend to bitstreams.  They need to ensure the integrity of bitstreams over time by generating and then periodically verifying the cryptographic hashes that I mentioned earlier.  They also often need to view files through hex editors, which are programs that allow them to see the underlying bitstreams (presented in 8-bit chunks called bytes), so they can identify file types, extract data from otherwise unreadable files, figure out the underlying contents and structures of files, and even reverse engineer formats in order to bring otherwise obsolete files back to life.

Bitstreams are also important when it comes to preserving the information acquired on removable media such as hard drives, flash drives, CDs or floppy disks.  Well-established practices in the field of digital forensics involve using a write blocker to ensure that none of the bits on the medium are accidentally changed or overwritten, and then creating a disk image.  A disk image is a perfect copy of the bitstream that is read off the disk through the computer’s input/output equipment.  It essentially allows librarians and archivists to retain all of the contents of a disk without having to rely on the physical medium.  This is important, because the medium will not be readable forever, so the bits need to be “lifted” off and placed in other storage.  It’s also important because there are many forms of data stored on the disk that may not be replicated correctly simply by copying and pasting the files from the disk.  The standard forensics software that creates a disk image also generates a cryptographic hash of the entire disk image (as opposed to the hashes of the individual files), so someone in the future can verify the disk image and ensure that none of the bits have changed.

The process for creating a disk image begins by being able to read the physical media. For example, a 3.5 inch disk like this.

The process for creating a disk image begins by being able to read the physical media. For example, a 3.5 inch disk like this.

Trevor: Disk images are an important part of that bitstream focus. At its core, BitCurator functions to help create disk images and then enable a user to carry out a range of operations on disk images. Could you tell us a bit about how your team is thinking about disk images themselves as a format? For example, to what extent is the image the artifact and the process of creating an image a preservation action? Or, conceptually is the image more of akin to a derivative of the artifact?

Cal: As I explained earlier, a bitstream is the same bitstream regardless of how it’s physically stored.  So if you navigate to a file that’s stored on your computer and send it to me as an email attachment, and I then save it to my computer, my copy of the bitstream will be exactly the same as your copy.  The associated metadata, such as the file name and timestamps could be completely different, but the file as a bitstream will not change (assuming there has been no corruption of the file along the way).  We can verify this by generating hashes on the two copies and seeing that they match.

This same set of relationships applies to disk images.  If you create a disk image of a floppy disk and send it to me, I’ll then have the exact same bitstream that you have.  If you create another disk image of that disk, it should also be exactly the same (again, assuming no data loss due to hardware failure).  It is this disk image that we need to treat as the “original” in a digital environment.  This is true for two fundamental reasons.  First, software on your computer doesn’t have access to the underlying physical properties of a disk the same way that a reader has direct access to the physical properties of a printed page.  The bitstreams that computers read, manage and process are always mediated through the computers’ input/output equipment.  So, except in extremely rare cases of heroic recovery, there’s no practical value in treating the contents of a disk as anything other than the stream of bits that can be read through the I/O equipment.  In other words, for practical purposes, the disk image is the disk.

The second reason to treat the disk image as the original is that the physical disk will not be readable forever.  The industry will abandon support for the hardware and low-level software/firmware required to read it.  The performance of the medium (its storage capacity and input/output transfer rate) will become less acceptable over time – ever try to store a terabyte of data on floppy disks?  And the bits will eventually be lost through natural physical aging.

This doesn’t mean that the artifactual properties of hardware are never important.  Understanding the original hardware can be important to knowing what the user experience was like at the time.  And taking pictures of original media in order to reflect things written on them can be a good way to reflect aspects of the creator’s intentions and work habits.

Here you see the interface for Guymager, the tool BitCurator uses to create disk images.

Here you see the interface for Guymager, the tool BitCurator uses to create disk images.

Trevor: How is the BitCurator team approaching interoperability between this tool and other digital preservation tools?

Cal: Probably the most important answer to your question is that all of the BitCurator software is distributed under an open-source license.  This means that people can download, manipulate and redistribute whatever parts they find useful.

We’re also in regular contact and collaborate with people involved in various other development activities.  For example, Courtney Mumma from Artefactual Systems is on the BitCurator Development Advisory Group, and we work closely with Artefactual to ensure that the BitCurator software and its data output are structured and packaged in ways that can be incorporated into Archivematica.  Mark Matienzo is also on the DAG, and we’ve had many discussions with him about how the BitCurator software can play well with ArchivesSpace.  Similarly, we strive to stay abreast of related software development activities being carried out within collecting institutions, such as the valuable work of Peter Chan at Stanford, Don Mennerich at the New York Public Library, Mark Matienzo at Yale, and activities outside the US that are represented well by the documentation that Paul Wheatley has developed for the Open Planets Foundation.

Kam Woods, who is the BitCurator Technical Lead, carries out extremely important liaison activities between our team and not just developers in the cultural heritage sector but also developers of standards and software in the forensics industry.  This is particularly important for BitCurator, because we’re repurposing, adapting and repackaging many existing open-source digital forensics tools.  Identifying and managing software dependencies is an ongoing process.

Viewing reports on a disk image in Bulk Extractor

Viewing reports on a disk image in Bulk Extractor

Trevor: Could you tell us a bit about the design principles at work in the BitCurator project? That is, instead of trying to build things from scratch you seem to be bringing together a lot of open source software created for somewhat different use cases and make it useful to archives. Why did your team develop this approach and what do you see as its benefits and limitations?

Cal: Almost twenty years ago, in a book called Darwin’s Dangerous Idea, Daniel Dennett argued that complex systems evolve through what he called the “accumulation of design.”  New products, services and theories and various other human products build off of existing ones.  Software development is no different.  Programmers know that it’s usually better to make use of existing code than to build it from scratch.  Why write the code required to write text to the screen, for example, if someone else has already done that?  Open-source software facilitates this process, because reusing someone else’s code doesn’t require the negotiation of permissions or payment.

Code adaptation and reuse is a particularly powerful proposition for the application of digital forensics to digital collections, because there is a great deal of powerful software that has already been developed, and it’s unlikely that collecting institutions would ever have sufficient resources to develop such tools completely on their own.  As someone who has been working with digital archives for many years, I’ve been amazed by how many tools being developed for digital forensics can be applied to the problems we face.  A great place to see leading-edge development in this space is the Digital Forensics Research Workshop, which is an annual conference that publishes its papers in a journal called Digital Investigation.  I’ve been particularly grateful for the open-source (or public domain) software developed by Simson Garfinkel at the Naval Postgraduate School and Brian Carrier of Basis Technologies.

Of course, all design decisions involve costs and benefits.  The main challenges of using software developed by others are that your specific use case may not have been the primary priority of those developers, and as I mentioned earlier, you have to stay on top of dependencies with that existing software as their (and your) software evolves over time.  The BitCurator team and I believe strongly that these costs are well worth the numerous benefits.  And we’re working to support the kinds of use cases that are most important to collecting institutions.

Visualizations of some of the file system metadata created though bulk extractor's reporting functions.

Visualizations of some of the file system metadata created though bulk extractor’s reporting functions.

Trevor: Could you tell us a bit about how you are thinking about the sustainability of BitCurator? For example, are you thinking about building a community of users and developers? What kinds of future funding streams are you looking to?

Cal: There are various elements of BitCurator that are designed to build capacity and ensure the sustainability of our activities. I’ve already explained that the software is distributed under an open source license, so diverse constituencies will be able to extend our tools at will.  Members of the BitCurator team have been offering a lot of continuing professional education opportunities (including a module for Rare Book School and classes for the Digital Archives Specialist program of the Society of American Archivists), which help to build and cultivate a community of users.  There’s a BitCurator user group that interested professionals can join, and our project wiki includes an increasing body of documentation to help people to install and use the software.

A significant focus of the second phase (October 2013 to October 2014) of BitCurator is to devise and implement a sustainability plan.  This is being overseen and coordinated by Porter Olsen, who is the Community Lead for BitCurator.  We’re currently exploring a variety of membership models.  We should have a much more detailed answer to your question in the coming year.

Trevor: Could you tell us a bit about how you are trying to engage and build a community around the software? What kinds of approaches are you taking and to what ends are you taking those approaches?

Cal: I’ve already talked about most of them within the context of sustainability.  The two issues (sustainability and community building) are closely related.  The products of the BitCurator project will ultimately be sustainable if there are professionals working in a variety of institutions who value them, use them, and contribute back to their ongoing development through evaluative feedback, bug reports and code revisions/enhancements.  In addition to our educational offerings and guidance resources, we’ve also published many papers/articles about this work and given talks at a variety of conferences and other professional events.

Porter Olsen is taking on many new engagement activities this year.  Among other things, this includes site visits and webinars.  The first two webinars that Porter is offering have filled up within a few days of announcing them, so there seems to be a lot of interest.

Trevor: It strikes me that one of the biggest opportunities and challenges here is that there is a significant literacy gap within the community around how to deal with born digital archival materials. For example, if you were making a tool to turn out finding aids there would be relatively solid requirements within the archives community of practice. In contrast, in working with born digital archival materials there is still an extensive need for developing those practices and a significant lack of knowledge about the issues at hand among many in the archives profession. First off, do you agree with this perspective? Second, if so how are you approaching designing a tool while the archives community is still simultaneously bootstrapping its way into working with?

Cal: I agree with you that the landscape is currently undergoing dramatic evolution.  This is what makes the work so fun and so fulfilling.  Professionals in a diverse range of collecting institutions are developing workflows that involve digital forensics tools and methods.  They’re learning from each other and making changes as they go along.

This is also a very exciting situation for an educator.   I don’t know if they always believe me when I tell them this, but today’s students in a program like the one at UNC SILS will be defining and establishing archival practices of the future.  If you want to continuously take on new challenges and creatively developed entirely new ways of working, then this is a great profession to join right now.  If you want a profession that’s safe and predictable, I recommend looking elsewhere.

Trevor: How has your work on BitCurator shaped your general perspective on the role that open source software can and should play in digital preservation? I would be particularly interested in any comments and connections you have to some of the interviews we have already done in this series. For reference, those include Bram van der Werf on Open Source Software and Digital Preservation, Peter Van Garderen &  Courtney Mumma on Archivematica and the Open Source Mindset for Digital Preservation Systems and Mark Leggott on Islandora’s Open Source Ecosystem and Digital Preservation.

Cal: It’s hard for me to argue with much that Bram, Peter, Courtney or Mark have said to you.  I think we are of a like mind on many things.  The curation of digital collections is a collective endeavor, and it can benefit greatly from open-source software development.  But it’s definitely not a panacea.  We have to learn from each other, assist each other, and celebrate each other’s victories.

Categories: Planet DigiPres

SPRUCE project Award: Lovebytes Media Archive Project

Open Planets Foundation Blogs - 28 November 2013 - 6:32pm

Lovebytes currently holds an archive of digital media assets representing 19 years of the organisation’s activities in the field of digital art and a rich historical record of emerging digital culture at the turn of the century. It contains original artworks in a wide variety of formats, video and audio documentation of events alongside websites and print objects.

In June 2013 we were delighted to receive an award from SPRUCE, which enabled us to devise and test a digital preservation plan for the archive through auditing, migrating and stabilising a representative sample of material, concentrating on migrating digital video and Macromedia Director files.

Alongside this we developed a Business Case, which makes the case for preserving the archive and describes the work that needs to be done to make it accessible for the benefit of current and future generations, with a view to this forming the basis of applications for funding to continue this work.


Lovebytes was set up to explore the cultural and creative impact of digitalisation across the whole gamut of artistic and creative practice through a festival of exhibitions, talks, workshops, performances, film screenings and commissions of new artwork.

We wanted the festival to be a forum to pose open questions about the impact of digitalisation for artists and audiences, in an attempt to find commonalities in working practice, new themes and highlight new and emerging forms and trends in creative digital practice and also provide support for artists to disseminated and distribute their own work through commissions.

This was a groundbreaking model for a UK media festival and established Lovebytes as key player amongst a new wave of international arts festivals.

The intention in developing a plan for Lovebytes Media Archive is to look at how best to capture the 'shape' of the festival by and how to best represent this in creating an accessible version of archive.

Main Objectives

The Objectives of the project funded through SPRUCE are outlined below:

  1. Develop a workflow for the migration of the digital files and interactive content, progressing on from work done during SPRUCE Mashup London.
  2. Tackle issues around dealing with obsolete formats and authoring platforms used by artists (such as Macromedia Director Projector files) and look at ways of making this content more accessible whilst also maintaining original copies for authenticity.
  3. Research and develop systems for transcription, data extraction and the use of metadata to increase accessibility of the archive.
  4. Report on progress and share our findings for the benefit of the digital preservation community.
  5. Develop a digital preservation Business case, with a view to approaching funders.


We started by developing a research plan for a representational sample of the archive (see below), focusing on one festival, rather than a range of samples from over the 19 years. We selected the year 2000 as this included a limited edition CD Rom / Audio CD publication which contains specially commissioned interactive and generative artwork in a variety of formats.

Additional assets in the representation sample include video documentation of panel sessions, printed publicity, photographs, press cuttings and audience interviews in a wide variety of formats.

Research plan for the representational sample

  • Auditing the archive.
  • Choosing a representative sample.

 Stabilising and migrating

  • Reviewing content to assess problems and risk
  • Stabilise again with a view to rectifying problems
  • Cataloguing and naming.
  • Planning for future accessibility and interpretation.
  • Extracting metadata.
  • Prototyping a search interface to provide access to the archive (with Mark Osbourne from Nooode).

Data integrity is paramount in digital preservation and requires utmost scrutiny when dealing with 'born digital' artworks, where every aspect of the artists original intentions should be considered a matter for preservation and any re-presentation of a digital artwork can be regarded as a reinterpretation of the work.

In all cases, the most urgent work was the migration of data to stabilise and secure it. Amongst the wide range of formats we hold, CDs and CD ROMs are prone to bit rot and other magnetic formats can degrade gradually or be damaged by electrical and environmental conditions or easily damaged during attempts to read or playback.

The majority of our preservation work was to migrate from a wide variety of formats to hard drive, essentially consolidating our collection into one storage medium, which is then duplicated as a part of a back up routine.

Our research focused on the following 6 areas

  1. Macromedia Director Projector files
    • Migrating obsolete files and addressing compatibility issues.
  2. DV Tapes
    • Migrating DV tapes and transcribing panel sessions with a view to researching how transcriptions could be used for text based searches of video content, and how this can be embedded as subtitles using YouTube.
  3. Restoring Lovebytes website
  4. Developing naming systems for assets
  5. Prototyping a searchable web interface and exploring the potential for using ready-made, free and accessible tools for transcription dissemination.
  6. Writing a Business Case for Lovebytes Media Archive

We learned some valuable lessons on the way that we'd we like to share with likeminded organisations, especially those who have limited resources and are looking to preserve their own digital legacy on a tight budget.

Our findings have been compiled into a detailed report, providing a workflow model which makes recommendations for capturing, cataloguing and preserving material. It outlines our research into preserving artwork on obsolete formats and authoring platforms, as well as systems for transcription, data extraction and the use of metadata to increase accessibility of the archive.

We wanted to begin looking at the preservation issues for our collection and devise our own systems and best practice, therefore the recommendations reached for preserving digital assets in various media formats reflect the organisational needs of Lovebyes and might not align with another organisations goals.

Business Case

We used the Digital Preservation Business Case Toolkit to help us get started on our Business Case. This was a fantastic resource and helped us shape our Case and consider all the information and options we needed to include.

The Business Case will form the foundation for applications for public and private funding and will be tailored to meet specific requirements. Through writing this, we were able to identify the potential risks to the archive, its value and how we might restage artworks or commission artists to use data from it within the preservation process.


As non-experts in digital preservation we knew we were about to encounter some steep climbs and were initially apprehensive about what lay ahead, given that most of our material had been sat in a garage for ten years. Our collection, until then, had remained largely un-catalogued and aside from being physically sealed in oversized tupperware, the digital assets had been neglected. Many items were the only copy, stored in one location in danger of decay, damage or loss. As a small arts organisation recently hit by cuts to the art funding, Lovebytes and its archives were in a precarious position; unsupported and vulnerable.

The SPRUCE Award gave us the opportunity to take a step back and re-evaluate these assets, making us aware of their value and the need to save them and to start the preservation process. It has given us the opportunity to explore solutions and devise our own systems for best practice within the limited resources and funding options available to us.

It has allowed us to crystallize our thoughts around using the Lovebytes Media Archive to investigate digital archivism as a creative process and specifically how digital preservation techniques may be used to capture and preserve the curatorial shape and context of arts festivals.

By using available resources and bringing in external expertise where necessary, we found this process rewarding both in terms of developing new skills and also reaffirming in terms our past, current and future curatorial practice.

Having undertaken this research we now feel positive about the future of the archive and have a clear strategy for preservation and a case to take to funders and partners to secure it as an exemplar digital born archive project which attempts to capture preserve and represent the history of Lovebytes as a valuable record of early international digital arts practice at the turn of the century.

Jon Harrison and Janet Jennings of Lovebytes, and Mark Osbourne of Nooode

Preservation Topics: SPRUCE
Categories: Planet DigiPres

10 Tips To Preserve Your Holiday Digital Memories

The Signal: Digital Preservation - 27 November 2013 - 3:18pm

During Thanksgiving and the rest of the holiday season, you might take photos and video of friends and loved ones. You might make audio recordings of voices, conversations and music. Whatever you photograph or record, we hope you will take time to backup and preserve your digital stuff.

Thanksgiving on Flickr by martha chapas95

Thanksgiving on Flickr by martha chapas95

  1. As soon as you can, transfer the digital files off the camera, cell phone or other device and onto backup storage. That storage could be your computer, a thumb drive, a CD, a hard drive or an online cloud service. You should also backup a second copy somewhere else, preferably on a different type of storage device than the first.
  2. If you have time, browse your files and decide if you want to keep everything or just cull the best ones. Twenty photos of the same scene might be unnecessary, no matter how beautiful the scene might be. And despite who is in that video, if the video is blurry and dark and shaky, you probably will never watch it again.
  3. When you back your files up, organize them so you can easily find them. You can rename files without affecting the contents. And renaming a file will help you find it quickly when you search for it later.
  4. Organize file folders however you want but be consistent with your system. Label folders by date, description or file type (such as “Photos” or “Thanksgiving 2013″). Organization makes it easy to find your stuff later.
  5. You can add descriptions to your digital photos, much as you would write a description to a paper photo. We’ve gone into depth in few blog posts, to describe how it works.
  6. Similarly, if you make any digital audio recordings, you can add descriptive information into the audio files themselves, information that will display in the MP3 player.
  7. If you have a special correspondence with someone, you can archive the emails and cell phone texts much as you would a paper letter or card.
  8. Remember that all storage devices eventually become obsolete; maybe you can recall devices and disks from just a decade ago that are now either obsolete or on their way out of fashion. If you have valuable files still on those obsolete media, those files become increasingly difficult to access with every passing year.
  9. So in order to keep your files accessible, you should move your collection to a new storage medium about every five to seven years. That is about the average time for something new and different to come out. At the least, if you use the same backup device frequently — like a favorite thumb drive — get a new one.  Migrate your collection to new media periodically.
  10. Write down where you have important files, along with any passwords needed to access them, and keep that information in a secure place that a designated person can access if you aren’t around. Allow your memories to live on!

Treat your digital files responsibly, preserve those memorable moments and you can enjoy them again and again for years.

For more information on personal digital archiving, visit

Categories: Planet DigiPres

Personal Digital Archiving 2014: Building Stronger Personal Digital Archiving Communities

The Signal: Digital Preservation - 25 November 2013 - 7:42pm
2.7 meg file, by flickr user s2art

2.7 meg file, by flickr user s2art

There is a growing community of individuals who are interested in the preservation of personal digital information.  Those individuals may include professionals working in libraries and archives who are receiving personal collections, scholars working with their own research materials and data, commercial companies working on consumer products to help people organize and save their digital content, and other people who create multitudes of personal digital content for various reasons.   They come together annually to share practical solutions to preserving and archiving all types of personal digital content.

Personal Digital Archiving 2014 will be held at the Indiana State Library in Indianapolis, Indiana, April 10-11, 2014.  This is the first time the conference will be held in the Midwest.  It was previously held San Francisco, California (2010-2012) and in College Park, Maryland (2013.)

The Personal Digital Archiving conference explores the intersections between individuals, public institutions, and private companies engaged in the creation, preservation, and ongoing use of the digital records of our daily lives. The conference reflects upon the current status of personal archiving, its achievements, challenges, issues, and needs as evidenced through research, education, case studies, practitioner experiences, best practices, the development of tools and services, storage options, curation, and economic sustainability. There is also interest in the role of libraries, archives and other cultural heritage organizations in supporting personal digital archiving through outreach or in conjunction with developing community history collections.

Some of the issues the conference committee is looking for the community to explore together are:

  • How do we preserve the ability to access digital content over time when every app/community/network has a lifecycle that involves the end of its existence?
  • How should libraries, museums and archives collect personal digital materials? How do we better share our knowledge and communicate about our work (including the failures as well as the successes)?
  • How are archivists, curators, genealogists using born-digital and/or digitized material in their research?
  • How can individuals be encouraged to undertake personal digital archiving activities?
  • What are effective strategies and best practices for personal digital archiving in social media and ecommerce settings?
  • What tools and services now exist to help with personal archiving? What do we need to make the process easier or more effective?

If you’re working with personal digital archives, please consider sharing your work at PDA2014.  The call for proposals is open and the submission deadline is December 2.

For those interested in attending, registration will open early in the new year on February 1, 2014.

PDA2014 is sponsored by the Indiana State Library and NDIIPP, in collaboration with the Coalition for Networked Information.

Categories: Planet DigiPres

From Analog to Digital: A Changing Picture of the Kennedy Assassination

The Signal: Digital Preservation - 22 November 2013 - 9:35pm

The first images I recall of the Kennedy Assassination are grainy black and white television broadcasts. I was in the fourth grade 50 years ago today, and after an anguished announcement on the public address system, we were sent home.

The TV was on in the living room with solemn reports. What followed over the next few days was a stunning flow of amazing events, all rendered in a few hundred flat lines of grey tones. I remember a strange mix of feelings, awash in horrible facts relayed by reassuringly familiar news correspondents. Those sober faces, rendered the same way as the thousands of hours of TV I had already consumed, helped me accept what had happened. Maybe it was my youth, but even the repeated rebroadcast of disturbing video clips–Jack Ruby’s shooting of Oswald in particular–eventually became an acceptable, if terribly sad, part of reality.

Aftermath of the shooting in Dallas, Cecil Stoughton. White House Photographs. John F. Kennedy Presidential Library and Museum, Boston

Aftermath of the shooting in Dallas, Cecil Stoughton. White House Photographs. John F. Kennedy Presidential Library and Museum, Boston

The Zapruder film upended that complacency. I first saw frames from the film in Life magazine shortly after the shooting, but their impact was minimal. They were static and in black and white. The full color version of the film was kept from public view for many years due to intellectual property restrictions, and it wasn’t until 1975 that it had a widespread public viewing. But even then most people saw the film on distinctly non-HD television, and perhaps not in color.

I didn’t see the film clearly until 1991 when it was used as part of the movie JFK. The lurid Kodachrome colors, the oddly intimate home movie jerkiness, the abrupt transition from banal to horrific–the film was a waking terror dream, something that couldn’t be happening actually was happening.

The nightmare quality was further enhanced by radical differences the film had from the original TV coverage: overly saturated colors in contrast with drab black and white; eerie silence in contrast with the soothing voices of newscasters; powerful, gut-churning visual reality in contrast with calm narrative descriptions.

With the internet came another change in my visual impression of the assassination. Beforehand, the Zapruder imagery was not in plain sight. But with digital versions proliferating on the web, the film was suddenly much more available in all kinds of different ways. It regularly showed up as images or clips in news stories and in essays; it was dissected in academic papers (such as A 3-D Lighting and Shadow Analysis of the JFK Zapruder Film (Frame 317) (PDF).  It became a staple of sites dedicated to video content, and anyone with a internet connection can view titles such as The Undamaged Zapruder Film, Zapruder Film Slow Motion (HIGHER QUALITY) or The Inky Face Trajectory In The Zapruder Film.

All this has altered my visual model of the assassination. I’ve moved from a purely rational, analog-based acceptance from what I originally saw on TV to a digitally-driven sense that the event lives in some strange, uncomfortable zone that resists clear-cut recognition or acknowledgement. While I have never seen compelling evidence of a conspiracy, I can easily see why people are drawn to the idea. Those 26.6 Zapruder seconds have a strange hallucinatory impact that seemingly builds each time you watch. It’s natural to try and explain what looks a delusion, especially one that streams over and over again to your own computer screen.


Categories: Planet DigiPres

On the Road with FADGI: Recent Conference Presentations Highlight Current Audio and Video Projects

The Signal: Digital Preservation - 21 November 2013 - 4:40pm

One of the best things about the Federal Agencies Digitization Guidelines Initiative is that we are a community-oriented group. We work together to bring about solutions to real-world problems. Our efforts are focused on defining common guidelines, methods and practices for federal agencies digitizing historical content, and the impact of our projects and products often extends beyond the government sector into the wider audio and moving image preservation communities.

This fall, two of our FADGI Audio-Visual Working Group members hit the conference circuit to discuss some of our current efforts, and we couldn’t be more pleased by the positive responses.

// Photo courtesy of AudioVisual Preservation Solutions from the FADGI Interstitial Error Study Volume I. The Study Report

The Interstitial Error is visible in the top row; the two rows would have the exact same waveform shape if there was no error. To hear an Interstitial Error, check out the AV Artifact Atlas.
Photo courtesy of AudioVisual Preservation Solutions from the FADGI Interstitial Error Study Volume I. The Study Report

In late October, FADGI’s work in audio preservation was highlighted at the Audio Engineering Society’s 135th International Convention in New York City. One of our expert consultants, Chris Lacinak of AudioVisual Preservation Solutions, included FADGI projects in his tutorial about audio performance systems testing. Part of the workshop covered the problem of Interstitial Errors (PDF), a term Chris coined to describe momentary artifacts caused by failure in a digital audio workstation’s writing of data to a storage medium which result in both lost content and a disruption in file integrity.

The workshop also illuminated the topic of analog-to-digital converter performance testing, highlighting the FADGI 2012 guideline on ADC metrics and testing, a document that built upon two foundational publications – the 2009 Guidelines on the Production and Preservation of Digital Audio Objects (TC04) from the International Association of Sound and Audiovisual Archives and the Audio Engineering Society’s AES-17: AES standard method for digital audio engineering — Measurement of digital audio equipment.

The FADGI 2012 guideline (PDF) will also serve as the starting point for a formal standards project by the AES Working Group on Digital Audio Measurement Techniques (SC-02-01), a project that will address both the development of test methods and performance criteria for the ADCs used in audio preservation systems. The prospect of an official standards project focused on the topic of Interstitial Errors is currently under discussion within this same working group.

Courtney Egan presenting the reformatted video matrix at the AMIA Poster Session.  Photo by Kate Murray

Courtney Egan presenting the reformatted video matrix at the AMIA Poster Session.
Photo by Kate Murray

In early November, FADGI work was again on display at the Association of Moving Image Archivists Annual Conference in Richmond, Virginia. Courtney Egan from the National Archives and Records Administration’s Audio-Video Preservation Lab participated in a poster session about the eagerly anticipated and very-soon-to-be-released-for-public-comment matrix which compares target wrappers and encodings against a set list of criteria that come into play when reformatting analog videotapes.

As mentioned in a previous blog post, the evaluation attributes in the matrix include format sustainability, system implementation, cost and settings and capabilities. Some features specific to video are also evaluated, such as the ability to store multiple or discontinuous time codes and the ability to support different color spaces and bit depths. The Working Group hopes that the matrix will be a helpful tool for those faced with the challenging choice of what target format they should use when migrating their legacy videotapes.

So what does all this mean for the future of the FADGI Audio-Visual Working Group? Both presentations were extremely well received. Chris’ tutorial made front page news in the AES Show Daily newspaper and Courtney’s poster session was mobbed. We’re proud, of course, that our efforts are helpful for our federal agency constituents. But we are thrilled that our work is appreciated and embraced by the audio and moving image preservation communities at large. Our collaborative approach to solving shared problems through community-based solutions is working – for everyone – and we wouldn’t have it any other way.

Categories: Planet DigiPres

Residency Program: From the Classroom to the Workplace

The Signal: Digital Preservation - 20 November 2013 - 3:07pm

The following is a guest post by Lyssette Vazquez-Rodriguez, Program Support Assistant & Valeria Pina, Communication Assistant, both with the Office of Strategic Initiatives at the Library of Congress

Residents in the inaugural class of the National Digital Stewardship Residency program have been busy at their host institutions since mid-September. The residents agree that during their first weeks of work they did what they know best: research.

Residents-Jefferson Building

This year’s class of Residents. (Photo credit: Molly Schwartz)

Jaime McCurry, resident at the Folger Shakespeare Library, explained, “Right now my work is very research-oriented. Over the course of the residency, I am preparing an annotated bibliography on various resources related to Web Archiving. I’m looking to provide an overview of the current landscape and also to find interesting sources pertaining to Web Archiving in the humanities, specifically. I’ve also performed Quality Assurance tasks on the Folger’s current Web Archive collections and I am in the process of discussing new collections to be added with our Collection Development team.”

  Molly Schwartz)

Molly Schwartz (Photo credit: Molly Schwartz)

Erica Titkemeyer, resident at Smithsonian Institution Archives, who is working with time-based media and art, explains that, “a typical day at my office tends to be low-key, since I work alone researching at my own workstation. As of now I have carried out a significant amount of research related to the current state of time-based media art (works of art which depend on technology and have duration as a dimension) to conservators within museum settings.”

In addition to research, some of the residents have had the opportunity to attend conferences and network with scholars from the field of digital preservation. Molly Schwartz, who is a resident at the Association of Research Libraries, attended a lecture of Dr. Jonathan Lazar, Professor of Computer and Information Sciences at Towson University.

  Molly Schwartz)

Margo Padilla (photo credit: Molly Schwartz)

Margo Padilla, a resident at the University of Maryland, said, “I recently conducted several interviews with electronic literature scholars on their expectations for access to born-digital literary collections. These interviews will help inform the development of the access models I will produce by the end of the residency.”

This is only just the beginning of the residency; the residents are very thrilled with what they have been doing so far and they are eager to continue learning and helping their host institutions complete their objectives.

Categories: Planet DigiPres

The OPF Appoints New Executive Director

Open Planets Foundation Blogs - 20 November 2013 - 8:55am
The Board of the OPF has appointed Ed Fay as the new Executive Director. Ed will join the OPF in  February 2014 and will lead the organisation in its efforts to address its members' digital preservation  challenges with a practical, and community-led approach.  Ross King, Chair of the Board, said: "The Board was extremely gratified to receive qualified applications  from Europe, the Middle-East, India, and the United States. Four top candidates were selected by all  board members and were interviewed personally by a board sub-committee. After evaluating the these candidates, support for Ed Fay within the committee and the OPF board was unanimous. Ed has demonstrated his understanding of the different challenges facing both libraries and archives and has a refreshing take on digital preservation from an institutional perspective. We look forward to working with him to enhance the visibility and reputation of the OPF and to create more value for its members". Ed commented on his appointment: "I’m thrilled to join the OPF and contribute towards the development of digital preservation practice at an important time for libraries, archives, and memory institutions everywhere. The OPF’s mission is to enable collaboration and shared solutions and I look forward to working with members and the wider community to build capacity for the digital collections of the future" Before being appointed by the OPF, Ed has been the Digital Library Manager of the London School of Economics (LSE) for 5 years. He successfully managed the development of LSE’s digital library from its inception to implementation. He also led digital preservation activities at LSE and their participation in a number of related projects and working groups. Prior to this he worked on several mass digitisation projects funded by JISC. Ed will take over the role from Bram van der Werf who has managed and grown the OPF from its foundation in 2010 to become a sustainable membership organisation.


Preservation Topics: Open Planets Foundation
Categories: Planet DigiPres

The Best Practices Exchange Conference is No Secret

The Signal: Digital Preservation - 19 November 2013 - 7:59pm

“What happens at BPE stays at BPE.”

So goes the oft-repeated mantra at the annual Best Practices Exchange conference, held this year under mostly-sunny skies in beautiful downtown Salt Lake City, UT.

 Butch Lazorchak

Best Practices Conference Program. Photo Credit: Butch Lazorchak

The phrase holds a special meaning for BPE attendees. Unlike the reticent returnees from America’s sin capitol who have presumably have something to hide, BPE attendees have something to share, but want to share it in a non-judgmental environment where their experiences, positive or negative, help to move the digital stewardship community forward.

This year provided ample opportunities for sharing and discussion, with a solid program put together by the hosts from the State of Utah Division of Archives and Records Service and their compatriots from around the state. This even included hilarious archives-based fortune cookies at the evening reception.

BPE accepts all comers, but attendance is largely centered on the state and local government library, archives and record managers communities. As such it tends to focus on practical solutions to real-world problems. This practical ethos was exemplified by the opening keynote from former Senator Robert F. Bennett, who encouraged the attendees to work closely with their legislators and funders to find digital stewardship solutions. Bennett provided three key thoughts on how to be an effective advocate:

  • Never ask anybody to do something that’s not in his or her best interest;
  • Always be nice;
  • Don’t put yourself in competition with other people’s budgets.

Practical advice was found everywhere. Jenny Mundy from Multnomah County, OR described a coordinated succession planning effort that helped them address critical needs in the hiring process. A session on “Making America’s Laws Available Now and in the Future” brought participants from the Utah State Library and the Utah Division of Administrative Rules together with Digital Preservation Pioneer Margaret Maes from the Legal Information Preservation Alliance for a spirited discussion on current approaches to preserving digital legal information.

 Butch Lazorchak

The Arizona Library created awesome visual aids to help people understand format obsolescence. Photo Credit: Butch Lazorchak

Linda Reib from the Arizona State Library talked about the challenges they faced while working to seek sustainable funding for their state archives electronic records repository (an ongoing effort related to work they did on the NDIIPP-supported PeDALS project), while showing off their visual aids to help people understand format obsolescence. The State Archives of North Carolina discussed their work on preserving the social media accounts of elected officials and state organizations.

We hosted a session on Wednesday afternoon on the 2014 National Agenda for Digital Stewardship and brainstormed ways to leverage the energy of BPE to support the work of the National Digital Stewardship Alliance.

Thursday opened with a plenary session from Meg Phillips, the Electronic Records Lifecycle Coordinator at the U.S. National Archives and Records Administration (and a member of the NDSA Coordinating Committee). Phillips focused on NARA’s new “Make Access Happen” initiative, inviting the participants to share their ideas and approaches for new ways of looking at electronic records management.

Afternoon events looked at the challenges facing digital filmmakers, an active community in Utah due to the presence of the Sundance Institute in Park City and its associated film festival. Milt Sheftner, a consultant to the Academy of Motion Picture Arts and Sciences, offered his insight on the digital preservation challenges facing the film industry and showcased a pair of NDIIPP-funded reports, the Digital Dilemma and the Digital Dilemma 2, that raise important concerns about the challenges of preserving digital motion pictures by both major studios and independent filmmakers.

Sheftner’s presentation was followed by a showing of the film These Amazing Shadows, a documentary that discusses the history and importance of the Library of Congress’ National Film Registry.

The morning of the third day brought presentations from a couple of big players in the genealogy space. Genealogical research is a $2.3 billion per year industry and some of the most significant operations are located in Utah. FamilySearch, founded in 1894 as the Genealogical Society of Utah, is chiefly supported by the Church of Jesus Christ of Latter-day Saints and makes their material available free of charge. They have embraced digital stewardship to a significant degree, building and maintaining a state-of-the-art preservation repository for the more than 100 petabytes of data on tape in their Granite Mountain records vault. They’ve also been engaged in addressing file format challenges and we wrote about their work a couple of years ago here on the Signal.

NDIIPP has been involved in the Best Practices Exchange since the first event was held in Wilmington, NC in 2006 and it’s refreshing to see the progress that the BPE community has made since then to address digital stewardship issues. While “what happens at BPE stays at BPE,” it’s important to continue to showcase the work of the BPE community. And that’s no secret.

Categories: Planet DigiPres

Establishing a Workflow Model for Audio CD Preservation

Open Planets Foundation Blogs - 19 November 2013 - 1:49pm

The preservation of audio CDs is something that is slightly different from the preservation of CDs containing data other than audio. Data on audio CDs cannot be easily cloned for preservation, as the music industry has lobbied the main operating system developers to curtail the duplication of CDs to crack down on the mass production of pirate copies. While this is understandable from an intellectual property perspective, it is rather problematic from a preservation viewpoint.

I have scoured published documents in this area but there are no comprehensive examples of best practice related to data preservation from audio CDs. There are guidebooks on the preservation of the CDs themselves but next to nothing about the preservation of the data on the audio CDs. This area requires urgent attention because audio CDs may contain risk and decaying audio data on a fragile medium. Certain types of audio CDs are nearing their end of life faster than others.

At the SPRUCE London Mashup in July 2013 I proposed the creation of a workflow model for the preservation of audio CDs. Working mainly with Peter May (British Library) and Carl Wilson (OPF), with input from other developers at the mashup, we established that the main problem that needed to be resolved was the fact that there was no open source tool to easily create a disk image or clone of data on an audio CD.

While this may seem a straightforward project, it took no fewer than three experienced developers working on this problem many hours before a practical solution was proposed, based on cdrdao. (See: an outline of the initial solution)

Having resolved the basic need to create a clone or disk image from an audio CD, the next step in this project was to explore how to catalogue the disk image and its contents, as well as normalise the audio files into the standard BWAV format. This was supported by a SPRUCE award (funded by JISC) covering the period August-October 2013, involving Carl Wilson and Toni Sant, with the participation of Darren Stephens from the University of Hull. Through further consultation with digital forensics experts at the British Library and elsewhere, as well as systematic development, this project has addressed this issue directly.

Once the fundamental open solution was in hand, our attention could be turned to the development of a four-step workflow model for the preservation of audio CDs. The four steps are as follows:

1.    Disk Imaging (stabilizing the data)
2.    Cataloguing (through individual Cue sheets)
3.    Data Ripping (normalising the data)
4.    Open access to the catalogue (outputting the metadata)

Working with an specific dataset (see: an outline of the dataset) this project is now able to provide a practical workflow model utilizing the solution proposed during the London SPRUCE mashup as a tool for steps 1 & 3 called arcCD.  An example of good practice has now been established in this under-explored area of preservation. All materials produced for this project are available on GitHub. Darren Stephens is also integrating further development on outputting the metadata into MediaWiki for easy access and editing of the catalogue, as part of his PhD research project entitled 'A Framework for Optimised Interaction Between Mediated Memory Repositories and Social Media Networks.'

The initial dataset used for the development of this project is managed by the Malta Music Memory Project (M3P), which seeks to provide an inclusive repository for memories of Maltese music and associated arts, ensuring that these are kept in posterity for current and future generations. M3P is one of the projects within the Media and Memory Research Initiative (MaMRI) of the University of Hull and it is facilitated by the M3P Foundation, a voluntary organization registered in Malta.

Preservation Topics: SPRUCE
Categories: Planet DigiPres

Beyond the Scanned Image: Assessing Scholarly Uses of Digital Collections

The Signal: Digital Preservation - 18 November 2013 - 2:52pm

Read the slides for a recent talk on this topic here.

The following is a guest post from Nicole Saylor, the head of the American Folklife Center‘s archives at the Library of Congress. Prior to her arrival at the Library, she was a member of the survey team while working as the head of Digital Research & Publishing at the University of Iowa Libraries.

It’s easy to see that digital collections are proliferating on the web. Just look at the growing corpora from Hathi Trust, Digital Public Library of America, ArtStor and Europeana, among many others. Providing online access to scholarship and cultural artifacts gathered in coherent aggregations in a variety of formats is increasingly driving the missions of many cultural heritage institutions. Yet, what is less apparent, is to what degree these digital collections are meeting the needs of current scholars.

A recent study of humanities faculty at twelve research institutions, led by Harriett Green, English and Digital Humanities librarian at University of Illinois at Urbana-Champaign, aimed to find out more about uses of digital collections among humanities scholars. A primary goal of the study was to help inform the areas of digital collection work in which libraries have expertise, such as metadata, information retrieval and other access issues.

The survey, conducted during the 2011-2012 academic year, included a web questionnaire and 17 in-person interviews. It was conducted on behalf of Project Bamboo, a now-completed national initiative to address the question, “How can we advance arts and humanities research through the development of shared technology services?” Green and Angela Courtney, Head of Arts and Humanities and Head of Reference at Indiana University-Bloomington Libraries, presented the survey findings at Digital Humanities 2013 this summer in Lincoln, Neb.

More than 60 percent of those surveyed said that digital collections comprise at least half of the sources they use in their research. The uses of digital collections ranged from the more traditional (researching historic newspapers, government reports, legal cases, etc.) to exploring high definition images of papyrus as the basis for textual reconstructions.

Findings centered largely on the need for sustained access and discovery of digital collections, and the desire for scholars to mix and reuse digital materials. One respondent replied, “The easier objects are to repurpose, remix and reuse the better.” Green categorizes the major themes of the findings into two categories: curation and interoperability.

To make digital collections more useful in research, respondents generally said they would like more completeness of content and a better way to search digital collections for the content they need. Another prominent request was for improved tools to annotate and edit digital collection objects broadly. One respondent said he/she wants “the ability to control your collection, set up your own library and so on and go deeper and deeper, adding tags, etc. Where it’s less of a skill and more of an expectation.”

“Most immediately, this study provided information to the Project Bamboo team on things to consider how to shape digital collections for scholarly needs,” said Green. “But on a larger scale, we hoped these findings will be useful to libraries who are interested in who is using their digital collections and how they’re being used.”

Green and Courtney are working on a full paper that compares their findings to the extensive qualitative data gathered by Project Bamboo research team member Quinn Dombrowski about scholarly practices during the Project Bamboo workshops held in 2008-2009. They hope the paper will be published within the next year.

“The goal of our investigations is to offer concrete analyses of how scholars are integrating digital content into their research workflow and how their research practices are evolving with the growth in digital content,” Green said. “Our Digital Humanities 2013 presentation received a very favorable response, and we hope our forthcoming publications will be useful to libraries and cultural institutions seeking to increase the impact of their digital collections.”

Categories: Planet DigiPres

Lee Harvey Oswald’s Laptop: Forensics and Conspiracy

The Signal: Digital Preservation - 15 November 2013 - 9:00pm

What if the Kennedy assassination had happened during the era of smartphones and laptops? And, assuming the perpetrator left a digital trail, would that evidence uncover any associated conspiracy?

Found while driving on the way to Costco in Hackensack, NJ, by Ken L., on Flickr

Found while driving on the way to Costco in Hackensack, NJ, by Ken L., on Flickr

As we approach the 50th anniversary of that awful day in Dallas, recent public opinion polls indicate that over 60 percent of Americans believe more than one person was involved with the assassination. These beliefs float on a steady stream of books and other media that scrutinize the various pieces of evidence available: recorded gunshots, photographs, bullets (both “magic” and regular) and the most famous home movie ever, the Zapruder film.

All manner of experts and enthusiasts have reviewed the evidence but agreement about what it means remains elusive: while 95 percent of all books on the subject depict a conspiracy, the purported conspirators are wildly varied and include Nazis, extraterrestrials and Corsican hitmen, among others. As The Atlantic noted a while back, much of this output is “popularized by a national appetite for mystery and entertainment.” Other studies have looked at the same evidence and concluded with certainty that Oswald acted alone.

If Oswald had lived in an the digital age, he seems to me like the sort of person who would have activity participated in chat rooms, commented on blogs and broadcasted his opinions via all kinds of social media. He probably would have left behind a device, such as a laptop, that documented his web browsing habits and his email contacts. Forensic investigators would have had a trove of information about who he knew and when he knew them. That evidence would have been critical both for the initial needs of law enforcement and for later researchers.

Ah, endlessly fascinating. Would there be emails from disgruntled government operators? Texts from organized crime figures? Photographs of other gunmen? Perhaps a series of tweets with darkly cryptic warnings? From a rational perspective, one would think that such details would go a long way to prove or disprove a conspiracy.

One thing is for sure: there would be lots of digital information to capture, examine and preserve. The question, however, remains open as to the research impact of this kind of evidence. Data from an Oswald laptop could disprove theories or throw open the door to a flood of conspiratorial prospects. Or some jumbled mix of both–in spite of William S. Burrough’s proclamation that “the purpose of technology is not to confuse the brain, but to serve the body, to make life easier.”

Ultimately, as with any subject, it would come down to what researchers make of the preserved body of evidence.

At this point, most of the experience with digital forensics is with the law enforcement world, although there is growing interest on the part of memory organizations to obtain this capability; see, for example, Digital Forensics and Preservation (PDF) and the BitCurator project. This is a good thing. Even though there is no Oswald laptop, there can be no doubt that digital forensic evidence will grow increasingly important for historical research.

Categories: Planet DigiPres

COPTR tools registry beta launch

Open Planets Foundation Blogs - 14 November 2013 - 6:43pm
Almost a year ago, I presented a proposal to the Aligning National Approaches to Digital Preservation (ANADP) group to create a community tool registry. I was frustrated by the profusion of tool registries and the lack of coordination between them. Pooling the knowledge in one place would result in a far better resource. It would be easier to discover new tools, to share experience in using them and to help avoid the tool development duplication we've seen so much of in the past. As ANADPII kicks off today in Barcelona, I'm very pleased to announce the beta launch of COPTR: the Community Owned digital Preservation Tool Registry. We've been working to collate, combine, de-duplicate and align the contents of 5 existing tool registries from: The Open Planets Foundation (OPF), The National Digital Stewardship Alliance (NDSA), The Digital Curation Centre (DCC), The Digital Curation Exchange (DCE) and the Digital POWRR Project. There were of course quite a few duplicates to weed out but the scope and depth of COPTR now supercedes anything out there that I've seen previously, albeit with an inconsistency of depth between resulting tool registry entries. Each source registry had it's own differing characteristics. At one end of the scale the DCC registry had really strong detail but coverage of well under a hundred tools. The DCE registry included over three hundred tools but with each tool described in far less detail. After much debate and consideration of feedback from many sources (thanks to everyone who got in touch), we've settled on the all important tool registry structure and a technology with which to manage the data: Mediwiki. We've kept the structure minimal to make creation of new entries relatively quick. We've also kept things factual. Experiences and evidence of using tools can be captured elsewhere and referenced from COPTR. Mediawiki provides an environment that enables easy navigation of the registry (probably most usefully by browsing via a tool's function) and that is quite straightforward for managing the data and providing a feed of the data. A nice touch in the registry is use of RSS feeds and Ohloh widgets to indicate how well supported (or otherwise) the codebase of a particular tool is. See an example on the Archivematica page here. So what happens next? The COPTR approach is not just to pool existing data in one place, but to remove the source registries from the web. The contributing organisations have committed to doing this, but first of course, they need to be happy that COPTR is ticking all the right boxes as a genuine replacement. So the next phase will be to take on board any final comment and ensure everyone is completey happy to move forward. The onus will then be on the contributors to remove their registries and perhaps explore utilising a feed of data from COPTR on their own sites. Some thought also needs to be given to differences in aims and scope of the existing repsitories and COPTR. the Digital POWRR grid, for example, does a different job which doesn't easily align with COPTR. So there is some discussion to be had with the POWRR team over the next few days on how (and if) we might be able to bring things together more closely. Perhaps most importantly we need *you* to help make this community resource a success. The data is still far from perfect. It needs tweaks, it needs more URLs, it needs entries for those important tools that are still missing. And most importantly it needs more references to your digital preservation war stories. Looking ahead we need to develop a roadmap, think about bringing in other registries, look at how we can encourage further editing and enhancement of COPTR data, and sound out interest in a hackathon to do cool things with the data feed. We would of course also appreciate any feedback on this beta launch of COPTR. In a parallel action, an informal group of experts is looking to bring a question and answer site to, to replace the abortive DP Stack Exchange. Watch this space.... Massive thanks go to the organisations who made this initiative possible, and kudos also to Andy Jackson for his terribly clever mediawiki skills. Paul Wheatley Preservation Topics: SPRUCE
Categories: Planet DigiPres

Astronomical Data and Astronomical Digital Stewardship: An interview with Brian Schmidt

The Signal: Digital Preservation - 14 November 2013 - 5:32pm

Brian Schmidt, Astronomer at the Research School of Astronomy and Astrophysics at the Australian National University

The following is a guest post from Jane Mandelbaum, co-chair of the National Digital Stewardship Alliance Innovation Working group and IT Project Manager at the Library of Congress.

As part of our ongoing series of insights discussions with individuals doing innovative work related to digital preservation and stewardship I am excited to talk with Brian Schmidt. Brian works as an astronomer at the Research School of Astronomy and Astrophysics at the Australian National University and his research is based on a lot of the “big data” that many individuals in the digital preservation and stewardship community have been keenly interested in.  Schmidt shared both the 2006 Shaw Prize in Astronomy and the 2011 Nobel Prize in Physics for providing evidence that the expansion of the universe is accelerating.

Jane: I read that you’ve predicted that IT specialists will be at the core of building new telescopes.  For example, your SkyMapper project, which is currently scanning the southern sky has a peak data rate of one terabyte per day. The Australian Square Kilometer Array Pathfinder, an array of 36 radio telescope dishes being built in Australia, will generate two terabytes per second.   Can you talk about how you think astronomers and IT specialists will work together on these kinds of projects?

Image of the Sky Mapper project, Jamie Gilbert

Brian:  New telescopes like, Skymapper, are creating massive amounts of data, a terabyte of data each night.  Processing a terabyte of data a night and making that data useful is as much an interesting computer science problem as it is an astronomy problem.  In the past, astronomers did a lot of this kind of computer science work themselves. But the reality is, this has moved beyond what I can do sensibly myself.  We need interdisciplinary groups of researchers to work together to meet these challenges.  So astronomers need to be able to specify the scientific outcomes and algorithms.  But implementation, and design of systems and databases and how that data is served, is computer science problem.  So we work with them, to meet our needs. If you have a lot of data, and you’re not a computer scientist, you really want to use expertise that is out there.

Jane: Do you think that astronomers deal with data differently than other scientists?

Brian: Astronomers are very open with their data.  This is one of the reasons that projects like the Sloan digital sky survey work in our field. Alongside that, our data is representations of the night sky. Everyone knows what stars look like, which means that people understand what we do in a way that they might not with other sciences. Aside from that, much of our data, for example images of galaxies, is beautiful in a way that something like DNA sequences isn’t. These features are all important for our ability to create complex citizen science projects.

Jane: It is sometimes said that astronomers are the scientists who are closest to practitioners of digital preservation because they are interested in using and comparing historical data observations over time.  Do astronomers think they are in the digital preservation business?

Brian:   Historical data is of the utmost importance in astronomy.  Astronomers are often looking for subtle changes that occur over hundreds of years. For example, if we discover a new asteroid that might come close to Earth you need to go back to the archives and see what data you have on it to figure out if it is a threat. The more years you have, the more accurate you can predict the orbit.  Other sciences benefit from this kind of long view of historical data, however,  we’re the discipline that has had our act together for the longest period of time.

Jane: What do you think the role of traditional libraries, museums and archives should be when dealing with astronomical data and artifacts?

Brian: I think we are still figuring out the role that libraries, archives and museums have to play in the contemporary work of astronomers.  In 2003 a fire-storm largely destroyed the library at the largely destroying Mount Stromlo Observatory. As a result of the work of IT and Library staff all of the digital information of the observatory was backed up and restored from off site. However, all the paper was just gone. Losing a Library of resources is a major loss, however, at this point, astronomy is basically a completely digital field. We keep a small numbers of books around for reference, but when we want to read the literature we have the Harvard/Smithsonian Astrophysics Data System. Just about every interaction I and my colleagues have with papers and articles is through that portal. Just search and download the full text.

While we have digital access to research and reference material through services like the Astrophysics Data System, there are substantial information challenges we are facing that I think libraries, archives and museums could help with.  We’re even more information driven than in the past.  Our work could be substantially aided with libraries providing systems for working with and curating data.   Libraries need to figure out how to help curate and make available data and data products.   Ideally, we would have librarians taking on increasingly specialist niches, across many institutions.  In our library, we are bringing in more staff who have expertise in data management – trained astronomers who decide they want to be exporting data to the masses.  I think training people in library science curation is important too, and I imagine we will increasingly see individuals with these skill sets and background embedded in the teams that produce, maintain, and provide access to various data products.

Jane: “Big data” analysis is often cited as valuable for finding patterns and/or exceptions.  How does this relate to the work of astronomers?

Brian: Astronomers are often interested in very rare objects. For example, Skymapper will is cataloging 10 billion stars.  And we want to find earliest stars in Milky Way with specific color signature.  We need that many stars to find enough of those stars to do our research, and as a result, we need to use data mining techniques to find those very few needles in that gigantic haystack.  Techniques allow us to do this.

Jane: What do you think astronomers have to teach others about generating and using the increasing amounts of data you are seeing now in astronomy?

Brian: Astronomers have been very good at developing standards (database and serving standards). There is a persistent danger that every library uses its own standards.  You don’t want to have to work across hundreds of standards to make sense of what each piece of data means.  You want it to be universal and also flexible to add things.  Astronomy has been doing this for a good while and it’s not easy.  Getting standards for data in place that work requires a consensus dictatorship.  It requires collaborations between librarians, and computer scientists to figure out how to create and maintain data hierarchies. Astronomers developed the FITS data standard in the 1980s and are still using it. In the last five to seven years it’s diverge a bit in the field, which suggests we likely need to revisit and revise. Every time an observatory observes something, there are stars in common between observations that can serve as a point of reference.  Linking this data can be very complicated – cross-matching is a difficult problem for 10 billion objects. Obvious thing is to give every object index number, but have to allow uncertainty.

Jane: What do you think will be different about the type of data you will have available and use in 10 years or 20 years?

Brian:  We are going to continue to have more and more data and information. Now have images of sky, but in future will have images at thousands of wavelengths (compared to 5 or 6 now).  We are going to have data cubes that record coordinates and intensity at 16k frequencies from radio telescopes.  We are talking about instruments that generate a petabyte of data a night.  This quantity of data is a challenge for every part of a system. It’s difficult to store, retrieve, process, and analyze and exactly how we work with it is a work in progress. We very well may need to be processing this data in real-time, finding the signal we care about and disregarding the noise, because the initial raw data is just too much to deal with if we let it pile up.

Jane: Speaking of raw data, do astronomers share raw data, and if so, how? When they do share, what are their practices for assigning credit and value to that work? Do you think this will change in the future?

Brian: Astronomers tend to store data in multiple formats. There is the raw data, as it comes off the telescope and we tend to store a copy of that.  However the average researcher doesn’t care about that.  They want it transformed into final state – fully calibrated, and we know where every pixel points to in the sky. At this point, all the data we provide access to is processed data. You can make a query and we give back “here’s this star and it’s properties.” It’s just too hard to query into the actual images we’ve collected. That isn’t how our systems are set up.

Jane: You’ve talked about the value of citizen science projects such as Galaxy Zoo.  How do you think these kinds of projects could make a case for preservation of data?

Brian:   Citizen science, at its best, serves as outreach/education and the advancement of science simultaneously.  We need to be careful that citizen science projects are doing scientifically useful work with the hours and efforts people are putting in. Ideally, we can leverage the work people put into these kinds of projects to calibrate algorithms to double the value of their efforts.  The immense data challenges facing astronomy and other sciences and the potential for citizen science projects to bring the public in to help us make sense of this data I think we are entering into a brave new information world. At this point, we need library and information science to become a lot bolder to stay relevant. There are huge opportunities to do great things in this area. I think timidity is likely the biggest threat to the future potential role that libraries, archives and museums could play in the future of sciences like astronomy. There are huge opportunities here to do great things.

Categories: Planet DigiPres

Preserving Vintage Electronic Literature

The Signal: Digital Preservation - 13 November 2013 - 4:08pm

Electronic Literature Lab at Washington State University Vancouver

Electronic Literature Lab at Washington State University Vancouver

Dene (pronounced “Deenie”) Grigar’s mother was an artist who painted mainly with oils on canvas. But occasionally she painted on a different medium, such as wood or pottery. Once she experimented with painting on bamboo, a medium she was unfamiliar with.

“Bamboo is porous,” said Grigar. “It can absorb the paint. So my mother compensated by using very thick paint and very thick brushes to get the paint to stay on the surface.” Grigar’ mother fiddled with various materials and techniques until she figured out what worked and what did not. Within the constraints of the bamboo surface she created a lovely work of art.

Grigar tells that story to illustrate how artists can still create even when using material that is unfamiliar to them. And she should know. Like her mother, Grigar is an artist. She is also director and associate professor of the Creative Media & Digital Culture Program at Washington State University Vancouver. The medium to which she devotes herself is electronic literature, or eLit, particularly works from the period between the mid-1980s to the late 1990s.

During that period, personal computers proliferated and experimental artists were drawn to the ones with graphic user interfaces (as opposed to text-based command line screens) and interactive multimedia. Artists were lured to computers despite of the unfamiliar material…or maybe because of it.

The new generation of personal computers in the 1980s, particularly Macintoshes, were a pleasure to use and play with, much like modern smart phones. Macs were not dry, business-only machines. There were no command lines to memorize, no “under the hood” technical details to fuss with. You simply turned Macs on and started playing. They invited play.

And artists did just that. They played. They explored. They tinkered. And from the palette of text, hyperlinks, audio and graphics arose – among other things electronic literature.

The term “electronic literature” applies to works that are created on a computer and meant to be read and experienced on a computer. Grigar, a scholar and devotee of eLit, helped build a lab in which to preserve and enjoy works of vintage electronic literature.Deene Grigar

Dene Grigar

She helped create the Electronic Literature Lab at Washington State University Vancouver, which houses a collection of over 300 works of eLit — one of the largest collections in the world — and twenty eight vintage Macintosh computers on which to run them. Each computer has its appropriate OS version and, for browser-based works, appropriate browser versions.

The ELL is never closed. Students with access rights can come and go at any time. Despite the age of the computers, they are all in good working condition. Grigar has someone who maintains the lab computers and keeps them tuned and running, and she uses a local computer-repair specialist for more serious technical issues.

In addition to preserving the software disks on which the works reside, the ELL backs up and preserves their software in a repository. In some cases, the ELL keeps a copy of the software on the computer on which the work is played rather than go through the whole re-installation process; on the older computers that could require loading several disks. For CD-based works, they make an ISO image backup copy.

The ELL has a searchable database to track all the works, the computers, operating systems and software requirements. If a user wants to view a work, he or she would search for it and, according to its requirements, locate which lab computer to use.

All of the electronic literature works at the ELL share one common element: they deviate from traditional literature. Unlike paper-bound literature with sequentially numbered pages and a beginning, middle and end, many works of eLit do not read linearly. There are underlying decision trees that enable users to decide where they want to go next; the experience is chunked into scene-like elements and it is up to the user which element to navigate to next. Navigation is often left to chance. In fact, the decision-making process that is standard for many games today have their roots in vintage eLit. (Think of first-person shooters and multi-player adventure games, the “where can I go and what are my options?” games.)

In vintage eLit, a work that was rich in content pushed the limits of the computers of the day: the richer the content, the slower the computer ran. One of the challenges the artist faced was to see how much she or he could pack into a piece.

“One of the coolest things about working with these early pieces from, say, StorySpace,” said Grigar, “is that when you put the 3 1/2 inch floppy in and as the work was loading, you got a little dialog box that said ‘This work has 2000 nodes and has 1600 links’ and you’re watching each link load, one at a time. Part of the excitement was seeing how many nodes and how many links there were and how big and intricate the work was.”

Grigar is dedicated to preserving the experience of each work as the author or artist originally intended it, under the same physical conditions as when you would have experienced it when it was first released. That includes experiencing the sluggishness and snags of the technology. Not only are the works historically and culturally significant, their limitations and affordances are too.

“All of the quirks, all of the glitches, all of the constraints are obvious to you,” said Grigar. “And it was kind of a badge of honor to artists that you did this much work. It’s like handing someone James Joyce’s Ullyses as opposed to handing them a forty page article. It’s like ‘This is my novel. See how big it is? See how many nodes there are? See how many hyperlinks I had to make?’

“When you put all this on an emulator, all of those differences collapse. The slowness and glitchiness was part of the beauty of the work…I’m not convinced that emulators can capture a lot of that experience and the wonder of how things actually moved.”

The computers in the ELL are arranged in chronological order to demonstrate the evolution of the art form. For example, beginning in 1983, you can see that artists created grayscale and ASCII characters. In time, computers acquired a palette of 256 colors, which spawned a different stage of creativity. Then came thousands of colors and another stage of creativity.

“The palette just kept getting bigger,” said Grigar. “And so they go crazy with that and have fun with that. CDs like the Voyager piece ‘Shining Flower‘ — it’s just exquisite. Its just amazing. You could tear up, it’s just that gorgeous.”

In the earliest works of eLit, artists coordinated words with audio and graphics. As the technology evolved and artists could include motion pictures, the storytelling blurred the lines between literature, animation and movies. Still, no matter how much artists stretched the genres, vintage eLit works were still limited by the computer keyboard and mouse.

Newer works of interactive media, or participatory media, reach for other methods of interactivity. For example, “The Breathing Wall,” by Kate Pullinger, responds to the user’s rate of breathing, not the clicking of a mouse. And new advances in augmented reality enable interactivity with software without directly touching — even if only with your breath — any hardware objects. In some game systems and art installations users can interact with software through gestures and eye movements. Artistic expressions of human/computer interaction will clearly continue to evolve along with technology.

For now, Grigar is focused on protecting vintage electronic literature. She does not assume that the machines and software of vintage eLit will always be available, so she and hypertext author Stuart Moulthrop created Pathfinders, which demonstrates the user experience through video recordings of the artist and users reading works of early eLit.

“We have the authors perform their work on the computers and we videotape it,” said Grigar. “And the video will be archived for posterity so that one day when there are no more Macintoshes from 1983, we will at least have the video. It is better than just an emulator, because you could see the work unfold and have the author talking.” Electronic Literature Showcase

Electronic Literature Showcase

In April, 2013, Grigar, along with colleague Kathi Inman Berens and eight of Grigar’ students presented, presented the Electronic Literature Showcase at the Library of Congress. She brought several Macintoshes with her (she has extra vintage Macs as well as extra copies of software) to demonstrate some notable works of eLit, including including a Mac Classic on which to show Shelley Jackson’s “Patchwork Girl” and Michael Joyce’s “Afternoon, A Story.” She also brought along a G3 iMac on which to run her original copy of “Myst.”

The ELL is one of several labs dedicated to the preservation of vintage multimedia. Others include the Media Archaeology Lab, The Trope Tank and especially the Maryland Institute for Technology in the Humanities.

Preservation and access are equally important in the curation of electronic literature. Grigar and her colleagues are committed to not only preserving vintage works of digital humanities — the sotware — but in maintaining access to them, keeping the machines running and encouraging people to experience each work in its native technological context.

Grigar said, “What drives my research is how artists use the medium and the platforms and all the things to their advantage and work through the constraints so that the constraints do not look like weaknesses but actually are part of the beautiful aspect of the work.”

Categories: Planet DigiPres

SPRUCE Project Award: Northumberland Estates

Open Planets Foundation Blogs - 13 November 2013 - 1:40pm
Using the Digital Preservation Business Case Toolkit to justify Digital Repository investmentNorthumberland Estates (NE) were delighted to be awarded a SPRUCE Award to carry out a detailed analysis of current digital repository solutions suitable for small to medium organisations. In conjunction with The University of London Computer Centre (ULCC) they created a toolkit justifying investment in a recommended solution. The business case will aim to implement a sustainable digital repository for the long term management of Northumberland Estates digital content. With a particular focus on small to medium organisations this project aims to address the lack of knowledge in the digital preservation community on preservation as a service (PaaS) providers.  Methodology Objective: Produce a specification detailing exact requirements for procurement of a digital repository There are a number of high level requirements which the adopted solution must meet. For this purpose, we created an organisational and technical assessment based on the methodology of the OAIS Reference Model. The technical specification is essentially a “shopping list” of what the chosen system has to do to perform digital preservation. The overall aim was to keep the specification concise, manageable and realistic so that it would meet the immediate business needs of NE, while also adhering to best practice. Objective: Case studies analysed against specification The specification was recast into a form that could be used for assessing a preservation solution. Before the product analysis was carried out three potential solutions were identified: 1. Open Source: Many Higher Education institutions already have mature repository instances through the use of open source software such as DSpace, EPrints, and Fedora. 2. Out of the Box: The emergence of PaaS providers such as Tessella Preservica and Ex Libris Rosetta provide active preservation and curation of digital assets. 3. Hybrid: A combination of commercial in house/open source systems. For example, Arkivum provides bit level preservation while open source OAIS digital preservation systems such as Archivematica can provide the extra level of preservation required for the creation of SIP’s, AIP’s, and DIP’s. By conducting a product analysis for each of these options a much greater understanding of the functional capabilities were formed. Objective: ISO 16363 assessment of NE The product analysis provided a really good benchmark for the functional aspects of each repository option, but it was felt that the results were tending to emphasise the performance of the software, rather than the needs of the producers, consumers, or archivist. To balance this trend, the project team took on an extra objective that was not originally in scope of the project. Broader requirements not captured by simply covering repository software functionality needed to be considered. In particular, storage and bit preservation resilience; how many copies of each file, storage in different locations; who will ingest content and where they will do it; will they have different user roles; how and where will users access the data. To cover these gaps, they were expressed by the Digital Curator in narrative form as a “basic information and workflow story” about the work of NE. The project team agreed to address the requirements by conducting a cut-down ISO 16363 assessment. This organisational analysis was explicitly intended to complement and enhance the assessment of the repository solution. The resulting organisational assessment resulted in a mini gap analysis on the digital preservation capacity of NE. By using the expertise provided by ULCC to validate these assessments against wider expert opinion, the results represent a summary of how and whether each requirement has been met, or could be met in the future. Objective: Business Case The final business case needed to be as concise and targeted as possible. The decision was made to take one recommendation forward based on the functional and organisational assessments made:

1. Open Source: Previous research undertaken by the Digital Curator indicated that the implementation of an open source digital repository would not be feasible due to the investment and expertise required.

2. Out of the Box (recommended option): Preservica scored very highly and also proved to be the most cost effective solution based on initial calculations. Other out of the box solutions were considered such as Ex Libris Rosetta, but the cost of implementing this system in-house was prohibitive.

3. Hybrid: The combination of using the OAIS compliant Archivematica in conjunction with bit-level preservation provided by Arkivum was considered. However, the combination of these two solutions was not as comprehensive and cost effective in comparison to an out of the box solution.

Once the recommended option was decided, it was a case of using the guidance of the Digital Preservation Business Case Toolkit to form the final business case. What resulted was a straight to the point and clear justification based on expert knowledge which was presented internally to key stakeholders within NE.  Lessons Learnt There is no one size fits all solution! 
  • Much of what is concluded will be based on your own organisational context, all of which can influence the right approach towards digital preservation. However, it is hoped that this project can establish a methodology which other small to medium organisations can adopt.
 Identify existing business drivers/organisational goals. 
  • Aligning organisational goals from the onset will save you a great deal of work further down the line. By identifying these key drivers you can begin to build up support for your recommended solution before the big pitch to senior management.
 Use existing work already available. 
  • There are a number of fantastic resources out there which can save you reinventing the wheel. The first and most obvious point of contact is the new Digital Preservation Business Case Toolkit. A fantastic resource including everything you need to get started.
 Lay out the options clearly and concisely. 
  • Nail down upfront costs for at least the first three years. After all, you want a solution which can be sustained into the future. For any costs include benefits and any potential returns on investments which can be identified 
       Conclusions We believe that both the methodology and the actual outputs will have reuse value for other small organisations. With the Specification document and the Organisational Assessment form we have achieved a credible specification and assessment method that is a good fit for NE. These two forms are also provided blank which it is hoped that other organisations can use. Our methodology shows it would be possible for any small organisation to devise their own suitable specification. It is based not exclusively on OAIS, but on the business needs of NE and a simple understanding of the user workflow. There are other methods of assessment; for example the MoSCoW method instead of a weighted score. With a thorough assessment of the solutions NE stands a better chance of selecting the right system for their business needs, using a process that can be repeated and objectively verified. This method should be regarded as quick and easy. Since we used supplier information, success of the method depends on whether that information is accurate and truthful. But it would be a good first step to selecting a supplier. More in-depth assessments of systems are possible. With the ISO 16363 assessment we can show that it is possible for an organisation to perform a credible cut down and restricted ISO self-assessment in a very short time. This could be a viable alternative to using an expensive consultant. It must be noted that these outputs do not represent a short cut to carry out a full ISO assessment. The methodology and outputs instead demonstrate how smaller organisations can carry out a similar process to assess their own digital preservation requirements. The results from this project are clearly encouraging for small to medium organisations who wish to address the problems associated with digital preservation. There are a variety of emerging solutions; from out of the box solutions like Preservica, to open source digital preservation systems such as Archivematica. With the correct buy-in from stakeholders and investment in time, resources, and expertise smaller organisations can implement solutions which will preserve digital content in a sustainable manner. However, procuring preservation systems is by no means a straightforward task. The current market remains relatively small and there are limited options to choose from. If small organisations (no matter which sector they belong to) are to be convinced of the worth of investing in digital preservation systems there needs to be greater advocacy within the wider digital preservation community, and increased competition amongst vendors who provide such solutions. The full Northumberland Estates case study can be found at: Christopher Fryer – Digital Curator and Assistant Records Manager, Northumberland Estates Edward Pinsent – Digital Archivist/Project Manager, University of London Computer Centre (ULCC)


Preservation Topics: SPRUCE
Categories: Planet DigiPres