The Signal: Digital Preservation
The beginning is a very fine place to start indeed for the Federal Agencies Digitization Guidelines Initiative Born Digital Video subgroup of the Audio-Visual Working Group. As mentioned in a previous blog post, the FADGI Born Digital Video subgroup is taking a close look at the range of decisions to be made throughout the lifecycle of born digital video objects, from file creation through archival ingest and access delivery. Through case histories from federal agencies such as National Archives and Records Administration, Smithsonian Institution Archives, and National Oceanic and Atmospheric Administration, Library of Congress, Voice of America and American Folklife Center, we are exploring the “truth and consequences” when creating and archiving born digital video. In this blog post, we’ll look at some of our guiding principles for creating born digital video.
But as Julie Andrew’s says, let’s start at the very beginning. What do we mean by born digital video? Quite simply, it’s video that is recorded to digital file at the point of creation. Born digital video is distinct from digitized or reformatted video, a label used to describe the result of translating the analog signal data emanating from a video object into a digitally encoded format. FADGI’s Reformatted Video subgroup is developing a matrix which compares target wrappers and encodings against a set list of criteria that come into play when reformatting analog videotapes.
The first set of FADGI BDV case histories highlight what we call advice for shooters (a.k.a. videographers), and by extension, the project managers within cultural heritage institutions who are responsible for the creation new born digital video files – especially determining the technical file specifications. It’s important to recognize that the FADGI target audience for these case histories isn’t Hollywood or commercial entertainment producers. It’s the cultural heritage community or smaller archives who create non-broadcast classes of content recording such as oral histories. A great example is the Civil Rights History Project at AFC. These types of projects have the opportunity to spec out the born digital video deliverable from the very beginning and end up with a file that is ingest ready for preservation and access systems.
The goal of the case histories project is to use guiding principles to illustrate the advantages of starting with high quality data capture from the very start. Two examples of FADGI’s guiding principles for creating born digital video include:
- Create uncompressed video instead of compressed video. Compressed video reduces the amount of data in a file or stream. Although a reduced amount of data can be beneficial for easing storage, transfer, and play-out requirements, it generally introduces additional technical complexity which can have a negative impact on usability of the file over time. Uncompressed video retains all the visual information captured at the selected resolution, which is preferable for preservation purposes.
- If compression is required, use lossless compression over lossy compression. Lossless compression uses algorithms that restore the compressed data after decompression. It is essentially reversible compression. Lossy compression permanently alters or deletes the compressed data. If data reduction gains are significant enough to warrant using the added complexity of compressed files, lossless compression is preferred to preserve video quality.
These are just two examples that focus on the video encoding. The guiding principles also cover considerations for file wrapper or container capabilities, format sustainability and more general project concerns.
But here’s the thing: our case histories don’t always follow our own guiding principles. And that’s just fine by us. None of us live in a utopian world where digital storage is abundant and systems are completely interoperable. We all have to make choices and compromises to work within our restraints. Uncompressed video files can be huge and a burden to manage and maintain. Lossy compression can be appropriate for certain projects. The guiding principles should all be read with the caveat “if you have the option….” Sometimes, you simply don’t have the option for a myriad of reasons. But when you do have the option, the guiding principles highlight the advantages of high quality data capture. The important take-away from the case histories project is the choices made during the file creation process will have impacts on the long term archiving and distribution processes and it’s essential to understand what those impacts are and have a plan for to resolve any conflicts.
Our hope is that these guiding principles and case histories help us start to flesh out more specific format guidance for born digital video but that’s in the future. The case history project, which will be published on the Federal Agencies Digitization Guidelines Initiative website this spring, is the first step towards understanding where we are as a community and what we can learn from each other.
How do I know if a digital file/object has been corrupted, changed or altered? Further how can I prove that I know what I have? How can I be confident that the content I am providing is in good condition, complete, or reasonably complete? How do I verify that a file/object has not changed over time or during transfer processes?
In digital preservation, a key part of answering these questions comes through establishing and checking the “fixity” or stability of digital content. At this point, many in the preservation community know they should be checking the fixity of their content, but how, when and how often?
A team of individuals from the NDSA Infrastructure & Standards working groups have developed Checking Your Digital Content: How, What and When to Check Fixity? in an effort to help stewards answer these questions in a way that makes sense for their organization based on their needs and resources. We are excited to publicly share this draft document for more broad open discussion and review here on The Signal. We welcome comments and questions, please post them at the bottom of this post for the working group to review.
Not Best Practices, but Guidance for Making Best Use of Resources at Hand
In keeping with work on the NDSA Levels of Digital Preservation, this document is not a benchmark or requirement. It is instead intended as a tool to help organizations develop a plan that fits resource constraints. Different systems and different collections are going to require different fixity checking approaches, and our hope is that this document can help.
Connection to National Agenda for Digital Stewardship
This guidance was developed in direct response to start to address the need articulated in the infrastructure section of the inaugural National Agenda for Digital Stewardship. I’ll include it below at length for context.
Fixity checking is of particular concern in ensuring content integrity. Abstract requirements for fixity checking can be useful as principals, but when applied universally can actually be detrimental to some digital preservation system architectures. The digital preservation community needs to establish best practices for fixity strategies for different system configurations. For example, if an organization were keeping multiple copies of material on magnetic tape and wanted to check fixity of content on a monthly basis, they might end up continuously reading their tape and thereby very rapidly push their tape systems to the limit of reads for the lifetime of the medium.
There is a clear need for use ‐ case driven examples of best practices for fixity in particular system designs and configurations established to meet particular preservation requirements. This would likely include description of fixity strategies for all spinning disk systems, largely tape ‐ based systems, as well as hierarchical storage management systems. A chart documenting the benefits of fixity checks for certain kinds of digital preservation activities would bring clarity and offer guidance to the entire community. A document modeled after the NDSA Levels of Digital Preservation would be a particularly useful way to provide guidance and information about fixity checks based on storage systems in use, as well as other preservation choices.
Again, please share your comments on this here, and consider forwarding this on to others who you think might have comments to share with us.
My two young teenage daughters spend hours playing Minecraft, building elaborate virtual landscapes and structures. They are far from alone; the game has millions of fans around the world. Teachers are seizing on Minecraft’s popularity with kids as a tool to teach both abstract and concrete subjects. What’s unique about this situation is not so much the product as that a virtual world is functioning as both a fun, engaging activity and a viable teaching tool. We’re witnessing the birth of a new genre of tools and a new set of challenges for preserving the digital creations people build with those tools.
Like most parents, I save many of the things that my daughters create. From where I’m sitting in my home as I write this blog post, I can see their works dotting the room. On one wall is a framed pencil sketch one daughter drew of our family; on a shelf is a perfect clay replica she made of Moomintroll. Hanging above a window are drawings my other daughter did — a Sharpie drawing of tree houses and a pen doodle of kaleidoscopic patterns that disappear into a tunnel-like vanishing point. Huge snowflakes (no two alike) that they cut from paper dangle here and there around the room.
I never gave much thought to their virtual gaming activities, aside from monitoring how much time they spend on their electronic devices. But I like that Minecraft lets my kids invent universes and play inside them together and I can tell that it feeds an important part of their intellectual growth as they make things, investigate things and solve problems. So I decided that I’d like to save what I can of the worlds they create, just as I save the rest of their crafts and artwork, which raised questions about what I can save, how I can save it and why I would even want to save it.
Over the last decade, the Library of Congress and its NDIIPP and NDSA partners have led the research into preserving virtual worlds, from military simulations to consumer games. Many of the questions – technological and philosophical – have long been asked and answered or at the least the challenges have been identified and defined. That’s fine for institutions that recognize the cultural value of virtual worlds and have the resources to archive them but what does it mean for a parent who just wants to save his or her kid’s virtual world creations?
A colleague at the Library of Congress, Trevor Owens, is part of the ongoing research on preserving digital worlds and preserving software. In fact, Owens is one of the organizers of the preserving software conference. He said that the solution to the question of saving something from virtual worlds depends on whether you want to save:
- the virtual world that you or someone else built
- testimony about what the virtual world meant to you or them at a particular time
- or documentation of the virtual world.
Preserving the virtual world itself is the most difficult and challenging option. The complexities of preserving virtual worlds are too much to go into in this blog post. And when it comes to talking about networked virtual worlds inhabited by live human participants, the subject often gets downright esoteric, like defining where “here” actually is and what “here” means in a shared virtual world and how telepresence applies to the virtual world experience. But to illustrate the basic technological dilemma of preserving a virtual world, here’s a simple example .
Let’s say I build an island, castle and estate in a virtual world and name it Balmy Island. If I want to save Balmy Island and be able to walk around it anytime I want to, I need all the digital files of which Balmy Island is constructed. I might need the exact version of the application or software that I used to build Balmy Island, as well as the exact operating system — and version of the OS — of the hardware device on which I built Balmy Island. And I might need the hardware device itself on which I created Balmy Island. So if I build Balmy Island on my computer, I have to preserve the computer, the software and the files just as they are. Never upgrade or modify anything. Just stick the whole computer in the closet, buy a new computer and pull out the old one whenever I wanted to revisit Balmy Island.
Another less-certain and less-authentic option is that I could save the Balmy Island files and hope that someday someone will build an emulator that will restore some approximate version of my original Balmy Island. It will not be exactly the same, but it might be close enough.
Saving the hardware and software for just this one purpose is unrealistic for the average person but for cultural institutions it makes perfect sense. Stanford University is the home of the Stephen M. Cabrinety Collection in the History of Microcomputing and it is also building a Forensics Lab with a library of software and electronic devices for extracting software from original media, so that it can be run later in native or emulated environments. Similar labs at other institutions include the Maryland Institute for Technology in the Humanities, the International Center for the History of Electronic Games at the Strong National Museum of Play and the UT Videogame Archive at the Dolph Briscoe Center for American History, University of Texas at Austin. The Briscoe Center was featured in the Signal post about video game music composer George Sanger. (Dene Grigar, who was the subject of another Signal blog post, created a similar lab devoted to her vintage electronic literature collection at Washington State University, Vancouver)
Henry Lowood, curator for History of Science & Technology Collections and Film & Media Collections in the Stanford University Libraries, was a lead in the Preserving Virtual Worlds project. Lowood has a historical interest in games, virtual worlds and their role in society, and he makes a case for the option of recording testimony about what a virtual world means to its users and builders.
Lowood helped create the Machinima and Virtual Worlds collections, which are hosted by our NDIIP/NDSA partner, the Internet Archive. These collections host video recordings of activities and events in virtual worlds and immersive games. As the users perform actions and navigate through the worlds, they sometimes give a running commentary about what is happening and their thoughts and observations about its meaning to them.
A parent or teacher could use this same approach by shooting a video of a child giving you a tour of their virtual world. It’s an opportunity to capture the context around their creation of the worlds and for them to tell you how they felt about it and what choices they made. If they interact with others in a shared virtual world, the child can describe his or her interactions and maybe even relate anecdotes about certain events and experiences.
Screenshots are easy to take on computers and most hand-held devices. PCs have a “print screen” button on the keyboard; for Macs, hold down the Apple key ⌘ plus shift plus 3. For iPods, press and hold the main button below the screen and the power button on the top edge of the device at the same time. And so on. Search online for how to take screen shots or screen captures for your device.
The screenshot will save as a graphic file, usually a JPEG or PNG file. Transfer that JPEG to your computer, crop it and modify it with a photo processing program if you want. Maybe print the screen shots and put them on the refrigerator for you to admire. When you’re finished with the digital photo file, back it up with your other personal digital archives.
Recording a walk through of a virtual world can be a slightly more complex task than taking a screenshot but not terribly so. Search online for “screencast software,” “motion capture” or “screen recording” to find commercial and freeware screencast software. Even version 10 of the QuickTime player includes a screen recording function. They all pretty much operate the same way: click a “Record” button, do your action on the computer and click “Stop” when you are finished. Everything that was displayed on the screen will be captured into a video file.
With the different screen capture software programs, be aware of the video file type that the software generates. QuickTime saves the video as an MOV file, Jing saves the video as an SWF file and so on. Different file types require different digital video players, so if you have any difficulty playing the file back on your computer search online to find the software that will play your video file type. If you upload a copy of your video to YouTube, backup a master copy somewhere else. Don’t rely on the YouTube version as your master “archived” copy.
Although this story is about the challenges of saving mementos from digital virtual worlds, the essence of the challenge — trying to preserve an experience — is not new. If I go to Hawaii, snorkel, build sand castles and have the time of my life, I cannot capture or hold onto that experience. I can only document the experience with photos, video and maybe write in a journal about it. In a way, it even goes back to the dawn of humanity, where people recorded their experiences by means of cave paintings.
So you cannot capture the experience of a virtual world but you can document it. And virtual worlds are a lot more accessible in 2014 than they were in 1990. It’s a long way from Jaron Lanier‘s work, from VPL labs and data gloves and headsets and exclusive access in special labs. Kids now carry their personalized virtual worlds in their handheld devices. Minecraft is just the current cool tool. Who can tell what is yet to come?
It seems appropriate to let Howard Rheingold have the last word on the subject. Rheingold is a writer, teacher, social scientist and thought-leader about the cultural impacts of technology. He is also an authority on virtual reality and virtual communities, having written the definitive books about both topics over twenty years ago. His current book is titled NetSmart.
In addition to his professional expertise, Rheingold is a caring father who dotes on his daughter. While he was researching and writing the books Virtual Reality(1991) and Virtual Communities: Homesteading on the Electronic Frontier(1994), his office walls were filled with her childhood artwork (she is now in her 20s). He brings a unique and authoritative perspective to this story.
Rheingold said, “I’ve been closely observing and writing about innovations in digital media and learning in recent years – and experiencing/experimenting directly through the classes I teach at Stanford and Rheingold U. Among my activities in this sphere is a video blog for DMLcentral, a site sponsored by the MacArthur Foundation’s Digital Media and Learning Initiative. It was there that I delved into the educational uses – and students and teachers’ passion for – Minecraft.
“In my interviews with teachers Liam O’Donnell and Sara Kaviar, it became clear that Minecraft was about much more than using computers to build things. It was a way to engage with a diverse range of abstract subject matter in concrete ways, from comparative religion to mathematics, and more importantly, a way for students to exercise agency in a schooling environment in which so much learning is dependent on what the teacher or textbook says.
“Minecraft artifacts are also important contributions to student e-portfolios, which will become more important than resumes in the not too distant future. Given the growing enthusiasm over Minecraft by students, teachers, and parents, and the pedagogical value of seeing these creations as artifacts and instruments of learning, it only makes sense to make it easy and inexpensive to preserve virtual world creations.”
The February issue of the Library of Congress Digital Preservation Newsletter (pdf) is now available!
Included in this issue:
- Spotlight on Digital Collections, including an interview with Lisa Green on Machine Scale Analysis of collections, and a look at the Cultural Heritage of the Great Smoky Mountains
- Digital Preservation Aid in Response to Tornado
- NDSA Digital Content Area: Web and Social Media
- Wikipedia and Digital Preservation
- AV Artifact Atlas, FADGI interview with Hanna Frost
- Several updates on the Residency Program
- Listing of upcoming events including the IDCC (Feb 24-27), Digital Maryland conference (March 7), Computers in Libraries (April 7-10), Personal Digital Archiving 2014 (April 10-11)
- And other articles about data, preservation of e-serials, and more.
To subscribe to the newsletter, sign up here
We’ve started planning our annual meeting, Digital Preservation 2014, which will be held July 22-24 in the Washington, DC area, and we want to hear from you! Any organization or individual with an interest in digital stewardship can propose ideas for potential inclusion in the meeting.
The Library of Congress has hosted annual meetings with digital preservation partners, collaborators and others committed to stewardship of digital content for the past ten years. The meetings have served as a forum for sharing achievements in the areas of technical infrastructure, innovation, content collection, standards and best practices and outreach efforts.
This year we’ve expanded participation from NDSA member organizations on the program committee. We’re delighted to have NDIIPP staff and NDSA members working together to contribute to the success of the meeting.
Digital Preservation 2014 Program Committee
- Vickie Allen, PBS Media Library
- Meghan Banach Bergin, University of Massachusetts Amherst
- Erin Engle, NDIIPP
- Abbie Grotke, NDIIPP
- Barrie Howard, NDIIPP
- Butch Lazorchak, NDIIPP
- Vivek Navale, U.S. National Archives and Records Administration
- Michael Nelson, Old Dominion University
- Trevor Owens, NDIIPP
- Abbey Potter, NDIIPP
- Nicole Scalessa, The Library Company of Philadelphia
Call for Proposals
We are looking for your ideas, accomplishments and project updates that highlight, contribute to, and advance the community dialog. Areas of interest include, but are not limited to:
- Scientific data and other content at risk of obsolescence, and what methods, techniques, and tools are being deployed to mitigate risk;
- Innovative methods of digital preservation, especially regarding sustainable practices, community approaches, and software solutions;
- Collaboration successes and lessons learned highlighting a wide-range of digital preservation activities, such as best practices, open source solutions, project management techniques and emerging tools;
- Practical examples of research and scholarly use of stewarded data or content;
- Educational trends for emerging and practicing professionals.
You are invited to express your interest in any of the following ways:
- Panels or presentations
- 5-minute lightning talks
A highlight of this past year was the release of the 2014 National Digital Stewardship Agenda at Digital Preservation 2013. The Agenda integrates the perspective of dozens of experts to provide funders and decision-makers with insight into emerging technological trends, gaps in digital stewardship capacity and key areas for development. It suggests a number of important sets of issues for the digital stewardship community to consider prioritizing for developments. We’d be particularly interested for you to share projects your organization has undertaken in the last year that address any of the issues listed in the Agenda.
To be considered, please send 300 words or less describing what you would like to present to ndiipp [at] loc.gov by March 14. Accepted proposals will be notified on or around April 3.
The last day of the meeting, July 24, will be a CURATEcamp, which will take place off-site from the main meeting venue. The topic focus of this camp is still under discussion, so stay tuned for more information about the camp in the coming weeks.
Please let us know if you have any questions. Your contributions are important in making this a community program and we’re looking forward to your participation.
The following is a guest post by Julia Blase, National Digital Stewardship Resident at the National Security Archive.
In case you hadn’t heard, the ALA Midwinter Meeting took place in Philadelphia last weekend, attended by around 12,000 librarians and exhibitors. If you didn’t attend, or didn’t have friends there to take notes for you, the Twitter hashtag #alamw14 has it covered – enough content for days of exploration! If you’d like to narrow your gaze, and in the theme of this post, you could refine your search for tweets containing both #alamw14 and #NDSR, because the National Digital Stewardship Residents were there in force, attending and presenting.
Emily Reynolds, the Resident at the World Bank, was so kind as to compile a list of the sessions we aimed to attend before the conference. On Saturday, though none of us made it to every event, at least a few of us were at the Preservation Administrators Interest Group, Scholarly Communications Interest Group, Digital Conversion Interest Group, Digital Special Collections Discussion Group and Challenges of Gender Issues in Technology sessions.
The first session I attended, along with Lauren Work and Jaime McCurry, was the Digital Conversion Interest Group session, where we heard fantastic updates on audiovisual digital conversion practices and projects from the American Folklife Center, the American Philosophical Society library, Columbia University Libraries and George Blood Audio and Video. I particularly enjoyed hearing about the successful APS attempt to digitize audio samples of Native American languages, many of which are endangered and the positive reaction from the Native community. For audio, it seemed, sometimes digitization is the best form of preservation!
The second session I attended, with Emily Reynolds and Lauren Work, was the Gender Issues in Technology discussion group (see news for it at #libtechgender). We were surprised, but pleased, at the number of attendees and quality of the discussion around ways to improve diversity in the profession. Among the suggestions we heard were to include diverse staff members on search committees, to monitor the language within your own organization when you review candidates to ensure that code words like “gravitas” (meaning “male,” according to the panelists) aren’t being used to exclude groups of candidates, to put codes of conduct into place to help remind everyone of a policy of inclusiveness, and to encourage employees to respond positively to mentorship requests, especially from members of minority groups (women, non-white, not traditionally gendered). The discussion seemed to us to be a piece of a much larger, evolving, and extended conversation that we were glad to see happening in our professional community!
On Sunday, though a few of us squeezed in a session or two, our primary focus was our individual project update presentations, given at the Digital Preservation Interest Group morning session, and also our extended project or topic presentations at the Library of Congress booth in the early afternoon. The individual presentations, I’m please to say, went very well! It would be impossible to recap each presentation here, however, many of us have posted project updates recently, so please be sure to look us up for details. Furthermore, searching Twitter for #alamw14 and #NDSR brings you to this list, in which you can find representative samples of the highlights from our individual presentations.
Presentations – Question and Answer Session
We concluded the session by taking some questions, all of which were excellent – particularly the one from Howard Besser, who wanted to know how we believed our projects (or any resident or fellowship temporary project) could be carried on at the conclusion of our project term. The general response was that we are doing our best to ensure they are continued by integrating the projects, and ourselves, into the general workflows of our organizations – keeping all stakeholders informed from an early stage of our progress, finding support from other divisions, and documenting all of our decisions so that any action may be picked up again as easily as possible.
We also had an excellent question about how important networking had been for the success of our projects, and all agreed that, while networking with the D.C. community has been essential (through our personal efforts and also through groups like the DCHDC meetup), almost more significant has been our ability to network with each other – to share feedback, resources, documents, websites, and connections to other networks, which has helped us accomplish our goals more efficiently and effectively. One of the goals of the NDSR program was, of course, to help institutions get valuable work done in the area of digital stewardship, which we are all doing. However, another goal was for the program to help build a professional community in digital stewardship. What is a community if not a group of diverse professionals who trust and rely on each other, who share successes and setbacks, resources and networks, and who support each other as we learn and grow? Though the language is my own, the sentiment is one I heard shared between us over and over during the ALA weekend.
NDSR Recent Activity
In recent news, Emily Reynolds and Lauren Work both discuss their take on our ALA experience, Emily’s here and Lauren’s here. Molly Swartz published some pictures and thoughts on ALA Midwinter over here. Jaime McCurry recently interviewed Maureen McCormick-Harlow about her work at the National Library of Medicine. And to conclude, I’ve recently posted two updates on my project, one on this page and another courtesy of the Digital Libraries Federation.
Thanks for listening, and be sure to tune in two weeks from now when Maureen McCormick-Harlow will be writing another NDSR guest post. If you, like us, were at ALA Midwinter last weekend, I hope you found it as enjoyable as we did!
Here’s a simple experiment that involves asking an average person two questions. Question one is: “how do you feel about physical books?” Question two is: “how do you feel about digital data?”
The first question almost surely will quickly elicit warm, positive exclamations about a life-long relationship with books, including the joy of using and owning them as objects. You may also hear about the convenience of reading on an electronic device, but I’ll wager that most people will mention that only after expounding on paper books.
The second question shifts to cooler, more uncertain ground. The addressee may well appear baffled and request clarification. You could help the person a bit by specifying digital materials of personal interest to them, such as content that resides on their tablet or laptop. “Oh, that stuff,” they might say with measured relief. “I’m glad it’s there.”
These divergent emotional reactions should be worrying to those of us who are committed to keeping digital cultural heritage materials accessible over time. Trying to make a case for something that lacks emotional resonance is difficult, as marketing people say. Most certainly, the issue of limited resources is a common refrain when it comes to assessing the state of digital preservation in cultural heritage institutions; see the Canadian Heritage Information Network’s Digital Preservation Survey: 2011 Preliminary Results, for example.
The flip side is that traditional analog materials are a formidable competitor for management resources because those materials are seen in a glowing emotional context. I don’t mean to say that analog materials are awash in preservation money; far from it. But physical collections still have to be managed even as the volume of digital holdings rapidly rise, and efforts to move away from reliance on the physical are vulnerable to impassioned attack by people such as Nicholson Baker.
What is curious is that even as we collectively move toward an ever deeper relationship with digital, there remains a strong nostalgic bond with traditional book objects. A perfect example of this is a recent article, Real books should be preserved like papyrus scrolls. The author fully accepts the convenience and the future dominance of ebooks, and is profoundly elegiac in his view of the printed word. But, far from turning away from physical books, he declares that “books have a new place as sacred objects, and libraries as museums.” One might see this idea as one person’s nostalgic fetish, but it’s more than that. We can only wonder how long and to what extent this kind of powerful, emotionally-propelled thinking will drive how cultural heritage institutions operate, and more importantly, how they are funded.
As I’ve written before, we’re at a point where intriguing ideas are emerging about establishing a potentially deeper and more meaningful role for digital collections. This is vitally important, as a fundamental challenge that lies before those who champion digital cultural heritage preservation is how to develop a narrative that can compete in terms of personal meaning and impact.
How do we make digital collections available at scale for today’s scholars and researchers? Lisa Green, director of Common Crawl, tackled this and related questions in her keynote address at Digital Preservation 2013. (You can view her slides and watch a video of her talk online.) As a follow up to ongoing discussions of what users can do with dumps of large sets of data, I’m thrilled to continue exploring the issues she raised in this insights interview.
Trevor: Could you tell us a bit about Common Crawl? What is your mission, what kinds of content do you have and how do you make it available to your users?
Lisa: Common Crawl is a non-profit organization that builds and maintains an open repository of web crawl data that is available for everyone to access and analyze. We believe that the web is is an incredibly valuable dataset capable of driving innovation in research, business, and education and that the more people that have access to this dataset, the greater the benefit to society. The data is stored on public cloud platforms so that anyone with a access to the internet can access and analyze it.
Trevor: In your talk, you described the importance of machine scale analysis. Could you define that term for us and give some examples of why you think that kind of analysis is important for digital collections?
Lisa: Let me start by describing human scale analysis. Human scale analysis means that a person ingests information with their eyes and then processes and analyzes it with their brain. Even if several people – or even hundreds of people – work on the analysis, it is not as fast as a computer program can ingest, process, and analyze information. Machine scale analysis is when a computer program does the analysis. A computer program can analysis data millions to billions of times faster than a human. It can run 24 hours a day with no need for rest and it can simultaneously run on multiple machines.
Machine scale analysis is important for digital collections because of the massive volume of data in most digital collections. Imagine that a researcher wanted to study the etymology of a word and planned to use a digital collection to answers questions such as:
- What is the first occurrence of this word?
- How did the frequency of occurrence change over time?
- What types of publication it is first appear in?
- When did it first appear in other types of publications and how did the types of publications it appeared in change over time?
- What other words most commonly appear in the same sentence, paragraph or page with the word and how did that change over time?
Answering such questions using human scale analysis would take lifetimes of man hours to search the collection for the given word. Machine scale analysis could retrieve the information in seconds or minutes. And if the researcher wanted to make changes in the questions or criteria, only a small amount of effort would be required to alter the software program, then the program could be rerun and return the new the information in seconds or minutes. If we want to optimize the extraction of knowledge from the enormous amounts of data digital collections, human analysis is simply too slow.
Trevor: What do you think libraries, archives and museums can learn from Common Crawl’s approach?
Lisa: I think it is of crucial importance to preserve data in a format that it can be analyzed by computers. For instance, if material is stored as a PDF, it difficult – and sometimes impossible – for software programs to analysis the material and therefore libraries, archives and museums will be limited in the amount of information that can be extracted from the material in a reasonable amount of time.
Trevor: What kind of infrastructure do you think libraries, archives and museums need to have to be able to provide capability for machine scale analysis? Do you think they need to be developing that capacity on their own systems or relying on third party systems and platforms?
Lisa: The two components are storage and compute capacity. When one thinks of digital preservation, storage is always considered but compute capacity is not always considered. Storage is necessary for preservation and the type of storage system influences access to the collection. Compute capacity is necessary for analysis. Building and maintaining the infrastructure or storage and compute can be expensive, so it doesn’t make much financial sense for each organization to develop it own their own.
One option would be a collaborative, shared system build and used by many organizations. This would allow the costs to be shared, avoid duplicative work and storing duplicate material, and – perhaps most importantly – maximize the number of people who have access to the collections.
Personally I believe a better option would be to utilize existing third party systems and platforms. This option avoids the cost of developing custom systems and often makes it easier to maintain or alter the system as there is a greater pool of technologists familiar with the popular third party platforms.
I am a strong believer in public cloud platforms is because there is no upfront cost for the hardware, no need to maintain or replace hardware, and one only pays for the storage and compute that is used. I think it would be wonderful to see more libraries, museums, and archives storing copies of their collections on public cloud platforms in order to increase access. The most interesting use of your data may be thought of by someone outside your organization and the more people who can access the data, the more minds can work to find valuable insight within your data.
Interface, Exhibition & Artwork: Geocities, Deleted City and the Future of Interfaces to Digital Collections
In 2009, a band of rogue digital preservationists called Archive Team did their best to collect and preserve Geocities. The resulting data has became the basis for at least two works of art: Deleted City and One Terabyte of Kilobyte Age. I think the story of this data set and these works offer insights into the future roles of cultural heritage organizations and their collections.
Let Them Build Interfaces
In short, Archive Team collected the data and made the dataset available for bulk download. If you like, you can also just access the 51,000 MIDI music files from the data set from the Internet Archive. Beyond that, because the data was available in mass, the corpus of personal websites became the basis for other works. Taking the Geocities data as a basis, Richard Vijgen’s Deleted City interprets and presents an interface to the data and Olia Lialina & Dragan Espenschied’s One Terabyte of Kilobyte Age is in effect designed reenactment grounded in an articulated approach to accessibility and authenticity.
An Artwork as the Interface to Your Collection
Some of the most powerful ways to interact the Geocities collection is through works of created by those who have access to the collection as a dataset. Working with digital objects means we don’t need to define the way that they will be accessed or made available. By making the raw data available on the web, and providing a point of reference for the data set everyone is enabled to create interfaces to it.
How to make available digital collections and objects?
Access remains the burning question for cultural heritage organizations interested in the acquisition and preservation of digital artifacts and collections. What kinds of interfaces do they need in place to serve what kinds of users? If you don’t know how to make it available in advance what can you do with it? I’ve been in discussions with staff from a range of cultural heritage organizations who don’t really want to wade too deep into acquiring born digital materials without having a plan for how to make them available.
The story of Geocities, Archive Team and these artists suggests that if you can make the data avaliable you can invite others to invent the interfaces. If users can help figure out and develop modes of access, as illustrated in this case, then cultural heritage organizations could potentially invite much larger communities of users to help figure out issues around migration and emulation as modes of access as well. By making the content broadly available, organizations have the ability to broaden the network of people who might contribute to efforts to make digital artifacts accessible into the future.
Collections and Interfaces Inside and Outside
An exciting model can emerge here. Through data dumps of full sets of raw data, cultural heritage organizations can consider embracing the fact that they don’t need to provide the best interface, or for that matter much of any interface at all, for digital content they agree to steward. Instead, a cultural heritage organization can agree to acquire materials or collections which are considered interesting and important, but which they don’t necessarily have the resources or inclination to build sophisticated interfaces to if they are willing to simply provide canonical homes for the data, offer information about the provenance of the data, and invest in dedicated ongoing bit-level preservation. This approach would resonate quite strongly with a more product less process approach to born digital archival materials.
An Example: 4Chan Collection/Dataset @ Stanford
For a sense of what it might look like for a cultural heritage organization to do something like this we need look no further than a recent Stanford University Library acquisition. The recent acquisition of an archive of a collection of 4Chan data into Stanford’s digital repository shows how a research library could go about exactly this sort of activity. The page for the data set/collection briefly describes the structure of the data and some information and context about the collector who offered it to Stanford. Stanford acts as the repository and makes the data available for others to explore, manipulate and create a multiplicity of interfaces to. How will others explore or interface with this content? Only time will tell. In any event, it likely did not take that many resources to acquire it and it will likely not require that much in resources to maintain it at a basic level into the future.
How to encourage rather than discourage this?
If we wanted to encourage this kind of behavior how would we do it? First, off I think we need more data dumps for this kind of data. With the added note that bitsize downloadable chunks of data are going to be the easiest thing for any potential user to right click and save to their desktop. Beyond that, cultural heritage organizations could embrace this example and put up prizes and bounties for artists and designers to develop and create interfaces to different collections.
What I think is particularly exciting here is that by letting go of the requirement to provide the definitive interface cultural heritage organizations could focus more on selection and working to ensure long term preservation and integrity of data. Who knows, some of the interfaces others create might be such great works of art that another cultural heritage organization might feature it in their own database of works.
In western North Carolina, in the foothills of the Great Smoky Mountains, rests a boulder covered in prehistoric petroglyphs attributed to the Native Americans who have resided in the area for thousands of years. Experts debate the specific origin and meaning of the glyphs but the general interpretation describes Judaculla, a human-like giant with supernatural powers, who protects the Cherokee Nation and the land that nourishes and supports them. This cultural record of Cherokee society, called Judaculla Rock, has been accessible for millennia because it is recorded in stone. With protection and preservation, it might continue to be accessible for thousands of years to come.
A few miles away, at Western Carolina University (which was built on the site of a Cherokee village) in the town of Cullowhee, Anna Fariello has helped create digital cultural records of Cherokee society, which she has preserved and made accessible online as the Cherokee Traditions collection. Given the potential longevity of digital collections, Cherokee Traditions — with protection and appropriate preservation — might be accessible for many years to come. Maybe even as long as Judaculla Rock.
Fariello, the head of Digital Initiatives at Western Carolina University’s Hunter Library, does not limit her preservation work to the Cherokee culture. She is trying to digitally preserve as much of the rich cultural heritage of the western North Carolina Smoky Mountain region as she can and make those collections available online.
She spent the early part of her career creating exhibits for museums, which is evident in how she stages each of Hunter Library’s online collections. But the transition from displaying material objects in a museum to displaying digital objects online did not happen quickly for Fariello. Creating an appealing online collection involved more than just displaying photos and text in a browser; it required conceptualizing and planning for the browser medium and the user experience. The process also required some trial and error. For example, she points out the text-heaviness of Hunter Library’s first online collection, Craft Revival, and notes that with each collection they moved further away from dense explanatory text toward showcasing the richness of the cultural artifacts, within the limitations and the possibilities of the medium.
Soon after Fariello started working at Hunter Library in 2005, she began scouting her community for primary source material for possible collections to put online. There was the Craft Revival, of course. Cherokee culture was also an obvious choice. “When I first moved here, I knew of the Cherokee people here but I didn’t realize we were at the seat of the Cherokee homeland,” said Fariello. “That collection developed out of my growing awareness of that and reaching out to our partners, the Museum of the Cherokee Indian and Qualla Arts and Crafts, the Cherokee’s artisan guild. The project won a major recognition last year from the Association of Tribal Archives, Libraries and Museums.”
Each digital collection that she developed presented a new challenge. The Western Carolina University Herbarium seemed promising and uncomplicated because the content — 100,000 plant specimens — is archived at the university. And while The Western Carolina University Herbarium was historically relevant (among the specimens collected over 150 years, it contains specimens from the decimated American Chestnut tree), funding was a challenge because the collection is a natural history collection and the grants that Fariello was pursuing applied to cultural history collections.
She traveled throughout Appalachia — county to county, museum to museum, library to library — to talk with archivists and librarians and gather material. When Fariello researched content for Hunter Library’s Great Smoky Mountains exhibit, she found very few historic photos and digitized artifacts online relating to the Great Smoky Mountains National Park. When she went to the national park to assess its collections, she discovered that they had many well-preserved photos and artifacts but they had no plans to put them online. “Digitization is outside the scope of what they can do in the current economic climate,” said Fariello. She took that as a confirmation that Hunter Library should digitize the Great Smoky Mountains National Park materials and develop a digital collection.
Some of the collections Fariello digitized were not organized to begin with. “There’s quite a bit of curating that needs to happen with those,” said Fariello. “How to tell a coherent story and find the important aspects of that story. How to figure out what to leave out in order to build a strong collection.”
Fariello gave a presentation last fall at the American Folklife Center’s Cultural Heritage Archives Symposium. During the presentation she spoke about how Hunter Library acquired and archived a unique oral history collection through serendipity and rescued it from possible digital loss. She said she was approached for an interview for Stories of Mountain Folk, a highly polished radio show produced near Cullowhee. She was impressed by the mission of the show, the professionalism of the interviewers and the show’s high production values. When the producers told her they record the show digitally and it had been around for five years, Fariello’s digital-preservation instincts kicked in. She said, “I asked them, ‘You’ve been doing this for five years? Where are all the sound files?’ And the answer was, ‘On GoDaddy.’ I was surprised, to say the least.”
Fariello immediately began to make arrangements to archive the program at the university, which resulted in Hunter Library hosting the Stories of Mountain Folk collection. Hunter Library’s website describes the collection as, “Over 200 half-hour and hour-long recordings capture ‘local memory’ detailing traditions, events, and the life stories of mountain people. A wide range of interviewees include down-home gardeners, herbalists, and farmers, as well as musicians, artists, local writers, and more.”
Except for the digital audio files from Stories of Mountain Folk, most of the digital files in Hunter Library’s digital repository are photographs and documents. The library’s Digital Production Team scans each photograph and document as a 600 dpi TIFF master copy for preservation; these TIFFs reside on servers at Western Carolina University and are also backed up onto gold CDs. The team also creates a 300 dpi JPEG copy of each scan to display online in the collections. They enter the related metadata into a database.
Hunter Library uses a content management system to transfer the JPEGs to a vendor, who displays each digitized item along with its metadata. Fariello likes the convenience and reliability of using a vendor — for which Hunter Library pays an annual fee — but doesn’t like that the URL changes in the browser from WCU’s to the vendor’s when a user is on a Hunter Library collection web page and they click on an item for a closer look. That “closer look” page displays its contents from the vendor’s server.
In other words, the collections’ top-level introductory pages reside at WCU and the individual item-level pages reside on the vendor’s server. Fariello would like to keep the entire online collection on campus, but Hunter Library lacks the financial and technological resources for that right now. The vendor service is an affordable compromise.
Like most libraries and museums in the U.S., Hunter Library’s small staff and tight funds limit the number of online collections it can create. Their vision exceeds their resources. Fariello said, “It seems to me that all over the country, digitization projects -– and digital tools for preservation -– are not always a funded part of core library services.” So she doggedly pursues grants. In the ten years she has been at Hunter Library, Fariello has raised more than a half million dollars to support their digital projects. She especially appreciates the way the state of North Carolina distributes the Library Service Technology Act funds, by way of IMLS. “In North Carolina those funds are administered by the state library,” said Fariello, “which created a grant program to get the funds out into the community at a local level.”
I asked Fariello if she saw Hunter Library’s online collections as a future direction for all libraries and her response was both realistic and hopeful. She said that the determining factor is whether a library has archived any collections to begin with. “The next phase for them would be to make the collections accessible through digitization,” said Fariello. “Not all libraries have an archival focus. If they don’t have collections, digitization is not going to part of their responsibility.”
She said that libraries are changing with the times and librarians, especially young librarians, accept digital services as a natural function of a modern library. “It’s no longer a future function, it’s a present function,” said Fariello. If a library is interested in developing digital collections, the tools are available and standards are in place.
“In 2005 when I started, the standards weren’t clear,” she said. “We wondered, ‘How do you do this?’ Now the standards are standard. Sites like the federal digitization standards site (Federal Agencies Digitization Guidelines Initiative) and the Northeast Document Conservation Center are well established. You don’t have to invent how to do it. If you want to achieve a certain level of professionalism, follow those guidelines. Things have changed. It’s not that hard once somebody figures out how to do it.”
Most researchers begin online and they expect to — or hope to — find what they are looking for or something related to what they are looking for. Fariello said that, for researchers, online collections are equally as useful as eJournals and Wikipedia. Online collections do not replace research at a library or a museum but online collections do make digital versions readily accessible.
“Access” has always been a guiding principle for Fariello in developing collections. She concentrates on making them useful and friendly for people. “The collections have been successful because I approach their development from the standpoint of someone who would use these collections,” said Fariello.
Librarians, curators, archivists and other information professionals provide a unique service by developing digital collections. And not just by digitizing the collections that reside within their institutions but also by looking outside, into the surrounding community, to rescue collections that are at risk.
“My position has never been to work within an ivory tower institution,” said Fariello. “I try to be aware of what is out in my community. Public institutions need to look to our communities and see where content is being created, especially by non-academic folks who don’t really know what to do with it once they pull it together.”
The following is a guest post from Andrea Goethals, Digital Preservation and Repository Services Manager at the Harvard University Library, with contributions from Stephen Paul Davis, Director of Columbia University Libraries Digital Program Division and Kate Zwaard, Supervisory IT Specialist, Repository Development, Library of Congress. Andrea and Kate co-chair the NDSA Standards and Practices Working Group.
When you hear about something that is new to you – where is the first place you usually go to learn more about it? If you’re like most of us, you usually find yourself reading a Wikipedia article. In fact, Wikipedia is the sixth most popular website. That was the inspiration behind the NDSA Standards and Practices Working Group’s project, started in 2012, to use Wikipedia as a platform to expose information about digital preservation standards and best practices. Since people are already going to Wikipedia for information, why not leverage it to build upon the information that is already there?
A Challenging Undertaking!
This idea proved more challenging than it first appeared. Wikipedia’s main article about digital preservation wasn’t in a state where the group could easily attach related articles on particular standards and best practices. Information about digital preservation in Wikipedia was spread out over multiple articles and important areas were completely left out while other areas were fairly detailed but out-of-date, some came from a non-library perspective, and some were poorly written or biased. In fact, the poor quality of the article hadn’t gone without notice by Wikipedia editors and there were banners at the top of the page warning readers:
Digital Preservation WikiProject
The group decided that the first step was to improve Wikipedia’s core article about digital preservation to provide a more complete scaffolding from which subsidiary articles on standards and best practices could be hung. A small group took on the task of writing an outline for reorganizing and adding to the existing Digital Preservation article and then started writing new sections including:“WikiProject” was set up to organize the work.
- Definition of digital preservation
- Challenges of digital preservation
- Intellectual foundations of digital preservation in libraries
- Specific tools and methodologies
- CRL certification and assessment of digital repositories
- Digital preservation best practices for audio, moving images and email
This was such an improvement to the quality of the Digital Preservation article that the disclaimers at the top of the article were removed.
This project couldn’t have been done without the dedication of Stephen Paul Davis and Dina Sokolova from Columbia University Libraries who provided the needed editorial oversight and wrote most of the new content. In addition, key contributions were made by Priscilla Caplan, formerly of the FCLA, Linda Tadic of the Audiovisual Archive Network and Chris Dietrich and Jason Lautenbacher, both from the U.S. Park Service.
What’s Next? How You Can Help
Wikipedia’s digital preservation articles need ongoing oversight, but this is a responsibility that should be distributed broadly. Please take a look at the article and outline and consider contributing in your areas of expertise. If you’re looking for a leadership opportunity in digital preservation, the NDSA is looking for someone who can help encourage participation in the WikiProject and act as a liaison to the coordinating committee. If you’re interested, please contact Stephen Paul Davis at email@example.com.
In this interview, FADGI talks with Hannah Frost, Digital Library Services Manager at Stanford Libraries and Manager, Stanford Media Preservation Lab and Jenny Brice, Preservation Coordinator at Bay Area Video Coalition about the AV Artifact Atlas.
One of my favorite aspects of the Federal Agencies Digitization Guidelines Initiative is its community-based ethos. We work collaboratively across federal agencies on shared problems and strive to share our results so that everyone can benefit. We’ve had a number of strong successes including the BWF MetaEdit tool, which has been downloaded from SourceForge over 10,000 times. In FADGI, we’re committed to making our products and processes as open as possible and we’re always pleased to talk with other like-minded folks such as Hannah Frost and Jenny Brice from the AV Artifact Atlas project.
The AV Artifact Atlas is another community-based project that grew out of a shared desire to identify and document the technical issues and anomalies that can afflict audio and video signals. What started out as a casual discussion about quality control over vegetarian po’boy sandwiches at the 2010 Association of Moving Image Archivists annual meeting, the AV Artifact Atlas has evolved into an online knowledge repository of audiovisual artifacts for in-house digitization labs and commercial vendors. It’s helping to define a shared vocabulary and will have a significant impact on codifying quality control efforts.
For an overview of AVAA, check out The AV Artifact Atlas: Two Years In on the Media Preservation blog from the Media Preservation Initiative at Indiana University Bloomington.
Kate: Tell me how the AV Artifact Atlas came about.
Hannah: When we get together, media preservation folks talk about the challenges we face in our work. One of the topics that seems to come up over and over again is quality and the need for better tools and more information to support our efforts to capture and maintain high quality copies of original content as it is migrated forward into new formats.
When creating, copying, or playing back a recording, there are so many chances for error, for things to go sideways, lowering the quality or introducing some imperfection to the signal. These imperfections leave behind audible or visible artifacts (though some are more perceptible than others). If we inspect and pay close attention, it is possible discover the artifacts and consider what action, if anything, can be taken to prevent or correct them.
The problem is most archivists, curators and conservators involved in media reformatting are ill-equipped to detect artifacts, or further still to understand their cause and ensure a high quality job. They typically don’t have deep training or practical experience working with legacy media. After all, why should we? This knowledge is by and large the expertise of video and audio engineers and is increasingly rare as the analog generation ages, retires and passes on. Over the years, engineers sometimes have used different words or imprecise language to describe the same thing, making the technical terminology even more intimidating or inaccessible to the uninitiated. We need a way capture and codify this information into something broadly useful. Preserving archival audiovisual media is a major challenge facing libraries, archives and museums today and it will challenge us for some time. We need all the legs up we can get.
AV Artifact Atlas is a leg up. We realized that we would benefit from a common place for accumulating and sharing our knowledge and questions about the kinds of issues revealed or introduced in media digitization, technical issues that invariably relate to the quality of the file produced in the workflow. A wiki seemed like a natural fit given the community orientation of the project. I got the term “artifact atlas” imaging guru Don Williams, an expert adviser for the FADGI Still Image Working Group.
Initially we saw the AV Artifact Atlas as a resource to augment quality control processes and as a way to structure a common vocabulary for technical terms in order to help archivists, vendors and content users to communicate, to discuss, to demystify and to disambiguate. And people are using it this way: I’ve seen it on listservs.
But we have also observed that the Atlas is a useful resource for on-the-job training and archival and conservation education. It’s extremely popular with people new to the field who want to learn more and strengthen their technical knowledge.
Kate: How is the AVAA governed? What’s Stanford Media Preservation Lab’s role and what’s Bay Area Video Coalition’s role?
Hannah: The Stanford Media Preservation Lab team led the initial development of the site, which started in 2012 and we’ve been steadily adding content ever since. We approached BAVC as an able partner because BAVC demonstrates an ongoing commitment to the media community and a genuine interest in furthering progress in the media archiving field.
Jenny: Up until this past year, BAVC’s role has primarily been to host the AVAA. We’ve always wanted to get more involved in adding content, but haven’t had the resources. When we started planning for the QC Tools project, we saw the AVAA as a great platform and dissemination point for the software we were developing. Through funding from the National Endowment for the Humanities, we now have the opportunity to focus on actively developing the analog video content in the AVAA. The team at SMPL have been a huge part of the planning process for this stage of the project, offering invaluable advice, ideas and feedback.
Over the next year, BAVC will be leading a project to solicit knowledge, expertise and examples of artifacts found in digitized analog video from the wider AV preservation community to incorporate into the AVAA. Although BAVC is leading this leg of the project, SMPL will be involved every step of the way.
Kate: You mentioned the Quality Control Tools for Video Preservation or QC Tools project. How does the AVAA fit into that?
Jenny: In 2013, BAVC received funding from the NEH to develop a software tool that analyzes video files to identify and graph errors and artifacts. You can drop a digital video file into the software program and it will produce a set of graphs from which various errors and artifacts can be pinpointed. QC Tools will show where a headclog happens and then connect the user to the AVAA to understand what a headclog is and if it can be fixed. QC Tools will make it easier for technicians digitizing analog video to do quality control of their work. It will also make it easier for archivists and other people responsible for analog video collections to quality check video files they receive from vendors, as well as accurately document video files for preservation. The AVAA, by providing a common language for artifacts as well as detailed descriptions of their origin and resolution (if any), helps serve these same purposes.
Kate: My favorite AVAA entry is probably the one for Interstitial Errors because it’s an issue that FADGI is actively working on. (In fact, when I mentioned this project in a previous blog post, you’ll notice a link to the AVCC in the Interstitial Error caption!) What topics stand out for you and why?
Jenny: When I first started interning at BAVC, I was totally new to video digitization. I relied heavily on the AVAA to help me understand what I was seeing on screen, why it was happening and what (if anything) could be done. The entries for Video Head Clog, Tracking Error and Tape Crease hold a special place in my heart because I saw them often when digitizing, and it took many, many repeat views of the examples in the AVAA before I could reliably tell them apart.
Hannah: There are so many to choose from! One highlight is SDI Spike, because it is a great example of a digitization error – and pretty egregious one at that – and thus demonstrates exactly why careful quality control is critical in preservation workflows. The DV Head Clog entry is noteworthy, as the clip shows how dramatic digital media errors can be, especially when compared to analog ones. Other favorite entries include those that give the reader lots of helpful, practical information about resolving the problem, as seen in Crushed Setup and Head Switching Noise.
Kate: Where do you get your visual examples and data for the Atlas? Are there gaps you’re looking to fill?
Hannah: Many of the entries were created by SMPL staff, drawing on research we’ve done and our on-the-job experience, and most of the media clips and still images derive from issues we encountered in our reformatting projects. A few other generous folks have contributed samples and content, too. We are currently in the process of incorporating content from the Compendium of Image Errors in Analogue Video, a superb book published in 2012 that was motivated by the same need for information to support media art conservation. We are deeply grateful to authors Joanna Phillips and Agathe Jarczyk for working with us on that.
Our biggest content gaps are in the area of audio: we are very eager for more archivists, conservators, engineers and vendors to contribute entries with examples! Also the digital video area needs more fleshing out. The analog video section is pretty well developed at this point, but we still need frames or clips demonstrating errors like Loss of Color Lock and Low RF. We keep a running list of existing entries that are lacking real-life examples on the Contributor’s Guide page.
Kate: I love the recently added audio examples to augment the visual examples. It’s great to not only see the error but also to hear it. How did this come about and what other improvements/next steps are in the works?
Hannah: Emily Perkins, a student of the University of Texas School of Information, approached us about adding the Sound Gallery as part of her final capstone project. Student involvement in the Atlas development is clearly a win-win situation, so we encourage more of that! We are also currently planning to implement a new way to navigate the content in terms of error origin. The new categories – operator error, device error, carrier error, production error – will help those Atlas users who want to better understand the nature of these errors and how they come about.
Jenny: As part of the NEH project, we want to look closely at the terms and definitions and correlate them with other resources, such as the Compendium of Image Errors in Analogue Video that Hannah mentioned. We also want to include more examples – both still images and video clips – to help illustrate artifacts. As QC Tools becomes more developed, we want to include some of the graphs of common artifacts produced by the software. The hope is that users of the AVAA or of QC Tools will have more than one way to identify the artifacts they encounter.
Kate: It can be challenging to keep the content and enthusiasm going for community-based efforts. What have you learned since the project launched and how has it influenced your current approach?
Hannah: So true: keeping the momentum going is a real challenge. Most of the contributions made to date have been entirely voluntary, and while the NEH funding is a welcome and wonderful development – not to mention a vote of confidence that the Atlas is a valuable resource – we understand fully well that generous donations of time and knowledge on the part of novice and expert practitioners will always be fundamental to the continued growth and success of the Atlas.
It definitely takes a core group of committed people to keep the momentum going and you always need to beat the bush for contributions. In our day-to-day work at SMPL, it has come to the point where I routinely ask myself about a problem we encounter: “is this something we can add to the Atlas? Have we just learned something that we can share with others?” If more practitioners adopted this frame of mind, the wiki would certainly develop more rapidly! I also try to remind folks that you don’t have to be an expert engineer to contribute. Practical information from and for all levels of expertise is our primary goal.
Kate: Is there anything you’d else like to mention about AVAA?
Jenny: We’re hiring! Thanks to funding from the NEH, we are able to hire someone part-time to work exclusively on building out content and community for the AV Artifact Atlas. If you are passionate and knowledgeable about video preservation, consider applying. We’re really excited to hire a dedicated AVAA Coordinator and to see how this position will help the Atlas grow!
The following is a guest post by Heidi Dowding, Resident at the Dumbarton Oaks Research Library in Washington, DC
As part of the National Digital Stewardship Residency program’s biweekly takeover of The Signal, I’m here to talk about my project at Dumbarton Oaks Research Library and Collection. And by the way, if you haven’t already checked out Emily Reynolds’ post on the residency four months in as a primer, go back and read that first. I’ll wait.
OK then, on we go.
My brief history in residence at this unique institution technically started in September, but really the project dates back a little over a year to a digital asset management information gathering survey that was undertaken by staff at Dumbarton Oaks. Concerned with DO’s shrinking digital storage capacity, they were hoping to find out how various departments were handling their digital assets. What they discovered was that, with no central policy guiding digital asset management within the institution, ad hoc practices were overlapping and causing manifold problems.
This is about where my project entered the scene. As part of the first cohort of NDSR residents, I’ve been tasked with identifying an institution-wide solution to digital asset management. This has first involved developing a deep (at times, file-level) understanding of Dumbarton Oaks’ digital holdings. These include the standard fare – image collections, digital books, etc. – but also more specialized content like the multimedia Oral History Project and the GIS Tree Care Inventory. I started my research with an initial survey sent to everyone around the institution, and then undertook interviews and focus groups with key staff in every department.
While I uncovered a lot of nuanced information about user behaviors, institutional needs, and the challenges we currently face, the top-level findings are threefold.
First, relationships within an institution make or break its digital asset management.
This is largely because each department has a different workflow for managing assets, but no department is an island. In interdepartmental collaborations, digital assets are being duplicated and inconsistently named. This is especially apparent in the editorial process at DO, wherein an Area of Study department acts as intermediary between the Publications department and various original authors. Duplicative copies are being saved in various drives around the institution, with very little incentive to clean and organize files once the project has been completed.
In this case, defined policies will aid in the development of interdepartmental collaborations in digital projects. My recommendation of a Digital Asset Management System (DAMS) will also hopefully aid in the deduplication of DO’s digital holdings.
Second, file formats are causing big challenges. Sometimes I even ran into them with my own research.
Other times, these were more treacherous around the institution, being caused by a lack of timely software updates for some of our more specialized systems or by a general proliferation of file formats. A lot of these issues could be addressed by central policy based on the file format action plans discussed by NDSR resident Lee Nilsson. Effective plans should address migration schedules and file format best practices.
Finally, staff need to be more proactive in differentiating between archival digital assets and everyday files.
By archival digital assets, I mean images from the ICFA or photographs of the gardens or word processing documents. This behavior becomes particularly problematic depending on where items are saved: many of the departmental drives are only backed up monthly, while a bigger institutional drive collectively referred to as ‘the Shared Drive’ is backed up daily. So if everyday items are being stored on a departmental drive, there is a much higher likelihood of data loss as there is no backup copy. Likewise, if archival assets are being put here with no local iteration being stored until the scheduled backup, really important digital assets could be lost. Finally, this also becomes problematic when digital assets are being stored long-term on the Shared Drive – they take up precious space and are not being properly organized and cared for.
My job over the next few months will be to look at potential Digital Asset Management Systems to determine whether a specific tool would assist Dumbarton Oaks’ staff in better managing digital files. I will also be convening a Digital Preservation Working Group to carry on my work after my residency ends in May.
Please check out NDSR at the upcoming ALA Midwinter Digital Preservation Interest Group meeting at 8:30am on Sunday, January 24 in the Pennsylvania Room.
In my work at the Library, one of my larger projects has to do with the acquisition and preservation of eserials. But this I don’t mean access to licensed and hosted eserials, but the acquisition and preservation of eserial article files that come to the Library.
In many ways, this is just like other acquisition streams and workflows: some specifications for the content are identified; electronic transfer mechanisms are put in place; processing includes automated and human actions including inspection, metadata extraction and enrichment, and organization; and files are moved to the appropriate storage locations.
They are serials and have a complex organization of files/articles/issues/volumes/titles. There are multiple formats, content, and metadata standards in play. Publisher often now have a very frequent article-based publishing model that includes versions and updates. And the packages of files to be transferred between and within organizations can have many formats.
My Library of Congress colleague Erik Delfino reached out to our colleagues at the National Institutes of Health/National Library of Medicine who operate PubMed Central, who deal with similar issues. Out of our shared interest has come a NISO working group to develop a protocol for the transfer and exchange of files called PESC – Protocol for Exchanging Serial Content. This group is co-chaired by the Library of Congress and NIH, and has representatives from publishers small and large, data normalizers and aggregators, preservation organizations, and organizations with an interest in copyright issues.
This group is making great progress identifying the scope of the problem, looking at how a variety of organizations solve the problem for their own operations, and drafting its ideas for solutions for exchange that support the effective management and preservation of serials.
If you are interested in the work, please read the Work Item description at the PESC web site, and check out who’s involved. There will also be a brief update presented as part of the NISO standards session at ALA Midwinter on Sunday, January 26 from 1-2:30 PM in Pennsylvania Convention Center room 118 C.
We hear a constant stream of news about how crunching massive data collections will change everything from soup to nuts. Here on The Signal, it’s fair to say that scientific research data is close to the heart of our hopes, dreams and fears when it comes to big data: we’ve written over two-dozen posts touching on the subject.
In the context of all this, it’s exciting to see some major projects getting underway that have big data stewardship closely entwined with their efforts. Let me provide two examples.
The Registry of Data Repositories seeks to become a global registry of “repositories for the permanent storage and access of data sets” for use by “researchers, funding bodies, publishers and scholarly institutions.” The activity is funded by the German Research Foundation through 2014 and currently has 400 repositories listed. With the express goal to cover the complete data repository landscape, re3data.org has developed a typology of repositories that compliments existing information offered by individual instutions. The aim is to offer a “systematic and easy to use” service that will strongly enhance data sharing. Key to this intent is a controlled vocabulary that describes repository characteristics, including policies, legal aspects and technical standards.
In a bow to the current trend for visual informatics, the site also offers a set of icons with variable values that represent repository characteristics. The project sees the icons as helpful to users as well as to assist repositories “identify strengths and weaknesses of their own infrastructures” and keep the information up to date.
I really like this model. It hits the trifecta in appealing to creators who seek to deposit data, to users who seek to find data and to individual repositories who seek to evaluate their characteristics against their peers. It remains to be seen if it will scale and if it can attract ongoing funding, but the approach is elegant and attractive.
The second example is ELIXIR, an initiative of the EMBL European Bioinformatics Institute. ELIXIR aims to “orchestrate the collection, quality control and archiving of large amounts of biological data produced by life science experiments,” and “is creating an infrastructure – a kind of highway system – that integrates research data from all corners of Europe and ensures a seamless service provision that it is easily accessible to all.”
This is huge undertaking and has the support of many nations who are contributing millions of dollars to build a “hub and nodes” network. It will connect public and private bioscience facilities throughout Europe and promote shared responsibility for biological data delivery and management. The intention is to provide a single interface to hundreds of distributed databases and a rich array of bioinformatics analysis tools.
ELIXIR is a clear demonstration of how a well-articulated need can drive massive investment in data management. The project has a well-honed business case that presents an irresistible message. ”Biological information is of vital significance to life sciences and biomedical research, which in turn are critical for tackling the Grand Challenges of healthcare for an ageing population, food security, energy diversification and environmental protection,” reads the executive summary. “The collection, curation, storage, archiving, integration and deployment of biomolecular data is an immense challenge that cannot be handled by a single organisation.” This is what the Blue Ribbon Task Force on Sustainable Digital Preservation and Access termed “the compelling value proposition” needed to drive the enduring availability of digital information.
As a curious aside, it’s worth nothing that projects such as ELIXIR may have an unexpected collateral impact on data preservation. Ewan Birney, a scientist and administrator working on ELIXIR, was so taken with the challenge of what he termed “a 10,000 year archive” holding a massive data store that he and some colleagues (over a couple of beers, no less) came up with a conjecture for how to store digital data using DNA. The idea was sound enough to merit a letter in Nature, Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. So, drawing the attention of bioinformaticians and other scientists to the digital preservation challenge may well lead to stunning leaps in practices and methods.
Perhaps one day the biggest of big data can even be reduced to the size of a bowl of alphabet soup or a bowl of mixed nuts!
The 2014 National Digital Stewardship Agenda, released in July 2013, is still a must-read (have you read it yet?). It integrates the perspective of dozens of experts to provide funders and decision-makers with insight into emerging technological trends, gaps in digital stewardship capacity and key areas for development.
The Agenda suggests a number of important research areas for the digital stewardship community to consider, but the need for more coordinated applied research in cost modeling and sustainability is high on the list of areas prime for research and scholarship.
The section in the Agenda on “Applied Research for Cost Modeling and Audit Modeling” suggests some areas for exploration:
“Currently there are limited models for cost estimation for ongoing storage of digital content; cost estimation models need to be robust and flexible. Furthermore, as discussed below…there are virtually no models available to systematically and reliably predict the future value of preserved content. Different approaches to cost estimation should be explored and compared to existing models with emphasis on reproducibility of results. The development of a cost calculator would benefit organizations in making estimates of the long‐term storage costs for their digital content.”
In June of 2012 I put together a bibliography of resources touching on the economic sustainability of digital resources. I’m pleasantly surprised as all the new work that’s been done in the meantime, but as the Agenda suggests, there’s more room for directed research in this area. Or perhaps, as Paul Wheatley suggests in this blog post, what’s really needed are coordinated responses to sustainability challenges that build directly on this rich body of work, and that effectively communicate the results out to a wide audience.
I’ve updated the bibliography, hoping that researchers and funders will explore the existing body of projects, approaches and research, note the gaps in coverage suggested by the Agenda and make efforts to address the gaps in the near future through new research or funding.
As always, we welcome any additions you might have to this list. Feel free to leave suggestions in the comments.
The Web site addresses listed here were all valid as of as January 14, 2014.
Allen, Alexandra. “General Study 16 – Cost Benefit Models: Final Report.” InterPARES3 Project; 2013. Available at http://www.interpares.org/ip3/display_file.cfm?doc=ip3_canada_gs16_final_report.pdf
Arrow, Kenneth, Robert Solow, Paul R. Portney, Edward E. Leamer, Roy Radner, and Howard Schuman. “Report of the NOAA Panel on Contingent Valuation.” National Oceanic and Atmospheric Administration. 1993. Available at http://www.darrp.noaa.gov/library/pdf/cvblue.pdf
Ayris, P.; Davies, R.; McLeod, R.; Miao, R.; Shenton, H.; Wheatley, P. The LIFE2 final project report. LIFE Project: London, UK. 2008. Available at http://discovery.ucl.ac.uk/11758/
Barlow, John Perry. “The Economy of Ideas: Selling Wine Without Bottles on the Global Net.” See especially the section entitled Relationship and Its Tools. Available at http://homes.eff.org/~barlow/EconomyOfIdeas.html
Beagrie, N., Chruszcz, J., and Lavoie, B. Keeping Research Data Safe: A Cost Model and Guidance for UK Universities. Final Report. April 2008. Available at http://www.jisc.ac.uk/media/documents/publications/keepingresearchdatasafe0408.pdf
Beagrie, N., Lavoie, B., and Woollard, M. Keeping Research Data Safe 2. Final Report. April 2010. Available at http://www.jisc.ac.uk/media/documents/publications/reports/2010/keepingresearchdatasafe2.pdf
Blue Ribbon Task Force on Sustainable Digital Preservation and Access. Sustainable Economics for a Digital Planet: Ensuring Long-Term Access to Digital Information. February 2010. Available at http://brtf.sdsc.edu/biblio/BRTF_Final_Report.pdf
Blue Ribbon Task Force on Sustainable Digital Preservation and Access. Sustaining the Digital Investment: Issues and Challenges of Economically Sustainable Digital Preservation. December 2008. Available at http://brtf.sdsc.edu/biblio/BRTF_Interim_Report.pdf
Botea, Juanjo, Belen Fernandez-Feijoo and Silvia Ruiz. “The Cost of Digital Preservation: A Methodological Analysis.” Procedia Technology, Vol. 5; 2012. Available at http://www.sciencedirect.com/science/article/pii/S2212017312004434
Brown, Adrian. “Cost Modeling: The TNA Experience.” The National Archives (UK). Powerpoint slides presented at the DCC/DPC joint Workshop on Cost Models, held July 26, 2005. Available at http://www.dpconline.org/docs/events/050726brown.pdf
Buckland, Michael K. “Information as Thing.” Journal of the American Society for Information Science; Jun 1991; 42, 5; pg. 351-360. Available at http://people.ischool.berkeley.edu/~buckland/thing.html
Cantor, Nancy, and Paul N. Courant. “Scrounging for Resources: Reflections on the Whys and Wherefores of Higher Education Finance.” New Directions for Institutional Research, Volume 2003, Issue 119 , Pages 3 – 12. Also available as “Scrounge We Must–Reflections on the Whys and Wherefores of Higher Education Finance” at http://www.provost.umich.edu/speeches/higher_education_finance.html
Chambers, Catherine M., Paul E. Chambers and John C. Whitehead. “Contingent Valuation of Quasi-Public Goods: Validity, Reliability, and Application to Valuing a Historic Site.” Available at http://faculty.ucmo.edu/pchambers/adobe/historical.pdf
Chapman, Stephen. “Counting the Costs of Digital Preservation: Is Repository Storage Affordable?” Journal of Digital Information, Volume 4 Issue 2. 2003. Available at http://journals.tdl.org/jodi/article/view/100
Charles Beagrie Ltd. and JISC. Keeping Research Data Safe Factsheet. 2011. Available at http://beagrie.com/KRDS_Factsheet_0711.pdf
Charles Beagrie Ltd and the Centre for Strategic Economic Studies (CSES), University of Victoria. “Economic Impact Evaluation of the Economic and Social Data Service.” 2012. Available at http://www.esrc.ac.uk/_images/ESDS_Economic_Impact_Evaluation_tcm8-22229.pdf
Crespo, Arturo, Hector Garcia-Molina. “Cost-Driven Design for Archival Repositories.” Joint Conference on Digital Libraries 2001 (JCDL’01); June 24-28, 2001; Roanoke, Virginia, USA. Available at http://www-db.stanford.edu/~crespo/publications/cost.pdf
Currall, James, Claire Johnson, and Peter McKinney. “The Organ Grinder and the Monkey. Making a business case for sustainable digital preservation.” Presentation given at EU DLM Forum Conference 5-7 October 2005 Budapest, Hungary. Available at http://hdl.handle.net/1905/455
Currall, James, Claire Johnson, and Peter McKinney. “The world is all grown digital…. How shall a man persuade management what to do in such times?” 2nd International Digital Curation Conference, Digital Data Curation in Practice, 21-22 November 2006, Hilton Glasgow Hotel, Glasgow. Available at http://hdl.handle.net/1905/690
Currall, James, and Peter McKinney. “Investing in Value: A Perspective on Digital Preservation.” D-Lib Magazine, Volume 12, Number 4; April 2006. Available at http://www.dlib.org/dlib/april06/mckinney/04mckinney.html
Davies, Richard, Paul Ayris, Rory McLeod, Helen Shenton and Paul Wheatley.“How much does it cost? The LIFE Project ‐Costing Models for Digital Curation and Preservation.” LIBER Quarterly, Vol. 17, no. 3/4, 2007. Available at http://liber.library.uu.nl/index.php/lq/article/view/7895
Digital Preservation Coalition. “Report for the DCC/DPC Workshop on Cost Models for Preserving Digital Assets.” Available at http://www.dpconline.org/events/previous-events/137-cost-models. A series of powerpoint presentations from a day-long workshop held on July 26, 2005.
“Erpa Guidance: Cost Orientation Tool.” 2003. Available at http://www.erpanet.org/guidance/docs/ERPANETCostingTool.pdf
“espida Handbook: Expressing project costs and benefits in a systematic way for investment in information and IT.” University of Glasgow/JISC. 2007. Available at https://dspace.gla.ac.uk/bitstream/1905/691/1/espida_handbook_web.pdf
Fontaine, Kathy, Greg Hunolt, Arthur Booth and Mel Banks. “Observations on Cost Modeling and Performance Measurement of Long-Term Archives.” NASA Goddard Space Flight Center, Greenbelt, MD. 2007. Available at http://www.pv2007.dlr.de/Papers/Fontaine_CostModelObservations.pdf
Ghosh, Rishab Aiyer. “Cooking Pot Markets: an Economic Model for the Trade in Free Goods and Services on the Internet.” First Monday, Issue 3_3, 1998. Available at http://www.firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/1516/1431
Granger, Stewart, Kelly Russell, and Ellis Weinberger: “Cost elements of Digital Preservation (version 4).” October 2000. Available at http://www.webarchive.org.uk/wayback/archive/20050409230000/http://www.leeds.ac.uk/cedars/colman/costElementsOfDP.doc
Griffin, Vanessa, Kathleen Fontaine, Gregory Hunolt, Arthur Booth, and David Torrealba. “Cost Estimation Tool Set for NASA’s Strategic Evolution of ESE Data Systems.” NASA. Unknown date. Available at http://vds.cnes.fr/manifestations/PV2002/DATA/5-8_griffin.pdf
Guthrie, Kevin, Rebecca J. Griffiths, Nancy L. Maron. Sustainability and Revenue Models for Online Academic Resources. Ithaka; 2008. Available at http://www.sr.ithaka.org/research-publications/sustainability-and-revenue-models-online-academic-resources
Hahn, Robert W. and Paul C. Tetlock. “Using Information Markets to Improve Public Decision Making.” AE_-Brookings Joint Center for Regulatory Studies; 2005. Available at http://www.law.harvard.edu/students/orgs/jlpp/Vol29_No1_Hahn_Tetlock.pdf
Hendley, Tony. “Comparison of Methods & Costs of Digital Preservation.” British Library Research and Innovation Report 106; 1998. Available at http://www.ukoln.ac.uk/services/elib/papers/tavistock/hendley/hendley.html
Hunter, Laurie, Elizabeth Webster and Anne Wyatt. “Measuring Intangible Capital: A Review of Current Practice.” Intellectual Property Research Institute of Australia Working Paper No. 16/04; 2005. Available at http://www.ipria.net/publications/wp/2004/IPRIAWP16.2004.pdf
Hunter, Laurie. “DCC Digital Curation Manual: Investment in an Intangible Asset.” University of Glasgow. 2006. Available at http://www.era.lib.ed.ac.uk/bitstream/1842/3340/1/Hunter%20intangible-asset.pdf
Iansiti, Marco, and Gregory L. Richards. “The Business of Free Software: Enterprise Incentives, Investment, and Motivation in the Open Source Community.” Harvard Business School. 2006. Preliminary draft available at http://www.hbs.edu/research/pdf/07-028.pdf
Kaufman, Peter B. “Assessing the Audiovisual Archive Market: Models and Approaches for Audiovisual Content Exploitation.” Presto Centre. 2013. Available at https://www.prestocentre.org/library/resources/assessing-audiovisual-archive-market
Kaur, Kirnn, Patricia Herterich, Suenje Dallmeier-Tiessen, Karlheinz Schmitt, Sabine Schrimpf, Heiko Tjalsma, Simon Lambert and Sharon McMeekin. D32.1 Report on Cost Parameters for Digital Repositories. Alliance for Permanent Access to the Records of Science Network. 2013. Available at http://www.alliancepermanentaccess.org/wp-content/uploads/downloads/2013/03/APARSEN-REP-D32_1-01-1_0.pdf
Kejser, Ulla Bøgvad, Anders Bo Nielsen and Alex Thirifays. Cost Model for Digital Preservation:Cost of Digital Migration. International Journal of Digital Curation, Issue 1, Vol. 6; 2011. Available at http://www.ijdc.net/index.php/ijdc/article/viewFile/177/246
King, Dennis M. King and Marisa Mazzotta. “Ecosystem Valuation.” Available at http://www.ecosystemvaluation.org/index.html. While this website pertains to considerations of natural environment valuation, its findings are applicable to the consideration of other intangible asset economies, such as the economic system surrounding digital preservation.
James, Hamish, Raivo Ruusalepp, Sheila Anderson, and Stephen Pinfield. “Feasibility and Requirements Study on Preservation of E-Prints.” JISC; 2003. Pg. 41-55. Available at http://www.sherpa.ac.uk/documents/feasibility_eprint_preservation.pdf
Lavoie, Brian. “Of Mice and Memory: Economically Sustainable Preservation for the Twenty-first Century.” Found in Access in the Future Tense. CLIR; 2004. Pg. 45-54. Available at http://www.clir.org/pubs/reports/pub126/pub126.pdf
Lavoie, Brian. “The Fifth Blackbird: Some Thoughts on Economically Sustainable Digital Preservation.” D‐Lib Magazine, Vol. 14, no. 3/4. March/April 2008. Available at http://www.dlib.org/dlib/march08/lavoie/03lavoie.html
Lavoie, Brian. “The Incentives to Preserve Digital Materials: Roles, Scenarios, and Economic Decision-Making.” OCLC Office of Research; 2003. Available at http://www.oclc.org/research/projects/digipres/incentives-dp.pdf
Lifecycle Information for E-literature: An Introduction to the third phase of the LIFE project. JISC/RIN. 2010. Available at http://www.life.ac.uk/3/docs/life3_report.pdf
Longhorn, Roger, and Michael Blakemore. “Re-visiting the Valuing and Pricing of Digital Geographic Information.” Journal of Digital Information 4, (2). 2003. Available at http://journals.tdl.org/jodi/article/viewFile/103/102
Machlup, Fritz. Knowledge: Its Creation, Distribution, and Economic Significance. Volume I: Knowledge and Knowledge Production. Princeton University Press; 1980.
Machlup, Fritz. Knowledge: Its Creation, Distribution, and Economic Significance. Volume III: The Economics of Information and Human Capital. Princeton University Press; 1984.
Maron, Nancy L., K. Kirby Smith, Matthew Loy. Sustaining Digital Resources: An On-the-Ground View of Projects Today. Ithaka; 2009. Available at http://www.sr.ithaka.org/research-publications/sustaining-digital-resources-ground-view-projects-today
Maron, Nancy L., Matthew Loy. Revenue, Recession, Reliance: Revisiting the SCA/Ithaka S+R Case Studies in Sustainability. Ithaka; 2011. Available at http://www.sr.ithaka.org/research-publications/revenue-recession-reliance-revisiting-scaithaka-sr-case-studies-sustainability
McLeod, Rory, Paul Wheatley, and Paul Ayris. “Lifecycle information for E-literature: Full Report from the LIFE Project.” LIFE Project, London, UK. 2006. Available at http://eprints.ucl.ac.uk/archive/00001854/01/LifeProjMaster.pdf
Moore, Richard L., Jim D’Aoust, Robert H. McDonald, and David Minor. Disk and Tape Storage Cost Models. San Diego Supercomputer Center, University of California San Diego; La Jolla, CA, USA. 2007. Available at http://users.sdsc.edu/~mcdonald/content/papers/dt_cost.pdf
Morrissey, Sheila. “The Economy of Free and Open Source Software in the Preservation of Digital Artifacts.” Library Hi Tech, Vol. 28 Iss: 2; 2010. Available at http://www.portico.org/digital-preservation/wp-content/uploads/2010/11/The-Economy-of-Free-and-Open-Source-Software-in-the-Preservation-of-Digital-Artifacts.pdf
Oltmans, Erik. “Cost Models in Digital Archiving.” Presentation at LIBER 2004 , Life Cycle Collection Management, St. Petersburg, July 1, 2004. Available at http://liber.library.uu.nl/index.php/lq/article/view/7789/7908
Oltmans, Erik, and Nanda Kol. “A Comparison Between Migration and Emulation in Terms of Costs.” RLG Diginews Volume 9, Number 2; 2005. Available at http://worldcat.org/arcviewer/2/OCC/2009/08/11/H1250012115408/viewer/file2.html
Palaiologk, Anna S., Anastasios A. Economides, Heiko D. Tjalsma and Laurents B. Sesink. “An Activity-based Costing Model for Long-term Preservation and Dissemination of Digital Research Data: the Case of DANS.” International Journal on Digital Libraries, Volume 12, Issue 4, 2012. Available at http://link.springer.com/article/10.1007%2Fs00799-012-0092-1
Palm, Jonas. “The Digital Black Hole.” Riksarkivet/National Archives Sweden. Available at http://www.tape-online.net/docs/Palm_Black_Hole.pdf
Perens, Bruce. “The Emerging Economic Paradigm of Open Source.” First Monday Special Issue #2: Open Source. October 3, 2005. Available at http://www.firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/1470/1385
Phillips, Margaret E. “Selective Archiving of Web Resources: A Study of Acquisition Costs at the National Library of Australia.” RLG DigiNews, Volume 9, Number 3. Available at http://www.nla.gov.au/openpublish/index.php/nlasp/article/view/1229
Rosenthal, David. “Modeling the Economics of Long-Term Storage.” DSHR’s Blog. 2011. Available at http://blog.dshr.org/2011/09/modeling-economics-of-long-term-storage.html
Sanett, Shelby. “The Cost to Preserve Authentic Electronic Records in Perpetuity: Comparing Costs across Cost Models and Cost Frameworks.” RLG Diginews, August 15, 2003, Volume 7, Number 4. Available at http://library.oclc.org/cdm/singleitem/collection/p267701coll33/id/366
Sanett, Shelby. “Toward Developing a Framework of Cost Elements for Preserving Authentic Electronic Records into Perpetuity.” College and Research Libraries 63 (5):388-404. 2002. Available at http://crl.acrl.org/content/63/5/388.full.pdf
Slats, Jacqueline and Remco Verdegem. “Cost Model for Digital Preservation.” Nationaal Archief of the Netherlands. 2005. Available at http://dlmforum.typepad.com/Paper_RemcoVerdegem_and_JS_CostModelfordigitalpreservation.pdf
Smith, David M. “The Cost of Lost Data.” Graziadio Business Report, Volume 6, Issue 3: 2003. Available at http://gbr.pepperdine.edu/033/dataloss.html
Strodl, Stephan, and Andreas Rauber. “A Cost Model for Small Scale Automated Digital Preservation Archives.” International Conference on Preservation of Digital Objects 2011. Available at http://www.ifs.tuwien.ac.at/~strodl/paper/strodl_ipres2011_costmodel.pdf
Throsby, David. “Determining the Value of Cultural Goods: How Much (or How Little) Does Contingent Valuation Tell Us?” Journal of Cultural Economics 27: 275–285, 2003. Available at http://culturalheritage.ceistorvergata.it/virtual_library/Art_THROSBY_D-Determining_the_Value_of_Cultural_Goods_-.pdf
Torre, Marta de la, editor. “Assessing the Values of Cultural Heritage: Research Report.” The Getty Conservation Institute; 2002. Available at http://www.getty.edu/conservation/publications_resources/pdf_publications/pdf/assessing.pdf
UC3 Curation Center. “Total Cost of Preservation (TCP): Cost and Price Modeling for Sustainable Services.” 2013. Available at https://wiki.ucop.edu/download/attachments/163610649/TCP-cost-price-modeling-for-sustainable-services-v2_1.pdf?version=4&modificationDate=1375721821000
Walters, Tyler and Katherine Skinner. “Economics, Sustainability, and the Cooperative Model in Digital Preservation.” Library High Tech, Vol. 28, no. 2, 2010. Available at http://www.emeraldinsight.com/journals.htm?articleid=1864753
Wellcome Trust. “Costs and business models in scientific research publishing.” SQW; 2004. Available at http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtd003184.pdf
Wellcome Trust. “Economic analysis of scientific research publishing: A report commissioned by the Wellcome Trust.” SQW; 2003. Available at http://www.wellcome.ac.uk/stellent/groups/corporatesite/@policy_communications/documents/web_document/wtd003182.pdf
Wheatley, Paul, P. Ayris, R. Davies, R. Mcleod and H. Shenton. “The LIFE Model v1.1. Discussion paper.” LIFE Project, London, UK. 2007. Available at http://eprints.ucl.ac.uk/4831/1/4831.pdf
Wheatley, Paul and Brian Hole. LIFE3: Predicting Long Term Digital Preservation Costs. LIFE3 Project, London, UK. 2009. Available at http://www.life.ac.uk/3/docs/ipres2009v24.pdf
Wright, Richard, Ant Miller and Matthew Addis. “The Significance of Storage in the “Cost of Risk” of Digital Preservation.” International Journal of Digital Curation, Vol. 4, No. 3; 2009. Available at http://www.ijdc.net/index.php/ijdc/article/view/138
This is a guest post by Abbie Grotke, Library of Congress Web Archiving Team Lead and Co-Chair of the National Digital Stewardship Alliance Content Working Group
You may have seen the news on this blog and elsewhere that the National Digital Stewardship Alliance launched the first ever National Agenda for Digital Stewardship last July. One major section of that document addresses digital content areas. Here’s an excerpt:
Both born‐digital and digitized content present a multitude of challenges to stewards tasked with preservation: the size of data requiring preservation, the selection of content when the totality cannot be preserved, and the selection of modes of both content storage and format migration to ensure long‐term preservation.
Digital stewardship planning must go beyond a focus on content we already have and technology already in use. Even in the near term, a number of trends are evident. Given the ever growing quantity of digital content being produced, scalability is an immediate concern. More and more people globally have access to tools and technologies to create digital content, increasingly with mobile devices equipped with cameras and apps developed specifically for the generation and dissemination of digital content. Moreover, the web continues to be a publishing mechanism for individuals, organizations, and governments, as publishing tools become easier to use. In light of these trends, the question of how to deal with “big data” is a major concern for digital preservation communities.
Selection is increasingly a concern with digital content. With so much data, how do we decide what to preserve? Again, from the agenda:
Content selection policies vary widely depending on the organization and its mission, and when addressing its collections, each organization must discuss and decide upon approaches to many questions. While selection policies for traditional content are most often topically organized, digital content categories, described here, present specific challenges. In the first place, there is the challenge of countering the public expectation that everything digital can be captured and preserved ‐‐ stewards must educate the stakeholders on the necessity of selection. Then there are the general organizational questions that apply to all digital preservation collections. For example, how to determine the long‐term value of content?
Audiences increasingly desire not only access, but enhanced use options and tools for engaging with digital content. Usability is increasingly a fundamental driver of support for preservation, particularly for ongoing monetary support. Which stakeholders should be involved and represented in these determinations? Of the content that is of interest to stakeholders, what is at risk and must be preserved? What are appropriate deselection policies? What editions/versions, expressions and manifestations (e.g. items in different formats) should be selected?
Members of the NDSA’s Content Working Group contributed to 2014 agenda by discussing what content was particularly challenging to them. Report writers then drafted sections of the Agenda to focus on particular challenges with each of the four identified content areas:
- Electronic Records
- Research Data
- Web and Social Media
- Moving Image and Recorded Sound
One simple thing we are doing within the NDSA Content Working Group is holding dedicated meetings focusing on each of the four areas listed above, so that members can learn more and share information about specific challenges, tools in use or being developed and so forth.
The first of these meetings was held December 4, 2013 and focused on web and social media. I provided an overview of web archiving: why web and social media is being archived, who is doing what, what challenges do we face, whether social and ethical, legal, or technical. A PDF of my slides is here. Kris Carpenter from the Internet Archive followed and spoke about the “Challenges of Collecting and Preserving the Social Web.” A PDF of her slides is here.
In January we’ll be focusing on electronic records, and later this spring we’ll have sessions on moving image and recorded sound as well as research data. If you’d like to get in on those conversations, join us in the NDSA!
We don’t claim that the issues surrounding any of the four content types will all be solved over the course of the year, or that these are the only content areas that our members and the broader digital preservation community are dealing with. Who knows what the 2015 Agenda will bring us! But we do hope that by drawing more attention to the challenges we are facing, more research, tools development and related efforts will help advance the work of stewards charged with caring for these digital content areas.
The Library Company of Philadelphia will be hosting Philadelphia’s first National Digital Stewardship Alliance (NDSA) Regional Meeting and Unconference on January 23 and 24. This is part of an initiative across the country for NDSA member organizations to host day-long events, or “NDSA Regional Meetings,” that provide networking and collaboration opportunities for members and highlight the work of regional institutions.
If you’re local to the Philadelphia region or if you’ll be in town for ALA Midwinter, I’d encourage you to check out the program. It’s a free event (!) and there are a few excellent reasons you’ll want to attend.
Learn More About the NDSA
The NDSA is a dynamic organization with more than 150 partner organizations, including universities, government and nonprofit organizations, commercial businesses, and professional associations. It’s self-organized, with the work being decided and driven by professionals contributing work to 5 working groups. The NDSA recently celebrated its third birthday and you can read more about its history and accomplishments here.
At the Philly Regional Meeting, there will be two talks on the NDSA: one on the Levels of Preservation and another on the NDSA and the National Agenda for Digital Stewardship. If you aren’t an NDSA member but you’re interested hearing if the NDSA is a good fit for your organization, please consider attending.
Everyone Wants Standards for Digital Preservation
A focus of this Regional Meeting will be on looking at standards in digital preservation and how different communities use them to preserve and manage their digital collections. The meeting is structured so that you’ll have the opportunity to hear from speakers, like Emily Gore from the Digital Public Library of America (DPLA), Ian Bogus from the University of Pennsylvania Libraries, and George Blood from George Blood Video, on their different approaches to metadata standards used to manage their digital resources (Thursday evening). You’ll also have the opportunity to collaborate during the unconference (Friday morning) on specific challenges or issues on any topic you want to explore with your fellow practitioners in a fun, informal way.
Connect Locally to Your Professional Peers
Creating professional relationships is important, and staying connected to what’s going on in your given field is equally important. NDSA Regional Meetings are particularly great professional development opportunities as a means to connect and network with a local community of practice for digital stewardship. You’ll have the chance to meet face-to-face with your professional peers, ask for advice or help, share ideas and work and generally broaden your knowledge of digital stewardship issues. If your organization is an NDSA member, this is a great time to meet with others in the area. And as I mentioned before, even if your organization isn’t member but you are local to the Philly area, you’re encouraged to attend!
This is the third NDSA Regional Meeting. The Boston Regional Meeting took place in May 2013, organized and hosted by WGBH and Harvard Library. Metropolitan New York Library Council hosted the NYC Regional Meeting last June. Other NDSA member organizations have expressed interest in organizing and hosting regional meetings later in 2014 in other parts of the country (DC-metro area and in the Midwest).
For the Philadelphia Regional Meeting, registration for the unconference on Friday, January 24 is sold out, but there are plenty of spots open for the Thursday, January 23 reception and talks.
Register for #NDSAPhilly today. We’d love to see you there!
Following is a guest blog post from Lisa Shiota, a student at Drexel University School of Information and Library Science and a staff member in the Music Division at the Library of Congress. She explains how she utilized Viewshare in a digital library technologies class.
I am currently finishing classes towards a post-graduate certificate in digital libraries through Drexel University’s online program. This past fall, for my Digital Library Technologies class, our final project was to create a digital library prototype. After looking at several open source applications for digital libraries, I chose Viewshare for my project.
What was particularly appealing to me about Viewshare was the different ways (or “views”) that the information could be presented. I figured it would be worth a try to see how easy it was to use. I requested an account from the moderator using their online form, and once I was approved, I created a login.
Plan and Approach
My plan to build the prototype was fairly simple, at least on paper.
- Identify the physical collection to be used for this project
- Read through the Help pages to learn how to use the system (http://viewshare.org/about/help/)
- Upload smaller test files to see how the system works
- Decide what metadata to record
- Scan covers/title pages
- Upload data
- Build one “view”
- Test interfaces and analyze results
The items I chose for the digital library project are opera scores by Giuseppe Verdi, a 19th-century Italian composer best known for his operas. The Music Division of the Library of Congress, where I am currently working, has most of Verdi’s operas in one print format or another. Although it seemed somewhat limiting to focus on one composer, I wanted to note the contrasting aspects of the collection. For example, most of the items have text in the original language, but there are some that have been translated into other languages. Many of the scores are the first printed editions, but there are several reprints that are represented as well. There are many items that are in manuscript; these are mostly by copyists who had viewed a printed score that had not been available in the United States and had painstakingly made a handwritten copy to add to the library’s collection. These copies are often in extremely brittle condition; many of the handwritten copies and the first printed editions have been copied to microfilm so that a legible, more durable copy could be preserved and made available to library patrons.
After playing around with uploading different kinds of files, I opted to upload a spreadsheet with the items’ metadata. Much of the metadata I chose to compile in my spreadsheet for the items in the digital collection are standard for library bibliographic records: composer, title, publication information, extent (number of pages/volumes), format, language, and call number. I added a couple of fields for internal tracking purposes: a link to the library’s OPAC record where available, and the shelving number for the microfilm version. The notes for each record are mostly mine, which are basic points I found noteworthy about the item.
I scanned the covers (or title pages, in absence of a cover) of the opera scores on a flatbed scanner and saved the images as .jpgs on my personal webspace on my school server. I then added the image URLs to the spreadsheet.
Lastly, I chose to include certain metadata– preferred title, librettists, and performance dates– solely for the purpose of being able to explore the available Viewshare presentations. I wanted to use the preferred title (or uniform title) of a work so that I could group items that were the same work together even if they had different titles on their cover or title pages. I wanted to highlight the names of the original librettists for searching purposes. I recorded the dates of the first performances of the opera (from the “Giuseppe Verdi” entry in Oxford Music Online so that I could experiment with the timeline view.
My final version of my digital library prototype includes List, Table, Timeline, and Gallery views, as well as facets for browsing by score format, language, and librettists, and is publicly available at http://viewshare.org/views/lshiota/verdi-scores/.
This project taught me a lot about the many components involved in creating a digital library. Based on the results of this prototype, I concluded that digitizing the library’s entire opera collection of several hundred items and making them available through Viewshare would prove to be too cumbersome to do. Other smaller collections, such as the division’s archival collections containing short correspondence, sketches, or photographs would work better here. Viewshare’s built-in interfaces for maps, timelines, and graphs would be great for users to interact with the digital collection in a way that they might not be able with the physical collection.
The January 2014 issue of the Library of Congress Digital Preservation Newsletter (pdf) is now available!
Included in this issue:
- Two digital preservation pioneers: Steve Puglia and Gary Marchionini
- New NDSA Report: Staffing for Digital Preservation
- GIS Data at Montana State Library
- Upcoming events: NDSA regional meeting, ALA Midwinter, International Digital Curation Conference
- Interviews with W. Walter Sampson, Mitch Fraas and Cal Lee
- More NDSA news, articles on resources, web archiving and more
To subscribe to the newsletter, sign up here.