The Signal: Digital Preservation
We’re big fans and proponents of face-to-face meetings and conferences a means to explore best practices and share lessons learned within and outside of the digital stewardship community. We can’t attend every event, so that’s why I’m thrilled to have the opportunity to go to the annual meeting of the Society of American Archivists, August 11-17 in New Orleans, LA. The program looks great (and packed) as usual!
I’ll be participating in a session on Friday, August 16 at 9:30am, “Building Better Bridges: Archivists Across the Digital Divide”, with a great line-up of professionals: Rebecca Goldman, Media and Digital Services Librarian, La Salle University; Rachel Lyons, Archivist, and Dolores Hooper, Archivist, New Orleans Jazz and Heritage Foundation; Jamie Seemiller, Acquisitions Librarian, Denver Public Library; Audra Eagle Yun, Acting Head, Special Collections and Archives, University of California, Irvine; and Rachel Donahue, Grad Student, University of Maryland. Eira Tansey, Library Associate, Tulane University, is chairing the session and Megan Phillips, Electronic Records Lifecycle Coordinator, NARA, will be moderating.
The session will explore the effects of the digital divide, focusing on the challenges archives face dealing with practical steps to manage born-digital materials, particularly addressing the growing gap in the skills archivists know (or are expected to know) in managing digital archives.
We’ll start out with a series of lightning talks on various topics, such as starting an e-records repository from scratch, managing digital projects and staff in a smaller organization, and discussing educational and outreach opportunities, as examples of solutions archivists and information professional can learn from to address challenges. A moderated discussion will follow to explore with the speakers and audience what are some broader challenges and how the profession can work on solutions to bridge the gap.
For my lightning talk contribution to the session, I’ll talk about how outreach and programming opportunities can raise digital preservation awareness at the personal level as a way one way to address the digital divide, within organizations and with the general public, using NDIIPP’s personal digital archiving resources. We developed these basic tips and guidance aimed at individuals to save their digital materials. We started holding and participating in events to share our guidance broadly – during National Book Festival and participating ALA’s Preservation Week activities and webinars. But we also recognize that local libraries and archives were in a better position to connect directly with the public and their patrons. Libraries and archives are, after all, very focused public services and community outreach. Though public programming and outreach events, archivists and information professionals have the opportunity to empower individuals to manage and preserve their digital information.
Admittedly, outreach and communication opportunities and efforts don’t close all the gaps that exist. What are some of the other challenges we hope to explore in the the session?
Archives of all sized face many resource challenges to address the acquisition of born-digital collections, including having tools, services, workflows and support from IT departments to process collections and experienced or knowledgeable staff to manage digital materials. To address the latter, organizations routinely hire “digital archivists” with the expectation that all their dilemmas of managing the digital deluge will be solved. They’ll be able to build a digital archive or repository, train existing staff to manage digital collections, provide researcher and user services, and to take on “other digital duties as assigned” (e.g. promoting collections with social media, creating online exhibits). But within the archives and information services profession, we know there is need for constant learning and exploring about what the emerging best practices, standards, tools and services are to effectively keep information (analog and digital) accessible over time. An organization can’t rely on one person to perform modern archivists responsibilities. Just because an archivist with “tech skills” has been hired doesn’t mean all their organizations’ digital archiving and preservation problems are immediately solved.
The session will have plenty of time for discussion, and we’re all really interested in delving deeper into the questions around how we can address acquiring the skills to perform the tasks and responsibilities asked or expected of us — and what some of those solutions might be. If you have any comments on the session topic that you’d be interested in sharing, I’d love to hear your thoughts below. You can follow along all the action during the conference on Twitter #saa13 and for this particular session #s301 .
The following is a guest post by Madeline Sheldon, a 2013 Junior Fellow with NDIIPP.
My major project as a Library of Congress Junior Fellow was to identify and analyze cultural heritage institution digital preservation policies. This project was an update and extension of work done in 2011 by another Junior Fellow, Kristen Snawder. My full report is available here. What follows is a overview of my findings.
Several parameters were established for my project:
- Focused primarily on digital preservation, not digitization
- Published, or last updated, between 2008 – 2013
- Written (or translated) in(to) English
- Published on the internet
- Identified as a policy or a strategy
I located a total of 33 digital preservation policies/strategies from around the world. They were almost equally divided between archives and libraries, with only two documents located from museums. About half of the documents were from U.S. institutions, with the rest primarily from western European nations.
The bulk of my analysis focused on developing and applying a taxonomy to describe the topics covered by the documents. I prepared the taxonomy to permit a high-level comparison among the various policies with regard to their scope and coverage. Due to the larger number of documents I identified, and their more recent publication, I found it necessary to modify Snawder’s earlier taxonomy somewhat (see table 1).Access and Use Preservation Model/ Strategy Accessioning and Ingest Preservation Planning Audit Rights and Restriction Management Bibliography Roles and Responsibilities Collaboration Security Management Content Scope Selection/Appraisal Glossary/Terminology Staff Training/Education Mandates Storage, Duplication, and Backup Metadata or Documentation Sustainability Planning Policy/Strategy Review
Table 1: Digital preservation Policies Taxonomy
I used the taxonomy to create a matrix in which I indicated coverage for each topic within each document. This involved a level of subjectivity, in that I indicated coverage only if the document dealt with the topic in what I judged to be a substantive manner. In other words, the treatment was detailed enough to potentially inform another institution in developing or revising their own policy document. The full results of this analysis are included in my report, but it is worth noting that the three most commonly used taxonomy elements were preservation strategy/model, collaboration and content scope. The three least commonly used were accessioning/ingest, audit and preservation planning.
The policy documents I used for this study, along with current links, are listed in table 2. Note that Archives New Zealand and the National Library of New Zealand co-authored a strategy, which I counted as two separate institutions. The United Kingdom Parliamentary Archives published two documents, one policy and one strategy, which I included as two separate documents, and chose to count the body as one institution, not two.
As Snawder found earlier, the state of institutional digital preservation policies is developmental. Given that more than many more than 33 institutions around the world are likely responsible for digital stewardship, it seems safe to say that most are still considering how best to define and document their policies. I hope my work in some small way helps move that process along.Archives New Zealand te Rua Mahara o te Kawanatanga and National Library of New Zealand Te Puna Matauranga o Aotearoa Digital Preservation Strategy (PDF) Boston University Libraries: Digital Initiatives & Open Access Digital Preservation Policy British Library Digital Preservation Strategy (PDF) Cheshire Archives (UK) Digital Preservation Policy Dartmouth College Library Digital Preservation Policy Florida Digital Archive FDA Policy and Procedures Guide, version 3.0 (PDF) Hampshire County Council Archives Digital Preservation Policy HathiTrust Digital Library Digital Preservation Policy Illinois Digital Environment for Access to Learning and Scholarship IDEALS Digital Preservation Policy John Hopkins Sheridan Libraries JScholarship Digital Preservation Policy London Metropolitan Archives Interim Digital Preservation Policy (PDF) National Archives of Australia Digital Preservation Policy National Library of Australia Digital Preservation Policy 4th Edition National Library of Wales Digital Preservation Policy and Strategy (PDF) National Museum Australia Digital Preservation and Digitization Policy (PDF) North Carolina Department of Cultural Resources Archival Process for Data and Image Preservation: The Management and Preservation of Digital Media (PDF) PlymouthCity Council (UK) Plymouth and West Devon Record Office Digital Preservation Policy Public Record Office of Northern Ireland Digital Preservation Strategy (PDF) Purdue University Research Repository PURR Digital Preservation Policy Rhizome at the New Museum Digital Preservation Practices and the Rhizome Artbase (PDF) State Library of Queensland Digital Preservation Policy (PDF) StatsBiblioteket State and University Library Digital Preservation Strategy for State and University Library, Denmark, version 2.0 (PDF) Swiss Federal Archives Digital Archiving Policy The Royal Library: The National Library of Denmark and Copenhagen University Library Policy for long term preservation of digital materials at the Royal Library (PDF) United Kingdom Data Archive Preservation Policy (PDF) United Kingdom Parliamentary Archives A Digital Preservation Policy for Parliament (PDF) United Kingdom Parliamentary Archives A Digital Preservation Strategy for Parliament (PDF) University of British Columbia Library Digital Preservation Policy (Draft) University of Manchester Library Digital Preservation Strategy (PDF) University of Massachusetts Amherst Libraries Digital Preservation Policy (PDF) University of North Carolina at Chapel Hill: The Howard W. Odum Institute for Social Science Digital Preservation Policies University of South Carolina Libraries USCL Digital Preservation Policy Framework (PDF) University of Utah J. Willard Marriott Library Digital Preservation Program: Digital Preservation Policy
Table 2, List of Digital Preservation Policies
What’s a Nice English Professor Like You Doing in a Place Like This: An Interview With Matthew Kirschenbaum
I’ve talked about Matthew Kirschenbaum’s work in a range of posts on digital objects here on The Signal. It seemed like it would be valuable to delve deeper into some of those discussions here in an interview.
If you are unfamiliar, Matthew G. Kirschenbaum is Associate Professor in the Department of English at the University of Maryland and Associate Director of the Maryland Institute for Technology in the Humanities. Much of his work now focuses on the critical and scholarly implications of the shift to born-digital textual and cultural production. He is the author of Mechanisms: New Media and the Forensic Imagination (MIT Press 2008). He was also a co-investigator on the NDIIPP funded Preserving Virtual Worlds project, a co-author of Digital Forensics and Born-Digital Content in Cultural Heritage Collections, and oversees work on the Deena Larsen Collection at MITH, a personal archive of hardware and software furnishing a cross-section of the electronic literature community during its key formative years, roughly 1985-1995. Currently he is a co-investigator of the BitCurator project and member of the faculty at the University of Virginia’s Rare Book School, where he co-teaches an annual course on born-digital materials.
Trevor: You have a Ph.D. in English literature and work in an English department. What are you doing so heavily involved in the cultural heritage digital archives and digital forensics community?
Matthew: When I teach at Rare Book School every summer, I introduce myself to our class by saying that I’m the English professor who instructs archivists and librarians about computer media. On the one hand, that makes me look very confused. On the other, though, it’s really just a linear and direct outgrowth of my scholarly training at the University of Virginia, an institution renowned for its tradition of attentiveness to texts in all their physical and material incarnations. That perspective was foundational to my first book, Mechanisms. So behind the glib throwaway line, I consider myself a textual scholar who is working with (very) recent cultural materials—primarily literary—many of which, by nature of their historical situation, are born-digital since so many writers are composing with computers just like the rest of us. Digital forensics, specifically, seems to me the natural companion of the rigorous evidence-based methodologies that have emerged in descriptive and analytical bibliography. My current book project, called Track Changes, is a literary history of word processing, and to write it I’m relying on both my training as a literary scholar and the knowledge I’ve gained from working with legacy media and file formats. So again, I simply see myself as following those disciplines using the tools and methods appropriate to the contemporary moment. That I’ve had the opportunity to teach and learn from so many practicing archivists is one of the great professional joys and privileges of my career.
Trevor: Within the digital preservation space there are some strong proponents of normalization of file formats and some equally strong proponents who eschew normalization in favor of staying true to the files one is given. When asked by a librarian or archivist for your perspective on this how would you respond? From your perspective what are the trade offs.
Matthew: The most obvious trade-off is of course resources. Normalization is attractive because it lends itself to batch processing and creates the foundation for a long-term preservation strategy. Institutions always have limitations on their resources and capabilities, and so normalization, which I take to mean migrating data away from legacy formats and media, is going to form the basis of the preservation strategy in many instances. Yet as a scholar who is committed to what we term the “materiality” of all artifacts and media, even supposedly virtual ones, I want to see as much of the original context as possible. This is an easier argument to make in some domains than in others. Games are an obvious example of where “normalization” would defeat the purpose of preservation, thus the widespread use of emulation in those situations. Sometimes we think that documents like word processing files or email offer fewer trade-offs with regard to normalization, since what people really want to see there is presumably the content, not the context. But you can never really predict what your users are going to want. To take an example from my current work on the literary history of word processing: Terrence McNally, in his “papers” at the Ransom Center, has a WordPerfect file wherein he comments about his discomfiture watching his text scroll off the edge of the screen into digital limbo. That’s an instance where a researcher wants to know what the original software was, how many lines and characters it permitted on the screen, what the size of the screen was, and so forth. In fact, I can tell you that the writers who were early adopters often obsessed over such details. The difference between a 40- and an 80-character display could be decisive in a decision to purchase.
The most dramatic example of what’s achievable in this regard is likely still the remarkable emulation work done at Emory for Salman Rushdie’s personal computers. Not only can users look at his wallpaper and other incidentals of the system, they can see which files were stored together in which folders, how software was configured, and so on—all details analogous to the sorts of things researchers find compelling in the physical world. Yet one does not have to go to such lengths to preserve material context. Obtaining and then retaining a bitstream image of the original media will allow future users to reconstruct the original context in as much detail as they like. Such a measure is logically prior to normalization, and relatively easy to implement.
Trevor: The cultural context and physical and digital technologies of computing have evolved and continue to evolve so fast. To what extent do you think we can develop practices and principles that are going to work with materials over the long term?
Matthew: Certainly with regard to legacy media, specifically disk-based magnetic media, I consider myself an optimist. I don’t, for example, think floppy disks are deteriorating at quite the rate claimed by some of our colleagues. I also think that in one hundred years we will know what ASCII and HTML and C++ are, along with Word and Excel, if for no other reason than those things are well documented on acid-free paper (walk into any Barnes and Noble and browse the computer section). And I’ve often said that “love will find a way”: meaning that when committed people care intensely, sometimes even irrationally, about a particular object or artifact, they are often—very often—able to find ways to recover and conserve it. My own best example here is the work around William Gibson’s poem “Agrippa,” an electronic artifact famously designed to disappear, in which I was heavily involved. Ben Fino-Radin has demonstrated the same principle with his work on The Thing BBS. Jason Scott demonstrates it seemingly on a daily basis, but see, for example, his collaboration with Jordan Mechner on recovering the original source code for Prince of Persia. Of course each of these situations which I cite as exemplary was fraught with perils and contingencies which could have easily rendered them fruitless. But I tend not to like the analogy to, say, early cinema (80% of the films made before 1930 are lost) because we are all so exquisitely aware of both the perils and importance of our born-digital heritage. NDIIPP and the NDSA certainly testify to that awareness and commitment in the US.
By the same token, the rise of the so-called cloud presents obstacles that are not primarily technical—for the cloud, as we all know, is merely a hard drive in some other computer—but rather legal and contractual. Likewise, the increasing tendency towards preemptive data encryption—practices which will surely become even more commonplace in the wake of recent revelations—threatens to make archival preservation of personal digital content all but unthinkable for entities who lack the resources of the militarized surveillance state. I know of very little that archivists can do in either of these instances other than to educate and advocate (and agitate). They are societal issues and will be addressed through collective action, not technical innovation.
Trevor: How has your thinking about the role of digital forensics tools developed since the publication of Digital Forensics and Born-Digital Content in Cultural Heritage Collections? Are there any areas where your thinking has evolved or changed? If so, please describe.
Matthew: I admit that when I first began learning about digital forensics I was drawn to the sexy CSI scenarios: recovering deleted files and fragments of files, restoring lost manuscripts, and so on. I still think there’s going to be some gee-whiz stuff in that area, something akin to the wizardry displayed by the team who worked on the 1000 year-old Archimedes Palimpsest for example, but I’ve come to appreciate as well the far less glamorous but wholly indispensable archival functions digital forensics can assist with, such as appraisal, arrangement and description, and the ongoing maintenance of fixity and integrity. I still enjoy quoting the historian R. J. Morris, who back in the 1990s opined: “Within the next ten years, a small and elite band of e-paleographers will emerge who will recover data signal by signal.” (And how could I not quote that for this venue!) But it’s also true that we have yet to see any really compelling examples of revisions, variants, alternate readings recovered in this way. The best demonstrations that I know come from my former Maryland colleague, Doug Reside in his work on Jonathan Larson’s lyrics and compositions for RENT, originally done on a Mac. By contrast, I was disappointed to learn that for both Ralph Ellison’s Three Days Before the Shooting and David Foster Wallace’s The Pale King, two examples of the very highest literary significance where authors left behind relevant born-digital materials, the scholars who prepared the posthumous editions worked from hard copy transcripts of the digital files, not the original disks or bitstream disk images.
Trevor: What do you see as the biggest hurdles archives face in making born digital materials part of their primary operations? Is this largely about a need for tools, frameworks, education and training, examples of how scholars are using born digital materials, the need for new ways to think about materials or other factors?
Matthew: The single biggest hurdle archives face for these materials are users. For the other things you name, I think the field is increasingly healthy. Oh, certainly work remains to be done, but just look at how far we’ve come in just the last five years: there are instructional opportunities available through SAA, RBS and others. There’s a growing pool of expertise amongst both professional practitioners and iSchool faculty. The technical landscape has taken shape through funded projects, meetings, social networks, and a growing journal literature. But even allowing for the relatively small number of digital collections that have been processed and opened to end users, interest in the scholarly community seems slight to non-existent. Emory’s work on Salman Rushdie’s computers, which I praised earlier, has, to my knowledge anyway, produced no great uptick of interest in his digital manuscripts in literary studies. This will doubtless change over time, but it will be slow—you need scholars working on the right topics, they need to be aware of the existence and import of relevant born-digital materials, they need to have or to be motivated to acquire the training to work with them, and finally the materials themselves must turn out to bear fruit. In the meantime I fear that lack of users will be one more reason resource-poor institutions choose to defer the processing of born-digital collections over other material. So I think we need users, or to put it more colloquially we need some big wins to point to in order to justify the expense and expertise processing these collections requires. Otherwise we may simply go the way of media conversion, outsourcing collections in bulk without regard for the material context of the data.
Trevor: The collection of computers at MITH bears some similarities to Lori Emmerson’s Media Archeology Lab and Wolfgang Ernst’s Media Archeological Fundus. To what extent do you think your approach is similar and different than theirs?
Matthew: That’s a great question. I think of places like MITH, which is a working digital humanities center, as well as the MAL, the MA Fundus, and Nick Montfort’s Trope Tank at MIT as inhabiting a kind of “third space” between manuscript repositories processing born-digital collections on the one hand, and computer history museums on the other. They’re really much more akin to fan sites and grassroots initiatives, like the Musee Mecanique penny arcade in San Francisco. Above all, these are entities whose commitment to the materiality of computer history is absolute. They adopt the media archaeological precept that not only does the materiality matter, but that the machines ought to be switched on. At MITH, you can fire up an Apple II or Amiga or Vectrex. You can also take a look at “Mysty,” a 200 lbs. IBM MT/ST word processor (1964!) that I have high hopes of one day restoring. We began collecting vintage computers when Doug Reside was still there, and over time the collection grew. They have been useful to us in several different funded projects over the years, and help distinguish us as a working digital humanities center. But what sets us apart is that we also have two individual author collections, Deena Larsen and Bill Bly—both early electronic literature pioneers—and we have worked to build a relationship to Library Special Collections to ensure their long-term safekeeping.
That last point is worth some further elaboration. I know that when MITH acquired first the Deena Larsen materials and then, more recently, Bill Bly, there were maybe a few eyebrows raised in the manuscripts world. Clearly here was digital humanities looking to usurp yet another role. But that wasn’t the motive at all. Rather, both Deena and Bill were attracted to the idea that we would be working with the collections, using them as research testbeds and sharing them with students. They saw them very much as teaching collections, not unlike the holdings at RBS where students are encouraged to handle the materials, sometimes even to take them apart (and put them back together again). Because MITH does not have other collections to process we were able to work at our own pace, experimenting and testing. But we’re also sensitive to the need for long-term stewardship, and so to that end have forged what may be a unique model of joint custody for these collections between MITH and University Special Collections. In an arrangement concretized by an MOU we are jointly developing procedures for processing these materials and eventually other born-digital collections at Maryland. MITH and the University Libraries are also fortunate enough to be hosting an NDSR fellow this coming year, and we have high hopes that our resident, Margo Padilla, will be able to help us think through the access portion of the model, by far (in my view) the toughest component. So while we align completely with the sensibilities of the MAL and other such places, we also have a rapidly maturing relationship with our institution’s special collections staff, and we hope that others may be able to benefit from that model.
Trevor: What projects or initiatives related to born digital materials are you most excited about now and why?
Matthew: Well, I would be remiss if I did not promote BitCurator, the open source digital forensics environment we’re developing along with a team at UNC SILS. BitCurator is not a tool or a set of tools, it’s an environment, specifically a custom Linux distribution that comes pre-configured with a range of open source tools, enhanced by additional scripting from us to link them together in a workflow and generate reports. We’re beginning a new phase of the project with a dedicated Community Lead in the coming year, and this will be critical for BitCurator in terms of its uptake. To that end we’re also developing an important relationship with Archivematica, where some of our BC scripts will be available as services.
Where I’ve really noticed the impact from BitCurator, though, is in my teaching. Permit me an anecdote. My first year at RBS, I attempted a SleuthKit installfest. I described the experience to Michael Suarez afterwards, and if you ever doubted that a distinguished bibliographer and Jesuit was capable of some salty language, his characterization of my description of the process would have disabused you. Lesson learned, the next two years I relied on screenshots and canned demos from the safety of the controlled environment on my laptop at the front of the room. Much safer, but not nearly as satisfying for the students. BitCurator, at least on my end, was born directly from those frustrations. Thus when this past year we were able to bring the students full circle—from analyzing a disk image, performing basic data triage like hashing and format identification, to searching the image for PII, generating DFXML metadata, and exporting it all as a series of human and machine-readable reports, it was hugely gratifying.
Trevor: You are active in the digital humanities community, with that said I don’t necessarily see many folks in the digital humanities working so extensively with born digital materials. What role do you think born digital materials have to play in the digital humanities and how do you think more digital humanists might be engaged in the research issues surrounding born digital primary sources?
Matthew: I think the potential here is huge, and it dumbfounds me that there isn’t already more cross-over and collaboration. Most DH folks, though, tend to work on older materials, if nothing else than for the obvious reason of copyright. There are some exceptions: Lev Manovich and his idea of cultural analytics, for example. Matt Jockers is going to begin working on more 20th century material (and has the corpus to do it with, an amazing feat) and Ed Finn has been working on contemporary material for a while. Still, they’re the exception. Part of it may be what’s always struck me as a pernicious and pointless division between new media studies and digital humanities, with the former trending towards contemporary digital cultural studies and the latter towards more established ventures in literary criticism and historical studies. But the digital humanities projects of today are the born-digital collections of tomorrow, and the vernacular culture of the Web is no less suitable for analytics from DH as the vernacular culture of two hundred years ago, which we now seek to apprehend through techniques such as distant reading and data mining. DH, it seems, to me, is the natural ally for the digital archivist in the scholarly world. (DHers, for their part, can perhaps better learn that there is such a thing as an archives profession and that their own free use of the term does not necessarily endear them to its practitioners, who have their own benchmarks for professionalism.) I’ve written down some additional thoughts about this, and perhaps the best thing to do is to point folks to this article here, particularly the concluding 5th section.
Thanks, Trevor, for this opportunity: The Signal is a terrific platform for the community, and I’m honored to be included here alongside so many friends and digital preservation pioneers!
The following is a guest post by Madeline Sheldon, former Junior Fellow with NDIIPP
The Junior Fellows Summer Intern Program, at the Library of Congress, provides a unique opportunity for undergraduate and graduate students to work on special projects and collections within the world-renowned institution. The Library selects students with various educational backgrounds and specializations, including libraries, archives, humanities and sciences.
During the 10 week internship, individuals pair with a supervisor, who guides the fellow as they work through their assignments, often acting as professional mentors in the process. Junior Fellow Program coordinators also arrange a series of special tours, meetings, and discussions around Washington, D.C., a special incentive, while working at the Library.
As a Junior Fellow, I had the privilege of working with William LeFurgy, Acting Director NDIIPP Program Management, in the Office of Strategic Initiatives, who encouraged me to attend several meetings, and allowed me to take on multiple projects during my tenure.
- At one of the first meetings organized for the Junior Fellows, Dr. James Billington, Librarian of Congress, and Roberta Shaffer, Associate Librarian for Library Services, spoke to us about their positions within the Library, as well as answered questions from the audience. I sat on the front row, literally feet away from these esteemed professionals: a memorable experience I will never forget.
- I also met with United States Senator, Debbie Stabenow, and Congressman John Dingell, who both took time out of their busy schedules to take a photograph with me, a new Michigander.
- In previous blog posts, I discussed my attendance at a talk, given by Courtney Johnston, Director of Dowse Art Museum, and summarized my experience spent at a symposium, which focused on conservation practices for time-based media within museums. Both presentations provided me with helpful information and strategies, which I used while researching digital preservation policy planning within cultural heritage organizations.
- While I spent a majority of my time researching and writing, I also had an opportunity to produce a video, based on Tess Webre’s blog post, titled Snow Byte and The Seven Formats: A Digital Preservation Fairytale. I helped with script development and designed a storyboard for the video, which outlined specific instructions for the visual, music, and audio/narration transitions. The video is currently going through a final stage of edits, but should debut shortly.
- During my final weeks, I participated in two events – the Junior Fellow Display and the NDIIPP annual meeting, Digital Preservation 2013, where I presented a poster, which outlined the research I conducted while working with NDIIPP.
I am very proud of the work I’ve done, but also thankful for all of the help and advice I have received along the way from fellow interns, Junior Fellow coordinators and OSI staff. The employees I’ve met, and collaborated with, have been so welcoming, thoughtful and encouraging; I value each and every experience shared with them.
While I am sad to leave the Library of Congress, I know that my time spent here has been such a fulfilling, professional opportunity. I feel so honored and fortunate to have served my country in this way, and hope the Library continues to offer such a valuable program.
Update, 8/9 – corrected URLs
The human rights organization, witness.org, — who gave a presentation at Digital Preservation 2013 — just published The Activists’ Guide to Archiving Video. Though this guide is intended for human rights activists, it covers all aspects of digital video archiving so thoroughly that it is of value to anyone and everyone, from individuals archiving their personal videos to organizations developing digital video archives.
Witness’s staff of professional archivists and video technologists structured the guide in a sequential workflow under the headings Create, Transfer, Acquire, Organize, Store, Catalog, Preserve and Share. Each step in the workflow includes an example scenario and graphics; details the advantages and disadvantages of certain practices; and provides tips with basic and advanced levels of technical information. The website is displayed in a clean, easily readable layout and each section is filled with links to tools and resources.
From the start, the video-creation process itself, the guide explains metadata, emphasizes its importance and details what metadata to capture, how to capture it (by either embedding it into the video file or describing it on camera visually and verbally) and how to display it. They even alert readers to technological snags, such as the possibility of metadata getting stripped out when transcoding files.
They examine the process of transferring files –- offloading from a camera, over the network or off a storage device –- and stress how crucial it is to constantly verify the integrity of the files by means of checksums/hashes, virus checking and spot checking. Acquisition steps include content evaluation and deciding what to keep. Organizing emphasizes the need for a logical system of organization, which is equally important for individuals and organizations; links in this section include tools for media management.
The section on storage media and hardware examines storage strategies and compares hardware devices, from simple hard drives to network storage to RAID arrays. Cataloging covers indexing, types of metadata and inventory tools.
Preserve gets to the heart of digital preservation: ensuring long-term accessibility. Since the guide is aimed primarily at human rights organizations, the Preserve section leans toward partnering with institutional archiving. If that is not an option, there is detailed information for building your own professional archive.
Finally, Share looks at issues in providing access to your videos: posting videos online, creating finding aids, controlling levels of access, copyrights and — given the dangerous climate that human rights organizations may operate in — security and identity.
Witness may have written the guide for human rights organizations but the digital-preservation information contained in the guide has been gleaned from the digital archive, library and video technology practices. The authors have managed to extract the nucleus from each issue in digital preservation, combine and organize them all in a logical flow and explain them in a manner so direct and clear that the guide could be easily understood by most people.
The August 2013 Digital Preservation Newsletter is now available!
In this issue:
- Overview of our annual meeting: Digital Preservation 2013
- Announcing Release of National Agenda for Digital Stewardship
- You Say You Want a Resolution…which is best for photos?
- 3 Things Needed for Personal Digital Archiving
- Twisty Little Passages to a Career in Digital Preservation
- Two posts each on email archiving, museum related digital archiving, and the NDSA levels of digital preservation
- Interviews with experts on Viewshare, digital offerings at the National Park Service, and others
- Announcements of upcoming events, training courses, and more
A ten year-old recently asked what I do for a living.
The response mostly involved explaining that the Library of Congress has digital collections and that I lead a team of people that take care of digital things, including writing software.
I have often been asked by family, friends and complete strangers to explain what I do. Here’s an attempt.
Research. It seems that every day I see notice of a new report, software, tool, or group that relates to some aspect of digital preservation, or that could have an impact on digital preservation. I cannot possibly read everything, but I certainly do download and skim a lot of reading material. Sometimes I have the time to really delve into some publications.
Follow Social Media. I follow several hundred accounts now on Twitter. My time on Twitter is never wasted, as I find so many announcements there. And I don’t just follow digital preservation-specific accounts, but the news organizations, cultural heritage organizations, scholars, technologists, librarians and archivists and curators, and art and technology journalists. One never knows where some relevant tidbit will appear. And as Old School as it may sound, I also still subscribe to some email listservs.
Attend Meetings and Conferences. I spend a lot of time in meetings about digital preservation. In some cases I am participating in or facilitating discussions that introduce people to digital preservation, or to consider technologies and tools and technology feasibility. And I participate in task forces and panels at other organizations or federal agencies. But other times I am just sitting in a room listening, which is just as valuable, if not more some times. But one of my favorite events every year where I get to talk for hours at a time is the National Book Festival. I volunteer at the NDIIPP booth and get to talk to hundreds of people over the course of two days about their personal digital archiving and preservation needs. This informs a lot of my thinking about tools we might need and guidance we need to develop.
Present. I am extremely honored to be invited to speak at many events every year. I give talks at the Library on initiatives we’re working on. I lecture to library and information school classes. I talk at conferences. And I get questions, which helps me refine my message and better understand how what I am doing might be useful or usable for other organizations.
Write. I draft preservation plans. I write statements of work for contracts for preservation tool development. I write papers and articles on digital preservation and technology topics. I write blog posts…
Build Relationships. One of the primary mandates in digital preservation is collaboration, as no one organization could or should work alone. I spend some of my time every day reaching out to people I know at other organizations, finding out what they’re working on; responding to messages from colleagues asking if the Library is doing anything that might be of use to them; and meeting new people, sometimes online, sometimes at meetings.
Work With Collections. Sadly, at this point the thing I do very little is interact with collections. I started out my career on a curatorial/collections management track, and I miss working directly with things. Sometimes I get to roll up my sleeves and do what is needed to make sure we have what is needed to process a collection. Or make sure that files are where they should be. Or audit and report on the status of a collection. Or, on rare occasions, create some metadata.
Write Code. The thing I actually do the least now is write code. As in not at all. But I get to work every hour of every working day with an amazing group of programmers who are writing code that is vital for the ingest, management, preservation, and access to the Library digital collections, and that we release as open source for the international preservation community. And I often get to sit in a room with them and talk about priorities for tool development based on what I’ve read/heard/learned at meetings with Library staff and people in the community. In many ways I get to see the fruits of all my efforts incorporated into the tools that we build. And that makes me extraordinarily happy.
This is a guest post by Ingrid Jernudd, a volunteer with NDIIPP.
I am a senior at Stanford University who is pursuing a degree in psychology. In the past I have worked for a public relations firm, worked on planning events and with community outreach for Stanford Dining, and been a research assistant in psychology labs at Stanford. These experiences, combined with an international upbringing, have contributed to my interest in effective methods of communication. In addition to these occupational experiences, I studied abroad at Oxford University for the second half of my junior year. This unique academic experience was eye-opening for me, as I discovered a passion for using digital sources for research and was provided with an irreplaceable opportunity to improve upon my writing.
My appreciation for access to digital resources, combined with my strong interest in effective communication, brought me to the Library of Congress this summer. I want to gain some practical experience in helping raise popular awareness about the value of digital preservation in our lives.
With the advent of globalization, and an increase in the role of technology, the need for the effective and rapid dissemination of information is apparent. Digital information has provided a solution to this in a number of ways. For instance, communication tools like social media and email enable people across the world to contact each other and share information in a matter of seconds, and digital 3D models of historical artifacts ensure global access to sources of cultural heritage.
I, personally, have definitely benefitted from this transition from physical to digital information, although my experience with digital preservation is limited. For instance, various social media websites provide a simple way for me to keep in touch with friends in other countries. In regards to academics, my professors at Stanford post their lectures and homework on course websites, and I have accessed countless numbers of research articles online when I have not been able to find a physical copy.
While I take these resources for granted, not everyone is aware of their availability. Moreover, these resources will only continue to be available if they are properly archived and maintained by ensuring that the archived digital information keeps up with new software as technology advances.
In order for this to occur, people not only need to know how to access digitally archived information, but also how to digitally archive and preserve information themselves. This is what I will be helping with at NDIIPP. While my own knowledge of the intricacies of digital preservation is limited, I do have communication and outreach experience. During my time with the NDIIPP, I will be working on creating tutorial videos on various methods for creating and archiving digital information, writing blog posts for The Signal, and continuing work on a previous NDIIPP outreach project that involved working with libraries to increase awareness about digital preservation. Along the way, I hope to learn more about digital stewardship, and the future trajectory of digital preservation. I am excited about volunteering with the Library of Congress, and look forward to working on my communication and outreach projects.
How do people outside of our community think about digital preservation?
In her opening, Hilary Mason, chief scientist at bit.ly and the first speaker at Digital Preservation 2013, posed this question, framing her talk from the perspective of computer engineers and those working in start-up businesses. She went on to talk about the evolution of bit.ly and data archiving, noting that preservation without access is useless. Her thoughtful keynote set an excellent stage for the next few days of presentations and discussions.
Hilary was one of of two dozen speakers, including Lisa Green of Common Crawl, Emily Gore of DPLA and Rodrigo Davies of the MIT Center for Civic Media Labs, invited to share their views and work during our annual summer meeting. Some of the speakers were not directly involved in the preservation or long-term access to cultural heritage, scholarly or scientific digital materials. But we like to invite speakers who expose our audience to the perspectives of those organizations creating, consuming and accessing digital information. Why?
One of the goals of our annual meeting is to support the development of expertise for digital preservation through education and training of working professionals and students. Hearing from a community of data producers and researchers as well practitioners and stewards of digital information, we can better understand together the current challenges and potential collaborative solutions of stewarding digital materials for future use and research value. People getting together in person to discuss issues, share ideas and work together on solving shared problems is an activity we find invaluable and a core benefit to our work. We hope those of you who were able to attend gained new insights to help them in their practice and had a meaningful experience.
For a full-run down of the first two days of the meeting (July 23-24), I’d encourage you to read Mat Kelly’s trip report on the Old Dominion University’s Web Science and Digital Libraries Research Group blog. It’s an excellent and comprehensive recap of the meeting and chock full of great quotes from speakers and lightning talks. Not only that, he captured videos of the speakers, which are a great resource! (We videotaped the presentations too, but our post-production process is not as fast as Mat’s.)
We were also thrilled to see that a couple of meeting attendees posted their talks on their own blogs. Sarah Werner, of the Folger Shakespeare Library, posted the text of her keynote talk, as did David Rosenthal, of Stanford University, of his talk on the “Green Bytes: Sustainable Approaches to Digital Stewardship“ (PDF) panel. Barbie E. Keiser wrote a nice article of her impressions for Information Today, Preserving Our Digital World, on the first two days of the meeting.
Aside from the great presentations, a couple of highlights for us at the meeting included:
The release of the 2014 National Agenda for Digital Stewardship. Micah Altman, director of research, libraries, MIT, rolled out the agenda, noting that the document integrates the perspective of dozens of experts and hundreds of institutions to provide funders and other executive decision-makers with insight into emerging technological trends, gaps in digital stewardship capacity, and key areas for development.
Presentation of the NDSA Innovation Awards. The awards winners were officially recognized during the meeting. Each of the winners talked briefly about their work and projects, and we find that it’s a nice way to mark achievements by organizations and individuals in the field.
As we did last year, alongside the main meeting, on July 25 we co-hosted a CURATEcamp. This camp’s focus was broadly about the idea of exhibition, everything from faceted browsing, visualizations and displaying audio-visual materials to digital storytelling, social media as exhibition and interpreting digital objects. Sharon Leon, of the Roy Rosenzweig Center for History and New Media, Michael Edson, of the Smithsonian Institution, and Trevor Owens, of the Library of Congress, facilitated the day. Many of the sessions notes were captured in Google docs, which are available on the wiki. You can get deeper sense of the topics and issues discussed just by reading the notes.
Presentations are now available on the NDIIPP website and videos will be added as they become available, starting later in August.
For those in the Washington DC area and who missed meeting, there will be an NDIIPP briefing Review of Digital Preservation 2013 Meeting on Tuesday, August 6th, at 11:00 am in the Pickford Theater, on the third floor of the James Madison building at the Library of Congress. The presentation is free and open to the public.
In this installment of the Content Matters interview series of the National Digital Stewardship Alliance Content Working Group we’re featuring an interview with Ben Blackwell, the “psychedelic stooge” at Third Man Records.
Third Man’s owner, musician Jack White, has a deep and abiding interest in musical anthropology in all its forms, while being strongly forward-thinking at the same time. His recent announcement of support for the National Recording Preservation Foundation reflects this interest along with his generosity.
Ben was a big part of our Citizen Archivists and Cultural Memory panel at the South By Southwest 2013 conference and has been with Third Man from before the beginning.
Butch: Tell us briefly about what Third Man Records does and the philosophy behind it.
Ben: Third Man was started initially as an insurance policy to prevent the White Stripes from getting ripped off when they started signing to major labels. It only existed on paper. Come 2009 we started in earnest as an actual label pressing records. While primarily known as a machine that handles all the projects that come out of Jack White’s head, we’ve been branching out more and more into things without his fingerprints on them, whether they be new artists, reissues, or classic Detroit recordings I can possibly sneak past Jack while he’s not looking.
Butch: Tell us briefly about your background and how you ended up at Third Man.
Ben: It’s the family business…Jack is my uncle. I started out carrying amps into bars at the earliest White Stripes gigs. I was 15 years old. Once they had a 7″ out I ran the merchandise table. By the time they’d graduated to an actual website I was in charge of all the information on there and the mailing list. Come their first cross-country tour in the summer of 2000, I’d just graduated from high school and turned 18 so I jumped in the van with Jack and Meg and had the absolute best learning experience I could ever ask for.
In the record business…I began as an unpaid intern at Italy Records, the super-small but super-important Detroit indie label that released the first two White Stripes singles. Mainly filling mail orders. Occasionally being tasked with calling distributors. This was 1999, I was 17 years old. Italy turned dormant by mid-2002 and come January 2003 I was starting my own label, literally in Italy’s image, called Cass Records. I ran it out of my bedroom for five and a half years before Jack called me with the idea of Third Man in Nashville. He said, “You’ve spent the past few years learning the vinyl process and everything involved. You know the White Stripes catalog better than anyone else. I can’t do this label without you.” Luckily, Detroit in 2009 wasn’t offering me any salaried record label positions so the timing was opportune.
Butch: What does the current workflow look like for how recordings come to Third Man? Are most Third Man masters created using analog recording technologies or digital?
Ben: If it’s a recording that’s generated brand-new from Jack’s studio or our live room, it’s pretty much 100% analog. Sometimes when we go back to release older, archival things they may be on a format that lends itself to digital transfer or clean-up…I’ve dealt with far more DATs and ADATs than someone my age should reasonably expect. And if we’re working with a licensed master, it’s very seldom analog. Things like the Public Nuisance LP and Loretta Lynn’s “Van Lear Rose” were cut from original analog mixdowns, but those are the exceptions to the rule.
Butch: No matter the recording workflow, you’re still faced with the challenges of preserving analog and digital materials. It seems that Third Man has been more thoughtful than most independent labels in recognizing the value of long-term preservation. What led you to think more deeply about the long-term stewardship of your own materials?
Ben: We’ve been very luck in that while Jack’s lawyer’s have always been very shrewd in making sure that the legal rights to his masters will always revert back to him, Jack has personally made sure that POSSESSION of his masters never gets too far out of his reach. That being said, once you have everything (and we do, pretty much, have everything) the question of what to do with it and how to do it becomes that much more serious. We’re lucky in that we own our own building and were able to put in a custom, master tape storage vault.
Butch: What have been some of the biggest technical challenges you’ve faced in preserving your own audio materials? Describe one of your most interesting preservation challenges.
Ben: As of right now, it’s space. Two inch tape carries a big footprint! We had a machine try and eat an ADAT just last week. Fortunately Nashville is the kind of town with folks who still know how to deal with archaic technology. Thankfully I don’t have to be terribly hands-on in a situation like that.
Butch: How widespread is an awareness of digital stewardship and preservation issues in the music industry?
Ben: With folks I know and deal with, it’s non-existent. Folks don’t think to back-up a hard drive they recorded on or even save the layouts for their artwork. It’s hard to think of that stuff as an asset (or even a future asset) when you’re struggling just to get it out. The bigger the artist the more likely they are to care…but I’ve yet to be friends with someone that didn’t have to work their way up a ladder to becoming a “big” artist. With that in mind, who’s keeping track of all the early stuff?
Butch: Libraries, archives and museums have come to rely on “citizen archivists” like you to take the lead in capturing, preserving and making accessible overlooked corners of our cultural heritage. Do you have any thoughts on what the role of LAMs should be in relation to the work that you do? Should LAMs take a more aggressive role in the early capture and preservation of pop cultural materials or should they continue to rely on collectors and the marketplace for early capture and preservation?
Ben: LAMs should be making their holdings available to as wide an audience as possible. The problem is, things are donated to these institutions all the time, but the processing of material is absolutely glacial in its pace. If I were to just GIVE all my pertinent Detroit/Michigan records to a university, they will just sit there for A WHILE before they’re properly cataloged and/or made accessible to the public. Meanwhile, [name] can donate his papers and give ‘em $2 million while he’s at it and that assures his work will be dealt with and handled properly and promptly.
In my circle of friends, it’s often said “don’t give your records to libraries/museums…they will just sit on a shelf.” Which I hate to say and even think, but it’s pretty true. In my dreams, all these institutions would be able to scan and transfer the entirety of their holdings and make them available on an easily-navigable website. While I enjoy holding actual original copies of things more than anybody, the unwashed masses of “the public” don’t need to be manhandling one-of-a-kind records. But if they had easy access to them via the web…I see that as properly serving the populace.
Butch: We were first made aware of the label’s interest in preservation in a New York Times article that described the room in the Third Man offices called “the Vault.” Describe how the idea for the Vault came about and what’s interesting about it technically. How easy would it be for other independent labels to create their own “Vault,” and should they?
Ben: Master tapes had been sitting in Jack’s closet at home for nigh-on ten years by the time the Third Man building was being retrofitted. It just made sense to clean out the closet. Technically it’s climate-controlled with a door that is fireproof and door insulation that is smoke-reactive. It has blocked off air-vents so no Tom Cruise, Mission Impossible break-ins. Poured concrete cinderblock walls. Lasers…don’t get me started on the lasers. I don’t think it would be too easy for other labels to implement a Vault on a similar scale, but not everyone needs what we have. To be honest, a closet works fine for most indie labels.
8/2/14: Some hyperlinks removed.
On a crisp, clear January day in Santa Fe, New Mexico, Lucinda Marker and her husband, John Tull, stepped inside an Airstream trailer that StoryCorps converted into a mobile recording studio. Marker and Tull were there to interview each other for an audio memento, to reminisce and talk about significant moments in their shared lives, especially about the time they were gravely ill with the bubonic plague during a vacation in New York City.
A StoryCorps facilitator greeted them, explained how the recording process worked, sat them down on either side of the kitchenette table, adjusted their microphones and — when the couple was comfortable and ready — began the recording session. As Tull and Marker chatted over the next forty minutes, the facilitator, sitting off to the side, jotted down notes and keywords correlated to time-code points in the recording.
The couple talked about how in 2002 –- in post-9/11 New York — their illness was suspected as a possible act of bio-terrorism and about how Tull slipped from flu-like symptoms into a coma that lasted almost 90 days. In his rich baritone, tinged with a southwestern accent, Tull sounded John-Wayne-tough as he described possible reasons for why his spirit hung onto life. As they wrapped up their conversation, the tone of their voices grew tender as they talked about the healing power of their love for each other and how it helped them endure their ordeal.
After the session ended, the facilitator took photos of the couple and gave them CD copies of their interview. Marker and Tull signed release forms, left the trailer and went on their way. Then the digital preservation began.
Staff processed the audio recording and the documents related to the interview and then temporarily stored the entire package –- what StoryCorps calls the interview record — with other interview records. Marker and Tull’s story began a curated journey that would take it from the trailer, cross-country over the Internet to StoryCorps headquarters in Brooklyn, NY, for processing and archiving, and later by hard drive to the American Folklife Center at the Library of Congress and into the Library’s digital repository.
Dave Isay, award-winning radio documentary producer and MacArthur Fellow, founded StoryCorps in 2003 to give the general public an opportunity to tell and record their personal stories. One of the guiding principles of StoryCorps, a principle embodied in all of Isay’s works, is that each person has a story to tell and every voice matters.
Some stories archived by StoryCorps are poignant, some are infuriating. Some are touching and some are horrifying. Some are mundane and some are thrilling. But all are intimate and real. For most participants, the interview is a rare opportunity to ask questions of loved ones, to talk about their thoughts and feelings, and especially to be heard; one of StoryCorps’ books is titled, Listening is an Act of Love and one of their poster taglines is, “Ask now. Listen forever.”
Facilitators are present during the interviews to help but the facilitators don’t actually conduct interviews unless people ask them. Mostly the interviews consist of friends or loved ones conversing with each other. As for what they talk about, they often come in with some topics or ideas in mind or they improvise. StoryCorps also offers a list of suggested questions.
The volume of StoryCorps interviews is much too large for staff to play back and review in detail, so during the recording of each interview, facilitators play a crucial front-line role in summarizing the interview content and adding keywords in real time. Staff are also vigilant for audio gems, listening for recordings that are especially interesting; if an interview stands out, staff will alert StoryCorps headquarters that the interview might be a good fit for a broadcast, either nationally on NPR or on a local station. The end result for broadcast is a three-minute extract from the forty-minute interview.
Most people grant permission for StoryCorps to archive the interviews and to share them for radio broadcasts, animations, research and with partner organizations. Some interviews are so colorful that StoryCorps enhances them with animation.
StoryCorps currently has stationary, soundproof recording centers in Atlanta, San Francisco and Chicago. They also have a traveling mobile recording unit, the one Marker and Tull used, and a door-to-door service. StoryCorps is staffed with professional archivists, librarians, engineers and information technologists to ensure that the interviews are properly conducted, recorded and archived in accordance with the best technological and institutional practices.
Isay had the foresight from the beginning to only record the interviews digitally and, realizing the unique value of the content, he arranged for the Library of Congress’s American Folklife Center to archive the ongoing collection. Bert Lyons, Digital Asset Manager at the Library of Congress, said that StoryCorps’ collection represented a significant milestone for the Library because it is one of the earliest 100% born-digital collections at the Library of Congress.
To achieve high sound quality, StoryCorps staff digitally record interviews as uncompressed PCM audio streams in WAV files at a resolution of 96,000 samples per second (96 kHz) and 24 bits per sample. This is archival quality, a much higher resolution than the average commercial CD.
They take individual and group portraits of the participants and then they scan the facilitators’ notes and the interviewees’ release forms to PDF files, assign identifiers to all the related files and bundle everything into one interview record.
All interview records go to StoryCorps headquarters for processing and archiving. At remote StoryCorps’ recording stations and mobile recording units, staff regularly upload batches of interview records to headquarters’ servers. At the time of each transfer to the Library of Congress, they run checksums to document the integrity of the files — fixity checking — and package the interview records in BagIt format.
Dean Haddock, StoryCorps’ senior manager of information technology, said, “We ensure the fidelity of the file every step of the way. It’s checksummed everywhere.”
StoryCorps headquarters staff process interviews around the clock and convert copies of the WAV audio files to MP3s for online access. They store their data in customized databases and replicate everything on backup servers. Then every three or four months, StoryCorps loads a hard drive with about 1TB of interview records and ships it to the Library of Congress.
At the Library, AFC staff members transfer the interview records off the drive and onto a server, and then conduct an inventory of the contents using the Library’s content management system, which they call the Content Transfer System. The CTS checks the StoryCorps files for viruses, runs checksums against the BagIt file manifests to verify that the content is intact (kind of like reviewing an invoice to confirm that what you received is the same as what the invoice says was sent) and then moves the files to their final destination on a tape archive. The Library keeps backup copies of all its digital collections onsite and another copy replicated at a remote geographic location.
The Library maintains a certain amount of its own metadata about the StoryCorps files it archives but the descriptive metadata associated with the interview records comes from StoryCorps’ own databases. The Library also archives copies of those databases as well as StoryCorps’ software code.
“We want to document their efforts,” said Lyons. “But if something were to happen to StoryCorps, if they had some cataclysmic event, we would be able to reboot a repository for them.”
The Library of Congress and StoryCorps work closely together and hold regular meetings in which they explore collaborative ways to improve the process and refine the archive. It is a tight, efficient archival partnership.
“We talk about things like how to improve controlling subject headings,” said Lyons. “Even things like how to improve tagging within their systems and how to improve description and so forth. Their adoption of the BagIt standard has been really useful. It moves the generation of fixity checksums that much closer to the moment of creation. It’s also taken some of the computing responsibility for inventory off of us and puts it on their computers. That makes life a lot easier.” Lyons said that one of their future collaborative goals is to try transferring the collection shipments over the Internet.
StoryCorps continually explores ways to reach out, to make its services available and digitally preserve new voices for posterity. They have a range of initiatives, such as the Military Voices Initiative for post-9/11 veterans, active-duty service members and their families; StoryCorps Legacy for people with serious illnesses; StoryCorps Historias for Latinos; the Griot Initiative for African Americans and the September 11th Initiative recording the stories of people affected by the World Trade Center bombings of September 11, 2001 and February 26, 1993.
Over the past year, with funding from the IMLS and in collaboration with the American Library Association, StoryCorps provided grants to ten pilot libraries across the U.S. to help them develop and implement a StoryCorps model of oral-history gathering for their communities. The digital interviews gathered as part of this program — called “StoryCorps @ your library” — will end up at the Library of Congress.
The essence, the “it,” at the heart of StoryCorps — the digital recordings of almost 100,000 people telling personal stories — is such a new type of cultural treasure that, with the help of other visionaries, StoryCorps has been exploring some possible uses for its vast audio library and its associated metadata.
On April 30, 2012, StoryCorps held an advisory summit, funded by the Alfred P. Sloan Foundation, titled “Re-Imagining the Archive: New Approaches to Data-Driven Public Programming and Research.” Participants included data scientists, statisticians, archivists and librarians, oral historians and linguists.
Millington said, “We brainstormed about ways this collection could be used in innovative ways, ways that benefit communities and ways that could serve as guides to other similar collections.” Long after the event, StoryCorps is still testing possible applications and solutions generated by the summit.
A model use for their collection is their collaboration with linguistic researchers from Oregon Health Sciences University and MIT.
“Our partners at MIT’s Lincoln Laboratory studied African-American Vernacular English,” said Virginia Millington, StoryCorps’ recording and archive manager. “Researchers took a representative sample of StoryCorps interviews and subjected them to computer analysis, specifically focusing on speech and dialect patterns.” One of the by-products of the research is that the Lincoln Laboratory created transcriptions of the interviews that they used, which, in turn, StoryCorps added back into their own interview records. This is significant because StoryCorps does not currently have the resources they need to transcribe all of the interviews in their collection.
Any profile of StoryCorps always has to come back to the precious intimacy and authenticity of each recording, about people opening up about how they feel and what they think about.
Dean Haddock said, “Part of what makes StoryCorps so special is the sacred space that we create for people to have this conversation and to listen to each other. The act of listening is hugely powerful. It strengthens and heals relationships and gives people an opportunity to talk about things they might not otherwise talk about.”
The value of each interview increases immeasurably when it becomes part of an archive of voices. Each story becomes an element of history, representative of a particular time, place, person and event, to be appreciated on its own and as part of a larger cultural mosaic.
StoryCorps is a new species of cultural institution that could only have emerged in the digital age. The staff of archivists, technologists and visionaries treat each interview record not only as a brief moment in someone’s life and a valuable cultural artifact, but also as a digital object that can be made searchable and relational, and therefore of potentially greater significance to future generations of researchers, in ways we cannot yet imagine.
Bringing Hidden Collections to Light with Viewshare: An Interview with Julie Miller, Historian at the Library of Congress
The following is a guest post from Camille Salas of the Library of Congress.
On June 11th, I had the opportunity to conduct a Viewshare presentation at the Library of Congress. In addition to a demonstration of how Viewshare works, I shared a few examples of how staff members at the Library of Congress are using the platform. One of the views I highlighted was created by Julie Miller, a historian in our Manuscripts Division. Julie first heard about Viewshare in December and contacted me to help her set up a view. She ended up creating a truly unique view about maritime documents originating from the eighteenth and nineteenth centuries. Her view has the potential to reach many different audiences interested in a wide range of subject matters. The following is an interview with Julie about the collection and her experience with using Viewshare.
Camille: During our first meeting, I really enjoyed the opportunity to see some of the documents you wanted to showcase through Viewshare. Please tell us about the collection.
Julie: As American ships traveled through foreign ports in the late eighteenth and early nineteenth centuries their captains had to negotiate a thicket of laws governing empire, trade, peace and war and disease. The result was that every ship accumulated piles of documents: clearances, bills of health, receipts of payments of customs and lighthouse duties, bills of lading, ship passports and more. The Manuscript Division has hundreds of these documents. Now that more than two hundred years have passed since most of them have fulfilled their original function, they have acquired layers of historical meaning. Stories about war and empire, diplomacy, slavery, epidemics, revolutions, privateers and pirates, and the careers of ship captains and colonial officials can be read in these documents.
Camille: What prompted you to think about this collection as a possible fit for using Viewshare?
Julie: The Manuscript Division has a goldmine of these documents in its Miscellaneous Manuscripts Collection, which consists of tiny collections of historical manuscripts, most of them small enough to fit in a single folder. Because they are arranged alphabetically, the ship papers, which consist of one folder per ship, each folder containing one to five documents, are scattered through the three hundred boxes of the Miscellaneous Manuscripts. I had been looking for a way to identify the ship papers with some kind of guide, so when I saw your Viewshare demonstration at the Women’s History meeting, Camille, I thought, Aha! that’s the way to bring this physically separated group of like items digitally together. And I thought Viewshare would make the information in the documents visible and quantifiable.
There was also something else – as I looked at more and more of these ship papers, I began to suspect that they were not especially miscellaneous. While their dates ranged across a century, from approximately the 1780s to the 1880s, and while many American and foreign ports were represented, most of the papers appeared to be clustered in a short time period, from about 1800 to 1812. Furthermore, many appeared to document trade between Baltimore and the Caribbean islands, mostly in the slave-grown crops of sugar, coffee, and cocoa.
To see whether many of the ship papers actually constituted a coherent collection of documents about Baltimore and the West Indian trade in the early nineteenth century, I chose a sample, made an Excel spreadsheet, and uploaded it to Viewshare. My sample consisted of a single accession: a group of approximately 127 documents representing about eighty-eight ships that the library bought from a rare book dealer in 1903 for $33. When I built a map and created tag clouds and lists, the dominance of Baltimore became obvious, as did its trade with Caribbean ports, especially the French colony of Saint Domingue, today, Haiti. Viewshare made it possible to see that these documents were not miscellaneous at all, but instead constitute a rich and meaningful collection.
Camille: Please walk us through the process of organizing the content and data for your view. For example, what kinds of decisions did you make with respect to the data you wanted to include?
Julie: My sample presented many challenges. The first was deciding what data to include. To ensure comparability, I ultimately decided on information that was common to most of these documents: ship name and type, ports, dates, captain, cargoes, and languages. While many are in English, the many foreign languages — Spanish, French, Portuguese, Dutch, Swedish, German and even Latin — were also challenging. I was lucky to have help from Library of Congress staff members and interns with specialized language skills. Taru Spiegel of the European Division, for example, translated an 1809 Swedish customs declaration of the brig Cyrus, which passed through theport ofGustavia in the Swedish Caribbean colony of Saint Barthelemy carrying sugar, coffee, anti-malarial “Peruvian bark,” and “old copper.” (When the King and Queen of Sweden came to the Library in April we showed them this customs declaration among other items documenting the long history of trade and friendship between Sweden and the United States.) This summer, Manuscript Division intern Crosby Enright translated a group of Spanish documents. My French, meanwhile, has come in handy.
Changes in place names have also proved challenging. During the French Revolution, for example,Franceand its colonial possessions jettisoned old royalist place names and replaced them with revolutionary ones. OnceFranceemerged from its revolution many of the old place names were restored. The Haitian city known today as Port au Prince, for example, was known as Port Republicain during the revolutionary years. While Viewshare recognizes and can assign map coordinates to Port au Prince, it does not recognize Port Republicain. The answer was to find the modern name for each old name. To preserve the old names I created a separate field for them. While I designated the old names as text fields, I designated the modern ones as location fields so that Viewshare could find the map coordinates for them.
I am currently investigating ways to move beyond my sample to enter all of the ship papers into Viewshare, and eventually to add images to the database.
Camille: While creating your view, did you learn anything new about the documents or about how you wanted to represent them online?
Julie: The main thing I discovered, and that I hope to convey to researchers, is that these documents, which look deceptively mundane and were created for strictly bureaucratic reasons, are in fact rich with fascinating and detailed information. Some of this information is apparent in individual documents, while some only becomes visible when the documents are viewed as a group.
I also learned that these documents come alive when used together with other resources at the Library. Steve Davenport, the maritime specialist in the main reading room, guided me to several valuable reference sources. I was also able to learn more about some of these ships by digitally searching the shipping columns of eighteenth and nineteenth-century newspapers. These are available at http://eresources.loc.gov/ under “Historical News.”
Camille: What audiences would most benefit from the view and how do you envision them using it?
Julie: Historians will appreciate the way that Viewshare makes it possible to manipulate data in these documents. The trade in sugar, coffee, and cocoa documented in the ship papers opens many avenues for research for historians of slavery, the economy, consumption, and more. Epidemiologists might be interested in the way that governments attempted to control contagious disease by making ships present bills of health. Conservators at the Library have already expressed interest in the dye woods and other pigments that some of these ships carried. Genealogists will find the list of ship captains’ names useful. The beautiful engravings on many of the documents should be of interest to art historians and might also contain useful information about the ships, port cities, and lighthouses they depict.
Teachers should find the Viewshare table an unexpected way to help students understand history. I hope that Baltimore teachers will see the ship papers as a way to understand the history of their city. Caribbean teachers should find that the ship papers provide material for the study of their colonial past. Finally, I hope the fragmentary but evocative information the ship papers contain about past people, places, and events will spark the imaginations of creators of novels, children’s books, artworks, films, websites, and more.
The following is a guest post from Jane Mandelbaum, co-chair of the National Digital Stewardship Alliance Innovation Working group and IT Project Manager at the Library of Congress.
In this installment of the NDSA innovation working group’s ongoing series of innovation interviews I interview Thea Lindquist. Thea Lindquist is an associate professor and history librarian at the University of Colorado Boulder. Through a project to digitize and enhance access to a collection of World War I materials, she became interested in the potential to increase interoperability and discovery across digital historical collections. In 2011 she spent the fall term at Aalto University in Finland working with the Semantic Computing Research Group on a World War I linked open data project, which is ongoing. The work was submitted to the 2013 Linked Open Data in Libraries, Archives, and Museums Summit as an entry to a challenge. The entry video is available online, as is a demonstration of the project.
She is particularly interested in the geospatial-temporal aspect of linking data, springing in large part from her previous work as a geospatial information librarian at the University of Michigan.
Jane: Can you tell us how you got interested in the project you worked on as a Fulbright scholar?
Thea: It started with a related project a colleague and I were working on to create a user-centered digital tool for work with online primary sources. As a part of this work, we conducted a user needs assessment with humanities students and faculty at CU to pinpoint what would make it easier and more interesting for them to engage with these sources. The big takeaways were not entirely unexpected: improved findability of documents – and the data within them – on specific people, places, topics and timeframes as well as more historical and archival context for the documents and data. In brainstorming ideas for the tool, I learned about linked data and how it could help ameliorate many of the problems associated with work with online primary sources. A big one is the findability of the sources – users often find the metadata inadequate to expose individual sources and especially sections within them with the desired granularity. Also, since similar concepts are expressed variantly across texts, keyword searching is haphazard. Online primary sources are even more susceptible to decontextualization, since keyword searching encourages users to look for snippets of a document in which a given term is mentioned and then skip forward to the next occurrence, rather than reading the document in its entirety. Search engines and collections of links to online sources can contribute to this problem by disaggregating individual documents from their archive of origin. Another issue is lack of context, which is necessary for many users, especially students and non-experts, to engage with the substance of the material. This context can include displaying the relationships between individual documents as well as resources that help explain how each document, and the information within it, fits into its historical context. Even with relevant sources and adequate context, users may struggle with further challenges inherent to primary-source research: foreign languages, document bias, historical usage, orthography, grammar, paleography/typography, etc. Once the utility of linked data for the purpose of addressing these problems – at least in an ideal implementation – was apparent, I needed time to learn more and find partners with specialized expertise on the technical side, so I decided to write up a Fulbright project to do just that.
Jane: How did you start working with the Semantic Computing Research Group at Aalto University in Helsinki?
Thea: At the time I wrote the grant, SeCo was one of the few groups that had published research on a Linked Data approach with digital cultural heritage materials, and particularly on digitized primary sources. I was fortunate that the director, Eero Hyvönen, and his group were interested in testing an innovative approach as I was, namely in going beyond the metadata to deep link in online primary sources and demonstrate to what extent we could improve access to and context in the sources in CU’s World War I Collection Online, testing both manual and automated methods of semantic annotation. SeCo developed several of the tools we have been using in this process, particularly the SAHA browser-based semantic annotator.
Jane: How do you describe to people what semantic computing might do for them?
Thea: Usually I say that it associates related concepts, increases findability, context and interoperability, enables semantically rich services (like faceted searching, content recommendations, and visualizations) and allows re-use, re-mixing and re-presenting of data. If they look puzzled, I start by comparing the current web of documents to the web of data. When you search for a term on the web of documents, the computer looks for the string of characters you entered, and it has no idea what the meaning associated with those characters are. When it finds matches, it returns the documents in which they are found, and it is up to you to slog through those and figure out if any of the matches are indeed relevant. If you look for “buck”, you could get documents about a male, antlered animal, a dollar, throwing (a rider) by bucking, giving someone a ride on your bike (this usage may be limited toMinnesota)…you get the picture. On the web of data, supported by ontological structures and intelligent applications, the computer can understand the word “buck” might have different meanings and what those might be, and it will ask you “are you interested in the monetary unit?” (among other things). If you say yes, it will direct you to the relevant data residing within documents rather than the entire document, whether the character string says “buck”, “dollar” or “single”.
In the historical context, it can help users find information in a variety of languages, for example, about places with alternate names or whose spellings have changed over time (Bratislava/Prešporok/Pressburg [formerly Preßburg]/Pozsony) and geographies that have merged with, split from and been subsumed by other entities with which they are associated (Bohemia/Czechoslovakia/Czech Republic). It also allows searches across all Linked Data and surfaces it to the top level where users are searching. From the perspective of digitized cultural heritage collections, which are often hived off in databases under institutional web sites, this is hugely useful. There are also some good resources out there to point people to for examples, like the sig.ma semantic search engine, Europeana’s “Linked Open Data – what is it?” video, and SeCo’s CultureSampo semantic portal.
Jane: How did you get interested in using the UC Boulder collection of World War I primary materials?
Thea: The collection was a surprise discovery while I was doing my first review of the history collection for offsite storage. In one of the many ranges of compact shelving in the basement, I came across 56 bound volumes with the title “World War pamphlets”. The material in them was amazing, and the only point of access was a skeletal record in the catalog with the same title and one subject heading, “World War, 1914-1918”, i.e., next to no access. On top of that, the paper was terrible, and the materials were deteriorating rapidly. I realize now that one of the reason they had not yet turned to corn flakes was that they had resided so long undisturbed in the basement. There is a lot of interest in World War I on the CU campus. For preservation and access purposes – not just for history classes at CU but for the wider world, I proposed the collection be digitized and made keyword searchable. We funded the digitization project through several grants. The WWI Collection Online is currently the CU Libraries’ largest digital collection.
Jane: How do you think institutions with primary materials collections (like the WWI collection) can take advantage of linked data to improve the access and use of their collections?
Thea: Institutions can start with the low-hanging fruit – their metadata. There are low-barrier tools available now that will allow them to make their digital collections more discoverable using linked data principles, like Viewshare. As you know, Viewshare allows institutions to easily generate and customize visualizations like timelines, interactive maps, and tag clouds – things that we did the hard way not too long ago using a variety of tools! Users really appreciate having a variety of ways to explore the content, and institutions don’t have to have a programmer on staff to do it.
Jane: Where do you see the intersection of historians and librarians in working with digital collections of primary materials?
Thea: Historians bring the specialized knowledge necessary in their areas of expertise to projects drawing on digital collections as well as ideas about how they and their students might best use these collections. Librarians are often the ones who digitize, organize, and make collections of value to historians accessible now and in the long term. Subject specialists, particularly ones who are technology fluent, understand the needs of their aggregate user group (broad, as compared to the historians’ deep) and may serve as the common point of contact in multi-disciplinary groups working on projects to make these collections more accessible. Someone who is both a librarian and an historian might be able to take things a bit further in each area than they might have otherwise, but the input of experts – in the case of WW1LOD, WWI historians specializing in Belgium and France, metadata specialists, digital initiatives librarians, and of course computer scientists – is absolutely critical.
Jane: What do you think users find the most appealing about digital collections of primary materials?
Thea: Having the look of the original documents paired with the power of discovering and viewing the content in interactive ways, from keyword searching across a large corpus to visualizations of selected data points, at any time of day and from anywhere they have an internet connection.
Jane: How do you think visualization tools like maps and timelines benefit from linked data implementations?
Thea: In much the same way that all applications do. Linked Data fosters interoperability and the representation of instances – people, places, events, etc. – in different ways, e.g., by linking alternate name forms that were valid during certain time frames. It allows the applications consuming it to query and connect to Linked Data on the web and make inferences by drawing upon ontologies underlying it. A map visualization could then display boundaries for the Roman Empire in the 2nd century AD and multilingual mapping of places (Wien/Vienna/Vienne/Vindobona), that is, if the necessary elements are there in terms of data and structure. One of the greater challenges in this scenario is the availability of historical boundaries so an accurate map can be generated on which to display the point data, and going that far back in time, current point data is also likely to be incomplete and less accurate. Another is access to geospatial ontologies with relevant historical coverage. I believe this will come, but it will take time and resources. Timeline application fueled by Linked Data could give a more nuanced display of events because alternate timeframes can be shown, again given the necessary elements. For instance, each of the major belligerents that fought on the Western Front in WWI (UK, France, Belgium, US and Germany) produced an official list giving the names and dates of engagements in which their troops took part, with inevitable discrepancies between them. For the Germans the Autumn Battle in Champagne ended on November 3, 1915; but for the French and Belgians, the 2nd Battle of Champagne ended on November 6, 1915. The timeline could show and compare these differing viewpoints.
Jane: You have described building a specialized vocabulary for describing the civilian experience in one country,Belgium, during the war and building semantic frameworks for military events? How did these efforts get started and how can they be used?
Thea: We started with the civilian experience in occupied Belgium in WWI since the documents were richer there, but the vocabulary has since been extended to cover occupied France as well. This topic was selected for more intensive semantic linking not only because it was well-represented in the WWI Collection Online, but also because the impact of “total war” on civilian populations is an area of current scholarly interest. Most of the publications in the collection falling into this category deal with the hardships civilians suffered during the German invasion and occupation of Belgium and northern France, particularly atrocity incidents such as killings and worker deportations and the impact of military rule on day-to-day life. The general, event-based framework for WWI was planned from the outset as a contribution that could be of value to many cultural heritage institutions seeking to expose their WWI-related digital collections, particularly in the run-up to the centenary. It includes key military, political and social events, the basis of which was timeline data shared by the Imperial War Museum’s First World War Centenary Partnership Programme. It is meant to be shared widely, thus providing the “semantic glue” that binds separate datasets relating to WWI together and allows searching and browsing in the broader corpus. The specialized vocabulary, event-based framework and other structures we have created for this project will be made freely available for reuse via a data dump and SPARQL endpoint.
Jane: You worked previously as a geospatial information librarian. Can you talk about how that is different than being a maps librarian? Can you describe what you learned about preserving geospatial information?
Thea: The job title took in the fact that I not only developed and helped users access analog resources but also digital resources. Much of my job was helping users in the humanities and social sciences find geographically referenced information and then use GIS to analyze and visualize it in ways that were meaningful to their research. The data really didn’t become a map until it had reached the visualization stage. It was a lot of fun to help one user find the town their grandparents came from in present-day Poland using historical gazetteers and then turn around and help another mashup data on how racial and social factors relate to unemployment in Flint, Michigan. At the time – over ten years ago – there weren’t many conversations about how we would preserve and provide longer-term access to our digital assets other than backing them up on hard drives and servers. The print maps were another story. They were housed flat in special map cabinets in an environmentally controlled area and received conservation and preservation treatment from a dedicated lab.
Since its founding in December 2010, the National Digital Stewardship Alliance has worked to establish, maintain, and advance the capacity to preserve our nation’s digital resources for the benefit of present and future generations.
In late 2012 the NDSA Coordinating Committee, in partnership with NDSA working group chairs, began brainstorming ways to leverage the NDSA’s national membership and broad expertise to raise the profile of digital stewardship issues to legislators, funders and other decision-makers. The National Agenda for Digital Stewardship became the vehicle to highlight, on an ongoing, annual basis, the key issues that affect digital stewardship practice most effectively for decision-makers.
The NDSA is excited to announce the release of the inaugural Agenda today in conjunction with the Digital Preservation 2013 meeting.
“The Agenda identifies our most pressing digital preservation challenges as a nation and gives us the direction to deal with them collaboratively,” said Andrea Goethals, the Digital Preservation and Repository Services Manager at the Harvard University Library and one of the Agenda’s authors.
Effective digital stewardship is vital to maintaining the public records necessary for understanding and evaluating government actions; the scientific evidence base for replicating experiments, building on prior knowledge; and the preservation of the nation’s cultural heritage, but in the current resource-challenged climate, digital stewardship issues often get lost in the shuffle.
Still, there is broad recognition that the need to ensure that today’s valuable digital content remains accessible, useful, and comprehensible in the future is a worthwhile effort, supporting a thriving economy, a robust democracy, and a rich cultural heritage.
The 2014 National Agenda integrates the perspective of dozens of experts and hundreds of institutions to provide funders and other executive decision-makers with insight into emerging technological trends, gaps in digital stewardship capacity, and key areas for development.
The Agenda informs individual organizational efforts, planning, goals, and opinions with the aim to offer inspiration and guidance and suggest potential directions and key areas of inquiry for research and future work in digital stewardship.
The Agenda is designed to generate comment and conversation over the coming months in order to impact future activities, policies, strategies and actions that ensure that digital content of vital importance to the nation is acquired, managed, organized, preserved and accessible for as long as necessary.
In addition to the discussions during the Digital Preservation 2013 meeting, a series of webinars will be scheduled over the next few months to provide further opportunities for the digital stewardship community to learn more about the agenda and explore opportunities to put it into practice.
The release of the inaugural Agenda is an important milestone in digital stewardship practice. For more information follow the activity on Twitter (hashtag: #nationalagenda or @NDSA2) and read more about the NDSA and the Agenda on the Signal.
We’d love to hear your thoughts on the Agenda in the comments.
The following is a guest post from Michael Mastrangelo, a Program Support Assistant in the Office of Strategic Initiatives at the Library of Congress.
The Midwest doubles down on its commitment to digital preservation with its second digital preservation Train-the-Trainer event in two years. Hosted July 9 – 12, 2013, by the Consortium of Academic and Research Libraries in Illinois (CARLI) in partnership with the Library of Congress’s Digital Preservation Outreach and Education (DPOE) program, the workshop expanded the Midwestern training network started last year in Indiana.
George Coulbourne, Executive Program Officer notes, “Now that the Midwest has the largest population of trained digital preservation practitioners, they are raising the standards of practice in the region and adding historical and economic value to Midwestern digital collections.” Held in Urbana-Champaign, home of University of Illinois Graduate School of Library and Information Science, this training builds on Illinois’s investment in the information sector.
DPOE, started in 2010, seeks to foster national outreach and education about digital preservation, using a Train-the-Trainer model. Starting with a three-and-a-half-day training of a small group of dedicated practitioners, DPOE plants the seeds of regional networks which train and advocate digital preservation. Those who complete the Library of Congress’s workshop, called Topical Trainers, build their own teaching tools and go out into their home organizations to spread the training. There are currently 63 topical trainers across 33 states who have trained over one thousand practitioners in their homespun workshops and webinars.
After learning of the success of DPOE’s Indiana Train-the-Trainer event in August of 2012, David Levinson, member of CARLI’s Digital Collections User Groups, reached out to the Library to set up a trainer network in Illinois. Acting as DPOE’s partner, CARLI secured training funds from the Institute of Museum and Library Services (IMLS), whose generosity has funded prior DPOE and National Digital Stewardship Residency efforts. CARLI ran a competitive application process, secured the venue and handled key logistical arrangements while still managing their state-wide library resources and training events.
“They (CARLI) went above and beyond our expectations,” Coulbourne noted, “by having the attendees sign contracts pledging to do their trainings within a year. CARLI has been a great partner and they are utilizing the DPOE training network resources to their fullest.”
DPOE’s current anchor instructors are Robin Dale of LYRASIS, Mary Molinaro from the University of Kentucky, and Jacob Nadal of the Brooklyn Historical Society. This team represents some of the nation’s top digital preservation experts. Both Molinaro and Dale have been involved with the National Digital Information Infrastructure and Preservation Program and the National Digital Stewardship Residency. Their generosity in offering their service without any fees, along with the commitment from their organizations, makes the trainings affordable to smaller organizations like CARLI.
The real beneficiaries of the DPOE training are the trainee’s home organizations which will be infused with basic digital preservation training. Illinois Institute of Technology, Lake Forest College, Eastern Illinois University, Newberry Library and many others in Illinois, now have staff ready to train and practice digital preservation.
Coulbourne said that “One of DPOE’s most valuable attributes is its cost-effectiveness. The cultural heritage community needs quality training at a low cost. Digital preservation is a critical skill set but training current staff is often too expensive for smaller institutions. We don’t compete with the I-schools and professional organizations but work with them to fill in the gaps.”
I have had two conversation recently — one an intern and one with a friend outside our community — about my career path, and career paths in general around digital preservation.
Paraphrasing, well, everyone (who may not know they are quoting the game Colossal Cave Adventure from 1976), it was a maze of twisty little passages, but which were NOT all alike.
My original career goal, decided upon when I was 13 years old, was to be an archaeologist and be a curator in a museum. Yes, really, I decided that when I was 13.
I was distracted, though, by a desire to be in what I perceived as a more creative field, and I actually started college as a studio art major. It took 4 quarters for me to decide that I really did want to follow the dream I identified as a young teenager, and I switched to anthropology. I took archaeology courses and museum studies courses, which allowed me to do hands-on work in museum registration and collection management. I started my Phd in Archaeology, volunteering at the same museum. I knew I wanted to collect, preserve, and research cultural objects.
I was surveying the museum’s human skeletal remains to to report our collection holdings as required by the newly-announced but not yet enacted Native American Graves Protection and Repatriation Act. One fateful day in 1986 the collection manager came down to our storage room and asked me “How would you like to move from the sub-basement to the basement?” Since there was actually natural light in the basement, I said yes. And I found myself working on a major records recon project: we were entering the entire museum accession history into a collection management system that ran on a Pick mini-mainframe as part of the project to inventory and pack the entire collection to move into a new building.
I was hooked.
I discovered concepts that were so new to me, from database schema design to data normalization to controlled vocabularies. Since this was an ethnographic museum, we we working from objects ranging across thousands of years representing every culture and geographic area, and had vocabulary in probably 100 languages. And we were digitizing, imaging a large archaeological collection and linking those images to the database. I suddenly became aware of the power of entering and normalizing the records and digitizing the collections to improve scholarly and public access to these extensive and rarely seen objects. Creating metadata and digital surrogates was going to change archaeological research.
And suddenly I was no longer working with physical objects, but with records and digital surrogates. And that’s when things started to get twisty. In the years following that I worked in museum IT units and registrar offices, coordinating systems and digitization. I built databases and web sites. I worked in instructional technology, working with faculty to create online teaching resources. I worked in archives and libraries. And along with way I became increasingly aware of the fragility of what we produced: missing backups for digitized items, the lack of versioning of web sites and online courses, and, in some cases, policies to preserve storage space by intentionally overwriting or deleting courses or online exhibits. I recognized the need for the preservation of electronic records, the digitized and the born digital.
So how was this a career path? I learned the following skills and concepts that are now vital to me in digital preservation:
- Familiarity with IT infrastructure, to better understand what is feasible. This includes hardware, software and web development.
- The methodology of digitization across multiple genres of items, from text to images to audio and video.
- Familiarity with a wide range of file formats.
- Key metadata standards used in the community to describe physical and digital items.
- An understanding of the acquisition and processing workflows for collection building in cultural heritage organizations.
- Knowledge of intellectual property law. Everything we work with has rights associated it.
No two people will have the same career path. Mine took me to museums, archives and libraries with a heavy emphasis on IT infrastructure, digitization and software/web development. Someone else may start on a more traditional library or archives path, while others will come out of software coding. The commonalities are a passion for collection building, a passion for preservation and a passion for learning new things. If digital preservation is anything, it is constantly changing, and requires constantly learning about new technologies and formats and possibilities. My job is never the same any two weeks in a row. And that’s the way I like it.
In early July I wrote about the “what” of email archiving. That is, “what” are we trying to preserve when we say we’re “preserving email.” It was admittedly a cursory look at the issue, but hopefully it’s a start for more thorough discussions down the road.
This time I’ll dig in a little deeper and highlight some of the “how” of email archiving: projects and approaches that are attempting to practically address email archiving issues.
What solution you choose depends, in the first instance, on whether you’re an individual or an institution. NDIIPP offers some high-level guidance for email archiving tailored to individuals (and smaller organizations) as part of our personal archiving tips, but this represents only one possible approach to an email archiving methodology. There are solutions available to individuals (including free ones), though some require more active management and resource allocation (that is, $$$) than others.
The Mobisocial lab at Stanford University has an interesting tool that runs on an individual’s computer called Muse. While not a preservation solution, exactly, Muse enables users to access and browse their personal email archives in a variety of creative ways.
Tools like Muse make it easier for end-users to access large collections of email without the collections being subject to significant upfront organizing, sorting or appraisal. Muse (and tools like it) enable a “bypass” approach that may be heretical to advocates of traditional appraisal, but its simplicity, ease-of-use and effectiveness make it valuable to individuals and small organizations that have pulled their email out of an email system but want to continue to access to the files.
[A discussion between differing archival approaches (let’s call them “heavy appraisal” vs. “save everything” just to be reductive) may be too incendiary to get into at this point, but an illuminating take on the subject can be found in a 2011 blog post from the New York Digital Archivists Working Group.]
Recall the four main technical preservation strategies for email from the last post:
- Migrate email to a new version of the software or an open standard
- Wrap email in XML formats
- Emulate the email environment
- Retain the messages within the existing e-mail system
Muse falls largely under the first strategy. In a tipsheet they note that the tool can access a variety of data formats for email, but they prefer that the archived data be migrated to the mbox format, if not already in that form. The tool can also fetch email from one or more online email accounts, suggesting another migration process hidden under the hood.
The Bodleian Library at the University of Oxford in the UK also undertook an email migration effort in 2011 and contributed to the “Preserving Email: Directions and Perspectives” conference that year.
The Collaborative Electronic Records Project is one of the most significant to explore leveraging XML wrappers in email preservation (though XML conversion was not their only preservation approach). The project worked with the North Carolina Department of Cultural Resources EMCAP project to develop a parser that converts e-mail messages, associated metadata and attachments from mbox into a single preservation XML file that includes the e-mail account’s organizational structure. They also published an XML Schema for a Single E-Mail Account.
The NDIIPP-supported Persistent Digital Archives and Library System project also released an open source software tool that extracts email, attachments and other objects from Microsoft Outlook Personal Folders (.pst) files, converting the messages into XML.
Why XML? As the Library of Congress Sustainability of Digital Formats page notes, XML satisfies most, if not all, of the listed sustainability factors, making it highly suitable as a target format for normalization.
As for the first two strategies, Chris Prom pulls them together under what he calls the “whole account approach.” This approach, he says, “reflects the traditional archival model of capturing records at the end of a lifecycle, then taking archival custody over them.”
He contrasts this with the “whole system approach,” which covers the third and fourth strategies above. This approach implements email archiving software to capture an entire email ecosystem, or a portion of that ecosystem, to an external storage environment.
Once captured it may take other tools to provide access. If you’re going to emulate your email environment you may just want to emulate the entire operating system. While not specifically about email, we took a long look at emulation as a service in an interview with Dirk von Suchodoletz of the University of Freiburg back in late 2012.
As for retaining the messages in the existing e-mail system, in some ways this runs counter to traditional archival practice. A 2008 Government Accountability Office report, looking at four federal government agencies, noted that “e-mail messages, including records, were generally being retained in e-mail systems that lacked recordkeeping capabilities, which is contrary to regulation.”
This “strategy” of benign neglect has a lot to do with the recordkeeping challenges posed by email, though efforts like the new “Capstone” approach from the U.S. National Archives are looking to streamline the process.
All of this is to say that there’s plenty of room for applied research in email archiving and preservation and the projects above suggest a variety of potential starting points. Now go to it!
Every year we’re thrilled to host a meeting with our partners and interested individuals in the digital preservation community. This year’s meeting, Digital Preservation 2013, features a number of speakers and presentations around exploring innovative ideas across the digital information landscape. Coming together to share stories and practices of collecting, delivering and preserving our digital materials is an effective way to address various obstacles to our collective and individual work.
Next week, July 23-25, over 200 attendees will gather together to hear from noted individuals, like Hilary Mason of bit.ly, Jason Scott of the Archive Team and Aaron Straup Cope of the Cooper-Hewitt Museum Labs, recognize the 2013 NDSA Innovation Award Winners, share current digital stewardship work in a lightning talks session (PDF), and attend smaller breakout sessions featuring tools and services, and discussions of education and professional development in the field. The last day of the meeting will feature CURATEcamp Exhibition, where participants will discuss ideas about the exhibition of digital collections dealing with narratives, storytelling and context.
We are particularly excited about our plenary panels this year. One panel that I wanted to highlight before the meeting is the “Green Bytes: Sustainable Approaches to Digital Stewardship” Panel with David Rosenthal of Stanford University, Kris Carpenter of the Internet Archive, and Krishna Kant of George Mason University and the National Science Foundation. Joshua Sternfeld, Senior Program Officer from the National Endowment for the Humanities, organized this panel to explore green sustainability in digital preservation for cultural heritage institutions. While there has been some research and discussion in the technology, scientific and commercial fields on the topic of green data centers, there is relatively little by way of the cultural heritage sector and the impact for the digital preservation community. The panel will outline the basic challenges and current efforts to find practical solutions. This abstract (PDF) is meant to provide a little more context for the session and encourage conversation and action beyond this meeting.
Registration for the meeting is full. But you can follow the event on Twitter through #digpres13 and @ndiipp will be live tweeting over the course of the meeting. The plenary speakers will be videotaped and presentations will be posted on our website later in August. We’ll announce those on this blog so please check back in with us! We’re interested in sharing the insights and conversations from the meeting over the next few months.
Preserving digital stuff for the future is a weighty responsibility. With digital photos, for instance, would it be possible someday to generate perfectly sharp high-density, high-resolution photos from blurry or low-resolution digital originals? Probably not but who knows? The technological future is unpredictable.
The possibility invites the question: shouldn’t we save our digital photos at the highest resolution possible just in case?
In our Library of Congress digital preservation resources we recommend 300 dpi/ppi for 4×8, 5×7 and 8×10 photos but why not 1000 dpi/ppi? 2,000 dpi/ppi? 10,000 dpi/ppi? Is there a threshold beyond which the pixel density is of little or no additional value to us? Isn’t “more” better?
Recently we received a comment at the Signal in response to a blog post in which the commenter expressed concerns about our ppi/dpi resolution recommendation. The commenter raised some intriguing issues and I asked two digital photo experts to respond to his concerns.
Barry Wheeler, one of the experts who responded, is a photographer, staff member of the Library of Congress and one of the digital photograph preservation researchers for the Federal Agencies Digitization Guidelines Initiative. Wheeler has also written several blog posts for the Signal about scanning and photo digitization.
David Riecks, the other expert, is a photographer, co-founder of Controlled Vocabulary and PhotoMetadata.org. Riecks has written several blog posts for the Signal about photometadata and about processing digital photos.
Below are the comments from all three people. Please read them through and decide for your self what the best digital photo resolution for archiving is.
Mark S. Middleton wrote:
I am concerned that advising people to save at 300 dpi will result in lots of regrets for future generations. The quality of printing, computer monitors and televisions will continue to improve (and thus the ability to see details in higher quality imagery). Also, a person may want to zoom in and view just a portion of a scan or even cut out a piece (just their grandmother from a school group photo) all of which will suffer from 300 dpi.
I believe that 600 dpi is a better recommended minimum size. It’s better to build the quality into the original scan (saving as a TIFF), then saving JPEGs from that for sharing with relatives or posting online (for smaller file sizes). I recommend looking at the “use cases” of scanned photography and as well as better future proofing recommendations. 600 dpi does cause larger files, but with hard drive prices coming down I believe the value is worth it.
David Riecks responded:
I think the answer really revolves around what you are scanning. For “photos” (i.e. a photographic gelatin silver print, or chromogenic dye print like RA4 process), you can scan at a higher resolution. However, in most cases, all you will see are the defects.
If the original you have to work with is a 4 x 6 inch print, and you scan it at 600 or 1200 pixels per inch, you could then make the equivalent of an 8 x 12 inch print, but it’s not likely to give you better quality. It will…take up much more space on your hard drive.
If you have a high-quality 8 x 10 inch glossy print, in which the image is sharp (no motion blur from the camera moving), it might be worth going to a higher sampling setting. But I would recommend that you do some tests first to make sure it’s worth it.
In my experience, higher scanning resolutions usually just give me more dust to spot out later and the enlarged images never look as good as the small original.
If you are scanning a b&w or color negative or a color slide, then you certainly want to scan at higher resolutions. Which is best has much to do with your intentions (now and in the future), the quality of the original and the type of hardware you are using to make the scan.
Many scanners advertise an interpolated sampling rate in their “marketing speak” though you will often get better results scanning at the maximum “optical resolution” of the scanner.
Barry Wheeler responded:
First, begin with how much detail is there actually in the original. This amount of detail varies widely. A halftone screen for an old newspaper may result in less than 200 dpi actual. A modern lens on a quality black 7 white emulsion may be 2800 dpi.
In the old days, (the 1990s) when scanning became widely available, 300 dpi was a good starting point because many, many books and documents did not contain more detail than that, and even today, 300 dpi is a good starting point.
For example, at the Library of Congress we currently print our digital photographs using high quality pigment printers that may claim a resolution of 1200 or 2400 or much, much more. But those are microdots of different color merged to produce the variety of shades of gray or color. Usually the printer driver produces a finished resolution between 240 dpi and 360 dpi.
Second, we need to sort out the term “resolution.” Scanners and cameras contain pixels and “sample” the image at a “sampling rate” depending on the distance between the camera and the image. So when people talk about “resolution” using 300 ppi or 600 ppi or 3000 ppi they are actually using the “sampling rate” of the device. But few devices are 100% efficient.
Common scanners may be only 50% efficient; cameras may be 80 – 95% efficient. Thus the actual resolution achieved at 300 ppi may only be about 200 ppi – higher ppi rates are the result of image processing which may give the appearance of sharper lines but which does not produce additional detail. Many scanners will claim 1200 ppi and produce less than 600 ppi true optical resolution. Federal Agencies Digitization Guidelines Initiative standards (http://www.digitizationguidelines.gov/) are currently at 80% efficiency for a 2 star, 90% for a 3-star, and 95% for a 4-star outcome. Many of our projects for prints and photographs and rare books are 400 ppi at 3-star levels, although some are much higher.
Third, many people want to enlarge an image. We often try to scan film – particularly 35mm film – at a resolution necessary to provide a final print at 300 dpi. So if you want a common 4″ x 6″ print you need a true resolution of 1200 ppi. Specialized film scanners and high quality camera setups can achieve this. Commonly available consumer flatbed scanners cannot. (If you read the fine print specifications, they will often say something like “true 2400 ISO sampling rate” not ISO “resolution.”)
But once you reach the limits of the device resolution and the detail in the original, then additional enlargement doesn’t help. I think I have a couple of illustrations of this in my most recent blog article about enlargement (http://go.usa.gov/j2q4). I don’t believe you can magnify a newspaper image and find additional detail in a scan with a true resolution above 300 ppi.
Finally, Apple claims that human vision is only capable of resolving 326 ppi (search online for their “Retina display” marketing materials). There is a lot of quibbling about that number but most still claim not more than 450 ppi.
In the end, I doubt that you will see any significant improvement in an image of reflective materials beyond an ISO standard resolution of 400 ppi. I doubt you will find any improved image quality on consumer scanners above an ISO standard resolution beyond 1200 ppi unless you scan 35mm film in a specialized, high quality film scanner.
Two final notes. I believe the costs of higher resolution are vastly underestimated. Scan time will increase significantly with increased resolution. Transfer times increase, processing times increase. The expertise needed increases to get better quality. Storage and multiple backups increase. Consumer hard disk drives are not archival devices. Your children and grandchildren may not be able to retrieve images from a hard disk even 15 years from now. Increased image size means greatly increased cost.
And I believe 300 ppi / 400 ppi is future-proof. At least for reflective materials, I don’t believe we will see greater detail in a 1200ppi scan no matter how improved future equipment is.
The following is a guest post from Megan Phillips, NARA’s Electronic Records Lifecycle Coordinator and an elected member of the NDSA coordinating committee and Andrea Goethals, Harvard Library’s Manager of Digital Preservation and Repository Services and co-chair of the NDSA Standards and Practices Working Group.
As part of the effort to publicize the NDSA Levels of Digital Preservation and as a way to continue to invite community comment on it, several members of the Levels group wrote a paper about it for the IS&T Archiving 2013 conference. The paper is The NDSA Levels of Digital Preservation: Explanation and Uses is available online.
At the conference, we got interesting comments and one significant suggestion to improve the paper from Christoph Becker, Senior Scientist at the Department of Software Technology and Interactive Systems, Vienna University of Technology. We wanted to present the suggestion he made here and ask for help from all of you to resolve it.
Christoph wrote that the major aspect of the levels that he would adjust is the label for the last function, “file formats.” You can see the table here. He pointed out that file formats are just one aspect of a larger preservation challenge related to how data (the bitstream) and computation (the software) collaborate in creating the “performances” that we really care about. New content is often not even file based. Format is just one element out of many that could be significant in preservation, and in some cases the format itself is almost meaningless. Often the real issues are related to specific features or feature sets (e.g. encryption), invalidities and sizes. (Petar Petrov tried to include part of this problem into his blog post about content profiling.) If you consider research data, for example, the format could be known to be XML-based but have no schema available. The real preservation challenge might be that the data requires a certain analysis module (found here) running on a certain platform, which is dependent on distributed resources — a certain metadata schema (found there), and certain understanding of semantics (found over here).
Christoph’s suggestion is that the overly-specific label “file format,” in the levels puts forward too narrow a view of the problem in question. The label could skew the real challenge since it excludes part of the problem (and part of the potential community). He suggested possible replacements for the “file formats” label. ”Diagnosis and action”? “Issue detection and preservation actions”? “Understandability”? For him, in fact, this is the heart of preservation, and if you look at the SHAMAN/SCAPE capability model that Christoph works on, the preservation capability really is all about the last two rows (operations include metadata), assuming that the bitstream is securely stored and managed.
We (Andrea and Meg) think that Christoph has a valid point, but we’re still not sure of the best label to capture the suite of interrelated issues that need to be addressed in the last row of the Levels chart. Christoph’s suggestions make sense in isolation, but they would overlap with activities in other rows of the chart, and don’t quite convey the concept we originally intended.
- Do you think “file formats” is clear enough as shorthand for these kinds of issues, given where most of us are in our practical digital preservation efforts, or does this need to be changed?
- What label would you use for the last row of the chart? (Content characteristics? Usability? Just plain formats (without “file”?)
- Are there other changes you think we should make to improve that row?
- Any changes you’d recommend to other parts of the chart?
In the Archiving 2013 paper, we said that any comments received by August 31, 2013 would influence the next version of the Levels of Digital Preservation, so please suggest improvements! We may come back to you again over the summer to help resolve other issues.