The Signal: Digital Preservation
Many of our readers may remember a unique blog post written by our former intern, Tess Webre. Tess took a very creative, educational approach to the subject of digital preservation and created Snow Byte and the Seven Formats, A Digital Preservation Fairy Tale.
This post turned out to be so popular (see the many comments), and, it had such visual appeal, that we were inspired to turn it into a video. So, here it is – Snow Byte and the Seven Formats, A Digital Preservation Fairy Tale, the video!
Snow Byte may have a tongue-in-cheek children’s book style, but the idea behind it is to illustrate the overall importance of digital preservation. Hopefully, this technology-oriented “fairy tale” will appeal to young people, as an entertaining way to learn about this topic. As Tess mentioned in her earlier post, children are learning about digital material at a younger and younger age. So this story idea came about as an answer to the question, “what’s a good way to teach them about this topic”?
In the NDIIPP program, we are faced with the same question all the time, but mainly for adults – how can we get more people to pay attention to the increasing problem of digital loss? Since “Snow Byte” also includes such concepts as “metadata schema” and “proprietary file formats”, and highlights the issue of file backup, it could also appeal to library professionals or anyone else looking for a gentle introduction to digital preservation. In other words, Snow Byte is a fairy tale for all ages.
In case you were wondering who does the voices in the video, no, we didn’t happen to have a bona fide theater troupe available. However, this project brought out some otherwise hidden theatrical talent among the staff, who were brought together for this video as the “NDIIPP Radio Players”.
Does Snow Byte manage to evade the evil queen, and retrieve her magic spell? How does Snow Byte avoid digital disaster in the end? And who is this fellow, “Dublin”? Watch the video and find out. And, enjoy!
See “Snow Byte” and our other videos on digital preservation related topics on our video page.
I was at a recent meeting of the Federal Geographic Data Committeee’s Coordination Group and Anne Castle, the Assistant Secretary for Water and Science in the Department of the Interior and the co-chair of the FGDC Steering Committee, was discussing the challenges of finding resources to support geospatial activity. The federal geospatial community is working with a reduced budget (for example, the FGDC recently announced the cancellation of their long-running CAP grant program for FY 2013 and 2014), but a chief concern of the participants was not just shrinking resources for geospatial activity, but the challenge of structuring funding in a way that facilitates and encourages cross-agency collaboration and long-term thinking.
We in the stewardship community are no strangers to this problem, but there’s much we can learn from the experiences of the geospatial community. The geospatial community not only provides governance models for how we might go about our business, but they also are building tools we can tap into to help us tackle stewardship issues across organizations and generations.
The National Digital Stewardship Alliance Geospatial Content subgroup is exploring ways to engage with the wider geospatial community, concentrating recent efforts on opportunities to engage with the federal government’s Geospatial Platform:
“The Geospatial Platform will offer access to a suite of geospatial assets including data, services, applications, and infrastructure that will be known as the geospatial Platform offering…The Geospatial Platform will support an operational environment, www.GeoPlatform.gov, where customers can discover, access, and use shared data, services, applications, and when appropriate, infrastructure assets.” – from Modernization Roadmap for the Geospatial Platform (PDF), pg. 10.
The Platform builds upon existing federal interagency geospatial initiatives to share data, develop collaborative programs and establish standard national datasets. It is a significant effort that has become a chief focus of energy and attention across the federal geospatial data community.
The NDSA Geospatial Content subgroup has held recent discussions with Platform planners, who are eager to help get us engaged. The Platform is still in early development, but several features have been implemented: the ability to explore featured maps; build your own maps based on available data; and create “communities” around common interest areas to share information and maps. The stewardship community might find the third feature particularly interesting, along with these other potential benefits:
- Make historic digital geospatial collections more immediately accessible in a forum with high visibility and a potentially significant user base;
- An infrastructure to house a clearinghouse of information on the stewardship of digital geospatial data;
- Access to advanced tools to create maps and make them accessible, with other technical services (preservation?) coming in the future;
- Access to advanced tools and services without a significant investment in technical infrastructure on the part of any individual organization;
- Redundant storage for some portion of the community’s digital maps;
- Engagement with the broad community of geospatial data creators and users, providing collaborative opportunities;
- A possible central point for digital geospatial data for stewardship capture purposes;
- A venue to explore the role of stewarding organizations in the management of digital geospatial information of long-term value across the entire lifecycle;
- Early adoption provides participants with more dedicated technical support resources and reputational benefits.
Of course, these potential benefits are offset by issues that the digital stewardship community must address before moving forward with any kind of engagement .
First off, our community will need to clarify its purpose(s) for engaging with the Platform or similar activities (the academic-centric OpenGeoportal project has some similarities to the Platform and may offer another outlet for digital stewardship community participation). Do we see the Platform as a clearinghouse for information on geospatial preservation and stewardship, like geopreservation.org but embedded in another community? Or is it most useful as an access point to collections of historical digital geospatial data? Or both?
Who will manage a “historical geodata community” on the Platform? Are there enough interested NDSA members to take on the management of a Platform community, or is it necessary to build a wider coalition of willing participants? Do activities like the Platform provide enough benefit to make the effort to utilize them as a central distribution point for historic digital data?
With all of this in mind, what are your thoughts on engaging with the Platform and activities like it? How can we most effectively marshal our community resources, both within the NDSA and across the wider community, to take advantage of opportunities like this as they arise?
I was staring at a blank screen when my colleague David came into my office. I semi-jokingly asked him for a blog topic.
“I have one for you,” he replied. “Content Archaeology. Discuss.” And with that he left my office.
People know that I trained as an archaeologist and did fieldwork in multiple locations. I still think of myself as a social scientist. This phrase resonates with me, and is a concept that I have discussed with others, more often under the rubric of “digital archaeology.” There is also the practice of using digital tools in archaeology, but that’s for another post.
In researching this, I did a bit of content archaeology myself. In the writing this morphed into a bit of a “Before You Were Born” post as well. This is a VERY truncated list of what one might consider digital archaeology.
- There was a very interesting article on digital archaeology in Wired in 1993. Yes, that’s really 1993.
- I read a very interesting article in the journal Social Semiotics by Gordon Fletcher and Anita Greenhill from 1996 entitled The Social Construction of Electronic Space that explicitly calls out digital archaeology as a methodology for research into virtual communities.
- There’s a UKOLN report titled Digital Archaeology: Rescuing Neglected and Damaged Data Resources by Seamus Ross and Ann Gow from 1999.
- I found a very illuminating paper from 2003 on what it took to reconstruct a set of UK education datasets known as The Schools Census.
- The digital archaeology story that is perhaps the most well-known to the public is the story from 2011 of the recovery of the Domesday Project, and its rebirth online.
- There is the Digital Archaeology project, aiming to recover disruptive moments in design and interactivity on the web. We interviewed Jim Boulton of Story Worldwide on The Signal in 2011.
- Mick Morrison at Flinders University posted an outline for a hands-on workshop on Digital Archaeology in 2011.
- Doug Reside of the New York Public Library wrote on Digital Archaeology: Recovering Your Digital History in 2012.
- I found a great 2013 case study from the University of Pennsylvania Museum of Archaeology and Anthropology in a blog post entitled Digital Archaeology — Uncovering a Website.
- In 2013 the New Museum launched a great experiment called XFR STN to help artists recover and migrate their digital art.
There is some holy grail content that the greater community would love to be found so digital archaeology and preservation actions could be taken, such as the full set of Apollo moon landing 11 tapes or the lost Dr. Who episodes.
How do you define “Content Archaeology” or Digital Archaeology”? What lost content would you like to see recovered?
When Sam Brylawski was a teenager he had to write a paper for his high school American history class about Gershwin’s “Rhapsody in Blue,” so he did something that was ambitious for a high school student: he traveled to the Library of Congress to examine the composition’s original manuscript in the Gershwin collection.
Brylawski found himself sitting at a table in front of the original manuscript, studying Gershwin’s music-notation “handwriting” – the often-stubby stems on the half notes, the squiggly rests, the hastily sketched but perfectly aligned syncopation and harmony almost bursting off the page. Wayne Shirley, who is a legend in the Library’s Music Division for his scholarship and encyclopedic knowledge, assisted Brylawski and pointed out some especially interesting sections.
“To actually examine a real Gershwin manuscript with Wayne Shirley’s amazing help was a great thrill,” said Brylawski. “Those things worked to get me hooked on the Library of Congress and on libraries in general.”
Hooked enough to work in the Library’s Recorded Sound Section every summer during college. Hooked enough to get a job there after graduating college, to immerse himself so deeply and thoroughly in his work that he would one day become the head of Recorded Sound. And hooked enough to crusade — in the 21st century — for unified action among public and private institutions to preserve and make accessible all recorded sound.
Brylawski, a recognized authority on the history and preservation of recorded sound, learned almost everything on the job, working side-by-side with scholars, talented engineers and recorded-sound savants, experts who get the best possible sound off of every recording medium.
Brylawski started out at the Library as a preservation technician, transferring recordings from disk to tape. Eventually he decided that he didn’t have the “ears” or the technical expertise to do the job the way it needed to be done, so he took a clerical job in the Library’s Recorded Sound Section and Recording Laboratory.
“It was a fabulous education,” said Brylawski. “It was sort of like being an apprentice in a reading room. I would help users look for things that they wanted to copy from the collections and I learned from Library professionals how to serve the public and the fundamentals of library work, as well as where everything was.”
Brylawski became a reference librarian in 1980 and a curator in the early 1990s. In addition to helping people find things, he worked with other staff to make things findable. They indexed unpublished recordings, primarily gift collections held by the Library, using information from the recordings’ engineering notes. This resulted in the Sound Online Inventory and Catalog, a database of over 200,000 recordings.
When James Billington became Librarian of Congress in 1987, one of his first major initiatives was to acquire Congressional funding to help the Library deal with its backlog of unprocessed materials. As a result, Recorded Sound staff and resources increased significantly. A symbol of that commitment is the Library’s National Audio-Visual Conservation Center in Culpeper, VA. Brylawski was on the executive team that planned the Center.
In 1996, Brylawski was chosen to head the Recorded Sound section of the Motion Picture, Broadcasting, and Recorded Sound Division. He said that in the years after his appointment, he observed two major changes.
“One was an increased emphasis on the importance of access,” he said. “And the other was a transition to digital collections and digital preservation.”
The American Memory project gave the public access to thousands of recordings from the stacks. It included some of the first online recorded sound collections from a major cultural institution.
Today the showpiece of online access to the Library’s Recorded Sound collections is the National Jukebox, one of the projects Brylawski devoted his time to after he retired from the Library in 2004. The Library created the Jukebox with Sony Music Entertainment in response to the National Recording Preservation Act of 2000 (which Brylawski contributed to), which states that “The Librarian [of Congress] shall…provide for reasonable access to the sound recordings and other materials in such collection for scholarly and research purposes.”
As for the Library’s transition to digital collections and digital preservation, that has been decades in the making. Digital recording has been around since the 1970s and commercial CDs have been available since the early 1980s. By the 1990s, Recorded Sound preferred CDs as the most reliable playback medium, mainly because CDs do not wear worn down by “playback” as a phonograph needle would wear down a record groove or a magnetic tape deteriorates.
Still, CDs are unreliable for long-term storage. Discs can be easily damaged by handling or by the environment and CD players will become obsolete, just as all media players eventually become obsolete. Besides, CDs are merely containers; the data is what is important.
Audio files are now transferred over the web in different formats and streamed in a variety of ways, and most of the time they are missing crucial metadata. And the Library is challenged to gather and preserve them.
In 2002 Brylawski published a comprehensive report, “The Preservation of Digitally Recorded Sound,” that articulated the complicated, multifaceted challenges involved with preserving recorded sound in the digital age.
He wrote about preserving streaming music and subscription-based music; about the proliferation of CD reissues of old vinyl and tape recordings, which vary in quality; about the explosion of native-born MP3s and their lack of metadata. And he wrote about how, more than ever, copyright can be an obstacle to preservation.
Brylawski is not against copyright. Quite the contrary. His family includes two very prominent copyright attorneys, one who began working with the Library of Congress more than 100 years ago. He appreciates that recorded sound has been a commercial business since its birth in the 19th century.
In the report, he observed that, “Record companies today feel bruised by the rampant swapping of music files…” He wrote about the copyright laws that do not realistically apply to digital preservation and how, in his opinion, those laws may impede the work of cultural institutions in preserving at-risk recorded sound.
Brylawski said, “Regarding copyright this is a interesting and very sensitive time. The music business has been very hard hit in this century. Record sales are way down from 20 years ago. Many in the business blame file-sharing for much of the decline. At the same time, it my personal belief that property holders overplayed their hand when they fought to extend copyright terms in the late 1990s and one result has been a decline in public respect for copyright laws. Librarians need to work with the industry to build collaborations and preserve our audio heritage.”
Given his decades of work with recordings, Brylawski is also painfully aware of the unclaimed orphaned recordings that were copyrighted but not in print and not available for anyone to hear. He wrote about the recordings on decaying media that would be lost forever if action wasn’t taken soon and he said that it is imperative for everyone with an interest and a stake in recorded sound to collaborate on mutually beneficial solutions.
In 2010, Brylawski was a member of one of the six task forces that contributed to the comprehensive report, “The State of Recorded Sound Preservation in the United States: A National Legacy at Risk in the Digital Age,” which was sponsored by the National Recording Preservation Board.
Brylawski said, “The task forces met many times to debate and discuss and share concerns and possible solutions to various aspects of what might go into a national plan of action.”
The report examined the problems in exhaustive detail. Two years later the Library published a national plan of action, “The Library of Congress National Recording Preservation Plan.”
The plan is clear and tightly focused, organized into four main topic areas:
- Building the National Sound Recording Preservation Infrastructure
- Blueprint for Implementing Preservation Strategies
- Promoting Broad Public Access for Educational Purposes
- Long-Term National Strategies
Each topic area breaks down into a few sub-topics and within those are specific, practical recommendations for action. One recommendation is the call for education in digital audio preservation.
“There are few courses taught in audio preservation or preservation courses that touch on audio,” said Brylawski. “But there is no degree program in Preservation Management of Audio. And we hope that there will be. Also, the sands are shifting, so continuing education is necessary for preservation administrators and engineers.
“There is also the challenge of debriefing the classic preservation engineers who have techniques they have developed. We can tap and preserve their knowledge. There is a great deal of legacy knowledge that we are very concerned about losing as people leave the profession. Or worse, die. The National Recording Preservation Board is funding the Association for Recorded Sound Collections in doing some video oral histories of great engineers.”
Brylawski is concerned that his reports recommendations may not be reaching out far enough to the local level, to smaller institutions, community orchestras, private collectors and others in the music business that might not be aware of the long-term threat to their collections or may not have the resources to archive their collections properly. He suspects there may be a vast quantity of recorded sound collections at large and at risk and he is helping develop methods of outreach and making resources easily accessible online.
Brylawski never slowed his pace after retirement. After he left the Library in 2004, he was appointed Editor and Co-Director of the Encyclopedic Discography of Victor Recordings, by the University of California, Santa Barbara. He is also chair of the Library’s National Recording Preservation Board.
“I have had a long interest in discography,” said Brylawski. “Comprehensive discographies are needed to study and fully understand recorded music and spoken word history. In addition, a discography can assist in cataloging and preservation planning — the latter by reducing redundancy. ”
Brylawski is obviously fervent and committed to what he does and he is reverent about recordings. When he described to me the early acoustic recordings — where musicians played altogether into a single acoustic cone that cut the recording directly onto a disc — his voice sounded awed as he referred to them as “snapshots of time.”
And in this new century, several long decades after Brylawski’s transformative experience at the Library of Congress researching the Gershwin manuscript, he had a hand in making accessible online — with the consent of all the stakeholders, for anyone and everyone to enjoy — a recording of Gershwin performing his “Rhapsody in Blue”.
The September 2013 Library of Congress Digital Preservation Newsletter (PDF) is now available.
In this issue:
- The Truth and Reconciliation Commission of Canada using the Levels of Digital Preservation
- Find out about the George Sanger Collection at UT Austin Videogame Archive
- Read an Analysis of Current Digital Preservation Policies
- What Is It That We Actually DO (at the Library of Congress)?
- Recent Interviews with: Matthew G. Kirschenbaum from the University of Maryland and Jason Scott from the Archive Team
- New and recently updated resources: The Digital Preservation Business Case Toolkit; The Activists’ Guide to Archiving Video; Digital Preservation Videos for the Classroom; Digital Preservation in a Box; Rich Online Resources Documenting the 1963 March on Washington
- Other news: Help Pick Panels for the 2014 South By Southwest Conference; Xporting Digital Format Sustainability Descriptions as XML; Format Migration and More Launching Points for Applied Research
- Upcoming Events: National Book Festival, Sept. 21-22, Washington, DC; Cultural Heritage Archives: Networks, Innovation & Collaboration Symposium, Sept. 26-27, Washington, DC; 2013 DLF Forum, Nov. 3-6, Austin, TX; Best Practices Exchange, Nov. 13-15, Salt Lake City, UT; Aligning National Approaches to Digital Preservation: An Action Assembly, Nov. 18-20, Barcelona, Spain
Last October, I wrote about The Atlas of Digital Damages on Flickr. The idea was that it would be instructive to showcase digital content that suffered from problems roughly equivalent to physical content that was deteriorating due to mistakes or neglect.
Since I last wrote about it, the atlas has acquired more examples reflecting all kinds of problems, from corrupted bit streams, to programs that didn’t work correctly, to media failure. Some examples are below. If you have any of your own, please add them to the site!
Engaging Communities to Preserve: The History Harvest as a Collaboration Model for Digital Preservation
This is a guest post by Meghan Vance, a Public History graduate student at the University of Central Florida.
As a Public History graduate student at the University of Central Florida, I had the unique opportunity to participate in an internship with E-Z Photo Scan, a member of the NDSA Outreach Group. This internship evolved from a business-university partnership in local digitization events. In the spring of 2013, UCF conducted a History Harvest, a community-based digitizing event of personal artifacts to be placed onto the UCF digital archive, the RICHES Mosaic Interface. The History Harvest began at the University of Nebraska-Lincoln as a series of events to bring communities together to not only learn about their own history, but also to digitize personal materials for greater knowledge and access. As UCF began their own version of this event, E-Z Photo Scan kindly offered scanning and digital processing services to the community. The college, local businesses and community members came together for one afternoon to digitize history. It was a great success and everyone walked away with new experiences in digital history.
But the History Harvest left a lingering question in my mind: How can all historical organizations (from a house museum to a volunteer-based historical society) digitize their artifacts and archives for public use and preservation? Thus began my internship with E-Z Photo Scan. They took me under their wing and taught me a plethora of information about digital preservation.
I used this new learning to begin a research project to address my looming question. Not attempting to reinvent the wheel, I looked for other’s that have discussed this topic. Mike Kastellec outlines in his article, Practical Limits to the Scope of Digital Preservation that these organizations face four fundamental issues: Technology, Access, Selection and Finances.
Using this concept, I began exploring the many facets of digital preservation through various blogs, such as The Signal, and digitally published materials. Essentially, each of these four topics were merged with more specific digital preservation information to gain a better understanding of the challenges that small organizations will face when attempting to digitize archives.
The History Harvest became a collaborative platform to overcome the challenges of file formatting, data storage, open access and a myriad of other digital preservation concerns. The solution was simpler than I thought; collaboration and community partnerships were the keys not only to digitization processes but also long-term digital preservation.
Often, the two biggest challenges that small organizations face are lack of manpower and money. Through collaboration with private businesses, universities, libraries and other data management companies, small organizations can conduct local events to crowd-source the digitization and digital preservation efforts.
The conclusion of my internship produced a draft guide for small institutions, Growing Community Engagement and Digital Preservation: Planning and Practice. This document serves as a tool from which organizations, large and small, can understand the many components of digital preservation and learn how community-based events, such as the History Harvest, can alleviate some of the stresses of going digital.
By no means is this document an end project. I hope to expand the research and begin to work personally with organizations to begin the processes of digitization of archives and artifacts for long-term digital preservation. But with the partnerships and collaboration of multiple groups, hopefully this will become a first step to bring the physical past into the digital future.
A single photograph in a personal collection or archive might be represented by any number of derivative files of varying sizes, in varying formats, all with different sets of embedded metadata. At the bit level, all of the variations of the photograph are unique. However, in practice, a particular individual or organization might just be interested in holding on to one copy of the image. You can get a sense of the kinds of permutations and variations of digital files we create in Cathy Marshall’s 2010 keynote for the Code4lib conference people, their digital stuff, and time (slides).
An organization can easily have 15 PDFs of the same article, each with a different cover page, but all of which are substantively identical. Again, at the bit level you have 15 unique articles, but if you had a trusted way to be able to identify these 15 PDfs it could cost you 15 times less to store the article over the long haul.
How do we go about being able to make these kinds of community dependent calls on what constitutes equivalent digital objects? How can we better operationalize our ideas about what accounts for significant differences between different digital objects for different potential user communities? This set of issues is one of my favorite issues identified in the recently released NDSA National Agenda for Digital Stewardship. I thought I would take this opportunity to talk through what I think is both particularly intriguing about the future of understanding digital equivalence and significance and mention some of the points that seem promising as ways we might be able to scale up the process of making judgement calls on equivalence.
Here is a bit from the equivalence and significance section of the report that I’m referencing.
Preservation research needs to map out the networks of similarity and equivalence across different instantiations of objects so that they can make better decisions on how to manage content, bearing in mind what properties of a given set of digital objects are significant to their particular community of use. Research is also required in order to characterize quality and fidelity dimensions and create methods for computing format-independent fingerprints of content, so that the fidelity of digital objects can be effectively managed over time.
The report goes on to identify two particularly interesting potential modes for developing ways to identify information equivalence that I thought some readers might like explained in a bit more depth.
Fuzzy Hashing and Degrees of Bit-level Similarity
You may be familiar with the concept of checksums and cryptographic hashes. They are a way to create something like a digital fingerprint for a file or bitstream. Most of the ways people generate hashes results in very minor differences in two files result in totally different hash values. For instance, two identical photographs with one cropped a single pixel smaller should generate completely different hash values. As a result, these hashes are great for telling us what two digital objects are exactly the same but are useless in telling us how similar two digital objects are.
In contrast, there are techniques for fuzzy hashing that attempt to identify the percentage similarity between the bitstream of two files. There is considerable potential for applying some of the work on fuzzy hashing to help digital content stewards make decisions about what minor differences between files do and don’t matter. An interview about the National Software Reference Library from last year discussed some of the work going on there on similarity digests that fits into this same area of research. In short, there are already algorithms out there we could be using to better understand, at the bit level, how similar or different a set of files are and there is considerable potential to apply these (and future algorithms) into curitorial workflows.
Comparing Rendered Content Algorithmically
Along with looking at bit level patterns, there are a range of promising approaches to analyzing and interpreting rendered content. For example, some image search systems will now give you the option to view similar or related images based on the qualities of the rendered photo. Beyond image comparison, the same approaches have the potential to identify similarity across audio and video and text files. Tools that could identify similar digital objects in these ways would be invaluable for both selection and for creating metadata about the relationships and connections between objects. All of this work on similarity has the potential to generate that kind of descriptive metadata and power visual interfaces for exploring relationships and connections between digital objects.
Both approaches to identifying bit level similarity and similarity in rendered digital content offer considerable potential value to stewards of digital content. Beyond continued basic research in these areas there is a need to begin translating existing work into tools and workflows for stewards of digital collections. In this respect, there is considerable potential for work exploring how to apply these different approaches to similarity in particular collection use cases. Applying these ideas of similarity in different situations will ultimately help us unpack the relationship between content similarity and the significant properties of particular sets of objects in particular stewardship and use contexts.
Over the summer, we were pleased to participate in a number of “Open House” programs that our colleagues in Education Outreach hosted for their Summer Teacher Institutes. Each summer, over 100 K-12 educators take part in week-long immersion programs to learn strategies for the classroom use of Library of Congress digitized primary sources. The use of primary sources is growing and the Library is helping teachers across the country meet that demand.
This year, the Education Outreach staff added an open house component so that participants could gain a greater understanding of all the Library has to offer K-12 educators and their students. During the Open Houses, representatives from the Library’s various curatorial divisions and programs shared materials and information teachers would be interested in learning about. Staff were encouraged to show off primary sources from the collections, that are also available online, and ideas for how the materials could be used in a K-12 classroom or library.
Unlike other Library divisions, NDIIPP doesn’t steward collections or materials; we’re about collaborative partnerships to help preserve important digital content, build new tools and develop best practices. In support of this, we’ve developed a rich collection of information and resources on our website, digitalpreservation.gov, which we use, among other things, to develop programs and presentations about digital preservation for various audiences.
In the past, we’ve put on a couple of programs geared toward students and shared information about the K-12 Web Archiving Program so we’ve had some experience providing information to the K-12 educator audience. For our contribution to the Open House programs, we took a closer look at the videos we produced for our Digital Preservation Video Series and we created a list of videos(PDF) teachers would find most relevant.
In the “back to school” spirit, I thought it would be useful to share those videos with our readers (and educators) here.
- Digital Natives Explore Digital Preservation: This video looks at the views teenagers have about the permanence of digital information.
- Why Digital Preservation is Important for You: A good “explainer” video that offers some basic digital preservation strategies. We also offer this video with Spanish captions.
- Adding Descriptions to Digital Photos: Your Gift to the Future: This video explains the value of adding descriptions and tags to digital photos in order to make it easier to organize and search your collection.
- America’s Young Archivists: The K-12 Web Archiving Program: This video profiles a group of eight-grade students who participated in the program.
- K-12 Web Archiving: Preserving the Present: This video is an interview with Paul Bogush, teacher at the Moran Middle School in Wallingford, CT, about his class’s participation in the program.
- Bridging Physical and Digital Preservation: This video compares the physical and digital preservation of the Waldseemüller Map.
South By Southwest is a great music conference that has morphed into an equally excellent technology conference. The growth of the technology portion has increasingly attracted information professionals in libraries, archives and museums who take the opportunity to talk about their current projects and connect with technology professionals over shared interests in open access, copyright and digital stewardship.
SXSW has now become so popular that the jockeying for presentation slots now commences months before the conference. The process of choosing panels includes a crowd-sourced aspect called the Panelpicker that engages the public to vote on the panels they’d be more interested in attending at the conference.
The panel voting for SXSW 2014 is now underway and extends through September 6, 2013. To vote for a session you need to visit the Panelpicker site and create an account. After that it’s just a matter of browsing through the 4,122 entries and picking the ones you like.
That’s a lot of panels! We didn’t submit any panel proposals this year after doing a couple last year (though we’re looking into participating in other ways), but there are a surprisingly large number of panels that are either organized by LAMs or that touch on areas of interest to our community. Here are some examples:
Description: “What is the future of the past in terms of new user interface, user generated content, and digital preservation?” So begins the proposal from panelists including National Digital Stewardship Alliance members Historypin and the Council on Library and Information Resources. They plan to “explore some of the diverse efforts to bring stories and memory to life in new ways, while also fostering open data and preservation, and the pros and cons at the intersection of public domain and private enterprise.”
Description: Does open licensing open doors for content creators or does it close off potential revenue streams? Panelists from Creative Commons and the Free Music Archive discuss how Creative Commons licensing has changed how artists think about copyright and intellectual property.
Description: A team of content experts including folks from the Recording Academy dive into the value proposition around archival and contextual information (metadata, that is) that allows for the long-term management and monetization of music content.
Description: Billed as a successor to last year’s successful “Libraries: The Ultimate Playground,” this is a jump into a “no-skills-required design sprint” on what the future of libraries might look like.
Description: A look at efforts to implement culture change a the Harvard University Libraries, through their exploratory Harvard Library Lab.
Description: A German effort lead by the Westdeutscher Rundfunk public broadcasting service to crowdsource and digitize historic artifacts and collect the stories around them.
Description: There are a number of proposals describing opportunities to transform public libraries through innovative technologies and this is one of the best. Not the “gig” you musicians might be thinking of, this focuses on the fine work being done at the Chattanooga Downtown Public Library.
Description: Mapping presents one of the best opportunities to express the cultural heritage utility of “big data.” Amongst other things, the panelists will explore the kinds of patterns and techniques currently available for analyzing big data sets.
Description: A wide-open presentation by a panelist from NDSA member the New York Public Library Labs that looks to stoke conversation through provocative questions such as “Ok – we get it, you’re not just about books anymore, you’re also about data. But how are you going to get data from all that old stuff you have?”
Description: And finally, something completely different: “If we can archive and store our personal data, media, DNA and brain patterns, the question of whether we can bring back the dead is almost redundant. The right question is should we?”
This barely scrapes the surface of the things being proposed for the conferences. Check out the full range of offerings and support the ones that appeal to you.
Digital Preservation in a Box, for those who may be unfamiliar, is a compilation of resources from many different organizations, all available in one virtual place. It’s been around for a little over a year. To give a brief background, the Box was produced by the National Digital Stewardship Alliance , specifically the Outreach working group, so this was, and is, very much a collaborative effort with our partners in this group. The initial aim was to consolidate many resources into one convenient place, providing access to basic information geared towards library professionals and educators.
So, what are some specific uses for Digital Preservation in a Box? Here are four possibilities:
- You are taking a digital curation course in library school and you need to find lots of resources to help with your project or research. (The Digital Preservation 101 section is a good place to start).
- You work in a library, museum or other cultural institution and you have been tasked with starting a digital preservation effort. (Again, see Digital Preservation 101)
- You are teaching a college course which includes digital preservation as a component and you are looking for classroom resources. (See the Resources for Educators, and the story below about Dr. Jane Zhang of Catholic University)
- You are preparing a presentation on the value of digital preservation. According to Butch Lazorchak, co-chair of the NDSA outreach working group, “The Box provides a ready-to-use set of resources that help make it easy to talk about digital stewardship.”
These are just a few ways in which Digital Preservation in a Box could be of use.
At this year’s Digital Preservation 2013 conference, I had a chance to present our NDSA poster on this resource. A lively poster session it was, too – there was a steady stream of attendees who came by to talk about it. During these conversations, I was pleased to hear one of two things: either people had used it over the past year and liked it, or, they didn’t know about it, but now they want to use it! If any of you have just started using this resource, we’d love to hear from you about your experience.
The Box content is organized into some basic sections, each containing links to a variety of resources:
Digital Preservation 101 – contains a wide range of information, with links to tutorials, videos, blogs, to help provide some basic context. It also includes an overall definition of digital preservation: “the series of managed activities necessary to ensure meaningful continued access, for as long as it is required, to digital objects and materials.”
Preservation by Format – includes links to suggested approaches for preserving photographs, audio, video, email, documents, and websites. Much of this is focused on smaller, personal collections.
Digital Storage, Cloud computing and Personal Backup – includes links to basic information on cloud storage and other backup options, as well as a timeline history of digital storage.
Resources for Educators – provides curriculum guidance, lesson plans and teaching materials relating to digital preservation and use of the Box materials. The class syllabi alone provide a good overview of the digital preservation process, complete with useful reading lists.
As an example of using the Box as an educational tool, Dr. Jane Zhang, Assistant Professor in the Department of Library and Information Science at Catholic University of America (and member of the NDSA outreach working group), created a project for her Digital Curation course that included work with the Box. The students utilized the box materials, suggested additional resources, and presented a public workshop. The experience was described first hand in a previous blog post by one of the participating students.
In addition to the above, the Box also includes other sections for glossaries, lists of tools, marketing and outreach, event guidance and basic digitization.
We have noticed some recent mentions of the Box – it’s showing up on some good resource lists, such as this one and this one, and it’s use was also mentioned in a recent blog post. Feel free to add the Box to any of your own digital preservation or library resource lists.
And this is no static “Box” – we are adding new resources as they become available, and the aim is to continue adding and updating the information to be as current and useful as possible. Have a suggestion for any additions? Want to tell us about how you have used this resource? Let us know in the comment section below.
A recent post from the Library of Congress’s main blog outlined some of the riches at the Library of Congress in connection with the 1963 March on Washington. Picture This, the blog for our Prints and Photographs Division also recently highlighted some recently digitized photographs from the march.
In the spirit of the 50th anniversary, I thought I’d mention some other digital resources that relate to the event. The World Digital Library has an item entitled “Civil Rights March on Washington, D.C.: Dr. Martin Luther King, Jr., President of the Southern Christian Leadership Conference, and Mathew Ahmann, Executive Director of the National Catholic Conference for Interracial Justice, in a Crowd” (reproduced here).
Historypin has a very fine digital “tour” available for the march that includes mapped items and a list of collections from the U.S. National Archives and Records Administration. Quite a few photos and documents are featured.
WETU, Florida West Coast Public Broadcasting Inc., has an online collection of local video stories and memories about the march, and the PBS NewsHour has a rich resource, “10 Resources for Teaching the 50th Anniversary of the March on Washington.”
Appropriately, the DC Public Library Washingtonia Division has a number of evocative images online. The Amistad Digital Resource for Teaching African American History also has an online collection of photographs and other resources documenting the event. The Williams College Digital Collection has an assortment of buttons, posters and other ephemera that are important for documenting the full context of the march.
If you know of other online resources that are of broad interest, please let us know in a comment.
This is a guest post by Paul Wheatley of the SPRUCE Project, which is “aiming to foster a vibrant and self-supporting community of digital preservation practitioners and developers via a mixture of online interaction and face to face events.” For more on SPRUCE, see an earlier interview with Paul.
A significant proportion of the project I’m currently running, the Jisc funded SPRUCE, has been about hands on digital preservation work: learning by doing. Changing attitudes from the bottom up. Doing digital preservation and sharing the outcomes, good or bad, for others to learn from.Supporting those who are actually managing digital data, and attempting to build a stronger community of peer support. I’ve been getting my teeth into a lot of this work throughout the project, and it has been great fun. A little bit on the back burner however, has been our aim to support the same audience in quite a different way: we wanted to make their practical digital preservation work more sustainable by helping these practitioners pull in the funding they need.
So we’ve been building our expertise and experience in various aspects of writing digital preservation focused business cases. As well as funding some case studies on the subject, we’ve been working with loads of practitioners at our mashup events on various business case themed exercises, and collating the results. This was all useful foundational work, but still to be properly realised into project outputs. In the last month however, I’ve finally been working full time on turning these foundations into the new Digital Preservation Business Case Toolkit, an online guide to help you make the case for funding your digital preservation work.
Except rather than sitting down and writing this work up in the time honored fashion (i.e. on my own) and then soliciting the feedback from our project team, we went for a slightly more experimental approach. At the beginning of August we hosted a three-day book sprint. We invited, along the project team, some of our favorite practitioners from our mashup events and a couple of external experts, and set them to work on collaboratively writing the toolkit.
We didn’t have a particularly strong view on how we wanted the end result to look, so we took things from the top and began by brainstorming what the contents page should look like. Immediately this gave us several interesting angles from which to tackle the rather nebulous problem of business case writing. Business cases themselves are defined very much by the organisation within which they sit, the stakeholders that are involved and the focus of the work they are making the case for. So what we needed to do was guide a user through the key thought processes without simply prescribing a particular outcome. Ask the right questions and the user should be able to come up with the right answers that are appropriate for them.
With the various approaches in place, we broke up into small groups and first brainstormed a bullet point summary for each toolkit section and then the detail to fill out those bullets. Working in short iterations, we peer reviewed the text as it appeared, enabling us to really hone each section using the selection of skills and experience that our book sprinters brought to the event.
Of course it wasn’t quite as straightforward as that (and I’ve blogged here in more detail about the challenges and benefits of book sprinting) but it was very effective in letting us build a key project output in a way which maximised the contributions, experience and buy in of our project team and invited experts.
We’ve just released the first version of the Digital Preservation Business Case Toolkit and we’ll be refining it during the remaining few months of the SPRUCE Project. We’ve funded a further couple of case studies which will be putting the toolkit through it’s paces and we’re hoping to solicit feedback from other users that will help us address any shortcomings in this first release.
A final word should go to Tom Woolley, our book sprint illustrator. Tom produced some fabulous drawings to bring the toolkit alive, and we’ve made all of them available under a CC license for those who would like to make use of them elsewhere.
In 2007, George Sanger and three other videogame industry leaders collaborated with the University of Texas at Austin to create the UT Videogame Archive at the Briscoe Center for American History. Sanger— who is best known by his persona, The Fat Man– is an award-winning, ground-breaking composer and sound designer who has created audio for over 250 games. He was ready to simplify his life by getting rid of a lot of his stuff and the Briscoe Center welcomed his collection. But they did not expect the complexity of the project they were about to take on.
Sanger’s career spans three decades and several generations of technology. When his archives arrived at the Briscoe Center, the caravan of sixty blue plastic bins was packed with at least nine types of storage media, most of which required special hardware to access their contents. Some of the storage media were obsolete and some of the files they contained were in obsolete formats created by obsolete programs.
Fortunately for the archivists at the Briscoe Center, Sanger is organized and efficient, and he had methodically labeled and organized his files, disks, tapes and drives. The real challenge that the Briscoe Center faced with the Sanger archives was not so much about cataloging it as it was about safely getting the files off of Sanger’s defunct media and into UT’s repository.
The career path that brought Sanger to the Videogames Archives was long and meandering. He grew up in Coronado, California, near San Diego, and studied music at Occidental College in Los Angeles. After graduation in 1979, he went to USC film school for a semester, where movie soundtracks made an impression on him that would later influence his work. But of all of his artistic influences, arcade and video games resonated the strongest and Sanger felt a calling.
In 1983, he landed a project creating music for a computer game. The process was low-tech; he composed the music on audiotape and sent it and the music notation to Dave Warhol, the game’s producer. Around that time the home gaming industry was in a slump, so Sanger did other work, creating background music for commercial films, demos for songwriters and custom karaoke tunes for vocal teachers.
In 1988, Warhol again contacted Sanger. By then, Nintendo’s Mario Brothers had jump-started the stalled gaming industry, the demand for games was growing and Warhol was at the forefront of American development tools for Japanese games. Sanger was in the right place at the right time and he was ready to take on anything because by then he had mastered MIDI, the essential technology for electronic game audio.
Musical Instrument Digital Interface, MIDI, is a technical standard for communication among electronic instruments, audio software and hardware. It enables a user to manipulate musical notes in the same way that a word-processing file enables a user to manipulate text. Example 1 shows a simple MIDI workstation setup.
When a user plays a note on the keyboard, say a middle C, the keyboard communicates via MIDI with a sequencer (an environment in which a user can modify the MIDI code), telling it which note was played played, how loudly it was played and how long it was held.
A MIDI composer uses a sound module to hear the composition created in the sequencer. The work done in the sequencer can also be saved as a MIDI file, which can be played back by different hardware and software combinations. The sound is determined by the software and hardware.
So in the early days of Sanger’s career, he would send a MIDI file off and hope for the best. The MIDI file may specify that certain notes are to be played by “trumpet” or “clarinet” but the actual tone of the instrument is no more part of the MIDI file than it is in sheet music. MIDI just represents sound.
For console games, Sanger embedded instructions for programmers in the MIDI files. The consoles would play back consistently because the audio platform was consistent and uniform. PC games, however, depended on the user having different sound cards, sound modules and eventually music-playback software. Each of these platforms had different capabilities, dependabilities, tones and sound qualities. This led to ugly artistic problems for Sanger and other game composers.MIDI files can trigger sound from modules on your computer. If you click the MIDI file below, a web page will open and the piano music will play in your browser, though the piano sound actually resides in your computer’s hardware. If you download the MIDI file (right-click > save as) and click on it, an audio player — such as Windows Media or Quicktime — will open and play the same piano music from the same source. “Rondo ala Turka” by Wolfgang Amadeus Mozart, MIDI file by Bernd Krueger.
Sanger’s 1988 MIDI workstation included a Mac Plus and a Roland MT 32 sound module, with which he created sound effects, atmospheres and soundtracks, not unlike the movie soundtracks he studied at USC. But he had the additional challenge of making the music “interactive.” Instead of creating music to play in a steady linear state, as on recordings and in movies, Sanger had to figure out, game-by-game, how the music might change in response to an action from the game player, as well as make sense to the ear.
“Music is a time-based art and game-music composers have no control over the time,” said Sanger. So he had to musically anticipate the options from “here” to “there” and make it all segue smoothly and logically.
During this period of his career, a number of new sound cards and sound modules of widely different capabilities hit the market. To make game music tolerable for all players, Sanger had to create a version of each MIDI file specifically for each playback platform.These two examples demonstrate the technical issues that plagued game audio before General MIDI. Here is an MP3 recording of what the music sounds like. NBA by George Sanger (MP3) Here is the MIDI file for that song. NBA by George Sanger (MIDI) In the MIDI file, you only hear one instrument — and a quicker tempo — because the file is coded in the pre-General MIDI format and your computer expects to receive General MIDI instructions. Because it does not get them, it defaults to playing everything on one instrument.
To make matters worse, the earliest MIDI specifications did not require that the playback platform actually play any particular instrument where it was specified in the MIDI file by the composer. Sanger said, “It got to the point where my melodies were playing back as some ‘buzz click’ thing.”
Sanger’s MIDI options improved as a result of industry changes that Sanger himself helped instigate. The General MIDI standard was established in 1991 and it at least fixed the “particular instrument” problem, and this promised to make MIDI-based audio more equal and reliably predictable across different MIDI-enabled devices. So, for the innovative CD-ROM-based game The Seventh Guest, Sanger created the first General MIDI soundtrack.
“Only one device that would play back General MIDI existed at that time –- the Roland Sound Canvas,” said Sanger. “So, to make this thing work, we wrote special sound banks for all the major sound cards that weren’t General MIDI yet. Those sound banks we created were bought by Yamaha and Microsoft and I believe they are still in use today, tucked here and there into obscure corners of systems.”
The establishment of General MIDI did not instantly make everything OK as predicted though, because, while the instruments could now be reliably specified, many aspects of the sound remained different from card to card. Playback of The 7th Guest on the new General MIDI cards was rough. Some instruments would be unbearably loud, some could not be heard.
Sound card companies knew of Sanger’s expertise, so when companies came out with new sound cards that claimed to be General MIDI, they sent their cards to Sanger for appraisal. So great was the demand that Sanger developed a side business called Fat Labs, which tested and appraised cards. If a company passed, they earned the prestigious “Fat Labs Seal of Approval” sticker on the product box.
By the middle of the 2000s, audio technology had evolved to the point where WAV music files gradually displaced MIDI from games. WAV files contain actual audio recordings; they don’t rely on sound cards to generate sound. Today MIDI makes up a small percentage of game audio.
Most of the digital content in the Sanger archives consists of MIDI files and sequencer project files. As part of the ingest process, the Briscoe Center set out to create a disk image of each digital item, a sector-by-sector replication of the structure and contents of each storage device, and deposit the disk images into their repository. However, the Briscoe Center has at times been frustrated by Sanger’s storage media.
The disks in the Sanger collection consist of:
- 3.5″ floppies (single density/double density and high density)
- zip disks
- internal hard drives (IDE, SCSI, SATA)
- compact discs
The tapes consist of:
- QIC-80 cartridges
- TR-1 Travan cartridges.
The double-density floppy disks were by far the most difficult to gain access to. They require a special kind of floppy drive but the Center’s archivists did not know that at first. For years, they were thwarted by input/output (I/O) error messages from the double-density floppies and they were convinced that the disks were corrupted. Eventually, through lengthy testing, they concluded that those disks required a specific floppy drive, while the high-density floppies did not.
The archivists’ persistence in trying to access something that would not open was admirable. They did not completely assume all the double-density disks were corrupt and they never considered throwing the disks away.
“Just because something is not readable now doesn’t mean that it won’t be readable in the future,” said digital archivist Zach Vowell.
The DAT and ADAT tapes required special hardware to play them and it took the archivists awhile to acquire the players. Sanger eventually donated his own ADAT player to the Center. There was a further wrinkle in that ADATs could be synchronized across multiple machines, for up to 32 discrete tracks of audio spread across four tapes. To keep these tapes synchronized to their original time code, digitization project archivist Justin Kovar used a Windows 98 workstation with the manufacturers’ discontinued ADAT/Edit program to migrate the PCM audio on the tapes to uncompressed WAV files.
The archivists could read the SyQuest tapes with the drive that Sanger provided as part of his collection, but the drive has a 25-pin SCSI interface. To date, the Center does not have a workstation equipped with a suitable SCSI interface, so they have not been able to migrate the content off the SyQuest tapes.
Some files are still locked on back-up disks because the backup software that Sanger used at the time used a proprietary compression algorithm. The files can only be restored to their uncompressed state using the same backup software, which the archives still has to acquire.
Sanger is still going strong, composing for games and also for slot machines. He has his own YouTube channel and he has published a book titled The Fat Man on Game Audio: Tasty Morsels of Sonic Goodness. The George Sanger collection includes not only digital material but also paper records, photographs, clippings, artifacts and analog audio recordings. Every few months Sanger drops off a few more things at the archives.
Because of Sanger’s foresight in properly archiving his materials, he was invited to speak at Personal Digital Archiving 2013.
Sanger still uses MIDI to compose and create sound, but the end result is a music file, not a MIDI file. Since he is not limited to MIDI, he can record of a range of exotic instruments and Foley effects. “I have bins of sound effects things,” said Sanger. “They’re sort of divided into clunkers, squeakers, bangers, ringers…that kind of thing.”
Many of the difficult digital-preservation challenges that the Briscoe Center faced with the Sanger archives had to do with his files from the golden age of MIDI, from about 1988 to around 2000. Fortunately for the Center, Sanger donated almost all of the hardware they need to run the software. The Briscoe Center has even assembled a vintage workstation, where the operating system, platform and hardware meets the requirements of vintage software and games.
The Sanger archives demonstrates that digital preservation often encompasses more than digital files. In Sanger’s case, preservation must include both the software and the hardware necessary to play the audio and recreate the process of developing that audio.
The Briscoe Center appreciates the unique archival relationship they have with Sanger, who has provided an enormous amount of resources and guidance above and beyond what an average collection donor would provide.
Sanger said, “I’ve done what I can and I leave the sorting out of it in the capable hands of the archivists and future researchers. The archivists bring very different skills and temperaments to their work than us ‘digital artists.’ It’s a very complimentary situation. My career requires me to look forward and move forward, quickly and relentlessly; it’s all I can do to keep from falling into organizational chaos. I often have to decide that, no, I will not look back, and because of that, I may or may not label this bin or that backup drive, and I’m not leaving much of a trail behind. If the stuff I’m leaving to the archive ends up having any value in the future, it’s only because of the skill and patience and care that the archivists have for this collection and collections like it.”
I am frequently asked about the difference between “traditional” preservation and digital preservation. My honest answer is that there are very few distinguishable differences.
Preservation activities are never traditional – there is constant innovation in preservation techniques. Digital preservation is in many ways still developing its tools and techniques, but physical preservation is also evolving.
All preservation activities are about actions and documentation of actions taken on collections.
Material science plays a huge role in preservation. I was once in a meeting with representatives of a preservation initiative where I heard the following declaration: “There is no such thing as digital preservation, only the preservation of digital media.” My (very brief) initial response was one of great surprise, followed by immediate recognition that, of course, my surprise was wrongheaded. Research into the qualities of physical materials, be they paper or magnetic storage media, is vital for preservation. New treatments and actions are being developed, and understanding of the archival nature of all media types is always being expanded.
People often say that born-digital collections are more at-risk than physical collections when planning for preservation needs. It certainly can often be the case that there is only one instance of a born-digital file on a single piece of media, and the fragility of the media may mean there is only one chance to read the media and copy the file into a managed environment. But is it certainly also the case that there are countless physical items in library, archives and museum collections where handling for research use could damage an item beyond recovery, and there is only one shot at preservation. And disasters can strike just as suddenly for any type of collection.
All collections need ongoing management and assessment. All collections require inventorying. Digital is no different.
I also hear that the skill sets are different. This is in part true. There is additional expertise in file formats, familiarity with potential risks in storage infrastructures, forensic analysis of files and auditing of storage and the use of tools to migrate files (and file formats) as appropriate that is needed. But, at the core, the skill set is one of being able to identify risks, analyze collections for risks, make decisions about needed preservation actions and take them. There is some specialization in the handling of digital media and files, but that level of specialization in preservation is not uncommon.
I often say that there is no such thing as a “digital library” — it’s just the library. Now I am wondering if I should also be saying that there is no such thing as “digital preservation” — it’s all just preservation.
The following is a guest post from Raegan Swanson, Archivist with the Cree Cultural Institute. Reagan contacted members of the NDSA group working on the levels of digital preservation with her thoughts and comments and we were excited to offer her the opportunity to share her comments on the utility of the levels with a broader audience here on the blog.
In this post I briefly describe the work of the Truth and Reconciliation Commission of Canada and the value that the NDSA levels of digital preservation provided in helping me think through how to best to prioritize our work to ensure long term access to the records of the commission. The following represents my own personal views and experience working with the NDSA levels, not those of the TRC itself.
Truth and Reconciliation Commission of Canada
Based out of Winnipeg, Manitoba, the Truth and Reconciliation Commission of Canada was established in 2008 with a 5-year mandate to gather information relating to Indian Residential Schools in Canada. These government and church-run schools date back to the 1870s, and over 150,000 First Nations, Inuit and Métis children were placed in over 130 schools across the country, with the last school closing its doors in 1996. Part of the TRC’s mandate is to collect testimonies from Residential School Survivors. Unlike other large Truth and Reconciliation Commissions, such as South Africa’s, the oral statements are collected exclusively as digital audio and video recordings. Some of the statements at larger events are streamed live online, while others are recorded in a private setting.
Statements are given by survivors, the former students, as well as former teachers, staff and families of survivors. Statement gathering has taken place in every province in the country and statement gatherers visit both small Inuit villages as well as large cities. The accessioning process for archivists is quick, as recording equipment must be promptly returned to the field. Some events could net over 1TB of video a day, plus back-ups. The Canadian TRC quickly became one of the largest collections of digital material in the province of Manitoba, and there were few resources for a “living” digital endeavour this large.
NDSA Levels of Digital Preservation
In November 2012, we at the TRC saw the post about NDSA Levels of Digital Preservation on this blog. We began to analyse our digital holdings using the levels over the next three months. Our holdings were complex and heterogeneous, as we held material that we ourselves had created (statements from survivors) as well as material from the Government of Canada and several different Church entities. For each of the different types of holdings we were able to determine the level which the collection fit in. Since we didn’t hold the original material for many of the records, we had to take into consideration the holdings of other institutions as well as our copies. It gave us a great opportunity to review and clarify where all our projects stood. For us, the levels document was usable far beyond thinking about preservation.
Usability of the Levels
We were impressed with the simplicity and usability of the levels. However, I believe that we struggled with dealing with items that were not born-digital, and in cases where physical copies still remained. We knew that the levels were created with the intention of working on digital preservation, but the originals still played a part in our collection. At this point, we were digitizing for researcher access, not for preservation. Because the commission is coming to an end in 2014, the project provided a great opportunity for our archivists to prepare a document to be handed over with a full account of semi-active records ready to be moved into archival storage at a National Research Centre. The levels provided a great “here is everything you need to know” type of document.
Aanischaaukamikw Cree Cultural Institute
Since working with the TRC, I have moved to the town of Oujé-Bougoumou in Northern Quebec, to a small community-run archive within the Aanischaaukamikw Cree Cultural Institute. I plan to continue using the levels document as we start up a digitization program for the archives and as well as for use with our corporate records. We have neither the means nor the need to fulfill all the levels, but I plan to use the levels as a guideline for our needs.
The following is a guest post by Jefferson Bailey, Strategic Initiatives Manager at Metropolitan New York Library Council, National Digital Stewardship Alliance Innovation Working Group co-chair and a former Fellow in the Library of Congress’s Office of Strategic Initiatives.
Jason Scott will no doubt be familiar to many readers of this blog having been interviewed previously by Leslie Johnston in her post Jason Scott, Rogue Archivist. In the intervening year, however, Jason has undertaken a number of new projects and initiatives dedicated to preserving digital information and the history of digital technologies, as well as continuing his work with the Internet Archive. In this interview, Jason talks about some of his recent work preserving digital culture.
Jefferson: Thanks for talking with us again. Since your previous interview here described your work on textfiles.com and with the Archive Team, I wanted to focus on some of your more recent projects. First, you declared November 2012 to be Just Solve the File Format Problem month. Tell us about how that project came about, how it went, and its future plans.
Jason: Like any industry involving a lot of public money and a lot of complicated projects, the library and archives worlds have a number of supposedly insurmountable projects that show very little return on investment beyond the general sense of good feeling with a general improvement down the line as the project becomes more mature and complete. In the process of speaking to groups and individuals in the library and archives world, I found a common theme of there being great difficulty in discerning the characteristics of digital files and data to understand what format it was in. Several projects have been mounted to approach this problem, but they tended to work along certain families of data, or were locked in a semi-proprietary situation and database that would not easily be shared. To work that hard and then give away the results of that work would be insane. And since open-source projects are in many ways insane, I thought it might be a good project to tackle.
Understanding, collecting documentation, and providing code related to file formats is one of the greatest and most difficult problems, simply because of the large wealth of sources available, and the fleeting nature of so much technical documentation, especially when the technology underlying becomes obsolete. Again, people were working on this, but they were running into hard limits left and right.
What I proposed with this project was to create a general common space not under the purview of any specific group, and allow many people both within industry and outside of it to track down and provide classification of file formats. In this way, the information could be absorbed into the other registries, as well as there being very little issue in expanding an entry that anybody had expertise on. In other words many of the properties of the wiki.
Taking this further, I thought that this file format problem represented one of many other “insurmountable” problems that exist in the world, desperately in need of focus of volunteers and a rule of making the results available to all. I also know that the process of working on a volunteer project can really slow down and lose energy if it had no specific time frame.
Combining these both I decided a really great idea would be a “Just Solve The Problem” Months, giving 30 days for the focusing collaboration that energizes any wiki, and providing a good foundation towards continued improvement.
The first solve the problem month was in November of 2012. I am still deciding if there’ll be another one. The File Formats wiki can be found here.
Jefferson: What were some of the challenges, successes, and unexpected surprises of launching a user-driven, almost freeform approach to documenting file format types?
Jason: Creating the first example of an ongoing series is always a challenge, because people don’t understand what the series is, or how the first example fits into it. It’s kind of like having a television pilot. We might not understand that there’s a large story arc, or that the pilot merely exists to introduce everyone and that the second and later episodes might not work the same way. I found people who thought that we would attack the same problem every year, or others who understood we were to attack a different problem each year, leaving a bunch of projects in its wake. Looking back, I certainly would’ve called this a one off event, and then had endless sequels.
One major surprise, which truly caught me off-guard, was that I considered this project to be a freeform open opportunity for archivists, librarians and others to not be inhibited by the structure of the organizations they normally collaborate or work for. And in the complaints, clarification demands, and distaste for the open format and freeform approach, I discovered that there were a percentage of volunteers who obviously enjoyed and depended on the structure of their positions. Not everyone was like that, of course. But enough were that it really, really surprised me. Some folks walked away from the project immediately, when discovering that there wasn’t a standards body or reference document, for instance. Others were majorly turned off by some of the ad hoc classifications that have been added, saying that they were a division of energy and not needed. Many of course were excited to break new ground, and did amazing work.
I intentionally put crazy file formats in the collection, including DNA, piano rolls, human language , and looms. I wanted people to understand that we weren’t simply doing one type of file, and that putting an expensive definition on what a file meant and what a format meant would give us more leeway from contributions around the world. That said, I also knew that we would be dedicated to having better and better classification, so that somebody who did not care about mechanical formats or organic formats could go immediately to the computer-based formats or application based formats and the information they needed.
Like many wiki projects, a number of people stepped forward who were major energy during the month. I don’t want to call their names out, in case I miss one or classify one wrong. But the editing history of the wiki shows the handful of folks who tirelessly added formats and links to information, some of which continue to add new information every single day. These people are angels, and the world is better for them.
Ultimately, I consider the project a success, and its continued growth and modification will make it a classic reference body.
Jefferson: While The Signal focuses largely on the preservation of digital materials, we are often reminded how much of the history of digital technologies and digital culture exists in print form. I’ve been impressed by your efforts to preserve both the populist, consumer-level cultural materials around computers but also much of, as you say, the “manuals, notes, booklets, ephemera” related to hardware and software. Tell us about The Computer Magazine Collection and the The Bitsavers Collection.
Jason: Computer hardware and software is nothing without its documentation. You might be able to stumble around, get some things running, and make good guesswork or even quality guesswork as to how it functions, but without the documentation you will always suffer from not knowing how everything works or why. You certainly won’t know how the original creators intended the machine to be used, or intended the software to be run, and the urge to just give up on a project because you don’t have information on it will always increase. By making manuals and documentation available, the world wins, even if we don’t have the original hardware or software at the moment.
Similarly, a lot of other technical and historical information is buried in the pages of computer magazines, newsletters, and flyers written around the time of the computer hardware’s glory days. Besides articles, there were also advertisements, type-in programs, and reviews that gave critical information to understanding what role these computers and software played. While it’s possible to get by with just the hardware, software or documentation, nothing beats finding out what writers of the time considered to be the important points and what was driving the industry.
One of the outside groups that has been working tirelessly for over a decade to digitize documentation, pamphlets and other written materials is the bit savers collection. The group is at bitsavers.org, And besides documentation, they have also captured the original bits off of magnetic tape and disks. I wrote a mechanism that automatically mirrors their contents on archive.org for easier reading and sorting. But the credit definitely goes to that group for their tireless efforts in bringing once lost material online.
Jefferson: Preserving that documentary evidence of how computers worked their way into our homes and our lives seems vitally important to understanding our social attitudes towards how we create, interact with, and ultimately preserve digital artifacts. In working with these collections, what novel insights into, or new understanding of, our relationship with digital technologies did you gain?
Jason: Technology industries are often quicker to adapt to changing needs or requests then other industries. In the pages of writing, you can sometimes see reaction to strange new features that later become absolute requirements, or which became analogues in the mobile world. Certainly in the 1970s, magazines and journals utilize all sorts of perspectives to technology as being a general idea. Puzzle articles sitting next to electronics and sitting next to software overviews, considering them all to be part of owning a computer. Through the 1980s and later, these general-purpose computing magazines split off into highly specialized periodicals, making a much more in depth review of aspects of those subjects, but losing the sense that human beings just thought of computers as computers. We lost something in that, but we have gained a lot of other things.
Jefferson: Much of this work is part of your self-described Charge of the Scan Brigade and your ongoing work with the Internet Archive. But, rogue archivist that you are, you have other non-IA projects focused on documenting digital history. Tell us about those.
I often take possession of computer artifacts, such as magazines, machines, and software, some of which I see about being transported to full professional archives elsewhere. So being a clearinghouse definitely takes some of my time.
Professional speaking on the subject happens occasionally, and is always a fun time.
Jefferson: In your previous interview, you ended with some advice for individuals facing data loss and for institutions looking to collaborate with projects like yours. I was hoping this time you could give advice specifically to archivists, curators, historians, and the preservation-minded that are hoping to preserve and make accessible both digital content and the physical collections of computer history. How would you advise these sorts of professionals to be more rogue?
Jason: Computer and technology history appears to still be a strangely fringe subject for many archives, yet many of the correspondence and other information related to all subjects are moving to computers. Getting archives and libraries to realize that situation is an important first step. Along with that realization will hopefully come funding in efforts around preserving computer data or having multiple digital backups available between organizations. I would like to think of a world where various libraries mirror each other’s data in return for having a secondary off-site backup. As for the preservation projects themselves, the fact is, it will not be as easy to walk into a space and pull away a pile of books or artwork or letters. It’ll be a case of being handed a laptop, or being given access to a series of Internet services like Gmail or Dropbox.
Much like how it was with home computers in the early 1980s, I would like to hope that various archivists are taking the initiative within their groups, and becoming knowledgeable enough to pass on what they’ve learned to the others.
It’s just the fact that web-based materials are becoming the dominant form of many types of archives in libraries, and getting ahead of the curve or catching up with it should be the top priority.
The following is a guest post by Carl Fleischhauer, a Digital Initiatives Project Manager in NDIIPP.
In 2003, we began drafting descriptions of digital formats, intended to support the Library’s preservation planning. Knowing that our descriptions would be of general interest, and wishing to work cooperatively with emerging format registries (e.g., the Unified Digital Format Registry), we soon began posting our descriptions online. Today the offering includes descriptions of 308 formats and subformats.
In our nine years of operation, we have been gratified by the interest shown by other organizations and individuals. We get mentioned as a source here and there, for example in the format preservation page at the Binghamton University Libraries Web site, in various pages in the Archivematica Wiki and, more recently, in another Wiki from the energetic Let’s Solve the File Format Problem! project. From time to time, various writers have cited our analytic framework, for example in this article by the professional photographer Jeff Schewe.
At first, we created our descriptions in HyperText Markup Language. In 2007, we began to move toward eXtensible Markup Language as the drafting format. We planned to treat the XML versions as master copies and to produce the online HTML files via an XSLT transformation (Extensible Stylesheet Language Transformation, a kind of script that reformats marked-up text or data). We started by converting our existing HTML into XML in a semi-automated process that, nevertheless, required a lot of hand editing. I blogged about the conversion process in March 2012: Formatting the Formats Pages.
Why bother with XML? As I wrote in 2012, XML markup can describe the different pieces of information using “tags” that convey the meaning of each chunk of text. Thus XML files can support a broader set of uses for the underlying data than HTML pages. For example, an interested organization could take the XML versions of our documents and apply an XSLT that recognizes the tags that are meaningful to the organization and use them to extract selected segments. A format registry like the UDFR could extract particular elements from our dataset to supplement their own format-specific data.
By the end of 2012, we had converted all the old descriptions to XML and had begun creating new ones in that format. With the help of our expert consultant Ignacio Garcia del Campo, we also refined the pair of XML Schema Definition files (.xsd extension) that we use. The refined versions carry the version number 1.0. There is a primary schema that uses an xsd:include declaration to reference a subsidiary schema that handles HTML styling within the longer text fields.
Now, in 2013, we are pleased to announce the availability of the XML versions as well as public access to the pair of XML schemas. We have added an introductory page to the site that provides links to the various resources, including a pointer to the ZIP file that contains the full set of XML versions of the Format Description Documents. There is also an instruction for those who seek a single instance and not the whole set.
We hope these XML versions will useful to others. We are always eager to receive comments from our users. Send a note to help us correct errors or to suggest formats that we should describe. Although this activity is not a full time job for any of us, we will do what we can to respond.
This is a guest post by Ingrid Jernudd, a volunteer with NDIIPP.
For the past week, I’ve been working on creating video tutorials for personal digital archiving and I must say – creating these videos is quite fun!
With the various types of video editing software, I find the process to be relatively intuitive. I’m focusing now on a tutorial for archiving emails, and doing my best to make the information easy to follow, while also working in entertaining aspects where I can.
The reason I have been doing all this, is because I am creating tutorial videos for you! As I mentioned in my previous blog post, part of what I am working on during my time volunteering with NDIIPP at the Library of Congress is creating tutorial videos on how to carry out certain types of digitizing and archiving of information. By the end of the summer I aim to get two videos done as part of a new series on personal digital archiving.
The video I’m currently working on is a basic tutorial on how to archive emails. The second video will be on how to scan documents for personal archiving purposes. These will be available eventually on the NDIIPP personal archiving pages. There is already basic information available there in the form of written instructions on how to do these two things. The email guidance is available via this document. See this if you want to scan a significant, or meaningful, document or the most recent picture that’s found a home on your refrigerator door drawn by either your kid or you (if you are a kid, that is). The aim in creating the videos is to further clarify the process, and tutorial videos can be easier to follow then written instructions.
So, for those of you that do prefer to learn via video, I am almost done with the archiving email tutorial. However, it does have to go through some rounds of editing both for the video and audio, and making sure my (lovely, if I do say so) voice-over syncs with the visual component of the tutorial is surprisingly time intensive. The screen recording aspect of the video does mean that, as opposed to following written instructions, you can directly see what you have to physically do on your laptop or other device in order to archive your email. Essentially, it adds a new dimension to step-by-step written instructions.
Both of these videos will be coming out shortly, so stay tuned for their release!
In June I did a post highlighting segments of the digital stewardship universe that could use applied research attention. I looked at the “what” of email archiving here and the “how” of email archiving here and now I turn my attention to format migration.
The need to migrate file formats arises out of concerns about format obsolescence. As I mentioned in my original post, there are ongoing discussions about how acute the format obsolescence problem might be, but for the purpose of this post we’re going to assume that migration is a possible solution to digital stewardship challenges and concentrate on useful resources that support the activity.
In my original post I proposed a series of largely technical questions that a researcher might ask regarding format migration, mostly about what happens to files and the information they contain in a migration process. This time around we’ll look at the infrastructure needed to do format migration and in a future post we’ll look at the results of a few migration experiments.
The first step in the infrastructure are the format registries. Format registries, such as PRONOM, developed by the UK National Archives, and the Unified Digital Format Registry developed by the University of California, provide detailed documentation about data file formats and their supporting software products. The format registries are important because we need to know as much information about the documented state of a format before we can understand what changes take place in a transformation.
[And while it's not a format registry, the Library's Sustainability of Digital Formats site has a lot of useful information in this area.]
The next step are tools that draw on the registry information to support the automated identification of file formats. Some interesting tools include FIDO, the Format Identification for Digital Objects Python command-line tool; the DROID Digital Record Object Identification tool; and JHOVE and JHOVE2. Each of these tools support file format identification, validation and characterization to varying degrees, though I’m not qualified to discuss their significant differences (I’ll let the developers point them out in the comments!).
They’re all similarly interesting for our purposes in that they allow the “identification” process to be incorporated into automated workflows along with a suite of other identification/characterization/migration/evaluation tools.
The next thing you need are files to migrate. I’m sure you’ve got plenty of your own, but if you’re working at scale you may want to access large corpora of data such as that provided by Biomed central. The Planets testbed was a very effective research environment hosted by the European Planets project to facilitate practical experimentation in digital stewardship and to enable users to repeat experiments in order to validate the results, but I’m still trying to clarify its current status. The successor to Planets, the Open Planets Foundation, does maintain a Formats Corpus.
On a side note, the National Software Reference Library has a research computing environment containing some 18,000,000 unique original files, along with a database containing metadata about the files. They do allow researchers to run an algorithm against the file collection by submitting a job (in code form) to the NSRL who run it for you.
Last but not least you need software tools to do the migrations. Here is where it starts to get complicated. A great place to start is the work being done by SCAPE, the SCAlable Preservation Environments project funded by the European Union and coordinated by the Austrian Institute of Technology. They’ve authored a report that looks at what they call “preservation action tools” developed under the Planets, CRiB and RODA projects. The paper introduces models for assessing the appropriateness of any particular piece of software for preservation migration purposes.
Another useful site is the Conversion Software Registry maintained by the Image and Spatial Data Analysis Group at the University of Illinois at Urbana-Champaign National Center for Supercomputing Applications. The registry is a repository of information about software packages that are capable of file format conversions, particularly tools to help identify conversion paths between formats.
There are proprietary tools already used in some domains (such as the geospatial community) that support the mass transformation of data across multiple formats, but they’re designed more to support the movement of data between databases and applications. It’s not clear to what degree (if any) they’ve considered preservation as a significant use for their tools, but it’s an area for future exploration.
In a future post we’ll take a closer look at the outputs from some migration efforts. Feel free to identify experiments or other migration tools and services in the comments.