Most of the conversations I end up in about digital preservation are about the digital versions of analog things. Discussions of documents, still and moving images and audio recordings are important, but as difficult as the problems surrounding these kinds of digital objects are, there is a harder problem: preserving executable content, aka software. Software isn’t simply what we use to render content–it’s is an important form of creative expression, a cultural artifact, a important commodity and an entity which increasingly is enmeshed our economic, political and social systems.
I thought I would start a quick list here of a few of what I think are some nice reads on preserving software. Some of these are posts from our blog, but most are papers and reports that I think do a nice job getting into some of the issues those interested in preserving software face and some of the ways folks are going about preserving software.
Please consider adding and reacting to these with:
The Life-Saving Software Reference Library: This interview I did with Doug White from NIST goes into considerable detail on the structure and design of NIST’s software library, which he describes as library of software, a database of metadata, a NIST publication and a research environment. Here is a bit of how Doug explained it, “The research environment allows NSRL to collaborate with researchers who wish to access the contents of the virtual library. Researchers may perform tasks on the NSRL isolated network that involve access to the copies of media, to individual files, or to “snapshots” of software installations. In addition to the media copies, NSRL has compiled a corpus of the 25,000,000 unique files found on the media, and examples of software installation and execution in virtual machines.”
The Geeks Who Saved Prince of Persia’s Source Code From Digital Death: This is the most fun story of any of those in this list. Be sure to follow the dramatic events as the original source code for the Apple II version of Prince of Persia makes it’s way off it’s original media and up onto Github.
Toward a Library of Virtual Machines: Insights interview with Vasanth Bala and Mahadev Satyanarayanan: This interview goes into some depth on the design of the Olive Library project. One quote is particularly salient on the potential importance of software preservation: “as all fields of scientific investigations rely on complex simulation and visualization software, the ability to archive these software artifacts in executable form becomes essential for reproducibility of scientific results. Software preservation also enables long term data preservation. Today’s data formats may become obsolete tomorrow, unless the software applications that process those formats are also preserved”
Emulation: From Digital Artefact to Remotely Rendered Environments: Dirk von Suchodoletz, Jeffrey van der Hoeven 2009; While not directly focused on software preservation, a section of the paper articulates the focuses on some of the needs for and various problems in constituting software archives. Here is a valuable quote “the original software also needs to be preserved if digital objects are to be kept alive via emulation. Guidelines similar to those created for digital objects themselves must be brought to bear in order to safeguard emulators, operating systems, applications and utilities. That is, software should be stored under the same conditions as other digital objects by preserving them in a OAIS-based (ISO 14721:2003) digital archive.”
Preserving Virtual Worlds Final Report: McDonough, J., Olendorf, R., Kirschenbaum, M., Kraus, K., Reside, D., Donahue, R., Phelps, A., Egert, C., Lowood, H., & Rojo, S. (2010). At 187 pages, this is more of a book than an essay, but it’s full of valuable exploration and discussion of the various issues, problems, and opportunities around preserving video games.
The Attic & the Parlor: Notes from a Workshop on Software Collection, Preservation & Access: The Computer History Museum’s Software Preservation Group hosted what looked to be a fascinating workshop in 2006. You can find the proceedings and presentations online and their wiki also includes a rather extensive directory of software collections. The Attic & Parlor notion in the title focuses on a distinction between highly curated collections and sprawling “gather it all up” collections. This, like the preserving virtual worlds report, focus on the value of collecting source code.
What should we collect to preserve the history of software? Shustek, L. (2006). IEEE Annals of the History of Computing, 28(4), 112 – 111. Another strong argument for preserving source code. “I argue that unless we collect, preserve, and interpret the software code in addition to the related artifacts, we have discarded the software’s intellectual essence. Emphasizing collateral materials puts the focus on the history of products and downplays the development of the scientific and engineering accomplishments that underlie them.”
Preserving Software: Why and How John G. Zabolitzky, Iterations: An Interdisciplinary Journal of Software History 1 (September 13, 2002): 1-8. Zabolitzky makes an impassioned argument for urgent action on software preservation and similarly makes an appeal for the preservation of original source code. “the evolution of software methods, techniques, styles, etc., is described in many books and articles. However, all of that is essentially hearsay: what actually has been done (and what may be different from what the active players in this area may report since they might have wished to do something different) can only be discerned and proven by examining the source code. The source code of any piece of software is the only original, the only artifact containing the full information. Everything else is an inferior copy.”
What essential readings would you add to a list like this? Please consider taking a moment to add them in the comments. Also, feel free to use this comment thread as a place to discuss the various ideas and approaches advocated for in these readings?
So, how far along are we with cloning? Because I could have really used a clone or two in order to cover the many (sometimes concurrent) interesting sessions at this year’s Museum Computer Network conference in Seattle. Since this was my first MCN, I’m probably looking at this with more of a beginner’s “gee whiz” outlook, but the presentations were not only interesting and relevant, they were well presented. And, it was nice to see a community that was so enthusiastic and supportive of the presenters.
I can’t do the conference justice in this short blog post (and, I’m on a deadline!) so I’ll just point out some of the highlights from the sessions I did attend. And thanks to all the tweets coming out of #mcn2012, this helps to fill in the gaps and provide context. In general, that’s my favorite use of twitter, anyway – a good way to create some group conference notes. (For reference, I’ve provided links to the relevant hashtags for the sessions below.)
Ignite Talks (#IgniteMCN)
This was something a bit different than the usual “lightning talks” from other conferences I’ve seen. There were nine presenters who used 20 slides in 5 minutes to make their case – certainly, “lightning” enough. This was a good introduction to the kind of creativity that was to be on display all week. All these talks (covering topics such as open authority, museum education, etc, but with interesting philosophic angles) were stimulating and thought provoking. And, it was held at the EMP Museum – how cool is that?
The most unusual talk, and a first for me at any conference, was maybe not so much a talk as a performance from the Smithsonian’s dynamic Michael Edson entitled “Jack the Museum”, done in poetry slam style. Yes, you read that right. Here’s an excerpt:
“Network action makes old school broadcast reaction a distraction to this powerful new faction: 6 billion people, connected, on the web.” (That’s just a taste – for more, here’s the whole brilliant thing). As @museum_mash noted on twitter, “Instant classic.”
Keynote Talk (#mcn2012key)
You think YOU’VE got big data? Microsoft’s Curtis Wong got the main conference off to a great start with his Keynote presentation, “Breaking out of the Box – Interactive Video and the Transformation of Storytelling”. Wong says he always wanted to be a museum person, and bring the storytelling and interactive together – and to illustrate, he demonstrated some amazing tools in which he did just that. His project, the World Wide Telescope, enables nothing less than tours of the universe using high resolution images. He also demonstrated Chronozoom, an interactive timeline of history, going all the way back to the big bang!
Wong also described the stages of what he calls the “information architecture of learning” – first, engagement, then build a mental model, and then validate that model. He said his ultimate goal was to make things easier to use, and, reusable. A sample tweet, from @simontanner: “I think Curtis Wong’s keynote shows benefits of high bandwidth access just when I thought mobile access wld dominate.”
Tales from the Blog (#mcn2012tale)
Summed up, the reports of blogging’s demise have been greatly exaggerated. In this lively session, panelists gave first person accounts of their rationale for starting or contributing to museum community blogs. Panelist Ed Rodley, noting a common reluctance to put it all out there, summed up his philosophy – “feel the fear and do it anyway”. (Thanks, glad to know I’m not the only one!) Another panelist, Mike Murawski, says blogs should really be less “look how great I am” and more about testing ideas and theories. (For a list of blogs from this community, many are noted at #mcn2012tale on twitter.)
All those 1′s and 0′s (#mcn2012dams)
This panel session was focused on standards for large digital media files, and started out with a bit of nostalgia – revisiting the floppies, zip drives, other media from days of old. This was of course a way to illustrate the growth of data files to what we have today, and what we may end up with in the future. Mainly, it raised the relevant questions to frame a discussion, such as:
Other problems such as low bandwidth, and limited staff to tackle all this only adds to the challenges. This kind of discussion indicates the museum community is indeed thinking about digital preservation, and starting a good dialog to help further these solutions.
Google Art Project on Trial (#mcn2012goog)
Again, the MCN folks came up with an interesting way to have a discussion. For this session, the Google Art Project was the focus of a mock trial – with panelists volunteering to serve as either defendants or prosecutors, with Michael Edson as moderator and the audience as jury. And to keep things interesting, Piotr Adamczyk, Google Art Project, Google Cultural Institute was also there (winning the Good Sport award for the conference!) Many issues were raised in this lively discussion, here’s a sampling of issues presented on both sides:
And on it goes (and will go). This was a valuable discussion to help frame this project within the museum community.
This is just a mere sampling of the conference content. Other great sessions I attended included Preserving Digital Art; Value, Sustainability and Disruptive Technology; Preservation of Email, and the closing plenary, which was a nice event wrap up, with highlights presented by way of small group discussions. Of course, since I’m involved in the digital preservation program here at the Library of Congress, it was great to see that subject represented, not only as the focus of some sessions, but occasionally in general conversations. It was indeed a topic of interest at the conference. So, for all those in the museum community eager for more information on this, in addition to the above website, the National Digital Stewardship Alliance is another resource – membership is free, and provides easy access to discussion with others who are grappling with the same digital preservation challenges.
One of the great things about MCN 2012 was that they provided live webcasts of selected sessions for those who couldn’t attend. In addition, all the sessions were filmed for viewing later on, and will be available on the MCN Youtube channel in the near future. I will need this, too, because there was so much, and now it’s all a blur.
In the meantime, any other thoughts from MCN goers???
Wouldn’t it be great to have a single technical solution that solves all your long-term digital archiving, stewardship and preservation needs? Perhaps a file format with millions of users, widespread adoption across different computing platforms, free viewers and open documentation?
A lot of hopes and dreams have been poured into the idea of “one preservation tool to rule them all,” and many people, both inside and outside of the preservation community, have come to think of the “archival” version of the widely used Portable Document Format as this single solution.
However, a close examination of the tool shows that while it’s useful and valuable for many things, it’s not the only answer for long-term archiving and preservation. This can’t be stated often enough, especially as awareness grows around the October 2012 release of the latest version of the PDF/A specification.
The specification, which goes by the cumbersome name of Document management — Electronic document file format for long-term preservation — Part 3: Use of ISO 32000-1 with support for embedded files (PDF/A-3) (or ISO 19005-3:2012 for short), defines a file format based on PDF which provides a mechanism for representing electronic documents in a manner that preserves their static visual appearance over time, independent of the tools and systems used for creating, storing or rendering the files. “Static visual appearance” ultimately means that conforming PDF/A files are complete in themselves and use no external references or non-PDF data.
But the scope of the PDF format has significantly expanded since a variety of organizations first met in October 2002 to begin work on the archival version of the specification. In 2011 PDF/A-2 brought the specification in concordance with the international standardization of PDF itself, and PDF/A-3 now addresses expanding business concerns in addition to the specification’s original strict preservation orientation defined largely by the cultural heritage community.
PDF/A-3 makes only a single, fairly monumental change. In the PDF/A-2 specification users were allowed to embed files, but only PDF/A files. PDF/A-3 now allows the embedding of any arbitrary file format, including XML, CSV, CAD, images and any others.
At first glance this sounds like a gigantic betrayal of everything that the format has stood for. Why define a subset of PDF attributes to ensure the long-term comprehension of the file if you’re going to turn around and allow the kitchen sink to be embedded within it? (You can follow some of the original discussion of this change here.)
The answer is that a wider business community, beyond the traditional archiving and cultural heritage sectors, pushed hard for it. The good news is that the addition of this feature to the specification will open up new application areas without seriously threatening the scope and intent of previous versions.
In the United States the corporate interest in PDF is led by the pharmaceutical, banking and financial sectors. As these industries already use PDF heavily, it makes sense for them to try and extend the PDF/A specification and leverage it for their own purposes.
The pharmaceutical sector, for example, has the challenge of managing a multitude of documents over long timeframes in the process of submitting their work to the FDA for approval. For their legal protection they also need to retain and archive these documents for the long-term, a natural benefit of PDF/A. Why not create a new version of the specification that would allow the multitude of documents to travel together in a single package?
In theory, this creates external dependency challenges in the newly created PDF/A-3 documents. But the specification makes the PDF/A-3 document a “dumb” container that prohibits “actionable” access to the embedded files. The embedded files should not be required in any way to comprehend the information in the PDF/A-3 document and are supplied merely as support to the information already in the document.
The significant language is in section 6.8, “Embedded files”:
Although embedded files that do not comply with any part of this International Standard should not be rendered by a conforming reader, a conforming interactive reader should enable the extraction of any embedded file. The conforming interactive reader should also require an explicit user action to initiate the process
For example, you might embed a word processing document that is the converted source of the PDF/A-3 document, or a spreadsheet file that is represented in the PDF/A-3 document by an image or a form. A PDF/A “conforming reader” (a software tool that renders a PDF/A-3 document reliably according to the rules of the specification) should not activate the embedded files but enable the files to extracted to another location for viewing, if the user has the proper tool to engage with that type of file.
Of course, a big assumption behind this change to the specification is that PDF documents are suitable universal package formats for all kinds of data. While this does fit into the established workflows of many communities, the idea has been met with skepticism in the preservation community.
PDF/A as an archival format isn’t broken with the introduction of PDF/A-3. The allowance of embedded files doesn’t make the preserving institution responsible for the keeping the embedded files comprehensible over time, and their inclusion shouldn’t affect the informational content of the document in any way.
As we all get smarter and technology improves the acute concerns about format obsolescence may diminish and we will likely welcome the fact that source materials are have been stored in the PDF/A-3 documents. This change is significant, but before we discount the format all together let’s explore what it means in practice and see how we can use this change to the advantage of the long-term stewardship community.
In this issue:
In honor of this week’s Museum Computer Network conference, I want to talk a bit about the early history of museum computing.
Most people are not aware that MCN was born out of a cooperative computing project in the New York City area in 1967, under the direction of Dr. Jack Heller. Fifteen New York-area museums joined forces to explore ways that an electronic index of the Metropolitan Museum’s collections could be used beyond the Met. With funding from the New York Council of the Arts and the Old Dominion Foundation, the consortium formed the Museum Computer Network to create a prototype system for a shared museum “data-bank.” Dr. Heller’s work resulted in a system called GRIPHOS (General Retrieval and Information Processor for Humanities Oriented Studies), which was based on a data dictionary that could accommodate the diverse institutions participating in the project: a tagged record format that allowed for the description of individual objects with separate, linked records for artist biographical information and for reference citations.
The first MCN conference was in 1968, and the group — now an international membership organization– is continuing strong today, some 40 years of its incorporation in 1972 under the directorship of David Vance as its first president. I attended my first MCN meeting in 1988, and that organization was my professional home for many years. I served on its board from 1992 to 1999.
In a similar time frame of 1965, the Smithsonian Institution National Museum of Natural History was developing SELGEM (Self Generating Master) System, which it shared with UC Berkeley, the Lowe Museum at the University of Florida, and the Oklahoma Inventory of Ethnological Collections.
The museum community has long been an innovator in the use of technology. In 1978, Robert Chenhall published his Nomenclature for Museum Cataloging, which was geared toward the sort of authority control needed for electronic resource metadata and discovery.
In 1979, The Detroit Institute of the Arts developed DARIS (Detroit Art Registration System). By 1982, 12 organizations across Michigan were using DARIS.
1979 was the same year that the Museum of Fine Arts, Boston, distributed its first videodisc of 2,000 collection images. In 1986, I was working on a videodisk project at the Fowler Museum at UCLA, where we captured images of ethnographic collections via video camera onto to videotape, and mastered them onto videodiscs that could be pulled up through an interface using what was then Questor Systems’ Argus collection management system. We even had an early digital image printer that created prints not unlike Polaroid prints (am I dating myself by thinking of them that way?).
In 1982, Canada launched its National Inventory Programme through CHIN (the Canadian Heritage Information Network), to evaluate and provide museum computing expertise to museums across Canada, emphasizing shared efforts.
In 1989, MCN launched the Computer Interchange of Museum Information (CIMI) project, which published its CIMI Standards Framework for interchange standards that should be used by different museum applications to transfer data independent of their hardware, software or network vendor.
Museums were some of the first organizations on the web. In 1995, at the annual AAM meeting, a group of MCN members created the first MCN web site while working at the MCN booth — we wrote the HTML by hand and used images from an early consumer digital camera. I was one of those original site creators, and I helped maintain the MCN directory of museums online until 2000. Versions of the MCN site dating back as far as January 1998 are available through the Internet Archive Wayback Machine.
Collaboration and the used of shared standards and technologies are not new to our community. We may not have been as focused on preservation in those early days, but we were highly focused on data sharing and interoperability, something which has not changed today. Let’s not forget that our focus effort on collaboration has borne fruit based on many decades of effort.
I was fortunate to have the opportunity to talk (via email) with Paul Wheatley, of the SPRUCE Project,about an assortment of activities, issues and ideas relating to digital preservation. Leeds University Library is leading the Sustainable PReservation Using Community Engagement project, collaborating with the British Library, the Digital Preservation Coalition, the London School of Economics and the Open Planets Foundation. Our conversation is below.
Bill: The SPRUCE Project is quite an innovative undertaking that is tackling a number of big issues. Can you give us a quick overview of the project, timeline and objectives?
Paul: We’re working to support digital preservation in the UK from the ground up. So we’re aiming to support organizations in taking some initial steps in practical preservation of their data and then finding a way of making it sustainable. We’re applying a strongly community focused approach. SPRUCE is there primarily to encourage and shape the interactions. Most of the experience and expertise is already out there, it’s just a little isolated.
Our Mashup events are a key part of the project. We get practitioners to bring along samples of their digital collections, work with them to identify the digital preservation problems and then team them up with technical experts who can work with them to solve the challenges. In the process we exchange that existing expertise that we all have, and build the connections we need to keep those exchanges going. You can see the results here. We’re also making small funding awards available to help sustain and embed the outputs of the events. We’ve funded five projects so far, and there will be more in 2013. The final element of SPRUCE is focused on developing a business plan for digital preservation. All this has been made possible with generous funding from JISC.
Bill: I am really impressed with the SPRUCE work on business plans. Do you see this as tied to the need to “articulate a compelling value proposition” as called for in the Blue Ribbon Task Force on Sustainable Digital Preservation and Access? Are there other considerations at play?
Paul: That’s certainly a big part of it. I think we’re all familiar with the core problems. Digital preservation is a long term thing in a world obsessed with the short term. It doesn’t sound particularly exciting and it therefore doesn’t pull in the resources that we know it deserves. The practitioners we work with in our Mashups are very clear about the difficulty they have in making the case to their institutions to fund their work adequately. So we’re aiming to develop a resource that will help the preservationistas on the ground get the money they need to do their job well.
Bill: Have you had the opportunity to draw any preliminary conclusions from your work with business plans? To what extent do you think academic and cultural heritage institutions are prepared to undertake this approach?
Paul: We’re still collecting the raw materials for this work, in part through our Mashup events where we take our practitioners through some key business plan building exercises and capture what they come up with. The end result is still taking shape, but we’re aiming for a toolkit that provides the approaches, justification, raw material and detailed examples for building a business case and delivering it effectively. I’m hopeful we’ll also have an array of complimentary bits and pieces that can help build the message. The Atlas of Digital Damages that we’ve been putting together (that’s “we” meaning the community – so nice to see!) is a great example. Thanks go to Barbara Sierman for the idea (and the great name) which seems to have really struck a chord.
Bill: How is the Crowd sourced Representation Information for Supporting Preservationeffort going? I see you are seeking “information about file formats, data structures or relevant standards” and “information about tools that render or interpret digital objects.” All of this is clearly important, but is it possible to single out an element as especially critical?
Paul: It’s disappointing, but we’ve not seen the kind of initial uptake we were hoping for. We wanted to demonstrate that contributing to a community driven project could be really quick and really simple. You can chip in to cRIsp by submitting a URL in under 30 seconds. Even if it’s only a single symbolic submission, I think it’s a real statement for people in this field to put their hand up and say “yes I do want to help fix these challenges”.
We all know that we need to take on the file format registry problem if we’re to make any kind of real difference to the digital preservation challenges we’re facing. But in the last ten years of tackling this problem we’ve made very little progress. The registries we have created are virtually empty, and that remains a big digital preservation fail for this community. We will keep plugging away however, and Open Planets Foundation should have some interesting stuff to reveal for the Archive Team’s File Format Month! We’re also contributing to a 24 hour file format identification hackathon with our colleagues at Archivematica and CurateCamp. From the starting point of a little conversation on Twitter, this one has seen loads of interest from around the globe. We’re hoping it will be a great success!
Bill: SPRUCE “mashes up” a broad range of partners. What would you say are the major benefits from collaboration?
Paul: Getting the kind of expertise we have on SPRUCE from just one organization would be very difficult. When you bring together the right combination of partners and individuals there’s definitely a greater than the sum of the parts element to it. SPRUCE feels like a bit of a dream team with the British Library, Open Planets Foundation, Digital Preservation Coalition and our two academic partners: LSE and University of Leeds! The real stars however are our Mashup participants. They do all the real preservation work on SPRUCE! We have a number of regulars who keep coming back for more. For example, Maurice de Rooij from the National Archives of the Netherlands has been fantastic, and finally won the participants’ award for best developer at our last Mashup.
Collaboration can add a new energy to proceedings, and a shared ownership of a problem. I talked extensively at iPRES on the duplication and poor communication that is sometimes prevalent in this field. We really need to operate more effectively as a community if we’re to make best use of the limited resources we have. I spend quite a bit of my time promoting these collaborative initiatives and encouraging a more open way of working.
Bill: Collaboration isn’t always easy; sometimes it’s referred to as “collabatition” (or worse). Can you describe any challenges or barriers that you have experienced?
Paul: Great question! Some of the best examples are probably a little too spicy to describe in a public forum, but suffice to say that collabitition isn’t wide of the mark. I’ve certainly used worse terms! The key to collaboration is trust. You’ve got to build a sound relationship with potential partners first, before progressing to more formal ties. The horror stories I’ve encountered previously tend to occur when organizations come together without the key individuals who will be doing the work together having built up any kind of relationship. It does sometimes surprise me how eager people can be to string a set of partners together for a funding bid, with no real idea of what those partners will be like to work with. Any collaboration comes with an overhead of communication and coordination, so you’ve got to give yourself a strong chance of forming a successful consortium otherwise it’s simply not worth the risk.
The LIFE-SHARE Project did some nice work in drawing together the lessons learned from cross institutional provision of a digital repository service, and is well worth a read for anyone looking to collaborate more widely.
Bill: SPRUCE does a fine job with outreach, communication and engagement. What do you think has been your most effective means of engaging the community? Have you gotten feedback that’s helped you target your work?
Paul: I’ll take that as a serious compliment from an initiative shortlised for the DPC comms award (congrats on that)! I’m on the panel this year, but I’m afraid I’m sworn to secrecy otherwise I’d be tempted to pass on some insider info! The best way to do outreach is to have the job pretty much done before you start, and our event participants tend to be sold on the idea of what we’re trying to do before we let them out the door! Having those guys communicate the SPRUCE message makes a lot of difference. Otherwise it’s a case of hitting all the usual channels and trying to strike a chord with your audience.
Feedback is important and we push hard to get our event participants to blog and tweet their views, as well as tell us what could be better via an anonymous feedback survey we run. There are plenty of suggestions we’ve fed back into our event structure in order to perfect it as best we can.
Years on from first signing up, I’ve been really surprised at how important a communication tool Twitter has become for me. Whether it’s publicizing what I’m working on, or keeping in touch with developments elsewhere, I really depend on it. There can be a lot of noise, so my advice for new tweeters is to have a dedicated account for work and try and keep it all on message [editor's note: Paul's handle is @prwheatley].
Bill: I noticed a much-retweeted comment from IPRES 2012 declared “Digital Preservation has seen two stages. The first stage was panic.” That seems a bit hyperbolic to me, but it does lead to reflection on trends over the last decade or so. Given your central perspective, how would you characterize the evolution of digital preservation/curation up to this point? What do you think are priorities for moving forward?
Paul: Steve Knight’s keynote at iPRES was not an overwhelmingly positive analysis of the last decade of DP, but I’m afraid I had to agree largely with what he said. As a community we’ve failed to tackle a lot of the most pressing problems and have frittered away development effort on unsupportable tools that solve the wrong challenges. Having been party to a few of those digital preservation crimes myself makes me all the more keen to learn the lessons and do a better job in moving forward. For me, that means strongly user led developments, re-use of existing technology wherever possible and an emphasis on an evidence based approach to our understanding of digital preservation. Our Mashup Manifesto captures some of those thoughts.
Bill: What comes next after SPRUCE?
I’ve worked in the DP field for the best part of a decade and a half and I suspect I’ll never be that far away from it, but our funding for SPRUCE runs out in a years time so I’ll be looking for a new gig then. I’d be very keen to continue working to support community initiatives in DP if an opportunity presents itself
Learning by doing and benefiting from a community of practitioners are key aspects of our approach to meeting the challenge of digital preservation. The International Internet Preservation Consortium is an organization that must also focus on practical solutions and quick action. The web is a huge distributed resource and is changing constantly so it takes an active and global effort to preserve it. If you are from a library, archive, museum or other cultural heritage organization that is planning for or collecting web archives, consider joining the IIPC in 2013.
In 2003, eleven national libraries and the Internet Archive established the IIPC to develop common tools and standards for web archiving and to encourage and support libraries, archives, museums and cultural heritage institutions everywhere to address Internet content collecting and preservation. Today the consortium includes over 40 members, all willing to share best practices, develop tools and resources for the global cultural heritage community, and preserve web collections. Learn more about the IIPC members and why they archive the web.
The IIPC is unique both in terms of its focus and its membership. The IIPC targets the preservation of web sites as a specific domain of digital content. This intense attention to a very specific and pervasive distributor for digital content has created a greater capacity to successfully address challenges. This focus has also contributed to greater interoperability of collections through the development and adoption of common tools and standards.
The IIPC has fostered the establishment of a world-wide web archiving community. It is the primary resource for organizations that are just starting web archiving programs, and also a venue for organizations with mature web archiving programs that want to advance the field.
On a global scale, members of the IIPC are committed to saving the contemporary knowledge, history and culture for the next generation. If your organization would like to participate on a local scale review the IIPC membership benefits and obligations and apply by November 30, 2012.
Quick quiz: Is the employment outlook for librarians growing or shrinking? The answer depends on what you call a “library job.”
According to the Bureau of Labor Statistics, the job outlook for librarians is “slower than average,” with a projected rate of change in employment this decade of 7%, slower than the 14% average growth rate for all occupations.
This sounds bad! Who wants to join a profession where you need a Master’s degree and the projected rate of employment growth is half of the national average?
But dig a little further into the BLS description of a librarian and a picture starts to emerge. For example, some of the BLS librarian duties include:
BLS partners with a site called O*Net OnLine that provides a more detailed report on librarianship, including the tools and technology used in the occupation. According to O*Net, some of the tools of the library trade include cash registers, microfilm readers, photocopiers and public address systems and technologies such as email, spreadsheets and desktop publishing software.
Then take a look at their list of the top four tasks of librarians:
Are you getting the picture? The BLS description propounds a somewhat parochial view of what it means to be a librarian these days, and the sad truth is that the “traditional” library they describe is becoming rapidly endangered as government budgets come under intense scrutiny.
The problem is, the BLS view doesn’t describe too many of the librarians, archivists and museum professionals I know. Just for kicks, let’s compare the BLS librarian description to the job area of Computer and Information Systems Managers, which O*Net describes as having a “bright outlook” (projected to grow at a rate of 29% or more this decade):
Funny…that list looks a lot more like the job descriptions of the librarians I know!
Never was this worldview disconnect more apparent than when my colleague Erin Engle and I spoke at the Fedlink Fall Expo (PDF) back in October. We spoke at the “Forging a Digital Roadmap: The Preservation, Curation, and Stewardship Nexus” event, which was sponsored by the NewFeds and Preservation Working Groups.
In my keynote presentation (PDF) I proposed some possible areas for new federal librarians to pursue if they had an interest in technology (big data, digital humanities), assuming the necessity of pointing out these interesting opportunities in librarianship.
Little did I realize that the NewFeds panel of early-career government information professionals that followed would be full of people talking not just about possible opportunities but demonstrating the incredible technology-based work they are already doing.
The panelists included Robin Butterhof, a digital conversion specialist in the Serial And Government Publications Division of the Library of Congress who is working on the National Digital Newspaper Program; Bianca Crowley, a collections coordinator from the Biodiversity Heritage Library who described the challenges of making their content available across an international taxonomic community; Wanda Davila, who described a signal management research tool being developed by the Center for Devices and Radiological Health at the Food and Drug Administration (a tool to identify potentially dangerous food and drug issues out of massive amounts of unstructured data); and Piper Mullins, the program coordinator of the Pan-Smithsonian Cryo-Initiative (did you know that the Smithsonian collects frozen things?)
(A webcast of the entire event is available.)
Even though the panelists all self-identify as librarians, the type of work they do is somehow missing from the BLS librarian job descriptions. There are efforts happening all over the place to define what it means to be a librarian, but I still don’t see terms like “digital archivist” or “repository librarian” or “library digital infrastructure and technology coordinator” showing up in general descriptions of librarianship, even in well-meaning ones like the American Library Association’s (I don’t think the word “puppets” should appear in any librarian’s job description ever again).
Librarianship is an increasingly technology-focused profession and that’s only going to become more true in the future. There are still all kinds of stereotypes (or worse) that have to be dealt with, but if we don’t act quickly to define the new face of the profession, others will do it for us, and it won’t necessarily be in our favor.
So what are we going do about it?
PBS Off the Book has a nice short video on The Art of Glitch. It’s a fun story about a born-digital art phenomena, but aside from that, I think it’s useful at helping us better understand the nature of digital objects. In the video, artist Scott Fitzgerald gives the following concise argument for the value of glitching, or breaking copies of digital files on purpose.
“Part of the process is empowering people to understand the tools and underlying structures, you know what is going on in the computer. As soon as you understand the system enough to know why you’re breaking it then you have a better understanding of what the tool was built for.”
I think we would all do well to develop a more visceral sense of what files exactly are, and I think some of his tactics for glitching can help with that.
A different way to read an MP3
Digital objects are encoded information. They are bits encoded on some sort of medium. We use various kinds of software to interact with and understand those bits. In the simplest terms software reads those bits and renders them. You can get a sense of how different software reads different objects by changing their file extensions and opening them with the wrong application.
For example, you can listen to this performance of the West Virginia Rag from Fiddle Tunes of the Old Frontier: The Henry Reed Collection. From that page you can download a .mp3 and .wav copy of the recording. Once you’ve done that, instead of opening and playing the files with a media player, try changing the file extension to .txt and then open the file up in your text editor of choice.
Below you can see an example of the kind of mess you can create by changing a file extension. My text editor has no idea what to do with a lot of the information in this mp3. The text editor software is attempting to read the bits in the file as alphabetical characters and it isn’t having a lot of success.
While is a big mess, notice that you read some text in there. Notice where it says “ID3″ at the top, and where you can see some text about the object and information about the collection. What you are reading is embeded metadata, a bit of text that is written into the file. They are part of the ID3 tags. We can read them in a text editor because the text editor can make sense of those particular arrangements of information as text.
Another way to view an MP3
Now, if you go back, and change the extension again, you can get something that looks a bit more interesting. This time, change it from .txt to .raw and open it in some image editing software. Here is what I saw when I did that with both a .mp3 version of the file and a .wav version. The black and white pixelated images below are screenshots of my image editing program attempting to read the MP3 as a RAW file. These are visual interpretations of the particular set of the information in those audio files.
Look at the difference between the .mp3 on the left and the .wav on the right. What I like about this comparison is that you can see the massive difference between the size of the files visualized in how they are read as images. Notice how much smaller the black and white squares are. It’s also neat to see a visual representation of the different structure of these two kinds of files. You get a feel for the patterns in their data.
Beyond just incorrectly reading these kinds of files, we can use the same sort of tactics to start to incorrectly edit them and further expose the logic of how they are encoded.
Edit an Image with a Text editor
A similar approach works with digital images. For example, start with this image, “Sod house, Grassy Butte, North Dakota, on Catherine Zakopayko farm.” If you download the .jpg version of the image, and change it’s file extension to .txt you can open it up in a text editor. It will look like gibberish. In this case, because of the way that compression works on .jpg files you can delete chunks of the file in the text editor, save the file, change the extension back to .jpg and see what would happen if the particular chunk of the file was lost.
You can see comparisons between the original image and two levels of degradation I created by cutting out chunks of the data in the file and copying and pasting parts of it into itself.
In the second image, notice how the removal of a block of information has degraded the image. The entirety of the image is still there, it’s just that a rectangular region is magenta and two slices across the image are grey. The compression algorithms used to create jpg files mean that removing a chunk of the file doesn’t necessarily remove a chunk of the image, it results in removing some of the information that is layered into the image. In the further degraded image you can see how additional removal can result in big stripes of grey and similar kinds of color problems.
What was that about Screen Essentialism?
New media and digital humanities scholars have coined the phrase “screen essentialism” to refer to a problem in many scholarly approaches to studying digital objects. The heart of the critique is that digital objects aren’t just what they appear to be when they are rendered by a particular piece of software in a particular configuration. They are, at their core, bits of encoded information on media. While that encoded information may have one particular intended kind of software to read or present the information we can learn about the encoded information in the object by ignoring how we are supposed to read it. We can change a file extension and read against the intended way of viewing the object.
This might seem like a rather academic point, however, I think it suggests the value of understanding the integrity of digital objects not simply as “looking right” in one particular reading out to the screen. In many cases, the integrity of the objects is something that can be expressed through a range of software enabled readings of it.
I’m curious to hear what folks have to say about these glitched files? What other things can they tell us about how these files work? Are there other ways to glitch files that you know of that you think can facilitate the same kinds of understanding? Lastly, what do you make of screen essentialism?
I like lists. I particularly like ordered lists. I’ve even read a book about checklists. Which is one of the reasons I wanted to point out a recent OCLC report, You’ve Got to Walk Before You Can Run: First Steps for Managing Born-Digital Content Received on Physical Media(PDF).
The report focuses on practical approaches institutions can apply to managing born-digital collections acquired on digital media, such as CD-ROMs, external drives and floppy disks. These approaches are laid out in a series of steps – and could be viewed by some as a checklist.
Here at the Library of Congress, the Tangible Media Project is developing generic workflows to get its digital collections off soon-to-be obsolete or at-risk physical media into storage systems for management and preservation. For those smaller to mid-sized institutions whose collections are equally at-risk, this report offers practical methods to take action now, get the boxes of physical media “off the floor” and the digital content into a stable environment (for the time being).
Not everything in the report may be relevant to your institution’s born-digital collections, but it could be a starting point. The benefit to referencing a step-by-step approach like this one could help identify gaps an organization’s ability to manage born-digital materials. For example, if you’re unable to perform or complete any of these steps, you gain knowledge about which actions may require resources, training, and other support for your institution’s digital preservation activities.
OCLC also published a follow-up report, Swatting the Long Tail of Digital Media: A Call for Collaboration (PDF) intended for decision-makers to help them understand the time, money and resources that may need to be allocated to the preservation of born-digital collections. You can read more about the project supporting both reports here.
One of the topics of conversations on this blog revolves around the challenges and solutions smaller institutions face with collecting and preserving born-digital materials. How do institutions get started with digital preservation projects? What are the best practices, workflows and tools available for managing and archiving digital content? What are other institutions doing? Or, what should they be doing? There are no easy answers to these questions, but we hope to explore them here with our readers. Please let us know if there are other topics of interest we can all discuss.
In anticipation of the Museum Computer Network conference next week in Seattle, I’ve been giving some extra thought lately to museum community involvement in digital preservation.
We (the National Digital Information Infrastructure and Preservation Program, that is) work with many partners from a range of industries, and in the last couple of years this has taken place mainly through the National Digital Stewardship Alliance.
The NDSA is a collaborative effort, leveraging the knowledge and expertise of our many digital preservation partners, to help preserve access to our national digital resources for the benefit of present and future generations. The NDSA currently has over 130 organizational members, and growing. With just minimal involvement, all members have access to a vast network of experience, and can participate in one of the many ongoing projects to help research and/or spread the good word about the importance of digital preservation.
Why is this helpful for museums? As we say time and again, all digital material is fragile, and needs maintenance to survive over the long term. So in addition to preserving digital art (or digital surrogates), there are also other preservation needs that will be increasingly important for museums, such as for electronic records, digitized publications (catalogs, for instance), and preservation of online exhibition websites.
And within the NDSA, museums are definitely stepping up to the plate. So far, the museum related membership includes ARTstor, The Hagley Museum, IMLS, Rhizome, Smithsonian Institution, and the U.S. Holocaust Memorial Museum. These organizations now have easy access to learn from other museum colleagues, and many other organizations, as they approach solutions for digital preservation.
As an example, see this previous blog post interview with Ben Fino-Radin, Digital Conservator for the Rhizome ArtBase online archive of digital art, who talks about what’s involved in the maintaining and preserving of this collection. Ben was also a participant this summer in our Digital Preservation 2012 conference, through our panel session on digital cultural heritage. This panel featured some great presentations and discussion on the range of issues involved in digital cultural projects, both visual and performing arts, throughout the country.
All NDSA members are also involved in one or more of the five specialized working groups. So far, museums and related organizations are mostly involved through the Content Working Group, with ongoing projects that focus on the selection, discovery and preservation of digital content in many topic areas. Within this working group, there is one team that is specifically working on arts and humanities content, which is a great way for arts groups to find out what others in the field are doing. If your museum is not already a member of the NDSA, think about joining – it’s free, easy to join, and there are many advantages (see the membership page for information on how to join.)
In addition to the resources of the NDSA, museums are also discovering the benefits of Viewshare, an open source tool developed by the Library of Congress that enables viewing, organizing and enhancing of digital collections. The data can be used to discover trends within collections, and there is a “gallery” option for displaying images, which makes it perfect for museum collections. Viewshare is freely available to any organization, all it takes is signing up for an account, which right away gives you access to the many options for this tool (and there’s a helpful online guide available, too). There are several previous blog posts in The Signal discussing the uses of Viewshare, including this one, and this one.
As an example of museum use, The National Gallery of Art has made good use of Viewshare to create an online view of the Kress Collection, which, in addition to the images, includes information about collection origins, sellers, locations of items, and purchase dates. The Rhizome ArtBase also has a collection view showing 400 born-digital artworks and associated information; this also allows for such discovery as tracing the development of emergent technologies. And I hear through the grapevine that there’s another blog post coming soon with more information about both of these Viewshare projects.
For museums and other cultural organizations located in the Washington, DC area, we have recently started a monthly Digital Cultural Heritage Meetup Group – informal gatherings open to all who are interested in the preservation of digital culture.
So, there are many avenues available for museum involvement and learning in the digital preservation community. Meanwhile, I’m looking forward to the Museum Computer Network conference, and learning more myself about all the latest technology projects in museums.
Kent Anderson offers a provocative post in The Mirage of Fixity — Selling an Idea Before Understanding the Concept. Anderson takes Nicholas Carr to task for an article in the Wall Street Journal bemoaning the death of textual fixity. Here’s a quote from Carr:
Once digitized, a page of words loses its fixity. It can change every time it’s refreshed on a screen. A book page turns into something like a Web page, able to be revised endlessly after its initial uploading… [Beforehand] “typographical fixity” served as a cultural preservative. It helped to protect original documents from corruption, providing a more solid foundation for the writing of history. It established a reliable record of knowledge, aiding the spread of science.
To my mind, Anderson does a good job demonstrating that not only is “file fluidity” a modern benefit of the digital age, it has long existed in the form of revised texts, different editions and even different interpretations of canonical works, including the Bible. Getting to the root of textual fixity, according to Anderson, means getting extremely specific–”almost to the level of the individual artifact and its reproductions.”
In the world of digital stewardship, file fixity is a very serious matter. It’s regarded as critical to ensure that digital files are what they purport to be, principally through using checksum algorithms to verify that the exact digital structure of a file remains unchanged as it comes into and remains in preservation custody. The technology behind file fixity is discussed in an earlier post on this blog; a good description of current preservation fixity practices is outlined in another post.
It is well and good to strive for file fixity in this context, and it is indeed “to the level of the individual artifact and its reproductions.” The question arises about the degree of fidelity that needs to be maintained with respect to the original look, feel and experience of a digital file or body of interrelated files. Viewing a particular set of files is dependent on a particular stack of hardware, software and contextual information, all of which will change over time. Ensuring access to preserved files is generally assumed to eventually require: 1) migrating to another format, which means that it will need to change it in some way by keeping some of its properties and discarding others, or 2) emulating the original computing environment.
Each has advantages and disadvantages, but the main issue comes down to the importance placed on on the integrity of the original files. Euan Cochran, in a comment on an earlier post on this blog, noted that “I think it is important to differentiate between preventable and non-preventable change. I believe that the vast majority of change in the digital world is preventable (e.g. by using emulation strategies instead of migration strategies).” He noted that the presumed higher cost emulation works against it, even though we currently lack reliable economic models for preservation.
I wonder, however, if the larger issue is that culturally we are still struggling with the philosophical concepts of fixity and fluidity. Do we aim for the kind of substantive finality that Carr celebrates or do we embrace and accept an expanded degree of derivation–ideally documented as such–in our digital information? Kari Kraus, in a comment on a blog post last week, put the question a different way:
[Significant properties] are designed to help us adopt preservation strategies that will ensure the longevity of some properties and not others. But if we concede that all properties are potentially significant within some contexts, at some time, for some audiences, then we are forced into a preservation stance that brooks no loss. What to do?
Ultimately I think wider social convention will determine the matter. Until then it makes good sense to continue to explore all the options open to us for digital preservation.
The following is a guest post by Nicholas Taylor, Information Technology Specialist for the Repository Development Group at the Library of Congress.
Prompted by questions from Library of Congress staff on how to more effectively use web archives to answer research questions, I recently gave a presentation on “Using Wayback Machine for Research” (PDF). I thought that readers of The Signal might be interested in this topic as well. This post covers the outline of the presentation.
The Wayback Machine that many people are familiar with is the Internet Archive Wayback Machine. The Internet Archive is an NDIIPP partner and a Founding Member of the International Internet Preservation Consortium. Their mission includes creating an archive of the entire public web; the Wayback Machine is the interface for accessing it.
While the Internet Archive has been primarily responsible for the development of Wayback Machine, it is an open source project. Internet Archive also devised the name “Wayback Machine;” it is a reference to The Rocky & Bullwinkle Show’s homophonous “WABAC” Machine, a time machine itself named in the convention of mid-century mainframe computers (e.g., ENIAC, UNIVAC, MANIAC, etc.). The contemporary Wayback Machine thus appropriately evokes both the idea of traveling back in time and powerful computing technology (necessary for web archiving).
Internet Archive’s Wayback Machine is just one among many, however; over half of the web archiving initiatives listed on Wikipedia provide access via Wayback Machine. It is the most common software used to “replay” the contents of ISO-standard Web ARChive (WARC) file containers.
Understanding the basic mechanics of Wayback Machine makes it easier to navigate around within a web archive. For example, the URL can be modified to request particular resources, show the time coverage for particular resources in the archive, or show all archived resources from a particular domain. Since Wayback Machine can only replay specifically-requested URLs, it is difficult to access past versions of a webpage if that webpage changed URLs at some point and there was no redirect in place.
The presentation offers a couple of examples of how these basic techniques could be used to find specific information in a web archive. The first example explores a strategy for finding a webpage whose historical URL is unknown by navigating to another webpage in the archive that is likely to link to it. The second example demonstrates that the conceptual organization of websites persists longer than their precise URL structure. This trend can be used to access content that was previously publicly available but has since been moved to a private section of a website.
Of course, it may not even be necessary to consult web archives in the first place. Recent research (PDF) suggests that ostensibly missing resources on the live web have more often been moved than removed. The Synchronicity Firefox add-on, based on technology from the NDIIPP-funded Memento project, leverages web archives to help locate the resource’s new location. If that fails, the MementoFox Firefox add-on can help to find the web archive with the best coverage for the desired resource and time range.
Fixity is a key concept for digital preservation, a cornerstone even. As we’ve explained before, digital objects have a somewhat curious nature. Encoded in bits, you need to check to make sure that a given digital object is actually the same thing you started with. Thankfully, we have the ability to compute checksums, or cryptographic hashes. This involves using algorithms to generate strings of characters that serve as identifiers for digital objects. Under normal non-tampered with conditions, these hash values more uniquely identify files than DNA uniquely identifies individuals. When we generate these hashes for digital objects to audit digital content we want to know if an object is the same as it was before. Is it still bit-for-bit the exact same thing? It is important to note that the “is” in that last sentence is only one tradition of saying that something is still the same.
An analog corollary to this kind of fixity checking is helpful in unpacking the different ways we can say “this is the same thing”. To ensure the authenticity of copies of texts scribes would count their way to check and make sure that new copies had the same middle paragraph, the same middle word and the same middle letter. It’s an analog fixity check; a technique to check if the encoded content of the text is identical to the encoded content of the copy (functionaly, it is a rather poor fixity check, but a fixity check nonetheless). In this case, much the same as in computing, the two scrolls would have the same text on them but they are actually two physically different objects, potentially created by different scribes and expressing unique characteristics, for example, each scribes handwriting. If you had two copies of the same ancient text and you told a manuscripts specialist they were identical they might scoff at you. Clearly they are two different artifacts; they are two distinct material objects that have their own physical properties. If we looked into the chemical properties of the papyri that each was encoded on we might be able to date them and find out which one is older, or we might find that the materials of one came from one place and the materials of the other came from another. While the encoded text of the two objects could be identical, there is an infinite amount of contextual information that could exist in the materiality of the objects they are encoded on.
The is of the Autograph and the is of the Allograph
Is means different things in different statements. This is Mary Shelly’s Frankenstein and this is the Mona Lisa. (Ok, so those are links to Frankenstein and the Mona Lisa). However, the link to the Mona Lisa isn’t really a link to the Mona Lisa. The Mona Lisa is on the wall in the Louvre. That link just points to an image of the Mona Lisa. If you load up the link to Frankenstein and the image of the Mona Lisa you can think through two of the different ways that something can be the same as something else. Most would agree that the former is Frankenstein, but that the latter is a copy of the Mona Lisa. Something is Frankenstein when it has the same text in it. In the art world, these kinds of art are referred to as allographic. You are actually looking at the piece of art when you see something that has the same spelling, that has the same encoded information. It is the same thing when it has the same encoded information in it. In the case of the Mona Lisa, we demand a different kind of is, the autographic is. There is only one Mona Lisa, it’s on the wall in the Louvre.
These conceptions, of something being the same as something else have corollaries in how Matt Kirschenbaum defines assertions that digital things are the same. In his vocabulary there is a formal sense, in which one object has the same bits as another (the same one’s and zeros) and the forensic sense, in which we think about how those bits are physically encoded and inscribed on an individual artifact. All the bits we care about are inscribed on storage media. Interestingly, in the forensic sense, all digital objects are also analog objects. While we read bits off disks each of those individual bits is on some level its own little unique snowflake. Each bit could conceptually be analyzed at the electron microscope level as having a signature, as having a length and a width on the medium on which it is encoded. This said, there really aren’t many cases in which we care about the physical material sense of the forensic bit. Sure, it is possible to use forensic techniques to read back several reads on a hard drive, but even in that case, what we care about is reading back layers of the encoded information, not examining the qualities of the actual bits themselves.
The Mutual Exclusivity of These Senses of Sameness
I find it interesting that these two different senses of sameness, the allographic and the autographic are fundamentally mutually exclusive properties. Try this little thought experiment. Imagine someone came up with a way to compute a fixity check on people. It might look like a CT scanner or something. It would scan you and then generate a string of characters that more or less uniquely identified you. If you came back the next day, climbed up in the machine again, and got your next reading your numbers wouldn’t match. Our bodies are always changing, today I had a lot of coffee so I have more caffeine in me, tonight I might go to spin class and then as a result tomorrow I would have burned some calories. This isn’t just the case for living things. Entropy (and it’s step-cousin in conservation science inherent vice) explain to us that all objects are in flux, slowly deteriorating toward the ultimate heat death of the universe.
Imagine if we stuck some fantastic rare book in this device that checks the fixity of physical objects, how about the Library of Congress copy of Sidarius Nuncius (not these digital images but the actual physical book). Even here, if we came back the next day we would get a different string of characters. While conservationists do their best, from day to day there are changes in things like the water content in pages or other minor fluxuations in the chemical composition of any artifact. I suppose if the device wasn’t particularly sensitive it wouldn’t detect the difference, but even if it did say they were the same thing we would know that it was a lie, it just wasn’t sensitive enough to pick up the subtle changes in the artifact. This is a key distinction between analog and digital objects. Digital objects are always encoded things, in this sense they (like the text of Frankenstein or the text transcribed by scribes) are allographic. Their essence is actually more allographic than those analog corollaries, as the encoding is much richer and leaves much less interesting information residing in the artifact itself. The medium on which a text is inscribed and the autographic components of an individual scribe or printer’s work actually carries a lot of interesting information in it. In contrast, a forensic disk image of a hardrive contains considerable information about the size and nature of the medium (the drive) and the additional information beyond the bits on a drive is actually older bits (computer forensic folks can get previous writes of a disk by looking at the parts where the write bands overlap.)
What is wild about digital objects is that there are extensive forensic, or artifactual, traces of the media they were stored on encoded on inside the formal digital object like a disk image. That is, the formal object of a disk image records some of the forensic, the artifactual, the thingyness of the original disk media that object was stored on. The forensic disk image is allographic but retains autographic traces of the artifact.
Here at the Library of Congress, there are many projects underway to digitize and make available vast amounts of historic, archival material. One such project is the National Digital Newspaper Program, providing access to millions of pages from historic newspapers (a previous blog post provides an introduction). Deb Thomas, NDNP program coordinator here at the Library, answers some questions about this amazing project – what’s been accomplished over the past year, as well as goals for the future.
Susan: Could you give us a quick summary of this project?
Deb: The National Digital Newspaper Program supports digitization and enhanced access to millions of historic newspaper pages selected and produced by state libraries, universities and historical societies across the United States. Jointly supported by the National Endowment for the Humanities and the Library of Congress, the program is a collaboration of national proportions, currently providing access to 5.2 million pages from more than 800 newspaper titles, published in 25 states between 1836 and 1922 (and growing – we add additional newspapers regularly.) The collection is made freely available through the Chronicling America website hosted by the Library of Congress. These newspapers provide the first glimpse into historic events, as well as the diverse voices and differing perspectives on the life, times, and activities of nineteenth and early-twentieth century America.
Susan: Tell us about some highlights from your recent partners meeting
Deb: Each year, awardees in the program gather in Washington, DC, to share their experiences and activities throughout the program. In late September, fifty-eight representatives of 27 active projects and 2 “alumni” projects joined in 2 days of meetings, including presentations from LC and the NEH on progress in the program, new developments and outreach activities. Throughout the meeting, state participants presented on their own activities, sharing their experiences with the challenges of mass digitization and their own plans for promoting the program, collections and Chronicling America.
Highlights of the agenda this year included discussions on the addition of non-English ethnic newspapers to the Chronicling America site by awardees in New Mexico, Arizona and Louisiana and presentations by several outside scholars on their use of the site’s open access protocols to develop new research approaches through datamining. In addition, a half-day of workshops helped awardees brainstorm ideas for connecting with general users and the educational community, as well as get technical guidance in working with the LC Newspaper Viewer, the core software applications that support the Chronicling America site, published by LC as open-source software.
Susan: Describe some of the ways you can search this collection.
Deb: We have full-text search for newspapers from all across the country covering almost a hundred years – you can find first-hand reporting on the battles of the Civil War, diverse voices during the years of Reconstruction, life events throughout families going back generations, and the scandals and crimes that riveted the reading public during these decades. You can explore land disputes, crop reports, society news in both cities and small communities, American perspectives on events across the world, fact and fiction in technological advances, poetry, serialized literature by such classic writers as Charles Dickens and Arthur Conan Doyle and much, much more. All digitized papers in the collection can be searched by date, location and full-text options in both a simple keyword search and a more advanced approach which allows users to zero in on specific times and places with combinations of words and phrases.
In addition to the digitized newspapers made available through the program (limited to those published between 1836 and 1922), the site also includes a separate searchable directory of US newspaper records, describing more than 150,000 titles published between 1690 to the present and listing libraries that have physical copies in microfilm or original print. This directory, derived from data collected under the US Newspaper Program (1982-2011, a precursor to NDNP supported by NEH and LC, providing funding and guidelines to state institutions to inventory, describe and selectively microfilm their state’s newspapers collections), helps users identify what’s available and where to go to find newspapers beyond those digitized for the Chronicling America site.
Susan: What can you tell us about any updates and/or milestones during the last year?
Deb: 2012 has been a very exciting year for NDNP. We have experienced record-use of the Chronicling America site over the year as well as crossed the milestone of having more than 5 million pages available. (Actually, at this time, it’s 5.2 million and ever-growing with monthly contributions from awardees.) This year we added 4 new participants to the program representing Iowa, Maryland, Michigan, and North Carolina, and incorporated the technical ability to search non-English language newspapers in French, German, Italian and Spanish. (Currently we have text for French and Spanish only, but other titles are in the works.)
Susan: What are plans for future updates, both short term and long term?
Deb: More, more, more! Over the long-term, we want more states to participate and add more content (NEH runs an annual award competition for interested applicants) All in all, over the next decade or so, we expect the collection to increase by tens of millions of pages gathered from all 54 states and territories.
Susan: What are some of your favorite items in the collection?
Deb: Some of my personal favorites are associated with the imaginative speculation by turn-of-the-century journalists. In an age when technology began to change lives dramatically year by year, the possibilities seemed endless. For example, an article in the Saint Paul Globe in 1904, told the story of the author jumping ten years ahead via a time machine to learn how his own works had stood up to time, only to find libraries dramatically changed. Instead of being places where people came to read books, libraries had evolved (in only 10 years!) to be places where books were read to people via phonograph or transferred as sound via wires. Another favorite is a feature article from the Washington Times in 1907, describing recent seismic disruptions around the world and presenting popular theories about the cause.
Susan: What are some uses of this material that you’ve heard about from researchers?
Deb: First, there are the genealogists who find these materials a treasure trove of unexpected details about family histories and events long past. A significant part of our user-base self-identifies as a “family historian.” Such folks can find articles on marriages, deaths, funerals, celebrations, scandals, day to day living and more. Teachers use the collection to teach narrative analysis and how perspective influenced (and still does) the news (e.g., articles during the Civil War from both Confederacy- and Union-aligned papers). Another type of use we see is downloading large parts of the collection through the Application Program Interface.
The data in Chronicling America is all in the public domain and available for open access. The software supporting the Web site has been designed to encourage reuse and holistic analysis of the data as a “big data” resource. This makes the site available to a new kind of research, the “digital humanities,” in which historians join with technologists to analyze large amounts of historic data in new ways. We know about research on geographic and chronological visualizations of newspaper publishing changes in the US, epidemiological studies incorporating how the “news” about disease spread and influenced societal behavior, and even linguistic analysis in different regions. We’d love to hear about other studies going on as well.
Susan: Are there any preservation capabilities built into this collection?
Deb: Access to these newspapers is key, and is intended to be sustained over time, so preservation capabilities are built into nearly every facet of the program. The digital objects LC prescribes for the program are based on the premise of a one-time opportunity to capture this diverse material scattered around the country so we need to get the most we can out of this data. The specifications are intended to capture as much information from the original microfilm as practical (using print master duplicate negatives for the best copy) and for the newspapers to be described by the people who know them best, the original selectors and curators.
In addition to the specifications themselves, the data transfer and handling procedures used by the digitization vendors, awardees, and the Library promote having a collection of uniform and consistent self-describing digital objects, with verifiable data values that can be checked over time to ensure enduring access. At LC the data lifecycle is supported, as much as possible, with an automated workflow system of repository services that help us manage the validity and consistency of the data over time, as well as a detailed inventory of what, how much and where the data is in our systems.
Susan: Could you describe the importance of collaboration in this project?
Deb: Collaboration is one of the key elements supporting success for NDNP. Within LC itself, the program is a successful partnership between the curatorial newspaper collection managers who know the ins and outs of working with historic newspapers and the repository development technologists who design ever-more sophisticated and efficient ways to approach the management of this growing digital collection. It’s also an important collaboration between federal agencies to provide the resources supporting the development of a national-level collection of historical newspapers. Most significantly, it’s a collaboration between institutions representing their state and local history and the sponsors of the program to build the best national resource possible from this important primary source material.
On October 17, I had the extreme pleasure of hearing Cory Doctorow at the Library for talk entitled “A Digital Shift: Libraries, Ebooks and Beyond.” Not surprisingly, the room was packed with attentive listeners.
The talk covered a wide range of topics–his love of books as physical objects and his background working in libraries and as a bookseller; his opinion on Fair Use under U.S. Copyright law; and his oft-discussed release of his own works as free ebooks under a Creative Common license in conjunction with their physical publication.
But the focus of his passion was the prevalent publishing and ownership model for ebooks.
When you buy a physical book, said Doctorow, you own that book. You can lend it to friends, give it away, or even sell it. But when you buy an ebook, you license it. Depending upon the source you purchased an ebook from, you may only have the right/ability to read it on a single device or type of device. It often comes with Digital Rights Management attached, he noted, so you cannot make any changes that will allow you to read your ebook on other devices or loan it or transfer it to someone else. You can’t even save it and open it independently of its original intended environment.
“If you can’t open it, you don’t own it,” he declared.
He posed a number of rhetorical questions. “What if a bookseller told you that if you buy a book from them, you could only read it while sitting in a specific chair using a specific light and you could not read it anywhere else? How would you feel about that?”
Not very good, in his estimation. “And yet, that’s where the marketplace is with most ebook sales now.”
He noted the story can take an additional turn for libraries, with contractual limits placed on the lending of ebooks to patrons, with some even having technical limits in place where the ebooks delete themselves after a certain number of circulations.
His point resonated with at least one questioner, who asked leaving our ebooks to our children when we pass on. Doctrow replied that doing so is currently difficult for most types of ebooks because of the licensing model, which he said could potentially lock up a an ebook legacy worth thousands of dollars.
This strikes a personal chord for me, as I am very interested in personal digital archiving–how people can best manage their own personal digital collections and legacy. Licensing is yet another issue to worry about in keeping personal collections accessible over time.
There is a lot of public awareness about musics files; less so about ebooks. There’s no easy answer to these all these questions right now, but it behooves all of us to talk more about this topic and raise the visibility of the issue.
Today’s guest post is by Carlos Martinez III, a Hispanic Association of Colleges and Universities intern in the Library of Congress’s Office of Strategic Initiatives.
The National Information Standards Organization provides standards to help libraries, developers and publishers work together. Their report, A Framework Guidance for Building Good Digital Collections, is still as helpful to organizations today as when it was published in 2007.
The report identifies, organizes and applies knowledge of resources that support the development of sound local practices and focuses on creation and management of good digital collections. If you haven’t read it, you should. And if you have, it’s worth a second look.
Although Library and Information Science professionals must address many different elements associated with digital collection development, the framework guidance identifies four core elements: collections, objects, metadata and initiatives.
The first core element, collections, is the point at which collection development policies and procedures are established and adjustments made. The information in this section suggests that collection-policy creation remains an iterative process, allowing for rules to be amended according to the needs of the user community but also initially sound enough to provide a good structure for the collection to be built upon.
The second core element addresses the organization of digital information – or objects –within a collection. The framework guidance report looks at how to address these challenges and notes that digital-collection developers must be aware of the increasing number of objects that are “born-digital.” The report offers guidance on handling not only born-digital items but also digitized items.
Perhaps one of the most challenging aspects of digital collection development is making the collection searchable, sustainable and usable. The metadata section of the report provides insightful suggestions for LIS professionals looking for a schema that is flexible, interoperable and extensible enough to meet the needs of the collection and the user, and be of use by other institutions.
While it was common for digital-collection developers in the past to focus efforts on meeting the needs of a specific user community, the framework recommends that LIS professionals create collections that can be repurposed and reused by other institutions in order to be part of a larger digital collection development effort.
Finally, LIS professionals need to consider the framework guidance as a valuable resource for developing management initiatives, allowing for people, policies and tools to work to together to ensure the overall value of the digital collection. For this reason, NISO states that digital-collection building efforts have become a core part of many organization’s missions and thus are the key component for ensuring overall success of their digital collections.
In light of previous discussions on this blog concerning digital preservation initiatives, the framework guidance is still a significant resource for information professionals to consider. The identification of the four core elements associated with building a digital collection proves that the NISO framework guidance is as relevant and helpful today as when it was first published.
Now I had three copies of my digital content on three different devices. Because if something happens to one of those media, I’ve got two others that have all my files saved (and safe). Great, right?
The hard drive on my old computer was failing, which is why I got a new one. My digital files on that drive aren’t exactly safe even though they are saved for the time being. Copies of my files on my new computer and my external hard drive are relatively safe (crossing fingers that those media don’t fail). And not to mention, I had two copies of ALL of my digital files, of which I had a general idea about what they were. But I certainly didn’t care about all of them.
This wasn’t a great start to creating my own digital archive. I know exactly the advice and steps to follow, so why wasn’t I doing it?
Because, I was being lazy. I am by no means as vigilant about archiving and backing up my personal files as some of my colleagues here at the Library. I’ve never experienced personal data loss (knock on wood) and I’ve never had enough digital files I’ve truly cared to save long-term.
I started taking digital photos when I bought a camera about seven years ago. I first purchased digital music files around the same time. I wrote tons of papers for grad school. I completed my Federal and State taxes electronically. I created or saved all of these digital files on my old computer.
My digital photos and documents hold personal value, and my music files, most of which I’ve purchased, are valuable literally. I truly care about saving and preserving these materials. Time to create that archive.
Comparing what I’ve done so far with NDIIPP personal digital archiving guidance, I’m not in bad shape. I’ll admit I’m just barely performing the steps of this guidance. I could (and probably should) be more cognizant of the file formats and software programs used to access my files, for example. But this is the level of effort I choose to devote to save my digital information of personal value. Hopefully it’s enough for right now and for a few years down the road.
Given I don’t have a large amount of digital content as compared to other people I know, I should be able to finish my archive this weekend.
So, let’s see where I stand against the guidance.
A question popped up in the blogosphere recently. “Where is our Atlas of Digital Damages?” asked Barbara Sierman of the National Library of the Netherlands.
She pointed out the amazement that would greet evidence of physical books, safely stored, with spontaneous and glaring changes in their content or appearance. “Panic would be huge if this would happen in our libraries and archives.” That statement is certainly correct. Nearly everyone expects libraries and archives to have the basic resources to keep physical documentation stable, intact and fixed.
Sierman bravely points out, however, that digital items are very much at risk of loss and corruption–even when libraries and archives manage the material. Digitization sometimes yields mistakes, storage systems fail, older files rendered in newer computing environments behave oddly.
While digital preservation practitioners are well aware of this risk, Sieman called for some visual evidence to prove the point. Such evidence would, she reasoned, help make the case for a robust preservation infrastructure as well as help drive discussion about acceptable degrees of loss and the significant characteristics for digital objects.
“Because there lies a real risk for the digital collections, but making it visible with examples. it will be more convincing than all the conference papers that we have written about the digital preservation challenge.”
Her argument spurred action, and The Atlas of Digital Damages now is up and running on Flickr. This is a crowdsourced effort, and anyone can upload pictorial evidence of bits gone bad. There are currently a few dozen images available, but it is easy to imagine building quite a large collection of compelling images.
I have a few candidate images myself. Up until now, I kept them out of something like a perverse artistic appreciation, thinking perhaps they conveyed some fanciful insight into what machines see (or don’t see). Computers may not lie, but they surely can get confused.
The picture to the left, for example, suffered corruption during transfer to Flickr. A special irony is that the picture was taken of a recent exhibit of old computer hardware and media to demonstrate the tentative status of digital information. An unintended outcome, to be sure, but indicative (and evocative) evidence of maintaining fidelity for digital objects over time.
Do you have any graphical evidence of digital damage? If so, please consider sharing it so that we can help people understand what is at risk for our digital heritage.
As digital preservation and stewardship professionals, we approach digital objects from a unique perspective. We evaluate the long-term value of any particular digital object and work to develop a technical and social infrastructure that will enable us to successfully preserve the objects over time.
Preserving and providing appropriate access are our primary functions, but no matter how you look at it we’re still managing digital assets; we just do it at a particular stage in the digital lifecycle.
So how is what we do different from what digital asset managers do?
Well, surprisingly, there are more similarities than differences, as I discovered when I traveled in late September to New York to participate in the Digital Asset Management conference. DAMNY brought together media management professionals from advertising, broadcasting, entertainment, publishing, libraries and archives and retail to discuss ways to maximize the value of digital materials under their control and also discuss challenges around the storage, security and preservation of those materials.
“Digital Asset Management” is a tough term to define, seeing as its a marketing term as much as anything else. Designers of software tools are always working to differentiate their products in the marketplace and to be able to claim “best of breed” status for whatever vertical they define. DAM as a category is no different, but beyond the surface confusion there are real differences in what DAMs do versus other differentiations such as Enterprise Content Management, Document Management or Content Management Systems.
DAM focuses to a greater degree on complex content and on maximizing the ability to access and reuse it. “Complex content” generally means multimedia (which the advertising industry often calls “rich media”), including images, video, audio and materials with a dynamic complexity. DAM tools also concentrate to a greater degree on integration with creative authoring tools (such as layout, design and video and audio editing applications) to allow asset managers ready access to their content storage infrastructure for re-use purposes.
Multimedia complexity challenges DAMs in their “day jobs,” but it’s even more challenging as they start to address longer-term preservation and stewardship issues.
Which is where my presentation (PDF) came in. I focused on the incentives to preserve digital content that are shared between DAMs and preservation professionals (libraries, archives and museums, or “LAMs” for short).
The differences come down to the type of data being preserved (proprietary for DAMs, largely open for LAMs); the purpose of preservation (monetization for DAMs, knowledge-sharing for LAMs); and the time horizon (shorter for DAMS and longer for LAMs), but the technology, standards and infrastructures for “doing” digital preservation are largely the same. LAMs have a lot of preservation knowledge we can share with DAMs, but we have to convince them that they’ve got incentives to preserve for the long term.
With that in mind, I posited 5 shared incentives to preserve for DAMs and LAMs:
The surprising thing is that DAMs are already considering many of these issues in their work. There was quite a bit of discussion throughout the conference on the challenges of big data and on storage infrastructures for both short- and long-term preservation, including an excellent presentation (PPT) from NDSA member WGBH on their open source digital asset management system for media preservation.
One key initiative in the DAM community that was much-discussed was the Publishing Requirements for Industry Standard Metadata (PRISM), a specification that defines a set of XML metadata vocabularies for syndicating, aggregating and multi-purposing publishing content (magazine, news, newsletter, marketing collateral, catalog, mainstream journal content, online content and feeds). PRISM was initiated in 1999 and is under the maintenance of the IdeaAlliance, a not-for-profit membership organization that “advances core technology to develop standards and best practices to enhance efficiency and speed information across the end-to-end digital media supply chain.” These are efforts that the digital stewardship community should monitor.
There are strong similarities in the work that DAMs do and the work that LAMs do, with numerous opportunities for collaboration. Let’s find more opportunities to meet and learn from each other.