This is my first, long overdue blog post since starting my new role as Software Configuration Manager for OPF at the start of the year. Truth be told that between the SCAPE end of year and review, a weeks holiday, and working out what to do it doesn't feel like four months since I started. I'm the OPF's first full time technical team member and will be dividing my time between:
offering guidance and assistance to developers working OPF software with best practices, and use of online tools.
helping to improve user documentation of software so that it’s easier to find and use.
showing members how to help shape future development of tools they use, by helping to convert requirements into developer tasks and automated tests.
engaging with the developers of open source digital preservation projects in order to share ideas, and software.
providing technical expertise and meeting members at OPF Hackathons and other events.
contributing to external projects the OPF is involved in, e.g. SCAPE & SPRUCE.
The OPF GitHub page currently lists 50 public projects. To put that in perspective I could afford a week of effort a year per project, if I did nothing else and took no holidays. In reality it would be no more than 2 days a year per project. The projects are in varying states of activity, are written in different programming languages e.g. Java, Ruby, Python, PHP, and some aren't software projects at all. Between other tasks I've started to update the OPF's current development guidelines, and added some guidance on the OPF’s GitHub policy. This includes standard practises that should be adopted by all OPF GitHub projects. The main concerns for new projects are:
create a descriptive (preferably GitHub markdown) README file.
clearly state the license terms of the project in a LICENSE file.
create a small YAML file listing some basic project metadata.
Adding this information makes it easy for somebody to find out what the project does, if they have permission to use it, and contact somebody if they have problems.
I’ve also written a little code that uses the GitHub API to create a web page that gives an overview of the OPF’s GitHub projects, providing warnings where projects don’t follow the OPF’s GitHub policy. The generated page can be found here and is currently updated once a day.
I’m now working on guidelines for using Travis-CI, the online continuous integration service, and hosting binary packages on BinTray. As I complete new sections I’ll also create a blog post giving a few more details. A recent OPF webinar tries to give the full picture, the slides are available on the Wiki.
I’ll wrap up this post by saying that I’m happy to take further suggestions, and answer questions, just drop me an email or IM me. I’m happy to provide direct assistance to members who require it. It’s also nice to meet members in person, I’ll be attending all OPF events where possible, starting with the Hackathon in Copenhagen next week. Oh, and I promise to blog a little more often......Preservation Topics: Software
In this installment of the NDSA innovation working group’s ongoing series of innovation interviews I talk with Alison Langmead and Brian Beaton about the approach they are taking to teaching Digital Preservation at the University of Pittsburgh. Alison holds a joint appointment in the Department of the History of Art and Architecture and the School of Information Sciences. Brian holds an appointment in the School of Information Sciences. In this interview we explore how they approach teaching digital preservation. You can read the syllabus for the course here.
Trevor: Could you give us a quick overview of your digital preservation graduate course?
Alison: Sure. Brian and I were interested reframing the contemporary practice of digital preservation as an imperfect and ongoing response to the history of digital culture. For example, decisions made in the 1940s, 1950s and 1960s about computing architecture still affect our work today, and we thought it was crucial for our students to not only understand today’s tools, but also to engage critically with the complex, layered legacy of information technologies.
Brian: We were also interested in teaching people to tack between past and present while making decisions about the objects in their stewardship. Building on Alison’s point, we wanted to situate digital preservation problems as outcomes and effects of choices, activities, and interactions over time that involved a tremendous range of human and non-human actors (although, I should add, we focused on the U.S. due to the typical career trajectory of our students at the University of Pittsburgh).
Alison: Indeed. To this end, we organized the 15-week course into two parts. In the first part, we focused our attention on primary source documents that captured the messy and contingent nature of emergent digital culture and its preservation. We began with texts from the 1940s and 1950s, working towards the present by decades, but as we approached the 1990s, we began examining ever-smaller increments of time. Each week, we would read documents produced only during the time period in question, concentrating on the ways in which human actors in the past understood digital technologies. The second part of the course was devoted to lab work and student presentations.
Brian: I would describe our approach as Media Archaeology meets Historical Epistemology. We tracked ideas, knowledge, machines, platforms, practices, and actors as they mutated over time— eventually congealing into something now commonly called digital culture, which presents a host of unique complications and challenges when it comes to its preservation.
Trevor: What do you see as the advantages of taking this approach to teaching digital preservation?
Brian: One key effect of this course design was that students were introduced to the computerization of American life as a continually unfolding interplay between technological obduracy and obsolescence. In the labs, we then encouraged students to apply that knowledge to contemporary information management problems. We also tried to model an outlook and sensibility that we believe is necessary for anyone interested in the preservation of digital culture; we instructed our students to conceptualize themselves as existing and operating in a moment that will likewise be rendered obsolete, perhaps soon. As information professionals interested in digital culture, they will have to constantly toggle between now-time, then-time, and future-time. To work in this area requires not just an understanding of data and files, but a whole set of physical and cognitive routines, aptitudes, and maneuvers. Our approach, I hope, captured some of the complexities around digital preservation and the tricky positioning of anyone working in this area.
Trevor: Alison, you have a background in Art History and Brian has a background in Science and Technology Studies. To what extent do you see each of those backgrounds structuring or changing how you approach digital preservation?
Alison: Brian and I both hold a firm belief in the importance of the historical contextualization of current-day information practices. We tried to present the history of digital culture in the United States as a critical piece of knowledge that preservationists can bring to bear on the effective stewardship of digital objects over the long-term. In terms of my own background, my training in the concrete and abstract issues surrounding material culture often leads me to emphasize visual knowledge and the impact that materiality can bring to a problem. Discussions about digital preservation concern the material manifestations of decades worth of decision-making.
Brian: My background often leads me to emphasize the social production of knowledge and the cross-traffic between “experts” and society. In terms of structuring my approach to digital preservation, I wanted students to leave the course as emergent experts in digital preservation and stewardship but also as deeply aware of the gaps and limitations in their own knowledge, and aware of the need for continuous re-training and re-tooling as they come to manage digital things in their everyday work lives. We also presented the professional conversations around digital preservation and stewardship as far from singular, unified, or coherent. Presenting the field as perpetually unsettled seemed more faithful to reality and more likely to position our students as critical, self-aware practitioners.
Trevor: How did you decide on how to periodize the history of computing in your course design?
Brian: In some ways the choice was arbitrary, structured by the limits of an academic term. We wanted the last few classes before the labs to focus on the most current research in this area, and then we worked back from there.
Alison: Also, in some ways the choice was tactical and meant to disrupt common periodizations of computing history. We wanted our students to think of this history as contested and open to re-periodization. For example, we investigated how the computerization of occupational and personal realms occurred at different rates and times, and spawned equally uneven conversations about digital preservation that continue into the present day.
Brian: In fact, the issue of uneven technology diffusion and uneven response on the part of the information professions became a major theme of the course.
Trevor: It strikes me that there are two related but different values in historicizing digital preservation education. On the one hand, the artifacts now making their ways into libraries, archives and museums come from different historical periods and as such an internalist understanding of different digital technologies and their features and affordances is valuable. With that said, more broadly, there is a value in understanding that computing has a social and cultural history. That is, a significant part of understanding (or for that matter, preserving, describing, and interpreting) a digital object involves entering into the past as a foreign country and coming to see it as someone in a different historical circumstance saw it. I am curious to know if you see a similar tension between these two values for historicizing and if in designing your syllabus there was any tension between focusing on the internalist story of devices and technologies changing over time and the externalist story of what those devices and technologies mean to different people in different historical contexts?
Alison: This tension is critical to our course design. In many ways, our entire course was predicated on this same observation. It is important to know both an insider’s history of computing as well as the social and subject effects of IT infrastructure.
Brian: This tension, I would add, is what makes digital preservation really interesting as an area of research, teaching, and practice. There are so many possible entry points into these uneven and overlapping conversations about the preservation of digital culture that emerged in the wake of computerization. There are also so many different zones of comfort and discomfort in any classroom. Some students might want to talk about data remanence or reconstructing hard drives or building the perfect emulator. Other students might want to talk about the work itself: project management, blurrings between consumption and production, or staffing and labor issues. Many students also arrive at the topic with a broad interest in the social and cultural history of technology. To address the second part of your question about coverage within the course itself, our effort to navigate between the internal and external, I think one of the more interesting and generative moves that we made involved reading outside the usual digital preservation literature. In preparing the course, we searched through field-specific journals in areas like nursing, banking, schooling, government, urban planning, and social science. Almost every nameable field has some version of a “Computers! What are they for?” article from the 1960s, 1970s, or 1980s. Reading these types of articles allowed us, as a class, to excavate the story of how specific machines and devices entered specific occupational realms. As instructors, we tried to call attention to subtle differences across domains that are often left un-named and lumped together.
Trevor: I’m curious about the extent to which some related notions like Media Archaeology can play into this historical approach to thinking about digital preservation. I interviewed Lori Emerson about her work on the Media Archeology Lab and I would be curious to hear what you see as the similarities and differences between the approach to your lab and the Media Archeology perspective Lori describes as informing her lab.
Brian: Your interview with Lori Emerson provides a wonderful distillation of Media Archaeology’s scattered intellectual origins and impulses. Our approach to teaching digital preservation shares a close affinity with Lori’s work at Boulder. Although we organized our course in “real time,” moving students experientially from the 1940s to the present, the only reason we moved chronologically was to capture and reveal subtle shifts in self-understanding and knowledge by the various human actors who were thinking, making, and doing with digital technologies. As I mentioned above, I would describe our approach as Media Archaeology meets Historical Epistemology. Thinkers like Ian Hacking and Lorraine Daston were just as influential on our course design as the various writers and thinkers named by Lori (e.g. Foucault, Kittler, et al.)
Alison: Perhaps one difference between our respective approaches, if I had to name one, is that our course focuses equally on historical components as well as on present-day electronic record-generating activities and the practice of digital preservation. Part of my own training is in the field of active information management, and I bring this training to the classroom with examples of current-day practices, policies and decisions. Digital preservation professionals continue their “training” every day by participating in their own digital cultural context. Policy decisions, the selection of particular hardware and software for the workplace and the home—all of these things are a part of the larger context shaping the ongoing conversations around digital preservation. Some key questions I raised as part of our class: How does the way we use information technology now impact how we treat historical objects? How does what we know about the past impact the way we, say, file our emails for future use? Does it make us think differently about using a site like Tumblr for our own personal purposes? How might the digital preservation profession play a part in actively and consciously constructing digital culture now and into the future? After all, this profession can (and has) made a profound impact on the ways in which people visualize their relationship to technology—an awesome responsibility.
Trevor: I know you are also working on a Digital Humanities Research network at the University of Pittsburgh. How do you see the relationship between your approach to digital preservation and your approach to digital humanities? Are these two parts of the same thing? Are they at odds with each other? Further, I saw some work from new media studies scholars, like Lev Manovich, on your syllabus. So, how do you see new media studies fitting together with digital humanities and digital preservation?
Alison: Yes, Brian and I are both involved with a group called the DHRX: Digital Humanities at Pitt. We are trying to create a strong but informal network of faculty who actively use digital technologies in their work, whether that be digital production or the use of digital methods to facilitate humanities-based research. Digital preservation strategies are always in the forefront of my mind when using technology in my research, and I am often in a place of being able to provide advice and collaborative support to my colleagues. If we do not consider how DH work will persist into the future (or even if we want our work to persist into the future), we are not, in my opinion, doing complete justice to our efforts.
Brian: I would describe new media studies, digital preservation, and digital humanities as organizational artifacts of our sociotechnical moment and as effects or symptoms of something larger happening at the intersection of people, information, and technology. In terms of our course design, we especially wanted to prepare our students to support new media and DH projects as they age, corrode, and ossify. I’ve written elsewhere about the “adaptive reuse” of “other people’s digital tools,” something I partly framed as a sustainability practice. In fact, because the principal contacts for the DHRX group at Pitt (Alison and myself) are also the people teaching digital preservation, the preservation side of DH is something I would really like to develop further as an area of research, teaching, and practice. If we take seriously past patterns and future predictions about obsolescence, then something like Preserving DH is already a long overdue anthology.
Alison: I agree very much with Brian that these academic fields seem like artifacts or affordances of something larger, not yet quite recognizable. Recent academic trends towards technology-oriented transdisciplinarity have demonstrated the benefits and the disadvantages of different scholarly communities coming together to work as groups. That might explain some of the simultaneity in terms of new media studies, digital preservation, and digital humanities. We have seen that some groups protect their identities so strongly that collaboration becomes impossible, while others have such a loosely-defined structure that they do not come to the table with any solidity, again making collaboration difficult. One possibility is to embrace co-existence and avoid worrying too much about academic fields, boundaries, and borders. Another possibility is to ask how we might bring these intellectual and methodological streams together productively without homogenizing the mixture and without just being strange bedfellows. In thinking about that very question, I am currently working with colleagues and graduate students on envisioning a course focused on digital materials and methods that will focus on this convergence and non-convergence of solid and ephemeral groups of actors grappling with digital culture in distinct but sometimes similar ways— some of whom study the digital, some of whom create in the digital, some of whom coopt the digital, some of whom reject the digital, and some of whom do all of these things and more, of course. We are playing around with the notion that to do this, we might best remove the human actors from the spotlight, and replace them with the technologies themselves. We often think of digital culture in terms of people and their material coagulations of mobile devices, desktop machines or pervasive sensor technologies, but what might the landscape of user interactivity as seen from the point of view of an embedded sensor teach us? What would a digital humanities/digital studies/digital preservation course look like from the point of view of the interface itself?
Brian: It sounds to me like this new course that you’re developing is Media Archaeology meets Historical Epistemology meets Actor-Network Theory meets Thing Studies…and the goal of the course is to investigate, as a set of interlinked symptoms or effects, the work happening in New Media Studies, Digital Preservation, and Digital Humanities. That’s pretty thick, elegant, and interesting. In closing, perhaps one further observation that can be made regarding our attempts to historicize “the digital” in digital preservation is that it seems to require a whole lot of aggregation: the combining of methods, terms, ideas, techniques, and theoretical tools from a wide range of literatures—which is only possible due to recent advances in search engines, databases, journal digitization projects, et cetera. Our course on the preservation of digital culture was designed and implemented by leveraging a good deal of present-day digital culture to dream up the structure and aggregate the content. That means our course, like the tools and technologies that we used to build it, may soon become obsolete. To me, that’s the best part of teaching digital preservation. It demands constant innovation.
The perfect digital preservation system does not exist. It may someday, but I don’t expect to live to see it.
Instead, people and organizations are working on iterations of systems, and system components, that are gradually improving how we steward digital content over time. This concept of perpetual beta has been around for a while; Tim O’Reilly explained lucidly in What Is Web 2.0 in 2005.
I gave a presentation recently in which I was expressing hope that prospective infrastructure developments for stewarding big data would bring benefits to the work of libraries, archives and museums to preserve digital content.
My intent was to convey that change should be iterative along a path to radical. In the spirit of avoiding bulleted presentation slides wherever possible, I searched for graphics that might help tell the story.
The one I ended up using was a picture from the Norfolk Record Office (UK) that showed delivery of a computer system some years ago. In it’s day, the Elliot computer was an advanced machine that cost the modern equivalent of nearly a million dollars. It read paper tape at 500 characters per second and had a CPU that was stored in a “cabinet about 66 inches long, 16 inches deep and 56 inches high.”
The picture got a good response from the audience and I wondered if perhaps I should have used others, perhaps from a later era, such as this one from Bell Labs in the late 1960s. This IBM mainframe was many iterations ahead of the Elliot, but any computer big enough to hide in surely needed to be delivered by truck as well.
These pictures are useful in illustrating a point that Clay Shirky and others made some time ago: the system should never be optimized. In other words, iteration and change should be embraced as a design principal. Any system surely can be improved–often radically– in the future. And, as time passes and successful migrations occur (our intent) the way we used to do things will inevitably will seem quaint in retrospect.
Yes, it’s only tres de mayo, but Sunday is a lousy day to hold a sale. Besides, today is International Day against DRM. For today through the 5th, you can get Files that Last on Smashwords — DRM-free, of course — for the super-low price of $2.99 instead of the usual $7.99. Enter the coupon code TT58Q when buying the book to get this price. If you already have it, why not buying a copy for a friend or colleague?
This applies only to copies bought on Smashwords, not on other sites. Sorry if you prefer to buy on the iTunes store, but I’m not able to issue coupons for other sites.
We produce occasional short videos related to digital preservation. These videos address such topics as personal digital archiving, adding descriptions to digital photographs and the K-12 Web Archiving program, to name a few.
Our newest video profiles one of the Library of Congress’s most magnificent treasures: the Packard Campus for Audio Visual Conservation, located in the foothills of the Blue Ridge Mountains in Culpeper, VA.
This state-of-the-art facility resides inside Mt. Pony, a high-security facility formerly occupied by the Federal Reserve Bank. The facility was completely rebuilt and optimized for the preservation of material and digital audio and visual items. David Packard’s Packard Humanities Institute funded the renovation, in large part.
The Packard Campus opened in 2007 . It houses the Library’s vast collection — nearly 5 million items — of motion pictures, audio recordings, television and radio broadcasts, videos and video games; many are in obsolete formats. The material items in the collections date from the late 19th century onward.
Our new video showcases the Packard Campus as a world leader in the preservation of born digital and digitized collections. It shows how the Packard Campus gathers born-digital collections shipped on drives, ripped from CDs and DVDs, transferred over networked cable and captured from live broadcasts.
The video also shows how the Packard Campus digitizes material collections. For example, SAMMA robots digitize videotape in batches around the clock, specially designed machines digitize rare old paper film and the IRENE system uses lasers to map the grooves of fragile recordings without risking further damage of the grooves through contact with metal record-player styluses.
In the last step of the digital-file journey, high-capacity servers pull in the digital collections and transfer them to backup drives and tapes for storage. The repository is designed to anticipate large-scale expansion of the digital collections, as well as power and cooling needs of the server hardware.
The end result is not just long-term digital preservation; it’s remote access as well. The Packard Campus serves some digital items from its repository over the network to researchers at A/V stations 70 miles away at the Library of Congress’s Audio Visual reading rooms in Washington, DC.
Files that Last is the first e-book on digital preservation directed at “everygeek.” In case your layout doesn’t show you the page links (e.g., on a mobile device), you can read what the book’s about and how to get it here.
This is a guest post by Jose “Ricky” Padilla, a HACU intern working with NDIIPP.
More and more cultural heritage organizations are inviting their users to tag collection items to help aggregate, sort and filter collection items. If we could better understand how and why users tag and what they’re tagging we can better understand how to invite their participation. For this installment of the Insights series I interview Jennifer Golbeck, an assistant professor at the University of Maryland, Director of the Human-Computer Interaction Lab and a research fellow at the Web Science Research Initiative about her ongoing studies of how users tag art objects.
Ricky: Could you tell us about your work and research on tagging behaviors?
Jennifer: I have studied tagging in a few ways. With respect to images of artworks, we have run two major studies. One looks at the types of tags people use. The other compares and contrasts tags generated by people in different cultures.
In the project on tag types, we used a variation of the categorization matrix developed by Panofsky and Shatford. This groups tags by whether they are about things (including people), events, or places and also by whether they are general (like “dog”), specific (like “Rin Tin Tin”), or abstract (like “happiness”). We also included a category for tags about visual features like color and shape. We found that people tended to use general terms to describe people and things most commonly. However, when they are tagging abstract works of art, they are much more likely to use tags about visual elements.
My PhD student Irene Eleta led our other study. She asked American native English speakers and native Spanish speakers from Spain to tag the same images. She found differences in the tags they assigned which were often culture specific. For example, on Winslow Homer’s “The Cotton Pickers”, Americans used tags like “Civil War” and “South” which Spanish taggers didn’t. This illustrates how translating tags can open up new types of access to people who use different languages and come from different cultures.
Ricky: Is there any of your research that you find would be particularly beneficial to those interested in digital stewardship?
Jennifer: Irene Eleta’s work on culture and language is very interesting. I think this is a relatively unexplored area, and there is so much that can be done by combining computational linguistics, other computing tools and metadata like tags to improve access.
Ricky: In your talk for the Digital Dialogues at the Maryland Institute for Technology in the Humanities you presented three research projects using tags on art. Could give us some background on research that was helpful informing your research in this area?
Jennifer: I come from a computer science background, so I am far from an expert in this area. I read up a lot on metadata and some existing tools and standards like the Art & Architecture Thesaurus. We also worked with museum partners who brought the art and museum professional perspective, which was very helpful.
Ricky: You explained in the talk that understanding what people are tagging and why can design better tagging systems. Could you elaborate on this idea?
Jennifer: Tags have been shown to provide a lot of new data beyond what a cataloger or museum professional will usually provide. However, to maximize the benefit of tags, it helps to understand how they will improve people’s access to the images. Worthless tags do not help access. Our work is designed to understand what kinds of tags people are applying. This can help in a few ways. First, we can compare this to the terms people are searching for. If search terms match tags, it definitely reveals that tags are useful. Second, we can see if tags are applied more to one type of image than another. For example, I mentioned that people use a lot of color and shape tags for abstract images. This means if someone searches for a color term, the results may be heavily biased toward abstract images. This has implications for tagging system design. We might build an interface that encourages people to use visual element tags on all images or we might use some computer vision techniques to extract color and shape data. At the core, by understanding what people tag, we can think about how to encourage or change the tagging they are doing in order to improve access.
Ricky: Has your research uncovered any ways to encourage tagging? If so what are some of the factors which encourage and discourage tagging?
Jennifer: We haven’t made it to that point yet. We have uncovered a number of results that suggest how we can begin to design tagging systems and what we might want to encourage, but how to do this is still an open question.
Ricky: In a study you compared tags from native English speakers from the USA and native Spanish speakers from Spain. Could you tell us a little about the findings of this investigation and how cultural heritage institutions could benefit from this research?
Jennifer: (I described this work a bit above). Cultural heritage institutions can benefit from this in a couple ways. If they have groups who use different languages, they can provide bridges between these languages to allow monolingual speakers to benefit from the cultural insights shared in another language. This can be done by translating tags on the back end of the system. It also suggests that in order to open up their collections to other cultures, language tools will be important.
Ricky: You mentioned automatic translations could help in improving the accessibility of digital collections but it was more complex than that. What are some of the pros and cons of automatic translation which you came across in your research?
Jennifer: I discussed some of the pros above. However, automated translation is a hard problem, especially when working with single words. For example, disambiguation is a classic problem. If you see the tag “blues”, does it refer to the colors or to the music? When there is surrounding text, a tool can rely on context, but that is much harder with tags. If we want to rely on translation, we will have to do more work in this area.
Ricky: Is there any other work you would like to do with data from theses studies, like the recordings of the eye-tracking sessions?
Jennifer: We have eye tracking data for people tagging images and looking at images. We also have it for people who spent time looking at an image for a while before tagging it and for people who began tagging immediately. It would be interesting to compare those to see how people look at art when they are given a task compared to when they are simply asked to look at it. Also, we can compare how people tag when they are familiar with an image vs. when they are seeing the image for the first time.
The Harvard Library developed FITS, the File Information Tool Set, as part of the ingest processing of its Digital Repository Service (DRS). This was mostly Spencer McEwen's work. It's a "Swiss army knife," running a number of different tools to identify formats and provide metadata information about files. It was put up on Google Code as open source, and a number of other institutions have started using it.
Harvard hasn't had the time to update it to a more broadly useful project, but thanks to a SPRUCE Award, I've been spending April making various updates and fixes to it, with the results currently available on Github. That repository is a temporary way station for it; these changes will be merged into an institutionally maintained repository, though just where hasn't been determined yet.
The first task I undertook was adding Apache Tika as a new tool. The work on this started at the OPF Hackathon in Leeds. The advantage of Tika is that not only does it already cover a lot of formats, but it's actively maintained, so we can expect support for more formats in future releases. FITS is a Java application, and Tika is a reasonably well-documented Java library, so getting it to work wasn't very hard. The main complication was that Tika's output vocabulary is sprawling and undocumented, so there's no good way to tell what properties it might report in previously untested cases. This makes it more difficult to translate Tika terms into standard FITS output.
Several of the tools FITS uses were out of date. JHOVE hadn't been brought up to its latest version because attempts to do so produced less metadata than version 1.5 did. This turned out to be because JHOVE had updated to the current MIX 2.0 schema, and FITS was still trying to interpret it as MIX 0.2. Once the problem was found, the fix was obvious.
DROID was a more difficult case. FITS was using DROID 3, and DROID 6 was vastly changed, to the point that FITS got numerous compilation errors after dropping in DROID 6. DROID has no public API documentation, making things difficult. Matt Palmer, who has worked on DROID development, provided vital help in figuring out how to call the current version.
Some issues in efficiency turned up. DROID uses an XML signature file to identify files. It's big, and parsing it took over 13 seconds on my computer. If FITS is run on a large directory, the time cost is spread out over a lot of files, but this is a problem if it's run on one file or a small directory. Hopefully there will be optimizations, perhaps a persistent serialized cache, in future versions of DROID.
The National Library of New Zealand's metadata tool was more problematic. An attempt to bring it up from version 3.4GA to 3.5GA ran into problems similar to the ones with DROID, with classes having been changed. Apparently this tool isn't being actively maintained, and I wasn't able to get the information needed to do the update. It's staying at 3.4GA in FITS.
Another task was improving the metadata vocabulary for video. FITS output isn't much more than a flat set of properties, so it wasn't possible to adopt any other schema full-blown, but ideas were used from a number of sources, including MediaInfo, Archivematica, and PBCore. Exiftool is currently the best of the tools for reporting video properties, so the output was shaped by what it can produce. Hopefully other tools, such as Tika, will produce more information on video files in future versions.
Documentation is an important part of any open source project, but one that often gets low priority. I did some work on the Javadoc and added documentation in the wiki pages of the Github repository. In particular, there are instructions on how to add a new tool to FITS.
Hopefully this work will make FITS a more useful tool, both for Harvard and for its other users.
The following is a guest post by Nicholas Taylor, Data Specialist for the National Digital Information Infrastructure and Preservation Program.
A previous post on this blog explored why it’s so hard to come up with a reliable measurement of the average lifespan of a webpage. In essence, the argument came down to this: links and the websites they represent tend to become decoupled over time. Without a broad understanding of how that process takes place, it’s hard to make definitive claims about the persistence of websites when available automated tools can only capably check for the persistence of links.
In an ideal web, webmasters would adhere to Tim Berners-Lee’s notion of “cool URIs” – links that have been purposely maintained so as to remain stable. Stable links are more useful to users, and it is technically feasible to maintain any particular link for at least the lifespan of the resource it points to. However, given both the popular perception of and the abundance of scholarly literature on link decay, it’s probably safe to say that Tim Berners-Lee’s vision for a cool URI-enabled web hasn’t yet been realized.
The good news is that websites are more durable than links. This is supported by multiple studies and makes intuitive sense, as well. The bad news is that most contemporary web archiving tools are actually link archiving tools; they are designed to agnostically capture and replay the content represented by links, not the intellectual objects (i.e., the websites) of interest per se. For the Library of Congress thematic web archives, we can only assure that the links we’re capturing continue to correspond to the websites we care about preserving by manually inspecting them on an ongoing basis.
To better understand the discrepancy between link and website persistence as well as the disposition of websites that we previously archived, intern Heidi Hernandez and I revisited 1,071 links archived as part of the U.S. Election 2002 web archive collection. We excluded over 1,000 links corresponding to electoral candidate websites, as they were especially short-lived. The remaining links corresponded to state government, political party, advocacy group, newspaper, and political blog sites.
We followed a two-part methodology. First, following the approach of many other link persistence studies, we ran the entire list of links through a link checker and recorded the http response codes (e.g., 404, 200, 301, 500, etc.). Second, we visited each of the links and noted whether the corresponding website was the same as that which was archived. If the website was different, or if the link didn’t work, we tried to discover the new location of the website using search engines.
There were a few noteworthy findings:
This last point should most certainly not be interpreted as a sign of the superfluity of web archiving; recall that over 1,000 links to now-disappeared websites were excluded from the analysis. Also consider, for example, that just because the White House website still exists eleven years after we archived it as part of the U.S. Election 2002 web archive collection doesn’t mean that any of the resources that made up the White House website of eleven years ago are still accessible now.
All-in-all, though, the results suggest a more complicated picture of the ephemeral web than the popular conception that tends to conflate the disappearance of links and that of websites.
We have moved so far so fast with personal computing that older machines are acquiring a cultural patina. Everyone, seemingly, has a memory of ”old computers,” even if some people think having a hard drive under 100 gigabytes fits the definition.
There are perhaps two ways to think about obsolete computers. One is as trash or e-waste, which is a serious environmental problem. The issue has been building for years as computers and related peripherals age-out after a few short years and are replaced by equipment that itself will be tossed in the near future. Even if they work just fine, older machines often are perceived to be too slow, too clunky or too uncool to keep around. Recycling is possible, but it doesn’t always happen the way it should, resulting in exposures to dangerous chemicals and other materials.
Ironically, some older machines that escaped being dumped have a second life that far exceeds their original intended purpose. All you have to do is glance at the vintage computing section of an online auction website to see how valuable certain kinds of equipment has become. And, if you are lucky, you can even find good stuff for free: I liberated a fully functional Osborne 1 portable computer from a trash heap a few years ago, for example.
The rarest personal computers are the original models dating back to the 1970s. I found a great picture on Wikimedia that shows some of the earliest models, now exhibited at the Computer History Museum in Mountain View, California.
All this goes to say that if you know about a stash of old computer equipment it might be worth checking to see if it has secondary value. Older machines can live on for functional purposes, such as reading old software. Or they might simply have aesthetic value as reminders of the early days of computing. Either possibility beats adding to the e-waste problem.
Looking for a way to get the word out about digital preservation? I’ve added a new page on reviewing FTL to this site. All publicity (well, nearly all) is good!
While the digital preservation challenge is caused by technology, it is not solved by technology. Many research projects started out with the ambition to devise a technology solution (migration, emulation, encapsulation, etc.) and many memory institutions thought it would suffice to apply the R&D results: the methods and associated tools. However, it has become clear that such all encompassing solutions do not exist. In addition, many tools and approaches have not survived the R&D stage. So, while R&D remains important to conduct research in specific, well-defined problem areas, it is not the main driving force behind digital preservation.
Although OPF originates from a research project and continues to foster R&D, its philosophy of digital preservation concentrates less on technology as a solution and more on growing digital competence as a long-term approach to digital preservation. In previous blogs I gave some background on this philosophy, which aims to
1) foster learning by doing as a means to develop skills and expertise in an area where best practices and standards have not yet matured and where research plays an important supportive role;
2) cultivate a community of experts and skilled people who embrace the values of active learning and professional sharing, values which assume a certain degree of organisational readiness on the part of memory institutions.
In this blog I will explain how the OPF hackathons are supporting these aims and why preservation managers should send their staff to OPF-hackathons.
What are OPF hackathons?
Our hackathons are 3 day-events organised around a specific digital preservation topic or challenge and bring together curators (those who understand the content and value of their collections) and software engineers (those who understand the underlying digital nature of these collections). In OPF-speak, we bring together the “practitioners” and the “developers”, which is a practical way to distinguish between 2 different roles: 1) the role of the practitioners who collect digital materials and can come with real examples and real, day-to-day problems they encounter when managing these materials; 2) the role of the developers which is the equivalent to that of the “conservators” in the analogue domain: they examine the digital materials (the files and the bit streams underlying the digital objects); suggest methods for storing, displaying, treating and processing them; research new techniques; etc.
In bringing these 2 roles together we are creating fruitful synergies, which not only result in practical solutions but more importantly, in cultivating a community of experts who share and develop professional practices together. The concept is simple: practitioners bring troubled data and developers “hack” with existing tools and develop practical approaches. Usually the problems and solutions are very much hands-on. They are neither about state of the art R&D nor about building future frameworks or digital sustainability platforms. They are not about risk assessment or risk management. We talk about the day-to-day operations and the use of tools such as Apache Tika and DROID, in real practice. We talk about integration of tools in workflows and compare practices. In this way we are building a shared practice, based on learning by doing.
Why is it important for memory institutions to send their people to OPF hackathons?
Institutions with a mission to preserve society’s digital heritage need to develop competence and confidence in digital preservation. It is OPF’s conviction that the best way to do so is by investing in staff development. OPF hackathons are better substitutes to (and cheaper than) training programmes. They help your staff to develop the knowledge, skills and abilities needed to perform their daily tasks. Through participation they can rely on peer support from the OPF community and vice versa derive job satisfaction from contributing to the community.
Just in case you don’t follow the other channels in which I’ve been talking it up, Files that Last, my new e-book on digital preservation for “everygeek,” is now out. It covers issues of backup, archiving, file formats, and long-term planning. Right now it’s available from Smashwords, Kobo, and the iTunes Store. It hasn’t shown up on Amazon yet, but I expect it will soon.
I’m not exactly impartial on this, but I think you’ll find it a valuable resource for preservation planning on the personal level and for large and small organizations.
I’ve spent the last few months looking at the JISC data management planning projects. It’s been very interesting. Data management planning for research is still comparatively immature, and so are the tools that are available to support it. The research community needs more and better tools at a number of levels. Here are my thoughts… what do you think?
At group or institution level, we need better “maturity assessment” tools. This refers to tools like:
Some of the existing tools seem rather ad hoc, as if they had emerged and developed from somewhat casual beginnings (perhaps not well put; maybe from beginnings unrelated to the scale of tasks now facing researchers and institutions). It is perhaps now time for a tool assessment process involving some of the stake-holders to help map the landscape of potential tools, and use this to plot development (or replacement) of existing tools.
For example CARDIO and DAF, I’m told, are really tools aimed at people acting in the role of consultants, helping to support a group or institutional assessment process. Perhaps if they could be adjusted to be more self-assessment-oriented, it might be helpful. The DAF resource really needs to be brought up to date and made internally consistent in its terminology.
Perhaps the greatest lack here is a group-oriented research data risk-assessment tool. This could be as simple as a guide-book and a set of spreadsheets. But going through a risk assessment process is a great way to start focusing on the real problems, the issues that could really hurt your data and potentially kill your research, or those that could really help your research and your group’s reputation.
We also need better DMP-writing tools, ie better versions of DMPonline or DMP Tool. The DCC recognises that DMPonline needs enhancement, and has written in outline about what they want to do, all of which sounds admirable. My only slight concern is that the current approach with templates for funders, disciplines and institutions in order to reflect all the different nuances, requirements and advice sounds like a combinatorial explosion (I may have misunderstood this). It is possible that the DMP Tool approach might reduce this combinatorial explosion, or at least parcel elements of it out to the institutions, making it more manageable.
The other key thing about these tools is that they need better support. This means more resources for development and maintenance. That might mean more money, or it might mean building a better Open Source partnership arrangement. DMPonline does get some codebase contributions already, but the impression is that the DMP Tool partnership model has greater potential to be sustainable in the absence of external funding, which must eventually be the situation for these tools.
It is worth emphasising that this is nevertheless a pretty powerful set of tools, and potentially very valuable to researchers planning their projects and institutions, departments etc trying to establish the necessary infrastructure.
Image scanning of one sort or another has been in common usage in some industries since the 1920s.
Yes, really, the 1920s.
The news wire services used telephotography — where images are captured using photo cells and transmitted over phone lines — well into the 1990s. Scanners and digital cameras like those we are familiar with came out of development in the 1960s and 1970s, and were already hitting the commercial market by the 1980s.
I have vivid memories of my first digitization project, because that project changed the course of my career.
In 1986 I was in graduate school and volunteering for the Fowler Museum of Cultural History at UCLA. One day the Collections Manager came down to the archaeology collections in the sub-basement (where I was surveying the human skeletal remains in the collections for our NAGPRA records) and said to me: “How would you like to move from the sub-basement to the basement”? How could anyone say no to that?
The project was to do a recon on all the paper records and enter them into the brand new Argus system running on a mini-mainframe. I am pretty certain that we were Questor’s second customer, after the Southwest Museum. While the recon project taught me the basics about what became the focus of my career — collection records management, digitization, system administration, being a DBA, working with authority control and creating multilingual controlled vocabularies — what was particularly exciting about the system was that it had the capacity to link to digital images.
So we started digitizing. We had acquired a particularly exciting and important archaeological collection, and I had the opportunity to work on the digitization. The objects were set on a stand and the image was captured via a video camera and written to tape, with a video titler used to embed the accession number into the image. The tapes were then mastered onto laser disks.
Now, this was very cutting edge – one entered an address for an image on a laser disk into a field in the object record, and the system could address the file on the laser disk and display it on a dedicated terminal. We had an early Sony Mavica camera, which used 3.5″ floppy disks as its storage media. And we had a printer, which printed color photos the size of old school Polaroids. It was heady stuff.
In 1988 I attended my first Museum Computer Network conference, another event that shaped my career. The 1989 MCN meeting was the pivotal one. We had our first meeting of a Visual Information SIG, where at least a dozen organizations shared their experiments, successes, and failures with digital imaging. I still have my write-up from that meeting, which appeared as a column in Spectra. I chaired that group for many years, and that group helped build a community around imaging practice that still exists.
Of course there were many early leaders and innovators in digital imaging. The American Museum of Natural History. The Fine Arts Museums of San Francisco Thinker imagebase. The Library of Congress American Memory project. Harvard University’s libraries and museums. Numerous Smithsonian projects. And too many others to name.
What other imaging projects were people involved in during the 1980s? If you are interested in the history of digital imaging I suggest the Digital Imaging page at CoOl, which includes a great historical bibliography. Not all the links work, but it’s a great jumping-off point for a history of the discipline.
Which way should you go? I’ll say first of all, just buy the book and I’ll be happy. Buying through Smashwords will give me a bigger cut than the other channels, but a sale’s a sale. If you’re planning to read it on an iPhone, iPod, or iPad, getting it from Apple is the easiest way to get it on there. I don’t really know anything about Kobo.
There should be more ways to buy FTL within the next week or so.
Last Friday’s CURATEcamp AVpres was a collaboration between several physical sites, using Google Hangout and IRC. I’d been asked if I could do a lightning presentation online on my work on FITS, but I had a commitment on the 19th, so Andrea Goethals at the Harvard Library said she’d do one.
That, unfortunately, was the day the Tsarnaev brothers went on their spree in Cambridge, and Harvard was closed for the day. Paul Wheatley picked up the job on short notice and did a presentation; the slide show is online. Paul suggested people should look at the work I’m putting on the Github repository after I’m finished at the end of April, but I wouldn’t mind if people tried it out now, while I’m still devoting my time to the project.
A simple but useful tool that’s part of FITS’s collection is FFident, written by Marco Schmidt. He apparently is no longer maintaining it, and its page disappeared from the Web but was retained on the Internet Archive. It seemed like a good idea to make it more readily available, so I’ve put it, using its LGPL license, into a Github repository.
FITS uses its own copy of the source code, so this really isn’t tested at all in its own right, but it’s there for people to play with. I added a build.xml file and organized the code the way Eclipse likes it. I don’t have any plans to support it, but if anyone wants to play with it, it’s there.
The following is a guest post by Christie Moffatt, Manager, Digital Manuscripts Program, History of Medicine Division, National Library of Medicine
In this installment of the “Content Matters” series of the National Digital Stewardship Alliance Content Working Group, I interview Dr. Sylvia Chou, PhD, MPH, Program Director of the National Cancer Institute’s Health Communication and Informatics Research Branch. Sylvia recently spoke at the National Library of Medicine on “Health Communication in the Digital World: Innovation and the Role of Social Media,” sharing some of her research on health-related uses of social media. In this interview, I asked Sylvia to describe a bit more about her research and her perspective on the value of preserving Web 2.0 communications (as well as scientific communications about those communications) over time.
Christie: Could you share a bit about your interest in social media and health? What is the focus of your current research?
Sylvia: I came to the National Cancer Institute as a Cancer Prevention Fellow interested in how people communicate about health and the impact of communication on attitudes, perceptions, and behavior related to health. One of my first publications, “Social Media Use in the United States: Implications for Health Communication,” based on data from the Health Information National Trends Survey (HINTS), received attention, particularly on the digital divide and how public health and clinical researchers may leverage Web 2.0 communication in their work.
As I engaged in analyzing national surveys on this topic, I also engaged in qualitative research on social media in an effort to begin understanding motivations and the nature of use. As an example, I began studying cancer survivors’ narratives posted on YouTube to better understand how and why people are sharing personal stories through what’s later termed “user-generated content,” characteristic of social media. Subsequently, in a literature review, we saw the abundance of commentaries about social media, but generally not as much empirical work to-date testing the utility of social media for health promotion. We felt that the time was ripe for more rigorous research on the topic.
Christie: What have you been able to learn through this research about the users of social media for health communication? How are people using social media to communicate about health?
Sylvia: We’ve learned that social media reactions (to health messages, including health campaigns and advertisements) are proving to be an authentic representation of the way people feel about them and can serve as interesting “data” for social scientists. Also (and related), that active users of social media have different levels of self-disclosure. In some social media venues (e.g. Facebook), people seem to post everyday thoughts and observations, with little editing or shielding of privacy.
We’ve also learned more about the influence of user-generated health content, and its potential for the dissemination of public health information. For example, there are many YouTube videos or blog posts of people sharing personal stories about a specific health care procedure, like colonoscopy, where they report that “it wasn’t so bad” or “it saved my life.” These narratives can be more persuasive than traditional public health guidelines or communication efforts. Stories like these are not coached and not perfect, but perhaps their authenticity makes them more effective in health promotion efforts. On the other hand, the use of personal narratives in social media can potentially have negative health impact. Individuals’ experiences are not necessarily evidence-based and can contain health myths, which can potentially be spread quickly on social networks.
Christie: Where do you expect this research will make its greatest impact?
Sylvia: Empirical research on social media will be helpful to those developing strategies for health campaigns. This research may also be helpful to clinicians, as they become more aware of the impact of social media conversations (e.g. negative stereotype of individuals suffering from health issues) when helping patients. For instance, in our project examining social media discourse about obesity, we found rampant weight stigmatization against individuals struggling with weight issues. Documentation of such online behavior can help clinicians and public health practitioners better understand the experiences of their clients and the barriers they face to improving their health.
Christie: What is your approach to finding and analyzing social media for use in your research? What types of communications are you studying?
Sylvia: We have used commercial data mining companies often for more commercial/marketing purposes, but we can do keyword searches on blogs, forums, YouTube, Facebook, and Twitter. My colleagues and other investigators have relied on Internet panels (e.g. focus groups and survey questionnaires), and data from Google and Twitter that they make available to researchers.
Christie: How do you gather and collect the data for your research? Do you keep an archive of this data for other researchers?
Sylvia: We have stored data in the cloud (such as an Amazon server or Dropbox). Increasingly we are seeing large data stored in such ways. My qualitative research (e.g. on YouTube posts) are smaller in scope and so the data are a bit easier to store. We save the URL link and the video content is transcribed. We also use low-tech screen captures.
Christie: Have you considered working with data in web archives?
Sylvia: I would love to work with web data in an archive. A main concern would be about selection bias. When I gather my own data I know my own selection criteria. A web archive would need to be clear on how content was selected for inclusion. It is also important to be able to date/time-stamp captured content, to be able to say “as of this date this is what the content was.”
Christie: What are your observations on how the communication of research findings has changed with Web 2.0 technologies? Do you have a blog or online notebook?
Sylvia: Many individuals and agencies feel a pressure to host a blog and Tweet (we see the example of this from our own NIH Director Francis Collins), but I have not done so yet and continue to publish my research through the formal peer review process. At the end of the day, having my publication in PubMed provides credibility and a searchable presence. As a scientist, peer review justifies what I do. I have been thinking, though, of becoming a little more active on Twitter, as oftentimes I have to rely on my colleagues (particularly postdocs) to notify me of relevant Tweets regarding my presentations/research, and I probably should start becoming more engaged on these channels.
Christie: How can the digital preservation community reach out to researchers on issues of digital preservation?
Sylvia: The digital preservation community could aim for what is equally important, but not captured in the traditional scientific publishing. Some academics, for example, may be substituting traditional publishing venues with engagement on blogs, and such work is not reflected in the scientific databases.
Perhaps institutions could partner with researchers to develop preservation strategies to support research questions (i.e. preserve the specific social media or other digital content used in research) and manage that data a scientific space.
Christie: One of the struggles the digital preservation community faces is the fact that not everything can (or should) be preserved. What is your perspective on the most valuable content to preserve in support of your own research?
Sylvia: The digital preservation community should play a role in selectively preserving/documenting the diachronic evolution of material, content and ways of sharing value, with an aim to filter out some of the noise.
On the other hand, I can see that with more content that is captured, more research can be done. Some kind of sampling of social media could be done, but the process for doing so would need to be well-framed. The amount selected would depend on the level of resources available.
From a social science or epidemiological perspective, there are some interesting research questions about the history of health and health communication. An archive of health-related social media communications could serve as a rich resource for studying how people talk about health and science in a given time or place, and how these types of communications are changing over time. We can study shifts in the way that information is communicated.
What kind of content matters to you? This is but one case for preserving valuable content for long-term access. If you or your institution would like to share your own story of use and long term value of access to a particular type of born-digital resources, please send us a note at email@example.com and in the subject line mark it to the attention of the Content Working Group. We would love to hear from you!
Libraries can buy Files that Last through Axis360 and Cloud Library, or will be able to at some point in the future. Since libraries are clearly key customers, both as users and as lenders, I’ve made the book available to them at a permanent discount, for $6.99. In addition to those aggregators, buyers can buy through Smashwords’ Library Direct.
Librarians, please let me know if you have good or bad experiences buying the book this way, or if you’ve had past experience with these channels.