Planet DigiPres

Netnography and Digital Records: An Interview with Robert Kozinets

The Signal: Digital Preservation - 13 August 2014 - 1:19pm

Robert V. Kozinets, professor of marketing at York University in Toronto

The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and worked on a range of projects leading up to CurateCamp Digital Culture in July. This is part of a series of interviews Julia conducted to better understand the kinds of born-digital primary sources folklorists, and others interested in studying digital culture, are making use of for their scholarship.

Online communities, and their digital records, can be rich source of information, invaluable to academic researchers and to market researchers. In this installment of the Insights Interviews series, I’m delighted to talk with Robert V. Kozinets, professor of marketing at York University in Toronto and the originator of “netnography.

Julia: In your book “Netnography: Doing Ethnographic Research Online,” you define “netnography” as a “qualitative method devised specifically to investigate the consumer behavior of cultures and communities present on the Internet.”  Can you expand a bit on that definition for us? What is it about online communities that warrants minting a new word for doing ethnographic work online? Further, how would you compare and contrast your approach to other terms like “virtual ethnography”?

Robert: It’s a great question, and one that is difficult to do justice to in a short interview. For readers who are aware of the anthropological technique of ethnography, or participant-observation, it may be fairly easy to grasp that ethnographic work can also be performed in online or social media environments. However, doing ethnographic work on the combination of digital archives and semi-real-time conversations, and much more, that is the Internet is a bit different from, say, traveling to Outer Mongolia to learn about how people live there. The online environment is technologically mediated, it is instantly archived, it is widely accessible, and it is corporately controlled and monitored in ways that face-to-face behavior is not. Netnography is simply a way to approach ethnography online, and it could just as easily be called “virtual,” “digital,” “web,” “mobile” or other kinds of ethnography. The difference, I suppose, is that netnography has been associated with particular research practices in a way that these other terms are not.

Julia: You began implementing netnography as a research method in 1995. The web has changed a good bit since you started doing this work nearly twenty years ago. How has the continued development of web applications and software changed or altered the nature of doing netnographic research? In particular, has the increased popularity of social media (Facebook, Twitter) changed work in studying online communities?

Networking, from user jalroagua on Flickr

Networking, from user jalroagua on Flickr

Robert: This is a little like asking an experimental researcher if the experiments they run are different if they are running them on children or old people, or if they are experimenting on prisoners in a prison, or students at a party. It is a tactical and operational issue. The guiding principles of netnography are exactly the same whether it is a bulletin board, a blog or Facebook. Fundamental questions of focus, data collection, immersion and participation, analysis, and research presentation are identical.

Julia: How do you suggest finding communities online outside of the relatively basic search operations offered by Google and Yahoo? What are some signs that a particular online community will be a good source for netnographic research?

Robert: There are many search tools that are available, but there is no particular need to go beyond Google or Yahoo. The two keys to netnography are finding particularly interesting and relevant data amongst the load of existing data, and paying particular attention to one’s own role and consciousness as participant in the research process. Whatever tools one chooses to work with, this is time-consuming, painstaking and rewarding work. One thing I would love search engines to be able to do is to include and tag visual, audio and audiovisual material. It would be wonderful to have a search engine that spat out results to a search and gave me, along with website, blog and forum links, a full list of links to Instagram photos, YouTube videos and iTunes podcasts.

Julia: Throughout the book, you reinforce the point that the key to generating insight in netnography is building trust. Can you unpack that a bit? What are some ethical concerns researchers should keep in mind when conducting ethnographic research?

Robert: A range of ethical concerns have been raised about the use of Internet data, many of which have proven over the years to be non-starters. Notions of informed consent can be difficult online, and ethical imperatives can be difficult in environments where the line between public and private is so unclear. However, disclosure of the researcher or the research is not always necessary–it depends always upon the context. As with any research ethics question, it is generally a question of weighing potential benefits against potential risks.

Julia: From your perspective as an ethnographer and market researcher, what kinds of online content do you think is the most critical for cultural heritage organizations to preserve for researchers of the future to study this moment in history? Collecting and preserving content isn’t your area, but I’d be interested to hear whether you think there are  particular subcultures, movements or content that aren’t getting enough attention.

Robert: I have used the Wayback Machine from time to time to look at snapshots of the Internet of the past. I also recall a recent research project in which we studied bloggers, and in which some interesting blog material was removed shortly after it was posted. It survived only in our fieldnotes, but we had not archived it. Of course, it would be nice to be able to instantly retrieve “the data that got away.” However, in my research, it is the immediate experience of the Internet which matters most.

Given the rapid spread of social media, I believe that the present holds far more information and insight that any other time in the past. There are so many archives of so many particular groups already, and those archives are, in themselves, rather revealing cultural artifacts. The ones I find the most fascinating to study are the archives that groups make of their own activities. So, to answer your last question, I suppose that, to answer a library sciences question, I would be more interested to see the archives that library science people construct about library science and how they represent themselves to themselves and to wider audiences of assumed “others” that I would about how library science people represent any other group.

Julia: Aside from what to collect, I would be curious to learn a bit more about what kinds of access you think researchers studying digital culture are going to want to have to these collections. How much of this do you think will be focused on close reading of what individual pages and sites looked like and how much on bulk analysis of materials as data?

Rob: I think researchers are hungry for everything. If you ask typical researchers what data they want, they will say everything. That is because, without a specific focus or research question, you want to keep all of your options open. Then the problem becomes what they do with all this data, and they end up with all sorts of big data methods that try to fit as much data as possible into models. My approach is a bit different, in that I am searching for individual experiences online that generate insight. This could come from masses of data, or from one page, one site, even one photograph or one video clip. I think the question of access is tied up with questions of categorizing, interpretation and ownership, and these are all interesting and complex matters that lend themselves to a lot more thought and debate. In the short- to medium-term, what is currently available on the Internet is certainly more than enough for me to work with.

Categories: Planet DigiPres

Coming to "Preserving PDF - identify, validate, repair" in Hamburg?

Open Planets Foundation Blogs - 12 August 2014 - 10:01am

The OPF is holding a PDF event in Hamburg on 1st-2nd September 2014 where we'll be taking an in-depth look at the PDF format, its sub-flavours like PDF/A and open source tools that can help. This is a quick post of list of things you can do to prepare for the event if you're attending and looking to get the most out of it.


The Wikipedia entry on PDF provides a readable overview of the formats history with some technical details. Adobe provide a brief PDF 101 post that avoids technical detail.

Johan van der Knijff's OPF blog has a few interesting posts on PDF preservation risks:

This MacTech article is still a reasonable introduction to PDF for developers. Finally, if you really want a detailed look you could try the Adobe specification page but it's heavy weight reading.


Below are brief details of the main open source tools we'll be working with. It's not essential that you dowload and install these tools. The all require Java and none of them have user friendly install procedures. We'll be looking at ways to improve that at the event. We'll also be providing a pre-configured virtual environement to allow you to experiment in a friendly, throw away environment. See the Software section a little further down.


JHOVE is an open source tool that performs format specific identification, characterisation and validation of digital objects. JHOVE can identify and validate PDF files against the PDF specification while extracting technical and descriptive metadata. JHOVE recognises PDFs that state that they conform to the PDF/A profile, but it can't then validate that a PDF conforms to the PDF/A specification.

Apache Tika

The Apache Foundation's Tika project is an application / toolkit that can be used to identify, parse, extract metadata, and extract content from many file formats.  

Apache PDFBox

Written in Java, Apache PDFBox is an open source library for working with PDF documents. It's primarily aimed at developers but has some basic command line apps. PDFBox also contains a module that verifies PDF/A-1 documents that has a command line utility.

These libraries are of particular interest to Java developers who can incorporate the libraries into their own programs, Apache Tika uses the PDFBox libraries for PDF parsing.

Test Data

These test data sets were chosen because they're freely available. Again it's not necessary to download them before attending but they're good starting points for testing some of the tools or your code:

PDFs from GovDocs selected dataset

The original GovDocs corpora is a test set of nearly 1 million files and is nearly half a terabyte in size. The corpus was reduced in size by removing similar items by David Tarrant, as described in this post. The remaing data set is still large at around 17GB and can be downloaded here.

Isator PDF/A test suite

The Isator test suite is published by the PDF Association's PDF/A competency centre, in their own words: 

This test suite comprises a set of files which can be used to check the conformance of software regarding the PDF/A-1 standard. More precisely, the Isartor test suite can be used to “validate the validators”: It deliberately violates the requirements of PDF/A-1 in a systematic way in order to check whether PDF/A-1 validation software actually finds the violations.

More information about the suite can be found on the PDF Association's website along with a download link.

PDFs from OPF format corpus

The OPF has a GitHub repository where members can upload files that represent preservation risks / problems. This has a couple of sub-collections of PDFs, these show problem PDFs from the GovDocs corpus and this is a collection of PDFs with features that are "undesirable" in an archive setting.


If you'd like the chance to get hands-on with the software tools at the event and try some interactive demonstrations / exercises we'll be providing light virtualised demonstration environments using VirtualBox and Vagrant. It's not essential that you install the software to take part but it does offer the best way to try things for yourself, particularly if you're not a techie. These are available for Windows, Mac, and linux and should run on most people's laptops, download links are shown below.

Vagrant downloads page:

Oracle VirtualBox downloads page:

Be sure to install the VirtualBox extensions also, it's the same download for all platforms.

What next?

I'll be writing another post for Monday 18th August that will take a look at using some of the tools and test data together with a brief analysis of the results. This will be accompanied by a demonstration virtual environment that you can use to repeat the tests and experiment yourself.

Categories: Planet DigiPres

Networked Youth Culture Beyond Digital Natives: An Interview With danah boyd

The Signal: Digital Preservation - 11 August 2014 - 6:00pm
danah boyd, principal researcher Microsoft Research, research assistant professor in media, culture and communication at New York University, and fellow with Harvard's Berkman Center for Internet & Society.

danah boyd, principal researcher, Microsoft Research, research assistant professor in media, culture and communication at New York University, and fellow with Harvard’s Berkman Center for Internet & Society.

The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and worked on a range of projects that lead up to CurateCamp Digital Culture in July. This is part of an ongoing series of interviews Julia conducted to better understand the kinds of born-digital primary sources folklorists, and others interested in studying digital culture, are making use of for their scholarship.

How do teens use the internet? For researchers, reporters and concerned parents alike, that question has never been more relevant. Many adults can only guess, or extrapolate based on news reports or their own social media habits. But researcher danah boyd took an old-fashioned but effective approach: she asked them.

I’m delighted to continue our ongoing Insights Interview series today with danah, a principal researcher at Microsoft Research, a research assistant professor in media, culture and communication at New York University, and a fellow at Harvard’s Berkman Center for Internet & Society. For her new book It’s Complicated: The Social Lives of Networked Teens, she spent about eight years studying how teens interact both on- and off-line.

Julia: The preface to your latest book ends by assuring readers that “by and large, the kids are all right.” What do you mean by that?

danah: To be honest, I really struggle with prescriptives and generalizations, but I had to figure out how to navigate those while writing this book.  But this sentence is a classic example of me trying to add nuance to a calming message.  What I really mean by this – and what becomes much clearer throughout the book – is that the majority of youth are as fine as they ever were.  They struggle with stress and relationships.  They get into trouble for teenage things and aren’t always the best equipped for handling certain situations.  But youth aren’t more at-risk than they ever were.  At the same time, there are some youth who are seriously not OK.  Keep in mind that I spend time with youth who are sexually abused and trafficked for a different project.  I don’t want us to forget that there are youth out there that desperately need our attention. Much to my frustration, we tend to focus our attention on privileged youth, rather than the at-risk youth who are far more visible today because of the internet than ever before.

Many parents and young people from the school and nearby communities attend the pie and box supper, given by the school to raise money for additional repairs and supplies. Each box or pie is auctioned off to the highest bidder, sometimes bringing a good deal, since the girl's "boyfriend" usually wins and has the privilege of eating it with her afterwards. Quicksand school, Breathitt County, Kentucky. 1940 Sept.  Farm Security Administration - Office of War Information Photograph Collection, Prints and Photographs.

Photograph from “pie and box supper,” Quicksand school, Breathitt County, Kentucky, September 1940. Farm Security Administration – Office of War Information Photograph Collection. Photo courtesy of the Library of Congress Prints & Photographs Division.

Julia: In a recent article you stated that “social media mirror, magnify, and complicate countless aspects of everyday life, bringing into question practices that are presumed stable and shedding light on contested social phenomena.” Can you expand a bit on this?

danah: When people see things happening online that feel culturally unfamiliar to them, they often think it’s the internet that causes it. Or when they see things that they don’t like – like bullying or racism – they think that the internet has made it worse.  What I found in my research is that the internet offers a mirror to society, good, bad and ugly.  But because that mirror is so publicly visible and because the dynamics cross geographic and cultural boundaries, things start to get contorted in funny ways.  And so it’s important to look at what’s happening underneath the aspect that is made visible through the internet.

Julia: In a recent interview you expressed frustration with how, in the moral panic surrounding social media, “we get so obsessed with focusing on relatively healthy, relatively fine middle- and upper-class youth, we distract ourselves in ways that don’t allow us to address the problems when people actually are in trouble.” What’s at stake when adults and the media misunderstand or misrepresent teen social media use?

danah: We live in a society and as much as we Americans might not like it, we depend on others.  If we want a functioning democracy, we need to make sure that the fabric of our society is strong and healthy.  All too often, in a country obsessed with individualism, we lose track of this.  But it becomes really clear when we look at youth.  Those youth who are most at-risk online are most at-risk offline.  They often come from poverty or experience abuse at home. They struggle with mental health issues or have family members who do.  These youth are falling apart at the seams and we can see it online.  But we get so obsessed with protecting our own children that we have stopped looking out for those in our communities that are really struggling, those who don’t have parents to support them.  The urban theorist Jane Jacobs used to argue that neighborhoods aren’t safe because you have law enforcement policing them; they are safe because everyone in the community is respectfully looking out for one another.  She talked about “eyes on the street,” not as a mechanism of surveillance but as an act of caring.  We need a lot more of that.

Southington, Connecticut. Young people watching a game.  1942 May 23-30. Farm Security Administration - Office of War Information Photograph Collection.  Prints and Photographs Division

Southington, Connecticut. Young people watching a game. 1942 May 23-30. Farm Security Administration – Office of War Information Photograph Collection. Photo courtesy of the Library of Congress Prints and Photographs Division.

Julia: You conduct research on teen behaviors both on and offline. How are physical environments important to understanding mediated practices? What are the limitations to studying online communities solely by engaging with them online?

danah: We’ve spent the last decade telling teenagers that strangers are dangerous, that anyone who approaches them online is a potential predator.  I can’t just reach out to teens online and expect them to respond to me; they think I’m creepy.  Thus, I long ago learned that I need to start within networks of trust. I meet youth through people in their lives, working networks to get to them so that they will trust me and talk about their lives with me. In the process, I learned that I get a better sense of their digital activities by seeing their physical worlds first.  At the same time, I do a lot of online observation and a huge part of my research has been about piecing together what I see online with what I see offline.

Julia: Researchers interested in young people’s social media use today can directly engage with research participants and a wealth of documentation over the web. When researchers look back on this period, what do you think are going to be the most critical source material for understanding the role of social media in youth culture? In that vein, what are some websites/data sets and other kinds of digital material that you think would be invaluable for future researchers to have access to for studying teen culture of today 50 years from now?

El Centro (vicinity), California. Young people at the Imperial County Fair. 1942 Feb.-Mar. Farm Security Administration - Office of War Information Photograph Collection.

El Centro (vicinity), California. Young people at the Imperial County Fair. 1942 Feb.-Mar. Farm Security Administration – Office of War Information Photograph Collection. Photo courtesy of the Library of Congress Prints and Photographs Division.

danah: Actually, to be honest, I think that anyone who looks purely at the traces left behind will be missing the majority of the story.  A lot has changed in the decade in which I’ve been studying youth, but one of the most significant changes has to do with privacy.  When I started this project, American youth were pretty forward about their lives online. By the end, even though I could read what they tweeted or posted on Instagram, I couldn’t understand it.  Teens started encoding content. In a world where they can’t restrict access to content, they restrict access to meaning.  Certain questions can certainly be asked of online traces, but meaning requires going beyond traces.

Julia: Alongside your work studying networked youth culture, you have also played a role in ongoing discussions of the implications of “big data.” Recognizing that researchers now and in the future are likely going to want to approach documentation and records as data sets, what do you think are some of the most relevant issues from your writing on big data for cultural heritage institutions to consider about collecting, preserving and providing access to social media, and other kinds of cultural data?

teenagers and their smartphones visiting a museum by user vilseskogen on Flickr.

teenagers and their smartphones visiting a museum by user vilseskogen on Flickr.

danah: One of the biggest challenges that archivists always have is interpretation. Just because they can access something doesn’t mean they have the full context.  They work hard to piece things together to the best that they can, but they’re always missing huge chunks of the puzzle.  I’m always amazed when I sit behind the Twitter firehose to see the stream of tweets that make absolutely no sense.  I think that anyone who is analyzing this data knows just how dirty and confusing it can be.  My hope is that it will force us to think about who is doing the interpreting and how.  And needless to say, there are huge ethical components to that.  This is at the crux of what archivists and cultural heritage folks do.

Julia: You’ve stated that “for all of the attention paid to ‘digital natives’ it’s important to realize that most teens are engaging with social media without any deep understanding of the underlying dynamics or structure.” What role can cultural heritage organizations play in facilitating digital literacy learning?

danah: What I love about cultural heritage organizations is that they are good at asking hard questions, challenging assumptions, questioning interpretations.  That honed skill is at the very center of what youth need to develop.  My hope is that cultural heritage organizations can go beyond giving youth the fruits of their labor and inviting them to develop these skills.  These lessons don’t need to be internet-specific. In many ways, they’re a part of what it means to be critically literate period.

Categories: Planet DigiPres

August Library of Congress Digital Preservation Newsletter is Now Available

The Signal: Digital Preservation - 8 August 2014 - 3:02pm

The August Library of Congress Digital Preservation Newsletter is now available:

Included in this issue:Augustcover

  • Digital Preservation 2014: It’s a Thing
  • Preserving Born Digital News
  • LOLCats and Libraries with Amanda Brennan
  • Digital Preservation Questions and Answers
  • End-of-Life Care for Aging, Fragile CDs
  • Education Program updates
  • Interviews with Henry Jenkins and Trevor Blank
  • More on Digital Preservation 2014
  • NDSA News, and more
Categories: Planet DigiPres

Cookbooks vs. learning books

File Formats Blog - 8 August 2014 - 12:30pm

A contract lead got me to try learning more about a technology which the client needs. Working from an e-book which I already had, I soon was thinking that it’s a really confused, ad hoc library. But then I remembered having this feeling before, when the problem was really the book. I looked for websites on the software and found one that explained it much better. The e-book had a lot of errors, using JavaScript terminology incorrectly and its own terminology inconsistently.

A feeling came over me, the same horrified realization the translator of To Serve Man had: “It’s a cookbook!” It wasn’t designed to let you learn how the software works, but to get you turning out code as quickly as possible. There are too many of these books, designed for developers who think that understanding the concepts is a waste of time. Or maybe the fault belongs less to the developers than to managers who want results immediately.

A book that introduces a programming language or API needs to start with the lay of the land. What are its basic concepts? How is it different from other approaches? It has to get the terminology straight. If it has functions, objects, classes, properties, and attributes, make it clear what each one is. There should be examples from the start, so you aren’t teaching arid theory, but you need to follow up with an explanation.

If you’re writing an introduction to Java, your “Hello world” example probably has a class, a main() function, and some code to write to System.out. You should at least introduce the concepts of classes, functions, and importing. That’s not the place to give all the details; the best way to teach a new idea is to give a simple version at first, then come back in more depth later. But if all you say is “Compile and run this code, and look, you’ve got output!” then you aren’t doing your job. You need to present the basic ideas simply and clearly, promise more information later, and keep the promise.

Don’t jump into complicated boilerplate before you’ve covered the elements it’s made of. The point of the examples should be to teach the reader how to use the technology, not to provide recipes for specific problems. The problem the developer has to solve is rarely going to be the one in the book. They can tinker with the examples until they fit their own problem, not really understanding them, but that usually results in complicated, inefficient, unmaintainable code.

Expert developers “steal” code too, but we know how it works, so we can take it apart and put it back together in a way that really suits the problem. The books we can learn from are the ones that put the “how it works” first. Cookbooks are useful too, but we need them after we’ve learned the tech, not when we’re trying to figure it out.

Tagged: books, writing
Categories: Planet DigiPres

Duke’s Legacy: Video Game Source Disc Preservation at the Library of Congress

The Signal: Digital Preservation - 6 August 2014 - 2:18pm

The following is a guest post from David Gibson, a moving image technician in the Library of Congress. He was previously interviewed about the Library of Congress video games collection.

The discovery of that which has been lost or previously unattainable is one of the driving forces behind the archival profession and one of the passions the profession shares with the gaming community. Video game enthusiasts have long been fascinated by unreleased games and “lost levels,” gameplay levels which are partially developed but left out of the final release of the game. Discovery is, of course, a key component to gameplay. Players revel in the thrill of unlocking the secret door or uncovering Easter eggs hidden in the game by developers. In many ways, the fascination with obtaining access to unreleased games or levels brings this thrill of discovery into the real world. In a recent article written for The Atlantic, Heidi Kemps discusses the joy in obtaining online access to playable lost levels from the 1992 Sega Genesis game, Sonic The Hedgehog 2, reveling in the fact that access to these levels gave her a glimpse into how this beloved game was made.

Original source disc as it was received by the Library of Congress.

Original source disc as it was received by the Library of Congress.

Since 2006, the Moving Image section of the Library of Congress has served as the custodial unit for video games. In this capacity, we receive roughly 400 video games per year through the Copyright registration process, about 99% of which are physically published console games. In addition to the games themselves we sometimes receive ancillary materials, such as printed descriptions of the game, DVDs or VHS cassettes featuring excerpts of gameplay, or the occasional printed source code excerpt. These materials are useful, primarily for their contextual value, in helping to tell the story of video game development in this country and are retained along with the games in the collection.

Several months ago, while performing an inventory of recently acquired video games, I happened upon a DVD-R labeled Duke Nukem: Critical Mass (PSP). My first assumption was that the disc, like so many others we have received, was a DVD-R of gameplay. However, a line of text on the Copyright database record for the item intrigued me. It reads: Authorship: Entire video game; computer code; artwork; and music. I placed the disc into my computer’s DVD drive to discover that the DVD-R did not contain video, but instead a file directory, including every asset used to make up the game in a wide variety of proprietary formats. Upon further research, I discovered that the Playstation Portable version of Duke Nukem: Critical Mass was never actually released commercially and was in fact a very different beast than the Nintendo DS version of the game which did see release. I realized then that in my computer was the source disc used to author the UMD for an unreleased PlayStation Portable game. I could feel the lump in my throat. I felt as though I had solved the wizard’s riddle and unlocked the secret door.

Excerpt of code from boot.bin including game text.

Excerpt of code from boot.bin including game text.

The first challenge involved finding a way to access the proprietary Sony file formats contained within the disc, including, but not limited to, graphics files in .gim format and audio files in .AT3 format. I enlisted the aid of Packard Campus Software Developer Matt Derby and we were able to pull the files off of the disc and get a clearer sense of the file structure contained within. Through some research on various PSP homebrew sites we discovered Noesis, a program that would allow us to access the .gim and .gmo files which contain the 3D models and textures used to create the game’s characters and 3D environments. With this program we were able to view a complete 3D view of Duke Nukem himself, soaring through the air on his jetpack and a pre-composite 3D model of one of the game’s nemeses, the Pig Cops. Additionally, we employed Mediacoder and VLC in order to convert the Sony .AT3 (ATRAC3) audio files to MP3 in order to have access to the game’s many music cues.


3D model for Duke Nukem equipped with jetpack. View an animated gif of the model here.

Perhaps the most exciting discovery came when we used a hex editor to access the ASCII text held in the boot.bin folder in the disc’s system directory. Here we located the full text and credit information for the game along with a large chunk of un-obfuscated software code. However, much of what is contained in this folder was presented as compiled binaries. It is my hope that access to both the compiled binaries and ASCII code will allow us to explore future preservation options for video games. Such information becomes even more vital in the case of games such as this Duke Nukem title which were never released for public consumption. In many ways, this source disc can serve as an exemplary case as we work to define preferred format requirements for software received by the Library of Congress. Ultimately, I feel that access to the game assets and source code will prove to be invaluable both to researchers who are interested in game design and mechanics and to any preservation efforts the Library may undertake.

Providing access to the disc’s content to researchers will, unfortunately, remain a challenge. As mentioned above, it was difficult enough for Library of Congress staff to view the proprietary formats found on the disc before seeking help from the homebrew community. The legal and logistical hurdles related to providing access to licensed software will continue to present themselves as we move forward but I hope that increased focus on the tremendous research value of such digital assets will allow for these items to be more accessible in the future. For now the assets and code will be stored in our digital archive at the Packard Campus in Culpeper and the physical disc will be stored in temperature-controlled vaults.

The source disc for the PSP version of Duke Nukem: Critical Mass stands out in the video game collection of the Library of Congress as a true digital rarity. In Doug Reside’s recent article “File Not Found: Rarity in the Age of Digital Plenty” (pdf), he explores the notion of source code as manuscript and the concept of digital palimpsests that are created through the various layers that make up a Photoshop document or which are present in the various saved “layers” of a Microsoft Word document. The ability to view the pre-compiled assets for this unreleased game provides a similar opportunity to view the game as a work-in-progress, or at the very least to see the inner workings and multiple layers of a work of software beyond what is presented to us in the final, published version. In my mind, receiving the source disc for an unreleased game directly from the developer is analogous to receiving the original camera negative for an unreleased film, along with all of the separate production elements used to make the film. The disc is a valuable evidentiary artifact and I hope we will see more of its kind as we continue to define and develop our software preservation efforts.

The staff of the Moving Image section would love the opportunity to work with more source materials for games and I hope that game developers who are interested in preserving their legacy will be willing to submit these kinds of materials to us in the future. Though source discs are not currently a requirement for copyright, they are absolutely invaluable in contributing to our efforts towards stewardship and long term access to the documentation of these creative works.

Special thanks to Matt Derby for his assistance with this project and input for this post.

Categories: Planet DigiPres

National Geospatial Advisory Committee: The Shape of Geo to Come

The Signal: Digital Preservation - 5 August 2014 - 1:24pm

World Map 1689 — No. 1 from user caveman_92223 on Flickr.

Back in late June I attended the National Geospatial Advisory Committee (NGAC) meeting here in DC. NGAC is a Federal Advisory Committee sponsored by the Department of the Interior under the Federal Advisory Committee Act. The committee is composed of (mostly) non-federal representatives from all sectors of the geospatial community and features very high profile participants. For example, ESRI founder Jack Dangermond, the 222nd richest American, has been a member since the committee was first chartered in 2008 (his term has since expired). Current committee members include the creator of Google Earth (Michael Jones) and the founder of OpenStreetMap (Steve Coast).

So what is the committee interested in, and how does it coincide with what the digital stewardship community is interested in? There are number of noteworthy points of intersection:

  • In late March of this year the FGDC released the “National Geospatial Data Asset Management Plan – a Portfolio Management Implementation Plan for the OMB Circular A–16” (pdf). The plan “lays out a framework and processes for managing Federal NGDAs [National Geospatial Data Assets] as a single Federal Geospatial Portfolio in accordance with OMB policy and Administration direction. In addition, the Plan describes the actions to be taken to enable and fulfill the supporting management, reporting, and priority-setting requirements in order to maximize the investments in, and reliability and use of, Federal geospatial assets.”
  • Driven by the release of the NGDA Management Plan, a baseline assessment of the “maturity” of various federal geospatial data assets is currently under way. This includes identifying dataset managers, identifying the sources of data (fed only/fed-state partnerships/consortium/etc.) and determining the maturity level of the datasets across a variety of criteria. With that in mind, several “maturity models” and reports were identified that might prove useful for future work in this area. For example, the state of Utah AGRC has developed a one-page GIS Data Maturity Assessment; the American Geophysical Union has a maturity model for assessing the completeness of climate data records (behind a paywall, unfortunately); the National States Geographic Information Council has a Geospatial Maturity Assessment; and the FGDC has “NGDA Dataset Maturity Annual Assessment Survey and Tool” that is being developed as part of their baseline assessment These maturity models have a lot in common with the NDSA Levels of Preservation work.
  • Lots of discussion on a pair of reports on big data and geolocation privacy. The first, Big Data – Seizing Opportunities, Preserving Values Report from the Executive Office of the President, acknowledges the benefits of data but also notes that “big data technologies also raise challenging questions about how best to protect privacy and other values in a world where data collection will be increasingly ubiquitous, multidimensional, and permanent.” The second, the PCast report on Big Data and Privacy (PCAST is the “President’s Council of Advisors on Science and Technology” and the report is officially called “Big Data: A Technology Perspective”) “begins by exploring the changing nature of privacy as computing technology has advanced and big data has come to the forefront.  It proceeds by identifying the sources of these data, the utility of these data — including new data analytics enabled by data mining and data fusion — and the privacy challenges big data poses in a world where technologies for re-identification often outpace privacy-preserving de-identification capabilities, and where it is increasingly hard to identify privacy-sensitive information at the time of its collection.” The importance of both of these reports to future library and archive collection and access policies regarding data can not be understated.
  • The Spatial Data Transfer Standard is being voted on for withdrawal as an FGDC-endorsed standard. FGDC maintenance authority agencies were asked to review the relevance of the SDTS, and they responded that the SDTS is no longer used by their agencies. There’s a Federal Register link to the proposal. The Geography Markup Language (GML), which the FGDC has endorsed, now satisfies the encoding requirements that SDTS once provided. NARA revised their transfer guidance for geospatial information in April 2014 to make SDTS files “acceptable for imminent transfer formats” but it’s clear that they’ve already moved away from them.  As a side note, GeoRSS is coming up for a vote soon to become an FGDC-endorsed standard.
  • The Office of Management and Budget is reevaluating the geospatial professional classification. The geospatial community has an issue similar to that being faced by the library and archives community, in that the jobs are increasingly information technology jobs but are not necessarily classified as such. This coincides with efforts to reevaluate the federal government library position description.
  • The Federal Geographic Data Committee is working with federal partners to make previously-classified datasets available to the public.  These datasets have been prepared as part of the “HSIP Gold” program. HSIP Gold is a compilation of over 450 geospatial datasets of U.S. domestic infrastructure features that have been assembled from a variety of Federal agencies and commercial sources. The work of assembling HSIP Gold has been tasked to the Homeland Infrastructure Foundation-Level Data (HIFLD) Working Group (say it as “high field”). Not all of the data in HSIP Gold is classified, so they are working to make some of the unclassified portions available to the public.

The next meeting of the NGAC is scheduled for September 23 and 24 in Shepherdstown, WV. The meetings are public.

Categories: Planet DigiPres

Making Scanned Content Accessible Using Full-text Search and OCR

The Signal: Digital Preservation - 4 August 2014 - 12:48pm

This following is a guest post by Chris Adams from the Repository Development Center at the Library of Congress, the technical lead for the World Digital Library.

We live in an age of cheap bits: scanning objects en masse has never been easier, storage has never been cheaper and large-scale digitization has become routine for many organizations. This poses an interesting challenge: our capacity to generate scanned images has greatly outstripped our ability to generate the metadata needed to make those items discoverable. Most people use search engines to find the information they need but our terabytes of carefully produced and diligently preserved TIFF files are effectively invisible for text-based search.

The traditional approach to this problem has been to invest in cataloging and transcription but those services are expensive, particularly as flat budgets are devoted to the race to digitize faster than physical media degrades. This is obviously the right call from a preservation perspective but it still leaves us looking for less expensive alternatives.

OCR is the obvious solution for extracting machine-searchable text from an image but the quality rates usually aren’t high enough to offer the text as an alternative to the original item. Fortunately, we can hide OCR errors by using the text to search but displaying the original image to the human reader. This means our search hit rate will be lower than it would with perfect text but since the content in question is otherwise completely unsearchable anything better than no results will be a significant improvement.

Since November 2013, the World Digital Library has offered combined search results similar to what you can see in the screenshot below:


This system is entirely automated, uses only open-source software and existing server capacity, and provides an easy process to improve results for items as resources allow.

How it Works: From Scan to Web Page Generating OCR Text

As we receive new items, any item which matches our criteria (currently books, journals and newspapers created after 1800) will automatically be placed in a task queue for processing. Each of our existing servers has a worker process which uses idle capacity to perform OCR and other background tasks. We use the Tesseract OCR engine with the generic training data for each of our supported languages to generate an HTML document using hOCR markup.

The hOCR document has HTML markup identifying each detected word and paragraph and its pixel coordinates within the image. We archive this file for future usage but our system also generates two alternative formats for the rest of our system to use:

  • A plain text version for the search engine, which does not understand HTML markup
  • A JSON file with word coordinates which will be used by a browser to display or highlight parts of an image on our search results page and item viewer
Indexing the Text for Search

Search has become a commodity service with a number of stable, feature-packed open-source offerings such as such Apache Solr, ElasticSearch or Xapian. Conceptually, these work with documents — i.e. complete records — which are used to build an inverted index — essentially a list of words and the documents which contain them. When you search for “whaling” the search engine performs stemming to reduce your term to a base form (e.g. “whale”) so it will match closely-related words, finds the term in the index, and retrieves the list of matching documents. The results are typically sorted by calculating a score for each document based on how frequently the terms are used in that document relative to the entire corpus (see the Lucene scoring guide for the exact details about how term frequency-inverse document frequency (TD-IDF) works).

This approach makes traditional metadata-driven search easy: each item has a single document containing all of the available metadata and each search result links to an item-level display. Unfortunately, we need to handle both very large items and page-level results so we can send users directly to the page containing the text they searched for rather than page 1 of a large book. Storing each page as a separate document provides the necessary granularity and avoids document size limits but it breaks the ability to calculate relevancy for the entire item: the score for each page would be calculated separately and it would be impossible to search for multiple words which fall on different pages.

The solution for this final problem is a technique which Solr calls Field Collapsing (the ElasticSearch team has recently completed a similar feature referred to as “aggregation”). This allows us to make a query and specify a field which will be used to group documents before determining relevancy. If we tell Solr to group our results by the item ID the search ranking will be calculated across all of the available pages and the results will contain both the item’s metadata record and any matching OCR pages.

(The django-haystack Solr grouped search backend with Field Collapsing support used on has been released into the public domain.)

Highlighting Results

At this point, we can perform a search and display a nice list of results with a single entry for each item and direct links to specific pages. Unfortunately, the raw OCR text is a simple unstructured stream of text and any OCR glitches will be displayed, as can be seen in this example where the first occurrence of “VILLAGE FOULA” was recognized incorrectly:


The next step is replacing that messy OCR text with a section of the original image. Our search results list includes all of the information we need except for the locations for each word on the page. We can use our list of word coordinates but this is complicated because the search engine’s language analysis and synonym handling mean that we cannot assume that the word on the page is the same word that was typed into the search box (e.g. a search for “runners” might return a page which mentions “running”).

Here’s what the entire process looks like:

1. The server returns an HTML results page containing all of the text returned by Solr with embedded microdata indicating the item, volume and page numbers for results and the highlighted OCR text:


2. JavaScript uses the embedded microdata to determine which search results include page-level hits and an AJAX request is made to retrieve the word coordinate lists for every matching page. The word coordinate list is used to build a list of pixel coordinates for every place where one of our search words occurs on the page:

adams080414image7Now we can find each word highlighted by Solr and locate it in the word coordinates list. Since Solr returned the original word and our word coordinates were generated from the same OCR text which was indexed in Solr, the highlighting code doesn’t need to handle word tenses, capitalization, etc.

3. Since we often find words in multiple places on the same page and we want to display a large, easily readable section of the page rather than just the word, our image slice will always be the full width of the page starting at the top-most result and extending down to include subsequent matches until there is either a sizable gap or the total height is greater than the first third of the page.

Once the image has been loaded, the original text is replaced with the image:


4. Finally, we add a partially transparent overlay over each highlighted word:


  • The WDL management software records the OCR source and review status for each item. This makes it safe to automatically reprocess items when new versions of our software are released without the chance of inadvertently overwriting OCR text which was provided by a partner or which has been hand-corrected.
  • You might be wondering why the highlighting work is performed on the client side rather than having the server return highlighted images. In addition to reducing server load this design improves performance because a given image segment can be reused for multiple results on the same page(rounding the coordinates improves the cache hit ratio significantly) and both the image and word coordinates can be cached independently by CDN edge servers rather than requiring a full round-trip back to the server each time.
  • This benefit is most obvious when you open an item and start reading it: the same word coordinates used on the search results page can be reused by the viewer and since the page images don’t have to be customized with search highlighting, they’re likely to be cached on the CDN. If you change your search text while viewing the book highlighting for the current page will be immediately updated without having to wait for the server to respond.


Challenges & Future Directions

This approach works relatively well but there are a number of areas for improvement:

  • The process described above allows the OCR process to be improved considerably. This provides plenty of room to improve results with technical improvements such as more sophisticated image processing, OCR engine training, and workflow systems incorporating human review and correction.
  • For collections such as WDL’s which include older items OCR accuracy is reduced by the condition of the materials and typographic conventions like the long s (ſ) or ligatures which are no longer in common usage. The Early Modern OCR Project is working on this problem and will hopefully provide a solution for many needs.
  • Finally, there’s considerable appeal to crowd-sourcing corrections as demonstrated by the National Library of Australia’s wonderful Trove project and various experimental projects such as the UMD MITH ActiveOCR project.
  • This research area is of benefit to any organization with large digitized collections, particularly projects with an eye towards generic reuse. Ed Summers and I have casually discussed the idea for a simple web application which would display images with the corresponding hOCR with full version control, allowing the review and correction process to be a generic workflow step for many different projects.
Categories: Planet DigiPres

Computational Linguistics & Social Media Data: An Interview with Bryan Routledge

The Signal: Digital Preservation - 1 August 2014 - 1:15pm
Bryan Routledge, Associate Professor of Finance Tepper School of Business Carnegie Mellon University.

Bryan Routledge, Associate Professor of Finance, Tepper School of Business, Carnegie Mellon University.

The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and worked on a range of projects leading up to CurateCamp Digital Culture last week. This is part of an ongoing series of interviews Julia is conducting to better understand the kinds of born-digital primary sources folklorists, and others interested in studying digital culture, are making use of for their scholarship.

What can a Yelp review or a single tweet reveal about society? How about hundreds of thousands of them? In this installment of the Insights Interviews series, I’m thrilled to talk with researcher Bryan Routledge about two of his projects that utilize a computational linguistic lens to analyze vast quantities of social media data. You can read the article on word choice used in online restaurant reviews here. The article about using Twitter as a predictive tool as compared with traditional public opinion polls here (PDF).

Julia: The research group Noah’s ARK at the Language Technologies Institute, School of Computer Science at Carnegie Mellon University aims in part to “analyze the textual content of social media, including Twitter and blogs, as data that reveal political, linguistic, and economic phenomena in society.”  Can you unpack this a bit for us? What kind of information can social media provide that other kinds of data can’t?

Bryan: Noah Smith, my colleague in the school of computer science at CMU, runs that lab.  He is kind enough to let me hang out over there.  The research we are working on looks at the connection between text and social science (e.g., economics, finance).  The idea is that looking at text through the lens of a forecasting problem — the statistical model between text and some social-science measured variable — gives insight into both the language and social parts.  Online and easily accessed text brings new data to old questions in economics.  More interesting, at least to me, is that grounding the text/language with quantitative external measures (volatility, citations, etc.) gives insight into the text.  What words in corporate 10K annual reports correlate with stock volatility and how that changes over time is cool.


Different metaphors for expensive and inexpensive restaurants in Yelp reviews. From: Dan Jurafsky, Victor Chahuneau, Bryan R. Routledge, and Noah A. Smith. 2014. Narrative framing of consumer sentiment in online restaurant reviews. First Monday 19:4.

Julia: Your work with social media—Yelp and Twitter—are notable for their large sample sizes and emphasis on quantitative methods, using over 900,000 Yelp reviews and 1 billion tweets. How might archivists of social media better serve social science research that depends on these sorts of data sets and methods?

Bryan: That is a good question.  What makes it very hard for archivists is that collecting the right data without knowing the research questions is hard.  The usual answer of “keep everything!” is impractical.  Google’s n-gram project is a good illustration.  They summarized a huge volume of books with word counts (two word pairs, …) by time.  This is great for some research.  But not for the more recent statistical models that use sentences and paragraph information.

Julia:  Your background and most of your work is in the field of finance, which you have characterized as being fundamentally about predicting the behavior of people . How do you see financial research being influenced by social media and other born digital content? Could you tell us a bit about what it means to have a financial background doing this kind of research? What can the fields of finance and archives learn from each other?

 in most locations, the word “baby” is neutral -- it suggests neither high nor low price.  Except in the Wall Street area of lower Manhattan where it is associated with higher priced steak.

In Yelp reviews of Manhattan restaurants with “steak” in the menu (an example). Predict the (log) menu item price using the words used to describe the item by location. For example: in most locations, the word “baby” is neutral — it suggests neither high nor low price. Except in the Wall Street area of lower Manhattan where it is associated with higher priced steak.

Bryan:  Finance (and economics) is about the collective behavior of large number of people in markets.  To make research possible you need simple models of individuals.  Getting the right mix of simplicity and realism is age-old and ongoing research in the area.  More data helps.  Macroeconomic data like GDP and stock returns is informative about the aggregate.  Data on, say, individual portfolio choices in 401K plans lets you refine models.  Social media data is this sort of disaggregated data.  We can get a signal, very noisy, about what is behind an individual decision.  Whether that is ultimately helpful for guiding financial or economic policy is an open, but exciting, question.

More generally, working across disciplines is interesting and fun.  It is not always “additive.”  The research we have done on menus has nothing to do with finance (other than my observation that in NY restaurants near Wall Street, the word “baby” is associated with expensive menu items).  But if we can combine, for example, decision theory finance with generative text models, we get some cool insights into purposefully drafted documents.

Julia: The data your team collected from Yelp was gathered from the site. Your data from Twitter was collected using Twitter’s Streaming API and “Gardenhose,” which deliver a random sampling of tweets in real-time. I’d be curious to hear what role you think content holders like Yelp or Twitter can or could play in providing access to this kind of raw data.

Bryan: As a researcher with only the interests of science at heart, it would be best if they just gave me access to all their data!  Given that much of the data is valuable to the companies (and privacy, of course), I understand that is not possible.  But it is interesting that academic research, and data-sharing more generally, is in a company’s self-interest.  Twitter has encouraged a whole ecosystem that has helped them grow.  Many companies have an API for that purpose that happens to work nicely for academic research.  In general, open access is most preferred in academic settings so that all researchers have access to the same data.  Interesting papers using proprietary access to Facebook are less helpful than Twitter.

Julia: Could you tell us a bit about how you processed and organized the data for analysis and how you are working to manage it for the future? Given that reproducibility is such an important concept for science, what ways are you approaching ensuring that your data will be available in the future?

Bryan: This is not my strong suit.  But at a high-level, the steps are (roughly) “get,” “clean,” “store,” “extract,” “experiment.”  The “get” varies with the data source (an API).  The “clean” step is just a matter of being careful with special characters and making sure data are lining up into fields right.  If the API is sensible, the “clean” is easy.  We usually store things in a JSON format that is flexible.  This is usually a good format to share data.  The “extract” and “experiment” steps depend on what you are interested in.  Word counts? Phrase counts? Other?  The key is not to jump from “get” to “extract” — storing the data in as raw form as possible makes thing flexible.

Julia:  What role, or potential role, do you see for the future of libraries, archives and museums in working with the kinds of data you collect? That is, while your data is valuable for other researchers now, things like 700,000 Yelp reviews of restaurants will be invaluable to all kinds of folks studying culture, economics and society 10, 20, 50 and 100 years from now. So, what kind of role do you think cultural heritage institutions could play in the long-term stewardship of this cultural data? Further, what kinds of relationships do you think might be able to be arranged between researchers and libraries, archives, and museums? For instance, would it make sense for a library to collect, preserve, and provide access to something like the Yelp review data you worked with? Or do you think they should be collecting in other ways?

 Linking Text Sentiment to Public Opinion Time Series. Brendan O'Connor,Ramnath Balasubramanyan, Bryan R. Routledge, and Noah A. Smith. In Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM 2010), pages 122–129, Washington, DC, May 2010

Sentiment on Twitter as compared to Gallup Poll. Appeared in From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series. Brendan O’Connor, Ramnath Balasubramanyan, Bryan R. Routledge and Noah A. Smith. In Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM 2010), pages 122–129, Washington, DC, May 2010

Bryan: This is also a great question and also one for which I do not have a great answer.  I do not know a lot about the research in “digital humanities,” but that would be a good place to look.  People doing digital text-based research on a long-horizon panel of data should provide some insight into what sorts of questions people ask.  Similarly, economic data might provide some hints.  Finance, for example, has a strong empirical component that comes from having easy-to-access stock data (the CRSP).  The hard part for libraries is figuring out which parts to keep.  Sampling Twitter, for example, gets a nice time-series of data but loses the ability to track a group of users or Twitter conversations.

Julia: Talking about the paper you co-authored that analyzed Yelp reviews, Dan Jurafsky said “when you write a review on the web you’re providing a window into your own psyche – and the vast amount of text on the web means that researchers have millions of pieces of data about people’s mindsets.” What do you think are some of the possibilities and limitations for analyzing social media content?

Bryan: There are many limitations, of course.  Twitter and Yelp are not just providing a window into things, they are changing the way the world works.  “Big data” is not just about larger sample sizes of draws from a fixed distribution.  Things are non-stationary.  (In an early paper using Twitter data, we could see the “Oprah” effect as the number of users jumped in the day following her show about Twitter).  Similarly, the data we see in social media is not a representative sample of society cross section.  But both of these are the sort of things good modeling – statistical, economic – should, and do, aim to capture.  The possibilities of all this new data are exciting.  Language is a rich source of data with challenging models needed to turn it into useful information.  More generally, social media is an integral part of many economic and social transactions.  Capturing that in a tractable model makes for an interesting research agenda.

Categories: Planet DigiPres

Digital Preservation 2014: It’s a Thing

The Signal: Digital Preservation - 30 July 2014 - 12:56pm

“Digital preservation makes headlines now, seemingly routinely. And the work performed by the community gathered here is the bedrock underlying such high profile endeavors.” – Matt Kirschenbaum

 Erin Engle.

The registration table at Digital Preservation 2014. Photo credit: Erin Engle.

The annual Digital Preservation meeting, held each summer in Washington, DC, brings together experts in academia, government and the private and non-profit sectors to celebrate key work and share the latest developments, guidelines, best practices and standards in digital preservation.

Digital Preservation 2014, held July 22-24,  marked the 13th major meeting hosted by NDIIPP in support of the broad community of digital preservation practitioners (NDIIPP held two meetings a year from 2005-2007), and it was certainly the largest, if not the best. Starting with the first combined NDIIPP/National Digital Stewardship Alliance meeting in 2011, the annual meeting has rapidly evolved to welcome an ever-expanding group of practitioners, ranging from students to policy-makers to computer scientists to academic researchers. Over 300 people attended this year’s meeting.

“People don’t need drills; they need holes,” stated NDSA Coordinating Committee chairman Micah Altman, the Director of Research at the Massachusetts Institute of Technology Libraries,  in an analogy to digital preservation in his opening talk. As he went on to explain, no one needs digital preservation for its own sake, but it’s essential to support the rule of law, a cumulative evidence base, national heritage, a strategic information reserve, and to communicate to future generations. It’s these challenges that face the current generation of digital stewardship practitioners, many of which are addressed in the 2015 National Agenda for Digital Stewardship, which Altman previewed during his talk (and which will appear later this fall).

 Erin Engle.

A breakout session at Digital Preservation 2014. Photo credit: Erin Engle.

One of those challenges is the preservation of the software record, which was eloquently illuminated by Matt Kirschenbaum, the Associate Director of the Maryland Institute for Technology in the Humanities, during his stellar talk, “Software, It’s a Thing.” Kirschenbaum ranged widely across computer history, art, archeology and pop culture with a number of essential insights. One of the more piquant was his sorting of software into different categories of “things” (software as asset, package, shrinkwrap, notation/score, object, craft, epigraphy, clickwrap, hardware, social media, background, paper trail, service, big data), each of which with its own characteristics. As Kirschenbaum eloquently noted, software is many different “things,” and we’ll need to adjust our future approaches to preservation accordingly.

Associate Professor at the New School Shannon Mattern took yet another refreshing approach, discussing the aesthetics of creative destruction and the challenges of preserving ephemeral digital art. As she noted, “by pushing certain protocols to their extreme, or highlighting snafus and ‘limit cases’ these artists’ work often brings into stark relief the conventions of preservation practice, and poses potential creative new directions for that work.”

 Erin Engle.

Stephen Abrams, Martin Klein, Jimmy Lin and Michael Nelson during the “Web Archiving” panel. Photo credit: Erin Engle.

These three presentations on the morning of the first day provided a thoughtful intellectual substrate upon which a huge variety of digital preservation tools, services, practices and approaches were elaborated over the following days. As befits a meeting that convenes disparate organizations and interests, collaboration and community were big topics of discussion.

A Tuesday afternoon panel on “Community Approaches to Digital Stewardship” brought together a quartet of practitioners who are working collaboratively to advance digital preservation practice across a range of organizations and structures, including small institutions (the POWRR project); data stewards (the Research Data Alliance); academia (the Academic Preservation Trust); and institutional consortiums (the Five College Consortium).

Later, on the second day, a well-received panel on the “Future of Web Archiving” displayed a number of clever collaborative approaches to capturing the digital materials from the web, including updates on the Memento project and Warcbase, an open-source platform for managing web archives.

 Erin Engle.

CurateCamp: Digital Culture. Photo credit: Erin Engle.

In between there were plenary sessions on stewarding space and research data, and over three dozen lightning talks, posters and breakout sessions covering everything from digital repositories for museum collections to a Brazilian digital preservation network to the debut of a new digital preservation questions and answers tool. Additionally, a CurateCamp unconference on the topic of “Digital Culture” was held on a third day at Catholic University, thanks to the support of the CUA Department of Library and Information Science.

The main meeting closed with a thought-provoking presentation from artist and digital conservator Dragan Espenschied. Espenschied utilized emulation and other novel tools to demonstrate some of the challenges related to presenting works authentically, in particular works from the early web and those dependent on a range of web services. Espenschied, also the Digital Conservator at Rhizome, has an ongoing project, One Terabyte of Kilobyte Age, that explores the material captured in the Geocities special collection. Associated with that project is a Tumblr he created that automatically generates a new screenshot from the Geocities archive collection every 20 minutes.

Web history, data stewardship, digital repositories; for digital preservation practitioners it was nerd heaven. Digital preservation 2014, it’s a thing. Now on to 2015!

Categories: Planet DigiPres

Art is Long, Life is Short: the XFR Collective Helps Artists Preserve Magnetic and Digital Works

The Signal: Digital Preservation - 29 July 2014 - 2:44pm

XFR STN (“Transfer Station”) is a grass-roots digitization and digital-preservation project that arose as a response from the New York arts community to rescue creative works off of aging or obsolete audiovisual formats and media. The digital files are stored by the Library of Congress’s NDIIPP partner the Internet Archive and accessible for free online. At the recent Digital Preservation 2014 conference, the NDSA gave XFR STN the NDSA Innovation Award. Last month, members of the XFR collective — Rebecca Fraimow, Kristin MacDonough, Andrea Callard and Julia Kim — answered a few questions for the Signal.

"VHS 1" from XFR Collective.

“VHS 1,” courtesy of Walter Forsberg.

Mike: Can you describe the challenges the XFR Collective faced in its formation?

XFR: Last summer, the New Museum hosted a groundbreaking exhibit called XFR STN.  Initiated by the artist collective Colab and the resulting MWF Video Club, the exhibit was a major success. By the end of the exhibition over 700 videos had been digitized with many available online through the Internet Archive.

It was clear  for all of us involved that there was a real demand for these services, that there are many under-served artists who were having difficulty preserving and accessing their own media. Many of the people involved with the exhibit became passionate about continuing the service of preserving obsolete magnetic and digital media for artists.  We wanted to offer a long-term, non-commercial, grassroots solution.

Using the experience of working on XFR STN as a jumping-off point, we began developing XFR Collective as a separate nonprofit initiative to serve the need that we saw.  Over the course of our development, we’ve definitely faced — and are still facing — a number of challenges in order to make ourselves effective and sustainable.

"VHS 3" by XFR Collective.

“VHS 2,” courtesy of Walter Forsberg.

Perhaps the biggest challenge has simply been deciding what form XFR Collective was going to take.  We started out with a bunch of borrowed equipment and a lot of enthusiasm, so the one thing we knew we could do was digitize, but we had to sit down and really think about things like organizational structure, sustainable pricing for our services, and the convoluted process of becoming a non-profit.

Eventually, we settled on a membership-based structure in order to be able to keep our costs as low as possible.  A lot of how we’re operating is still very experimental — this summer wraps up our six-month test period, during which we limited ourselves to working with only a small number of partners to allow us to figure out what our capacity was and how we could design our projects in the future.

We’ve got a number of challenges still ahead of us — finding a permanent home is a big one — and we still feel like we’re only just getting started, in terms of what we can do for the community of artists who use our services.  It’s going to be interesting for all of us to see how we develop.  We’ve started thinking of ourselves as kind of a grassroots preservation test kitchen. We’ll try almost any kind of project once to see if it works!

Mike: Where are the digital files stored? Who maintains them?

XFR: Our digital files will be stored with the membership organizations and uploaded to the Internet Archive for access and for long-term open-source preservation.  This is an important distinction that may confuse some people: XFR Collective is not an archive.

While we advocate and educate about best practices, we will not hold any of the digital files ourselves; we just don’t have the resources to maintain long-term archival storage.  We encourage material to go onto the Internet Archive because long-term accessibility is part of our mission and because the Internet Archive has the server space to store uncompressed and lossless files as well as access files.  That way if something happens to the storage that our partners are using for their own files, they can always re-download them.  But we can’t take responsibility for those files ourselves. We’re a service point, not a storage repository.

"VHS 2" by XFR Collective

“VHS 3,” courtesy of Walter Forsberg.

Mike: Regarding public access as a means of long-term preservation and sustainability, how do you address copyrighted works?

XFR: This is a great question that confounds a lot of our collaborators initially.  Access-as-preservation creates a lot of intellectual property concerns.  Still, we’re a very small organization, so we can afford to take more risks than a more high-profile institution.  We don’t delve too deeply into the area of copyright; our concern is with the survival of the material.  If someone has a complaint, the Internet Archive will give us a warning in time to re-download the content and then remove it. But so far we haven’t had any complaints.

Mike: What open access tools and resources do you use?

XFR: The Internet Archive itself is something of an open access resource and we’re seeing it used more and more frequently as a kind of accessory to preservation, which is fantastic.  Obviously it’s not the only solution, and you wouldn’t want to rely on that alone any more than you would any kind of cloud storage, but it’s great to have a non-commercial option for streaming and storage that has its own archival mission and that’s open to literally anyone and anything.

Mike:  If anyone is considering a potential collaboration to digitally preserve audiovisual artwork, what can they learn from the experiences of the XFR Collective?

XFR: Don’t be afraid to experiment!  A lot of what we’ve accomplished is just by saying to ourselves that we have to start doing something, and then jumping in and doing it.  We’ve had to be very flexible. A lot of the time we’ll decide something as a set proposition and then find ourselves changing it as soon as we’ve actually talked with our partners and understood their needs.  We’re evolving all the time but that’s part of what makes the work we do so exciting.

We’ve also had a lot of help and we couldn’t have done any of what we’ve accomplished without support and advice from a wide network of individuals, ranging from the amazing team at XFR STN to video archivists across New York City.  None of these collaborations happen in a vacuum, so make friendships, make partnerships, and don’t be nervous about asking for advice.  There are a lot of people out there who care about video preservation and would love to see more initiatives out there working to make it happen.

Categories: Planet DigiPres

The MH17 Crash and Selective Web Archiving

The Signal: Digital Preservation - 28 July 2014 - 4:34pm

The following is a guest post by Nicholas Taylor, Web Archiving Service Manager for Stanford University Libraries.

//">Internet Archive Wayback Machine</a>.

Screenshot of 17 July 2014 15:57 UTC archive snapshot of deleted VKontakte Strelkov blog post regarding downed aircraft, on Internet
Archive Wayback Machine

The Internet Archive Wayback Machine has been mentioned in several news articles within the last week  (see here, here and here) for having archived a since-deleted blog post from a Ukrainian separatist leader touting his shooting down a military transport plane which may have actually been Malaysia Airlines Flight 17. At this early stage in the crash investigation, the significance of the ephemeral post is still unclear, but it could prove to be a pivotal piece of evidence.

An important dimension of the smaller web archiving story is that the blog post didn’t make it into the Wayback Machine by the serendipity of Internet Archive’s web-wide crawlers; an unknown but apparently well-informed individual identified it as important and explicitly designated it for archiving.

Internet Archive crawls the Web every few months, tends to seed those crawls from online directories or compiled lists of top websites that favor popular content, archives more broadly across websites than it does deeply on any given website, and embargoes archived content from public access for at least six months. These parameters make the Internet Archive Wayback Machine an incredible resource for the broadest possible swath of web history in one place, but they don’t dispose it toward ensuring the archiving and immediate re-presentation of a blog post with a three-hour lifespan on a blog that was largely unknown until recently.

Recognizing the value of selective web archiving for such cases, many memory organizations engage in more targeted collecting. Internet Archive itself facilitates this approach through its subscription Archive-It service, which makes web archiving approachable for curators and many organizations. A side benefit is that content archived through Archive-It propagates with minimal delay to the Internet Archive Wayback Machine’s more comprehensive index. Internet Archive also provides a function to save a specified resource into the Wayback Machine, where it immediately becomes available.

Considering the six-month access embargo, it’s safe to say that the provenance of everything that has so far been archived and re-presented in the Wayback Machine relating to the five-month-old Ukraine conflict is either the Archive-It collaborative Ukraine Conflict collection or the Wayback Machine Save Page Now function. In other words, all of the content preserved and made accessible to date, including the key blog post, reflects deliberate curatorial decisions on the part of individuals and institutions.

A curator at the Hoover Institution Library and Archives with a specific concern for the VKontakte Strelkov blog actually added it to the Archive-It collection with a twice-daily capture frequency at the beginning of July. Though the key blog post was ultimately recorded through the Save Page Now feature, what’s clear is that subject area experts play a vital role in focusing web archiving efforts and, in this case, facilitated the preservation of a vital document that would not otherwise have been archived.

At the same time, selective web archiving is limited in scope and can never fully anticipate what resources the future will have wanted us to save, underscoring the value of large-scale archiving across the Web. It’s a tragic incident but an instructive example of how selective web archiving complements broader web archiving efforts.

Categories: Planet DigiPres

Song identification on GitHub

File Formats Blog - 24 July 2014 - 11:42am

The code for my song identification “nichesourcing” web application is now available on GitHub. It’s currently aimed at one project, as I’d mentioned in my earlier post, but has potential for wide use. It allows the following:

  • Users can register as editors or contributors. Only registered users have access.
  • Editors can post recording clips with short descriptions.
  • Contributors can view the list of clips and enter reports on them.
  • Reports specify type of sound, participants, song titles, and instruments. Contributors can enter as much or as little information as they’re able to.
  • Editors can modify clip metadata, delete clips, and delete reports.
  • Contributors and editors can view reports.
  • More features are planned, including an administrator role.

This is my first PHP coding project of any substance, so I’m willing to accept comments about my overall coding approach. It’s inevitable that, to some degree, I’m writing PHP as if it’s Java. If there are any standard practices or patterns I’m overlooking, let me know.

Tagged: music, software, songid
Categories: Planet DigiPres

Understanding the Participatory Culture of the Web: An Interview with Henry Jenkins

The Signal: Digital Preservation - 24 July 2014 - 10:51am
Henry Jenkins, Provost Professor of Communication, Journalism, and Cinematic Arts, a joint professorship at the USC Annenberg School for Communication and the USC School of Cinematic Arts.

Henry Jenkins, Provost Professor of Communication, Journalism, and Cinematic Arts, with USC Annenberg School for Communication and the USC School of Cinematic Arts.

The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and is working on a range of projects related to CurateCamp Digital Culture. This is part of an ongoing series of interviews Julia is conducting to better understand the kinds of born-digital primary sources folklorists, and others interested in studying digital culture, are making use of for their scholarship.

Anyone who has ever liked a TV show’s page on Facebook or proudly sported a Quidditch t-shirt knows that being a fan goes beyond the screen or page.  With the growth of countless blogs, tweets, Tumblr gifsets, Youtube videos, Instagram hashtags, fanart sites and fanfiction sites, accessing fan culture online has never been easier. Whether understood as a vernacular web or as the blossoming of a participatory culture individuals across the world are using the web to respond to and communicate their own stories.

As part of the NDSA Insights interview series, I’m delighted to interview Henry Jenkins, professor at the USC Annenberg School for Communication and self-proclaimed Aca-Fan. He is the author of one of the foundational works exploring fan cultures, “Textual Poachers: Television Fans and Participatory Culture,”  as well as a range of other books, including “Convergence Culture: Where Old and New Media Collide,” and most recently the co-author (with Sam Ford and Joshua Green) “Spreadable Media: Creating Value and Meaning in a Networked Culture.” He blogs at Confessions of an Aca-Fan.

Julia: You state on your website that your time at MIT, “studying culture within one of the world’s leading technical institutions” gave you “some distinctive insights into the ways that culture and technology are reshaping before our very eyes.”  How so? What are some of the changes you’ve observed, from a technical perspective and/or a cultural one?

Henry: MIT was one of the earliest hubs in the Internet. When I arrived there in 1989, Project Athena was in its prime; the MIT Media Lab was in its first half decade and I was part of a now legendary Narrative Intelligence Reading Group (PDF) which brought together some of the smartest of their graduate students and a range of people interested in new media from across Cambridge; many of the key thinkers of early network culture were regular speakers at MIT; and my students were hatching ideas that would become the basis for a range of Silicon Valley start ups. And it quickly became clear to me that I had a ringside seat for some of the biggest transfomations in the media landscape in the past century, all the more so because through my classes, the students were helping me to make connections between my work on fandom as a participatory culture and a wide array of emerging digital practices (from texting to game mods).

Kresge Auditorium, MIT, Historic American Buildings Survey/Historic American Engineering Record/Historic American Landscapes Survey, Library of Congress Prints and Photographs Division,

Kresge Auditorium, MIT, Historic American Buildings Survey/Historic American Engineering Record/Historic American Landscapes Survey, Library of Congress Prints and Photographs Division,

Studying games made sense at MIT because “Spacewar,” one of the first known uses of computers for gaming, had been created by the MIT Model Railroad club in the early 1960s. I found myself helping to program a series that the MIT Women’s Studies Program was running on gender and cyberspace, from which the materials for my book, “From Barbie to Mortal Kombat” emerged. Later, I would spend more than a decade as the housemaster of an MIT dorm, Senior House, which is known to be one of the most culturally creative at the Institute.

Through this, I was among the first outside of Harvard to get a Facebook account; I watched students experimenting with podcasting, video-sharing and file-sharing. Having MIT after my name opened doors at all of the major digital companies and so I was able to go behind the scenes as some of these new technologies were developing, and also see how they were being used by my students in their everyday lives.

So, through the years, my job was to place these developments in their historical and cultural contexts — often literally as Media Lab students would come to me for advice on their dissertation projects, but also more broadly as I wrote about these developments through Technology Review, the publication for MIT’s alumni network. It was there where many of the ideas that would form “Convergence Culture” were first shared with my readers. And the students that came through the Comparative Media Studies graduate program have been at ground zero for some of the key developments in the creative industries in recent years — from the Veronica Mars Kickstarter campaign to the community building practices of Etsy, from key developments in the games and advertising industry to cutting edge experiments in transmedia storytelling. The irony is that I had been really reluctant about accepting the MIT job because I suffer from fairly serious math phobia. :-)

Today, I enjoy another extraordinary vantage point as a faculty member at USC, who is embedded in both the Annenberg School of Communication and Journalism and the Cinema School, and thus positioned to watch how Hollywood and American journalism are responding to the changes that networked communication have forced upon them. I am able to work with future filmmakers who are trying to grasp a shift from a focus on individual stories to an emphasis on world-building, journalists who are trying to imagine new relationships with their publics, and activists who are seeking to make change by any media necessary.

Julia: Much of your work has focused on reframing the media audience as active and creative participants in creating media, rather than passive consumers.  You’ve critiqued use of the terms “viral” and “memes” to describe  internet phenomena as “stripping aside the concept of human agency,” and that the biological language “confuses the actual power relations between producers, properties, brands and consumers.” Can you unpack some of your critiques for us? What is at stake?

Henry: At the core of “Spreadable Media” is a shift in how media travels across the culture. On the one hand, there is distribution as we have traditionally understood it in the era of mass media where content flows in patterns regulated by decisions made by major corporations who control what we see, when we see it and under what conditions. On the other hand, there is circulation, a hybrid system, still shaped top-down by corporate players, but also bottom-up by networks of everyday people, who are seeking to move media that is meaningful to them across their social networks, and will take media where they want it when they want it through means both legal and illegal. The shift towards a circulation-based model for media access is disrupting and transforming many of our media-related practices, and it is not explained well by a model which relies so heavily on metaphors of infection and assumptions of irrationality.

The idea of viral media is a way that the broadcasters hold onto the illusion of their power to set the media agenda at a time when that power is undergoing a crisis. They are the ones who make rational calculations, able to design a killer virus which infects the masses, so they construct making something go viral as either arcane knowledge that can be sold at a price from those in the know or as something that nobody understands, “It just went viral!” But, in fact, we are seeing people, collectively and individually, make conscious decisions about what media to pass to which networks for what purposes with what messages attached through which media channels and we are seeing activist groups, religious groups, indie media producers, educators and fans make savvy decisions about how to get their messages out through networked communications.

Julia: Cases like the Harry Potter Alliance suggest the range of ways that fan cultures on the web function as a significant cultural and political force. Given the significance of fandom, what kinds of records of their online communities do you think will be necessary in the future for us to understand their impact? Said differently, what kinds of records do you think cultural heritage organizations should be collecting to support the study of these communities now and into the future?

Henry: This is a really interesting question. My colleague, Abigail De Kosnik at UC-Berkeley, is finishing up a book right now which traces the history of the fan community’s efforts to archive their own creative output over this period, which has been especially precarious, since we’ve seen some of the major corporations which fans have used to spread their cultural output to each other go out of business and take their archives away without warning or change their user policies in ways that forced massive numbers of people to take down their content.

Image of Paper Print Films in Library of Congress collection.

Image of Paper Print Films in Library of Congress collection. Jenkins notes this collection of prints likely makes it easier to write the history of the first decade of American cinema than to write the history of the first decade of the web.

The reality is that it is probably already easier to write the history of the first decade of American cinema, because of the paper print collection at the Library of Congress, than it is to write the history of the first decade of the web. For that reason, there has been surprisingly little historical research into fandom — even though some of the communication practices that fans use today go back to the publication practices of the Amateur Press Association in the mid-19th century. And even recently, major collections of fan-produced materials have been shunted from library to archive with few in your realm recognizing the value of what these collections contain.

Put simply, many of the roots of today’s more participatory culture can be traced back to fan practices over the last century. Fans have been amongst the leading innovators in terms of the cultural uses of new media. But collecting this material is going to be difficult: fandom is a dispersed but networked community which does not work through traditional organizations; there are no gatekeepers (and few recordkeepers) in fandom, and the scale of fan production — hundreds of thousands if not millions of new works every year — dwarfs that of commercial publishing. And that’s to focus only on fan fiction and would does not even touch the new kinds of fan activism that we are documenting for my forthcoming book, By Any Media Necessary. So, there is an urgent need to archive some of these materials, but the mechanisms for gathering and appraising them are far from clear.

Julia: Your New Media Literacy project aims in part to “provide adults and youth with the opportunity to develop the skills, knowledge, ethical framework and self-confidence needed to be full participants in the cultural changes which are taking place in response to the influx of new media technologies, and to explore the transformations and possibilities afforded by these technologies to reshape education.” In one of your pilot programs, for instance, students studied “Moby-Dick” by updating the novel’s Wikipedia page. Can you tell us a little more about this project? What are some of your goals? Further, what opportunities do you think libraries have to enable this kind of learning?

Henry: We documented this project through our book, “Reading in a Participatory Culture,” and through a free online project, Flows of Reading. It was inspired by the work of Ricardo Pitts-Wiley, the head of the Mixed Magic Theater in Rhode Island, who was spending time going into prisons to get young people to read “Moby-Dick” by getting them to rewrite it, imagining who these characters would be and what issues they would be confronting if they were part of the cocaine trade in the 21st century as opposed to the whaling trade in the 19th century. This resonated with the work I have been doing on fan rewriting and fan remixing practices, as well as what we know about, for example, the ways hip hop artists sample and build on each other’s work.

So, we developed a curriculum which brought together Melville’s own writing and reading practices (as the master mash-up artist of his time) with Pitts-Wiley’s process in developing a stage play that was inspired by his work with the incarcerated youth and with a focus on the place of remix in contemporary culture. We wanted to give young people tools to think ethically and meaningfully about how culture is actually produced and to give teachers a language to connect the study of literature with contemporary cultural practices. Above all, we wanted to help students learn to engage with literary texts creatively as well as critically.

We think libraries can be valuable partners in such a venture, all the more so as regimes of standardized testing make it hard for teachers to bring complex 19th century novels like “Moby-Dick” into their classes or focus student attention on the process and cultural context of reading and writing as literacy practices. Doing so requires librarians to think of themselves not only as curators of physical collections but as mentors and coaches who help students confront the larger resources and practices opened up to them through networked communication. I’ve found librarians and library organizations to be vital partners in this work through the years.

Julia: Your latest book is on the topic of “spreadable media,” arguing that “if it doesn’t spread, it’s dead.”  In a nutshell, how would you define the term “spreadable media”?

Henry:  I talked about this a little above, but let me elaborate. We are proposing spreadable media as an alternative to viral media in order to explain how media content travels across a culture in an age of Facebook, Twitter, YouTube, Reddit, Tumblr, etc. The term emphasizes the act of spreading and the choices which get made as people appraise media content and decide what is worth sharing with the people they know. It places these acts of circulation in a cultural context rather than a purely technological one. At the same time, the word is intended to contrast with older models of “stickiness,” which work on the assumption that value is created by locking down the flow of content and forcing everyone who wants your media to come to your carefully regulated site. This assumes a kind of scarcity where we know what we want and we are willing to deal with content monopolies in order to get it.

But, the reality is that we have more media available to us today that we can process: we count on trusted curators — primarily others in our social networks but also potentially those in your profession — to call media to our attention and the media needs to be able to move where the conversations are taking place or remain permanently hidden from view. That’s the spirit of “If it doesn’t spread, it’s dead!” If we don’t know about the media, if we don’t know where to find it, if it’s locked down where we can’t easily get to it, it becomes irrelevant to the conversations in which we are participating. Spreading increases the value of content.

Julia: What does spreadable media mean to the conversations libraries, archives and museums could  have with their patrons? How can archives be more inclusive of participatory culture?

Henry:  Throughout the book, we use the term “appraisal” to refer to the choices everyday people make, collectively and personally, about what media to pass along to the people they know. Others are calling this process “curating.” But either way, the language takes us immediately to the practices which used to be the domain of “libraries, archives, and museums.” You were the people who decided what culture mattered, what media to save from the endless flow, what media to present to your patrons. But that responsibility is increasingly being shared with grassroots communities, who might “like” something or “vote something up or down” through their social media platforms, or simply decide to intensify the flow of the content through tweeting about it.

We are seeing certain videos reach incredible levels of circulation without ever passing through traditional gatekeepers. Consider “Kony 2012,” which reached more than 100 million viewers in its first week of circulation, totally swamping the highest grossing film at the box office that week (“Hunger Games”) and the highest viewed series on American television (“Modern Family”), without ever being broadcast in a traditional sense. Minimally, that means that archivists may be confronting new brokers of content, museums will be confronting new criteria for artistic merit, and libraries may be needing to work hand in hand with their patrons as they identify the long-term information needs of their communities. It doesn’t mean letting go of their professional judgement, but it does mean examining their prejudices about what forms of culture might matter and it does mean creating mechanisms, such as those around crowd-sourcing and perhaps even crowd-funding, which help to insure greater responsiveness to public interests.

Julia: You wrote in 2006 that there is a lack of fan involvement with works of high culture because “we are taught to think about high culture as untouchable,” which in turn has to do with “the contexts within which we are introduced to these texts and the stained glass attitudes which often surround them.” Further, you argue that this lack of a fan culture makes it difficult to engage with a work, either intellectually or emotionally. Can you expand on this a bit? Do you still believe this to be the case, or has this changed with time? Does the existence of transformative works like “The Lizzie Bennet Diaries” on Youtube or vibrant Austen fan communities on Tumblr reveal a shift in attitudes? Finally, how can libraries, museums, and other institutions help foster a higher level of emotional and intellectual engagement?

Henry:  Years ago, I wrote “Science Fiction Audiences” with the British scholar John Tulloch in which we explored the broad range of ways that fans read and engaged with “Star Trek” and “Doctor Who.” Tulloch then went on to interview audiences at the plays of Anton Checkov and discovered a much narrower range of interpretations and meanings — they repeated back what they had been taught to think about the Russian playwright rather than making more creative uses of their experience at the theater. This was probably the opposite of the way many culture brokers think about the high arts — as the place where we are encouraged to think and explore — and popular arts — as works that are dummied down for mass consumption. This is what I meant when I suggested that the ways we treat these works cut them off from popular engagement.

At the same time, I am inspired by recent experiments which merge the high and the low. I’ve already talked about Mixed Magic’s work with “Moby-Dick,” but “The Lizzie Bennett Diaries” is another spectacular example. It’s inspired to translate Jane Austen’s world through the mechanisms of social media: gossip and scandal plays such a central role in her works; she’s so attentive to what people say about each other and how information travels through various social communities. And the playful appropriation and remixing of “Pride and Prejudice” there has opened up Austen’s work to a whole new generation of readers who might otherwise have known it entirely through Sparknotes and plodding classroom instruction. There are certainly other examples of classical creators — from Gilbert and Sullivan to Charles Dickens and Arthur Conan Doyle — who inspire this kind of fannish devotion from their followers, but by and large, this is not the spirit with which these works get presented to the public by leading cultural institutions.

I would love to see libraries and museums encourage audiences to rewrite and remix these works, to imagine new ways of presenting them, which make them a living part of our culture again. Lawrence Levine’s “Highbrow/Lowbrow” contrasts the way people dealt with Shakespeare in the 19th century — as part of the popular culture of the era — with the ways we have assumed across the 20th century that an appreciation of the Bard is something which must be taught because it requires specific kinds of cultural knowledge and specific reading practices. Perhaps we need to reverse the tides of history in this way and bring back a popular engagement with such works.

Julia: You’re a self-described academic and fan, so I’d be interested in what you think are some particularly vibrant fan communities online that scholars should be paying more attention to.

 A Vlogbrothers FAQ”

Screenshot of the VlogBrothers, Hank and John Green, as they display a symbol of their channel in a video titled “How To Be a Nerdfighter: A Vlogbrothers FAQ”

Henry: The first thing I would say is that librarians, as individuals, have long been an active presence in the kinds of fan communities I study; many of them write and read fan fiction, for example, or go to fan conventions because they know these as spaces where people care passionately about texts, engage in active debates around their interpretation, and often have deep commitments to their preservation. So, many of your readers will not need me to point out the spaces where fandom are thriving right now; they will know that fans have been a central part of the growth of the Young Adult Novel as a literary category which attracts a large number of adult readers so they will be attentive to “Harry Potter,” “Hunger Games,” or the Nerdfighters (who are followers of the YA novels of John Green); they will know that fans are being drawn right now to programs like “Sleepy Hollow” which have helped to promote more diverse casting on American television; and they will know that now as always science fiction remains a central tool which incites the imagination and creative participation of its readers. The term, Aca-Fan, has been a rallying point for a generation of young academics who became engaged with their research topics in part through their involvement within fandom. Whatever you call them, there needs to be a similar movement to help librarians, archivists and curators come out of the closet, identify as fans, and deploy what they have learned within fandom more openly through their work.

Categories: Planet DigiPres

Future Steward on Stewardship’s Future: An Interview with Emily Reynolds

The Signal: Digital Preservation - 23 July 2014 - 10:44am
Emily Reynolds, Winner of 2014 Future Steward NDSA Innovation Award.

Emily Reynolds, Winner of 2014 Future Steward NDSA Innovation Award.

Each year, the NDSA Innovation Working Group reviews nominations from members and non-members alike for the Innovation Awards. Most of those awards are focused on recognizing individuals, projects and organizations that are at the top of their game.

The Future Steward award is a little different. It’s focused on emerging leaders, and while the recipients of the future steward award have all made significant accomplishments and achievements, they have done so as students, learners and professionals in the early stages of their careers. Mat Kelly’s work on WARCreate, Martin Gengebach’s work on forensic workflows and now Emily Reynolds work in a range of organizations on digital preservation exemplify how some of the most vital work in digital preservation is being taken on and accomplished by some of the newest members of our workforce.

I’m thrilled to be able to talk with Emily, who picked up this year’s Future Steward award yesterday during the Digital Preservation 2014 meeting, about the range of her work and her thoughts on the future of the field. Emily was recognized for the quality of her work in a range of internships and student positions with the Interuniversity Consortium for Political and Social Research, the University of Michigan Libraries, the Library of Congress, Brooklyn Historical Society, StoryCorps, and, in particular, her recent work on the World Bank’s eArchives project.

Screenshot of the Arab American National Museum's web archive collections.

Screenshot of the Arab American National Museum’s web archive collections.

Trevor: You have a bit of experience working with web archives at different institutions; scoping web archive projects with the Arab American National Museum, putting together use cases for the Library of Congress and in your coursework at the University of Michigan. Across these experiences, what are your reflections and thoughts on the state of web archiving for cultural heritage organizations?

Emily: It seems to me that many cultural heritage organizations are still uncertain as to where their web archive collections fit within the broader collections of their organization. Maureen McCormick Harlow, a fellow National Digital Stewardship Resident, often spoke about this dynamic; the collections that she created have been included in the National Library of Medicine’s general catalog. But for many organizations, web collections are still a novelty or a fringe part of the collections, and aren’t as discoverable. Because we’re not sure how the collections will be used, it’s difficult to provide access in a way that will make them useful.

I also think that there’s a bit of a skills gap, in terms of the challenges that web archiving can present, as compared to the in-house technical skills at many small organizations. Tools like Archive-It definitely lower the barrier to entry, but still require a certain amount of expertise for troubleshooting and understanding how the tool works. Even as the tools get stronger, the web becomes more and more complex and difficult to capture, so I can’t imagine that it will ever be a totally painless process.

Trevor: You have worked on some very different born-digital collections, processing born-digital materials for StoryCorps in New York and on a TRAC self-audit at ICPSR, one of the most significant holders of social science data sets. While very different kinds of materials, I imagine there are some similarities there too. Could you tell us a bit about what you did and what you learned working for each of these institutions? Further, I would be curious to hear what kinds of parallels or similarities you can draw from the work.


Image of a StoryCorps exhibit at the New Museum which Emily participate in.

Emily: At StoryCorps, I did a lot of hands-on work with incoming interviews and data, so I saw first-hand the amount of effort that goes into making such complex collections discoverable. Their full interviews are not currently available online, but need to be accessible to internal staff. At ICPSR, I was more on the policy side of things, getting an overview of their preservation activities and documenting compliance with the TRAC standard.

StoryCorps and ICPSR are an interesting pair of organizations to compare because there are some striking similarities in the challenges they face in terms of access. The complexity and variety of research data held by ICPSR requires specialized tools and standards for curation, discovery and reuse. Similarly, oral history interviews can be difficult to discover and use without extensive metadata (including, ideally, full transcripts). They’re specialized types of content, and both organizations have to be innovative in figuring out how to preserve and provide access to their collections.

ICPSR has a strong infrastructure and systems for normalizing and documenting the data they ingest, but this work still requires a great deal of human input and quality control. Similarly, metadata for StoryCorps interviews is input manually by staff. I think both organizations have done great work towards finding solutions that work for their individual context, although the tools for providing access to research data seem to have developed faster than those for oral history. I’m hopeful that with tools like Pop Up Archive that will change.

Trevor: Most recently, you’ve played a leadership role in the development of the World Bank’s eArchives project. Could you tell us about this project a little and suggest some of the biggest things you learned from working on it?

Julia Blase and Emily Reynolds present on “Developing Sustainable Digital Archive Systems.” at ALA 2013 Midwinter Meeting. Photo by Jaime McCurry.

Emily: The eArchives program is an effort to digitize the holdings of the World Bank Group Archives that are of greatest interest to researchers. We don’t view our digitization as a preservation action (only insofar as it reduces physical wear and tear on the records), and are primarily interested in providing broader access to the records for our international user base. We’ve scanned around 1500 folders of records at this point, prioritizing records that have been requested by researchers and cleared for public disclosure through the World Bank’s Access to Information Policy.

The project has also included a component of improving the accessibility of digitized records and archival finding aids. We are in the process of launching a public online finding aid portal, using the open-source Access to Memory (AtoM) platform, which will contain the archives’ ISAD(G) finding aids as well as links to the digitized materials. Previously, the finding aids were contained in static HTML pages that needed to be updated manually; soon, the AtoM database will sync regularly with our internal description database. This is going to be a huge upgrade for the archivists, in terms of reducing duplication of work and making their efforts more visible to the public.

It’s been really interesting to collaborate with the archives staff throughout the process of launching our AtoM instance. I’ve been thinking a lot about how compliance with archival standards can actually make records less accessible to the public, since the practices and language involved in finding aids can be esoteric and confusing to an outsider. It has been an interesting balance to ensure that the archivists are happy with the way the descriptions are presented, while also making the site as user-friendly as possible. Anne-Marie Viola, of Dumbarton Oaks, has written a couple of blog posts about the process of conducting usability testing on their AtoM instance, which have been a great resource for me.

Trevor: As I understand it, you are starting out a new position as a program specialist with the Institute for Museum and Library Services. I realize you haven’t started yet, but could you tell us a bit about what you are going to be doing? Along with that, I would be curious to hear you talk a bit about how you see your experience thus far fitting into working for the federal funding for libraries and museums?

Emily: As a Program Specialist, I’ll be working in IMLS’s Library Discretionary Programs division, which includes grant programs like the Laura Bush 21st Century Librarian Program and the National Leadership Grants for Libraries. Among other things, I will be supporting the grant review process, communicating with grant applicants, and coordinating grant documentation. I’ll also have the opportunity to participate in some of the outreach that IMLS does with potential and existing grant applicants.

Even though I haven’t been in the profession for a very long time, I’ve had the opportunity to work in a lot of different areas, and as a result feel that I have a good understanding of the broad issues impacting all kinds of libraries today. I’m excited that I’ll be able to be involved in a variety of initiatives and areas, and to increase my involvement in the professional community. I’ve also been spoiled by the National Digital Stewardship Residency’s focus on professional development, and am excited to be moving on to a workplace where I can continue to attend conferences and stay up-to-date with the field.

Trevor: Staffing is a big concern for the future of access to digital information. The NDSA staffing survey gets into a lot of these issues. Based on your experience, what words of advice would you offer to others interested in getting into this field? How important do you think particular technical capabilities are? What made some of your internships better or more useful than others? What kinds of courses do you think were particularly useful? At this point you’ve graduated among a whole cohort of students in your program. What kinds of things do you think made the difference for those who had an easier time getting started in their careers?

Emily: I believe that it is not the exact technical skills that are so important, but the ability to feel comfortable learning new ones, and the ability to adapt what one knows to a particular situation. I wouldn’t expect every LIS graduate to be adept at programming, but they should have a basic level of technical literacy. I took classes in GIS, PHP and MySQL, Drupal and Python, and while I would not consider myself an expert in any of these topics, they gave me a solid understanding of the basics, and the ability to understand how these tools can be applied.

I think it’s also important for recent graduates to be flexible about what types of jobs they apply for, rather than only applying for positions with “Librarian” or “Archivist” in the title. The work we do is applicable in so many roles and types of organizations, and I know that recent grads who were more flexible about their search were generally able to find work more quickly. I enjoyed your recent blog post on the subject of digital archivists as strategists and leaders, rather than just people who work with floppy discs instead of manuscripts. Of course this is easy for me to say, as I move to my first job outside of archives – but I think I’ll still be able to support and participate in the field in a meaningful way.

Categories: Planet DigiPres

EaaS: Image and Object Archive — Requirements, Implementation and Example Use-Cases

Open Planets Foundation Blogs - 23 July 2014 - 10:33am
bwFLA's Emulation-as-a-Service makes emulation widely available for non-experts and could prove emulation as a valuable tool in digital preservation workflows. Providing these emulation services to access preserved and archived digital objects poses further challenges to data management. Digital artifacts are usually stored and maintained in dedicated repositories and object owners want to – or are required to – stay in control over their intellectual property. This article discusses the problem of managing virtual images, i.e. virtual harddisks bootable by an emulator, and derivatives thereof but the solution proposed can be applied to any digital artifact.RequirementsOnce a digital object is stored in an archive and an appropriate computing environment has been created for access, this environment should be immutable and should not be modified except explicitly by an administrational interface. This guarantees that a memory institution's digital assets are unaltered by the EaaS service and remain available in the future. Immutability, however, is not easy to handle for most emulated environments. Just booting the operating system may change an environment in unpredictable ways. When the emulated software writes parts of this data and reads it again, however, it probably expects the read data to represent the modifications. Furthermore, users that want to interact with the environment should be able to change or customize it. Therefore, data connectors have to provide write access for the emulation service while they cannot write the data back to the archive. The distributed nature of the EaaS approach requires an  efficient network transport of data to allow for immediate data access and usability. However, digital objects stored in archives can be quite large in size. When representing a hard disk image, the installed operating system together with installed software can easily grow up to several GBs in size. Even with today's network bandwidths, copying these digital objects in full to the EaaS service may take minutes and affects the user experience. While the archived amount of data is usually large, the data that is actually accessed frequently can be very small. In a typical emulator scenario, read access to virtual hard disk images is block-aligned and only very few blocks are actually read by the emulated system. Transferring only these blocks instead of the whole disk image file is typically more efficient, especially for larger files. Therefore, the network transport protocol has to support random seeks and sparse reads without the need for actually copying the whole data file. While direct filesystem access provides these features if a digital object is locally available to the EaaS service, such access it is not available in the general case of separate emulation and archive servers that are connected via the internet.ImplementationThe Network Block Device (NBD) protocol provides a simple client/server architecture that allows direct access to single digital objects as well as random access to the data stream within these objects. Furthermore, it can be completely implemented in userspace and does not require a complex software infrastructure to be deployed to the archives.  In order to access digital objects, the emulation environment needs to reference these objects in the emulation environment. Individual objects are identified in the NBD server by using unique export names. While the NBD URL schema directly identifies the digital object and the archive where the digital object can be found, the data references are bound to the actual network location. In a long-term preservation scenario, where emulation environments, once curated, should last longer than a single computer system that acts as the NBD server, this approach has obvious drawbacks. Furthermore, the cloud structure of EaaS allows for interchanging any component that participates in the preservation effort, thus allowing for load balancing and fail-safety. This advantage of distributed systems is offset by static, hostname-bound references.Handle It!To detach the references from the object's network location, the Handle System is used as persistent object identifier throughout our reference implementation. The Handle System provides a complete technological framework to deal with these identifiers (or "Handles'' (HDL) in the Handle System) and constitutes a federated infrastructure that allows the resolution of individual Handles using decentralized Handle Services. Each institution that wants to participate in the Handle System is assigned a prefix and can host a Handle Service. Handles are then resolved by a central resolver by forwarding requests to these services according to the Handle's prefix. As the Handle System, as a sole technological provider, does not pose any strict requirements to the data associated with Handles, this system was used as a PI technology.Persistent User Sessions and DerivativesAs digital objects (in this case the virtual disk image) are not to be modified directly in the archive by the EaaS service, a mechanism to store modifications locally  while reading unchanged data from the archive has to be implemented. Such a transparent write mechanism can be achieved using a copy-on-write access strategy. While NBD allows for arbitrary parts of the data to be read upon request, not requiring any data to be provided locally, data that is written through the data connector is tracked and stored in a local data structure. If a read operation requests a part of data that is already in this data structure, the previously changed version of the data should be returned to the emulation component. Similarly, parts of data that are not in this data structure were never modified and must be read from the original archive server. Over time, a running user session has its own local version of the data, but only those parts of data that were written are actually copied. We used the qcow2 container format from the QEMU project to keep track of local changes to the digital object. Besides supporting copy-on-write, it features an open documentation as well as a widely used and tested reference implementation with a comprehensive API, the QEMU Block Driver. The qcow2 format allows to store all changed data blocks and the respective metadata for tracking these changes in a single file. To define where the original blocks (before copy-on-write) can be found, a backing file definition is used. The Block Driver API provides a continuous view on this qcow2 container,  transparently choosing either the backing file or the copy-on-write data structures as source. This mechanism allows modifications of data to be stored separately and independent from the original digital object during an EaaS user session, allowing to keep every digital object in its original state as it was preserved  Once the session has finished, these changes can be retrieved from the emulation component and used to create a new, derived data object. As any Block Driver format is allowed in the backing file of a qcow2 container, the backing file can also be a qcow2 container again. This allows „chaining" a series of modifications as copy-on-write files that only contain the actually modified data. This greatly facilitates efficient storage of derived environments as a single qcow2 container can directly be used in a binding without having to combine the original data and the modifications to a  consolidated stream of data. However, this makes such bindings rely not only on the availability of the qcow2 container with the modifications, but also on the original data the qcow2 container refers to. Therefore, consolidation is still possible and directly supported by the tools that QEMU provides to handle qcow2 files. Once the data modifications and the changed emulation environment are retrieved after a session, both can be stored again in an archive to make this derivate environment available. Only those chunks of data that actually  were changed by the user have to be retrieved. These, however, reference and  remain dependent on the original, unmodified digital object. The derivate can then be accessed like any other archived environment. Since all derivate environments contain (stable) references to their backing files, modifications can be stored in  a different image archive, as long as the backing file is available. Therefore, each object owner is in charge for providing storage for individualized system environments but is also  able to protect its modification without loosing the benefits of shared base images. Examples and Use-CasesTo provide a better understanding of the image archive implementation, the following three use-cases demonstrate how the current implementation works. Firstly, a so called derivative is created, a tailored system environment suitable to render a specific object. In a second scenario, a container object (CD-ROM) is injected into the environment which is then modified for object access, i.e. installation of a  viewer application and adding the object to the autostart folder. Finally, an existing harddisk image (e.g. an image copy of a real machine) is ingested into the system. This last case requires, besides the technical configuration of the hardware environment, private files to be removed before public access.Derivatives – Tailored Runtime EnvironmentsTypically, an EaaS provider provides a set of so-called base images. These images contain a basic OS installation which has been configured to be run on a certain emulated platform. Depending on the user's requirements, additional software and/or configuration may be required, e.g. the installation of certain software frameworks or text processing or image manipulation software. This can be done by uploading or making available a software installation package. On our current demo instance this is done either by uploading individual files or a CD ISO image. Once the software is installed the modified environment can be saved and made accessible for object rendering or similar purposes. Object Specific CustomizationIn case of complex CD-ROM objects with rich multimedia content from the 90s and early 2000s, e.g. encyclopedias and teaching software, typically a custom viewer application has to be installed to be able to render its content. For these objects, an already prepared environment (installed software, autostart of the application) would be useful and would surely improve the user experience during access as „implicit“ knowledge on using an outdated environment is not required anymore to make use of the object. Since the number of archived media is large, duplicating for instance a Microsoft Windows environment for every one of them would add a few GBs of data to each object. Usually, neither the object’s information content nor the current or expected user demand justify these extra costs. Using derivatives of base images, however, only a few MBs are required for each customized environment since only changed parts of the virtual image are to be stored for each object. In the case of the aforementioned collection of multimedia CD-ROMs, the derivate size varies between 348KBs and 54MBs.  Authentic Archiving and Restricted Access to Existing ComputersSometimes it makes sense to preserve a complete user system like the personal computer of Vilèm Flusser in the Vilèm Flusser Archive. Such complete system environments usually can be achieved by creating a hard disk image of the existing computer and use this image as the virtual hard disk for EaaS. Such hard disk images can, however, contain personal data of the computer's owner. While EaaS aims at providing interactive access to complete software environments, it is impossible to restrict this "interactiveness", e.g. to forbid access to a certain directory directly from the user interface. Instead, our approach to this problem is to create a derivative work with all the personal data being stripped from the system. This allows users with sufficient access permissions (e.g. family or close friends) to access the original system including personal data, while the general public access only sees a computer with all the personal data removed.Conclusion

With our distributed architecture and an efficient network transport protocol, we are able to provide Emulation as a Service quite efficiently while at the same time allowing owners of digital objects to remain in complete control over their intellectual property. Using copy-on-write technology it is possible to create a multitude of different configurations and flavors of the same system with only minimal storage requirements. Derivatives and their respective "parent" system can be handled completely independent from each other and withdrawing access permissions for a parent will automatically invalidate all existing derivatives. This allows for a very efficient and flexible handling of curation processes that involve the installation of (licensed) software, personal information and user customizations.

Open Planets members can test aforementioned features using the bwFLA demo instance. Get the password here:

Taxonomy upgrade extras: EaaSPreservation Topics: Emulation
Categories: Planet DigiPres

Archiving video

File Formats Blog - 19 July 2014 - 10:59am

Suppose you see a cop beating someone up for jaywalking, or you’re stopped at one of the Border Patrol’s internal checkpoints. You’ve got your camera, phone, or tablet, so you make a video record of the incident. What do you do next? The Activists’ Guide to Archiving Video has some solid advice. Its purpose is to help you “make sure that the video documentation you have created or collected can be used for advocacy, as evidence, for education or historical memory – not just now but into the future.” The advice is solid, and most of it applies to any video recording that has long-term importance. In essence, it’s the same advice you’d get from Files that Last or from the Library of Congress. It includes considerations that especially apply to sensitive video, such as encryption and information that might put people at risk, but it’s a valuable addition to anyone’s digital preservation library.

There’s a PDF version of the guide for people who don’t like hopping around web pages. Versions in Spanish and Arabic are also provided.

Tagged: metadata, preservation, video
Categories: Planet DigiPres


Open Planets Foundation Blogs - 17 July 2014 - 2:36pm

We have just set up a vagrant environment for C3PO. It starts a headless vm where the C3PO related functionalities (Mongodb, Play, a downloadable commandline jar) are managable from the host's browser. Further, the vm itself has all relevant processes configured at start-up independently from vagrant, so it can be, once created, downloaded and used as a stand-alone C3PO vm. We think this could be a scenario applicable to other SCAPE projects as well. The following is a summary of the ideas we've had and the experiences we've made.

The Result

The Vagrantfile and a directory containing all vagrant-relevant files live directly in the root directory of the C3PO repository. So after installing Vagrant and cloning the repository a simple 'vagrant up' should do all the work, as downloading the base box, installing the necessary software and booting the new vm.

After a few minutes one should have a running vm that is accessible from the hosts browser at localhost:8000. This opens a central welcome page that contains information about the vm-specific aspects and links to the playframework's url (localhost:9000) and the Mongodb admin interface (localhost:28017). It also provides a download link for the command-line jar, which has to be used in order to import data. This can be used from the outside of the vm as the Mongodb port is mapped as well. So I can import and analyse data with C3PO without having to fiddle through the setup challenges myself, and, believe me, that way can be long and stony.

The created image is self-contained in that sense that, if I put it on a server, anyone who has Virtualbox installed can download it and use it, without having to rely on vagrant working on their machine.

General Setup

The provisioning script has a number of tasks:

  • it downloads all required dependencies for building the C3PO environment
  • it installs a fresh C3PO (from /vagrant, which is the shared folder connection between the git repository and the vm) and assembles the command-line app
  • it installs and runs a Mongodb server
  • it installs and runs the Playframework
  • it creates a port-forwarded static welcome page with links to all the functionalities above
  • it adds all above to the native ubuntu startup (using /etc/rc.local, if necessary), so that an image of the vm can theoretically be run independently from the vagrant environment

These are all trivial steps, but it can make a difference not having to manually implement all of them.

Getting rid of proxy issues

In case you're behind one of those very common NTLM company proxies, you'll really like that the only thing you have to provide is a config script with some some details around your proxy. If the setup script detects this file, it will download the necessary software and configure maven to use it. Doing it in this way has been actually the first time I got maven running smoothly on a linux VM behind our proxy.

Ideas for possible next steps

There is loads left to do, here are a few ideas:

  • provide interesting initial test-data that ships with the box, so that people can play around with C3PO without having to install/import anything at all.
  • why not having a vm for more SCAPE projects? we could quickly create a repository for something like a SCAPE base vm configuration that is useable as a base for other vms. The central welcome page could be pre-configured (SCAPE branded) as well as all the proxy- and development-environment-related stuff mentioned above.
  • I'm not sure about the sustainablity of shell provisioning scripts with increasing complexity of the bootstrap process. Grouping the shell commands in functions is certainly an improvement, it might be worth though to check out other, more dynamic provisioners. One I find particularly interesting is Ansible.
  • currently there's no way of testing that the vm works with the current development trunk; a test environment that runs the vm and tests for all the relevant connection bits would be handy


Preservation Topics: SCAPE
Categories: Planet DigiPres

CSV Validator version 1.0 release

Open Planets Foundation Blogs - 15 July 2014 - 12:10pm

Following on from my previous brief post announcing the beta release of the CSV Validator,, today we've made the formal version 1.0 release of the CSV Validator and the associated CSV Schema Language.  I've described this in more detail on The NAtional Archives' blog,

Preservation Topics: Tools
Categories: Planet DigiPres

Crowdsourcing song identification

File Formats Blog - 14 July 2014 - 10:04am

Some friends of mine are pulling together a project for crowdsourcing identification of a large collection of music clips. At least a couple of us are professional software developers, but I’m the one with the most free time right now, and it fits with my library background, so I’ve become lead developer. In talking about it, we’ve realized it can be useful to librarians, archivists, and researchers, so we’re looking into making it a crowdfunded open source project.

A little background: “Filk music” is songs created and sung by science fiction and fantasy fans, mostly at conventions and in homes. I’ve offered a definition of filk on my website. There are some shoestring filk publishers; technically they’re in business, but it’s a labor of love rather than a source of income. Some of them have a large backlog of recordings from past conventions. Just identifying the songs and who’s singing them is a big task.

This project is, initially, for one of these filk publishers, who has the biggest backlog of anyone. The approach we’re looking at is making short clips available to registered crowdsource contributors, and letting them identify as much as they can of the song, the author, the performer(s), the original tune (many of these songs are parodies), etc. Reports would be delivered to editors for evaluation. There could be multiple reports on the same clip; editors would use their judgment on how to combine them. I’ve started on a prototype, using PHP and MySQL.

There’s a huge amount of enthusiasm among the people already involved, which makes me confident that at least the niche project will happen. The question is whether there may be broader interest. I can see this as a very useful tool for professionals dealing with archives of unidentified recordings: folk music, old jazz, transcribed wax cylinder collections, whatever. There’s very little in the current design that’s specific to one corner of the musical world.

The first question: Has anyone already done it? Please let me know if something like this already exists.

If not, how interesting does it sound? Would you like it to happen? What features would you like to see in it?

Update: On the Code4lib mailing list, Jodi Schneider pointed out that nichesourcing is a more precise word for what this project is about.

Tagged: archiving, crowdsourcing, filk, music
Categories: Planet DigiPres