Planet DigiPres

Curating Extragalactic Distances: An interview with Karl Nilsen & Robin Dasler

The Signal: Digital Preservation - 18 August 2014 - 4:54pm
EDD Homepage

Screenshot of Extragalactic Distance Database Homepage.

While a fair amount of digital preservation focuses on objects that have clear corollaries to objects from our analog world (still and moving images and documents for example), there are a range of forms that are basically natively digital. Completely native digital forms, like database-driven web applications, introduce a variety of challenges for long-term preservation and access. I’m thrilled to discuss just such a form with Karl Nilsen and Robin Dasler from the University of Maryland, College Park. Karl is the Research Data Librarian, and Robin is the Engineering/Research Data Librarian. Karl and Robin spoke on their work to ensure long-term access to the Extragalactic Distance Database at the Digital Preservation 2014 conference.

Trevor: Could you tell us a bit about the Extragalactic Distance Database? What is it? How does it work? Who does it matter to today and who might make use of it in the long term?

//">Wikimedia Commons</a>.

Representation of the Extragalactic distance ladder from Wikimedia Commons.

Karl and Robin: The Extragalactic Distance Database contains information that can be used to determine distances between galaxies. For a limited number of nearby galaxies, the distances can be measured directly with a few measurements, but for galaxies beyond these, astronomers have to correlate and calibrate data points obtained from multiple measurements. The procedure is called a distance ladder. From a data curation perspective, the basic task is to collect and organize measurements in such a way that researchers can rapidly collate data points that are relevant to the galaxy or galaxies of interest.

The EDD was constructed by a group of astronomers at various institutions over a period of about a decade and is currently deployed on a server at the Institute for Astronomy at the University of Hawaii. It’s a continuously (though irregularly) updated, actively used database. The technology stack is Linux, Apache, MySQL and PHP. It also has an associated file system that contains FITS files and miscellaneous data and image files. The total system is approximately 500GB.

EDD Result table

Extragalactic Distance Database Result table.

The literature mentioning extragalactic or cosmic distance runs to thousands of papers in Google Scholar, and over one hundred papers have appeared with 2014 publication dates. Explicit references to the EDD appear in twelve papers with 2014 publication dates and a little more than seventy papers published before 2014. We understand that some astronomers use the EDD for research that is not directly related to distances simply because of the variety of data compiled into the database. Future use is difficult to predict, but we view the EDD as a useful reference resource in an active field. That being said, some of the data in the EDD will likely become obsolete as new instruments and techniques facilitate more accurate distances, so a curation strategy could include a reappraisal and retirement plan.

Our agreement with the astronomers has two parts. In the first part, we’ll create a replica of the EDD at our institution that can serve as a geographically distinct backup for the system in Hawaii. We’re using rsync for transfer. Our copy will also serve as a test case for digital curation and preservation research. In this period, the copy in Hawaii will continue to be the database-of-record. In the second part, our copy may become the database-of-record, with responsibility for long-term stewardship passing more fully to the University of Maryland Libraries. In general, this project gives us an opportunity to develop and fine-tune curation processes, procedures, policies and skills with the goal of expanding the Libraries’ capacity to support complex digital curation and preservation projects.

Trevor: How did you get involved with the database? Did the astronomers come to you or did you all go to them?

Karl and Robin: One of the leaders of the EDD project is a faculty member at the University of Maryland and he contacted us. We’re librarians on the Research Data Services team and we assist faculty and graduate students with all aspects of data management, curation, publishing and preservation. As a new program in the University Libraries, we actively seek and cultivate opportunities to carry out research and development projects that will let us explore different data curation strategies and practices. In early 2013 we included a brief overview of our interests and capabilities in a newsletter for faculty, and that outreach effort lead to an inquiry from the faculty member.

We occasionally hear from other faculty members who have developed or would like to develop databases and web applications as a part of their research, so we expect to encounter similar projects in the future. For that reason, we felt that it was important to initiate a project that involves a database. The opportunities and challenges that arise in the course of this project will inform the development of our services and infrastructure, and ultimately, shape how we support faculty and students on our campus.

Trevor: When you started in on this, were there any other particularly important database preservation projects, reports or papers that you looked at to inform your approach? If so, I’d appreciate hearing what you think the takeaways are from related work in the field and how you see your approach fitting into the existing body of work.

Karl and Robin: Yes, we have been looking at work on database preservation as well as work on curating and preserving complex objects. We’re fortunate that there has been a considerable amount of research and development on database preservation and there is a body of literature available. As a starting point, readers may wish to review:

Some of the database preservation efforts have produced software for digital preservation. For example, readers may wish to look at SIARD (Software Independent Archiving of Relational Databases) or the Database Preservation Toolkit. In general, these tools transform the database content into a non-proprietary format such as XML. However, there are quite a few complexities and trade-offs involved. For example, database management systems provide a wide range of functionality and a high level of performance that may be lost or not easily reconstructed after such transformations. Moreover, these preservation tools may involve dependencies that seem trivial now but could introduce significant challenges in the future. We’re interested in these kinds of tools and we hope to experiment with them, but we recognize that heavily transforming a system for the sake of preservation may not be optimal. So we’re open to experimenting with other strategies for longevity, such as emulation or simply migrating the system to state-of-the-art databases and applications.

Trevor:  Having a fixed thing to preserve makes things a lot easier to manage, but the database you are working with is being continuously updated. How are you approaching that challenge? Are you taking snapshots of it? Managing some kind of version control system? Or something else entirely? I would also be interested in hearing a bit about what options you considered in this area and how you made your decision on your approach.

Karl and Robin: We haven’t made a decision about versioning or version control, but it’s obviously an important policy matter. At this stage, the file system is not a major concern because we expect incremental additions that don’t modify existing files. The MySQL database is another story. If we preserve copies of the database as binary objects, we face the challenge of proliferating versions. That being said, it may not be necessary to preserve a complete history of versions. Readers may be interested to know that we investigated Git for transfer and version control, but discovered that it’s not recommended for large binary files.

Trevor: How has your idea of database preservation changed and evolved by working through this project? Are there any assumptions you had upfront that have been challenged?

Karl and Robin: Working with the EDD has forced us to think more about the relationship between preservation and use. The intellectual value of a data collection such as the EDD is as much in the application–joins, conditions, grouping–as in the discrete tables. Our curation and preservation strategy will have to take this fact into account. We expect that data curators, librarians and archivists will increasingly face the difficult task of preservation planning, policy development and workflow design in cases where sustaining the value of data and the viability of knowledge production depends on sustaining access to data, code and other materials as a system. We’re interested to hear from other librarians, archivists and information scientists who are thinking about this problem.

Trevor: Based on this experience, is there a checklist or key questions for librarians or archivists to think through in devising approaches to ensuring long term access to databases?

Karl and Robin: At the outset, the questions that have to be addressed in database preservation are identical to the questions that have to be addressed in any digital preservation project. These have to do with data value, future uses, project goals, sustainability, ownership and intellectual property, ethical issues, documentation and metadata, data quality, technology issues and so on. A couple of helpful resources to consult are:

Databases may complicate these questions or introduce unexpected issues. For example, if the database was constructed from multiple data sources by multiple researchers, which is not unusual, the relevant documentation and metadata may be difficult to compile and the intellectual property issues may be somewhat complicated.

Trevor: Why are the libraries at UMD the place to do this kind of curation and preservation? In many cases scientists have their own data managers, and I imagine there are contributions to this project from researchers at other universities. So what is it that makes UMD the place to do it and how does doing this kind of activity fit into the mission of the university and the libraries in particular?

Karl and Robin: While there are well-funded research projects that employ data managers or dedicated IT specialists, there are far more scientists and scholars who have little or no data management support. The cost of employing a data manager, even part-time, is too great for most researchers and often too great for most collaborations. In addition, while the IT departments at universities provide data storage services and web servers, they are not usually in the business of providing curatorial expertise, publishing infrastructure and long-term preservation and access. Further, while individual researchers recognize the importance of data management to their productivity and impact, surveys show that they have relatively little time available for data curation and preservation. There is also a deficit of expertise in general, though some researchers possess sophisticated data management skills.

Like many academic libraries, the UMD Libraries recognize the importance of data management and curation to the progress of knowledge production, the growth of open science and the success of our faculty and students. We also believe that library and archival science provide foundational principles and sets of practices that can be applied to support these activities. The Research Data Services program is a strategic priority for the University of Maryland Libraries and is highly aligned with the Libraries’ mission to accelerate and support research, scholarship and creativity. We have a cross-functional, interdisciplinary team in the Libraries–made up of subject specialists and digital curation specialists as needed–and partners across the campus, so we can bring a range of perspectives and skills to bear on a particular data curation project. This diversity is, in our view, essential to solving complex data curation and preservation problems.

We have to acknowledge that our work on the EDD involves a number of people in the Libraries. In particular, Jennie Levine Knies, Trevor Muñoz and Ben Wallberg, as well as University of Maryland iSchool students Marlin Olivier and, formerly, Sarah Hovde, have made important contributions to this project.

Categories: Planet DigiPres

Research is Magic: An Interview with Ethnographers Jason Nguyen & Kurt Baer

The Signal: Digital Preservation - 15 August 2014 - 7:54pm
Jason Nguyen and Kurt Baer, PhD students in the Department of Folklore and Ethnomusicology at Indiana University, drawn in the style of My Little Pony Friendship is Magic

Jason Nguyen and Kurt Baer, PhD students in the Department of Folklore and Ethnomusicology at Indiana University, drawn in the style of “My Little Pony Friendship is Magic”

The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and worked on a range of projects leading up to CurateCamp Digital Culture in July. This is part of a series of interviews Julia conducted to better understand the kinds of born-digital primary sources folklorists, and others interested in studying digital culture, are making use of for their scholarship.

When Hasbro decided to reboot their 1980s “My Little Pony” franchise, who would have guessed that they would give rise to one of the most surprising and interesting fan subcultures on the web? The 2010 animated television series “My Little Pony: Friendship is Magic” has garnered an extremely loyal–and as a 2012 documentary put it, “extremely unexpected”–viewership among adult fans. Known colloquially as “bronies” (a portmanteau of “bro” and “ponies”), these fans are largely treated with fascination and confusion by the mainstream media. All of this interest has resulted in a range of scholars in different fields working to understand this cultural phenomena.

In this installment of the NDSA Insights Interview series, I talk with Jason Nguyen and Kurt Baer. Both PhD students at Indiana University in the Department of Folklore and Ethnomusicology, Jason and Kurt decided to study this unique subculture. Their website is where they both conduct their field research, blog about their findings and invite feedback from the community.

Julia: Can you tell me a little bit more about bronies (and pegasisters)? How do they define themselves? How long have these movements been occurring and where are they communicating online? Do you have any sense of how large these communities are?

Jason: An important starting premise for us is that bronies attach a wide variety of different values and identity markers to the label of brony, imagining and experiencing their relationships to one another in multiple ways–sometimes even conflicting ones. Nonetheless, there are some shared histories that nearly all bronies will describe as specific to this community. Specifically, bronies as a concept unique from My Little Pony fandom arose out of the relaunch/reboot of the Hasbro franchise as My Little Pony: Friendship is Magic in fall 2010. Lauren Faust, particularly known to this group for her work with her husband Craig McCracken on Powerpuff Girls and Foster’s Home for Imaginary Friends, developed the idea and wrote for the show through its first two seasons, and her gender politics has a lot to do with the complex and often non-normative characterization of the ponies. Because of that, bronies will generally start with the content of the show as reason enough for being a fandom: it is smartly written and portrays a positive, socially-oriented world view. Some bronies will portray this oppositionally to other, more negative media, but at the same time, many are involved in multiple fandoms and are often fans of “darker” work as well.

In any case, the label of “brony” has a pretty specific starting point, arising out of the show’s popularity in 2010 on 4chan, which was to some extent ironic, i.e. “Haha, we’re grown men watching a little girls’ show,” though I think the irony of that moment is always overstated (since irony is a useful footing to allow a grown man to watch a little girls’ show if he so desires). Over the following year, the bronies started to overtake 4chan and were kicked out; 4chan eventually opened /mlp/ for them, but the conflict lasted for a few months and was an impetus to organize elsewhere on the web.

At this point, things get more complicated, because people who like FiM search for other fans online, but the cross-demographic appeal means that reasons for being a fan and even ways of being a fan are not necessarily shared in the way you might expect of a more homogenous group. For example, fans coming from other “geek” fandoms are used to the convention scene and fandom as a sort of genre (keeping in touch with friends online, then getting together a few times a year at a convention), but for many bronies, this is the first time they have participated in this kind of mass-mediated imagined community.

Kurt: As far as numbers go, it is really hard to tell how large the brony community is. This is partly due to the varying definitions of what makes a “brony.” However, the brony community (or communities) is quite large and very active both online and off. For instance, Bronycon, the largest brony convention, brought in over 8,000 people last year, Coder Brony’s 2014 herd census received over 18,000 responses from all around the world, and Equestria Daily is, as of now, rapidly approaching 500 million hits on their website. There are brony communities all over Facebook and Reddit (which even has multiple subreddits devoted to sorting out all of the MLP subreddits). There are very active 4chan, Twitter, SoundCloud and DeviantArt communities; brony groups on other online games ranging from Team Fortress to Minecraft to Clash of Clans; over a dozen 24-hour streaming radio stations for Brony music; and major news sites such as Equestria Daily and Everfree that link bronies to relevant information from all over the web. What’s more is that these “communities” are not discrete from one another. People bounce between platforms all of the time, sometimes between different online personas, making coming up with specific numbers very difficult.

Julia: How is your approach to studying bronies similar or different from approaches to studying other fan cultures, and for that matter, any number of other modes of participatory culture?

Jason: In a lot of ways, I don’t think the work we are doing is all that different than many ethnographic studies insofar as the basic process of participant observation is concerned. As for the field of fan/fandom studies, we have thus far not cast our work in that light, though not because of any strong feelings either way. Fandom studies has a strong thread of reception and media studies coming from a more literary and cultural studies perspective that we enjoy but it’s not our theoretical foundation (I’m thinking of Henry Jenkins’ early work, for example).

That emphasis on broad cultural production that I think is heavily influenced by the legacy of the Frankfurt School is perhaps one difference, since we are strongly ethnographic and thus more granular in our approach. That said, many scholars we might read in a fandom studies class have used ethnographic and anthropological methods as well, such as Bonnie Nardi in her great “My Life as a Night Elf Priest” about the “World of Warcraft” fandom.

Kurt: Ultimately, while we might be one of a few people researching about people and brightly colored ponies on the internet at the moment (that number is always growing), the questions that we are looking to understand and the ways that we are trying to understand them are quite similar to research coming from a long line of ethnographers dating (in the anthropological imagination, at least) all the way back to Bronislaw Malinowski. Perhaps one relatively substantial difference that we have at least been trying for, however, lies in the fact that we are trying to use the blog format to allow for more back-and-forth interaction between us and the people who we are studying/studying with than the traditional ethnographic monograph allows. While many ethnographers (such as Steven Feld in his ethnography “Sound and Sentiment”) are able to get feedback from the people they study with and incorporate that into the writing process (or at least their second editions), we have been trying to find ways to speed up that process of garnering feedback, learning from it, and using that knowledge as a means for further theorization.

Screenshot of the Research is Magic blog, which serves as a space for dialog with research participants.

Screenshot of the Research is Magic blog, which serves as a space for dialog with research participants.

Julia: You’ve stated that your blog “represents an attempt at participant-observation that collapses the boundaries between academic and interlocutor.” Can you expand on this? What are some of your goals with this blog? Why start your own blog as opposed to gathering data and engaging with bronies on their own virtual “turf,” like websites like Equestria Daily?

Kurt: One important bit of background information that I feel is important to bring up here is that Jason and I both come from fields that focus primarily upon ethnographic research, and in fact, the blog itself was started as part of a course in creative ethnography taught by Dr. Susan Lepselter that Jason and I took at Indiana University. In approaching this research ethnographically, we wanted to be able to ask questions and elicit observations from bronies themselves in addition to analyzing the various other types of “texts” such as the show itself, other websites, and pre-existing conversations. We also wanted to be clear and open about the fact that we are researchers conducting research. We figured that starting our own blog would give us the space that we needed to be able to ask questions and make observations while still being clear about our research and research objectives. Through our interactions with people on social media sites and on places such as Equestria Daily, it has been our hope that the blog becomes a space that is part of different bronies’ “turfs,” where they can go to interact with us and each other and discuss different aspects of being a brony.

As far as our attempts to collapse the boundaries between academic and interlocutor goes, one of the things that drew us to the brony community in the first place is that they are already very involved in theorization about themselves and about the show. They talk about what it means to be a brony, provide deep textual analyses of the show and its themes, and grapple with the social implications of liking a show that some people think that they shouldn’t. Rather than us going into the “field,” collecting data about bronies, and then returning to write that information up in an article to be published in an academic journal, we hoped to create a space where we can theorize together and and where all of the observations and ideas would be available in the same space to serve as material for more conversation and theorization.

Jason: Another way to think about this is that there is nothing more brony-like than to start a space of your own online. As Kurt has recounted above, bronies have been quite prolific in their production of cyberspaces for communal interaction, and not all of them are big like Equestria Daily. Of course there are always the YouTube stars and Twitter celebrities of any mass-media fandom, but the more mundane spaces are equally important, and the process of making a website, maintaining a Twitter profile, etc.–in short, creating a presentation of self as brony researchers amongst other people similarly engaged in a presentation of self as bronies–has been invaluable in our experience of the “participant” part of participant-observation. We both have web presences, as most bronies do before they join the fandom, but many choose to create fandom-specific identities, and that means anchoring those identities somewhere; we’ve in part chosen to anchor our brony-related identities on the website.

FiM villain Discord with the intellectual hero Michel Foucault by Jason

Photoshop of the MLP:FiM villain Discord with the intellectual hero Michel Foucault by Jason

With all that said, we do spend a lot of time investigating bronies in other spaces and in less explicitly theoretical ways. We live-tweet (tweeting comments about something as it occurs) new episodes from time to time, which is a really fun experience that lets us interact with both fans and show staff alike. I have drawn fan art and Kurt has made fan music that we have shared via Twitter, Reddit and our site.

So we like to think that we are doing both things at the same time. Of course it is important for anyone doing anthropologically informed ethnography to meet people where they are and explore their lives as they lead them, but at the same time, many fans have shown an interest in a space where they can read about and join in conversations that marry explicit theorization with personal observations of their fandom, and the “Research Is Magic” blog produces a hybrid narrative framing that we found was not previously existing in either academic or brony fandom spaces.

Julia: One of the reasons bronies as a group are so interesting is because they appear to subvert both gender and age norms. But you argue that “an analytical orientation that positions bronies as resisters trivializes their rich social interactions and effaces complicated power dynamics within and peripheral to the fandom.” That’s some dense language! Can you unpack this a bit for us?

Kurt: Essentially, our argument here is one against the tendency to find resistance and subversion and then get carried away insisting on interpreting everything about the group in that light. There is certainly some very interesting subversion of age and gender norms going on in the fandom, but bronies are not only, or even (I would argue) primarily, resisting. Most bronies that we have talked to don’t think of themselves as being oppositional, but instead as simply liking a show that they like. While it is both productive and interesting to look at the ways that bronies are resisting gender norms, it is also very easy for academics to fall into the trap of casting everything in that light, limiting the rich and complex social interactions of bronies to a romanticized narrative about bronies rising up together and resisting the gender stereotypes of larger society.

Jason: Resistance as a concept works because of a binary opposition: X resists Y. However, multiple competing discourses may be at work and are probably not all aligned to one another. For example, earlier this year, a North Carolina school kept a nine year old boy from bringing his Rainbow Dash backpack to school because it was getting him bullied by other students. On one level, the reasoning on all sides is obvious. To the other boys, a boy wearing “girly” paraphernalia is ripe to be bullied. The school counselor wanted to ensure the boy’s safety, so removed what was believed to be the problem. Some parents were concerned that the boy was being punished for simply expressing himself, and that the bullies should have been punished instead. …

So, while each person appears to act in resistance according to a particular discourse of meaning, and each person may have a particular narrative, the entire scenario is complicated by these competing ideas of masculinity that intersect with ideologies of personal freedom and liberty. Rainbow Dash (the character on the backpack), for example, is clearly written as a “tomboy” character–good at sports, adventurous, daring and 20 percent cooler than you. If a boy was going to pick a character to identify with that does not break existing standards of masculinity, she would be the one; thus, insofar as male fans identify with her, they’re also identifying with characteristics that don’t challenge their heteronormativity. But she is also the one covered in rainbows, and that has a particular valence as a form of non-heteronormative imagery (e.g. LGBT rights symbolism). In short, there is a density of meaning attached to Rainbow Dash that complicates people’s responses, though I would argue that it’s that complexity and density of meaning that allows different groups to be drawn to MLP in the first place.

Kurt: The ways in which people are using the show in relation to gender norms further complicate things. While in many ways bronies are challenging gender norms through their liking the show and re-defining ideas about masculinity, in other ways many bronies are super heteronormative. While they like a show that some people think is for girls, their argument is less about the fact that gender norms need dismantling than it is about the fact that the show is written in a way that is appealing to heteronormative men and that men can still be manly while liking MLP. The World’s Manliest Brony, for instance, while going against gender norms in some ways by embracing MLP and re-enforcing the manliness of giving charitably, also reinforces them in others–leaving many ideas of masculinity intact but drawing MLP into the list of things that can be manly.

Julia: Psychologist Marsha Redden, one of the conductors of The Brony Study, stated in an interview that the fandom is a normal response to the anxiety of life in a conflict-driven time, saying “they’re tired of being afraid, tired of angst and animosity. They want to go somewhere a lot more pleasant.” Likewise, a lot of what you talk about on your blog has to do with the positivity of the actual show, how each episode has a positive message and emphasizes the importance of friendship and other values.  It feels very rare that we hear something positive about bronies from the mainstream media. Can you talk a bit about this? What draws adults to the show, and to the community? What do you make of the moral panic surrounding Bronies in the mainstream media?

Jason: At the risk of sounding a little persnickety, I’d like to suggest that we invert the way we think about such causal explanations. Explanations similar to Dr. Redden’s–basically, some version of the idea that the world is a rough and cynical place and that MLP presents an alternative space, no matter how delimited or constrained, that is more trusting and open–are pretty common within the fandom as part of people’s personal narratives for why and how they became bronies (obviously, this is not true for everyone, but it’s clearly a fandom trope). In anthropology itself, scholars like Victor Turner and Max Gluckman have suggested that certain carnivalesque (to borrow Bakhtin’s term) rituals act as a kind of “safety valve” for a society to release its pent up frustrations and conflicts without destroying the order of things, and some version of that idea is laden in Redden’s theory and that of many bronies. There are many bronies who see involvement in fandom and watching the show as that safety valve.

But there are many others who narrate their experience as simply watching a show that they like–just like any other show–and, to their surprise finding outside resistance. Indeed, we don’t expect people to explain their affinity for most elements of popular culture. You need not justify why you watch “Breaking Bad” or “Game of Thrones.”

The fact that causal explanations that answer why you are a brony are central to the narratives of many bronies does not really indicate too much about their truth value, but they are a useful indicator of where society draws its lines and how people who find themselves on the wrong sides of social lines create meaning based on their situations. Here, I’m drawing heavily on Lila Abu-Lughod‘s ideas about resistance as a “diagnostic of power” that points us to the methods and configurations of power (“The Romance of Resistance: Tracing Transformations of Power Through Bedouin Women,” 1990). In this case, bronies (and researchers) find themselves having to produce narratives that can explain why they have crossed norms of gender and age appropriateness, even if they don’t live by those norms themselves. Jacob Clifton in “Geek Love: On the Matter of Bronies” does a great job arguing that, being the first generation raised by feminists, of course these young men don’t see any difference between Twilight Sparkle or Han Solo being their idols.

Kurt: Ultimately the fact that bronies have to justify why they like the show is in many ways coming from the fact that they get such negative press and draw such negative stereotypes. We haven’t done too much to tease out what actually draws people to the show, although we’ve seen many people give many different reasons as we’ve gone about our research–the good writing and production, the positive themes, the large and thriving fan community, having friends and relatives that like the show, that they just somehow liked it, etc. I’m not sure that there is necessarily one, or even a few, things inherent in the show or the fandom that draw people to it any more than there being something inherent in basketball that makes people want to watch it. There are a lot of really complex personal, psychological and socio-cultural things at work in personal preference and the reasons people give usually seem to explain less about why they like something (I couldn’t tell you why I like Carly Rae Jepson or George Clinton) than they give culturally-determined reasons why it might be okay for them to like it.

Julia: Right now you have the benefit of both directly looking for source material on the open web, and having it come to you (through participation on your blog). Given your perspective, what kinds of online content do you think are the most critical for cultural heritage organizations to preserve for anthropologists of the future to study this moment in history?

Kurt: That’s a tough one, as even with our research on bronies I feel like everywhere I look, I see someone joining the Brony research herd with a new and different focus. Although we try to do a lot of our work by talking and collaborating directly with bronies, we’ve dealt with Twitter exchanges, media reports about MLP, message board archives, brony music collections, the show itself and just about anything that we can find where people are exchanging their ideas about the fandom. Others have dealt with collection of fanfics, sites dedicated to discussing MLP and religion, fan art, material culture and cosplay, and just about anything else you can think of. I’m always finding people who focus upon and draw insight from archives (both in the sense of actual archives and in the super-general sense of “stuff people use as the basis of their research”) that I would never have thought to use.

This being said, as someone that primarily studies expressive culture (my degree is from the department of Folklore and Ethnomusicology), I tend to place a lot of importance on it. The amount and quality of the music, art, videos, memes, stories, etc. floating around within the fandom has never ceased to astound me and was one of the primary reasons that I became attracted to the fandom in the first place. I feel like these bodies of creative works–from “My Little Dashie,” “Ponies: The Anthology,” and “Love me Cheerilee” to the Twilicane memes and crude saxophone covers of show tunes –are very important to the fandom and to those that want to understand it as scholars.

Jason: Broadly speaking, anthropologists have taken two approaches to describing the lives of others to their audience. The first is like a wide-angle lens, allowing someone to get a sense of the full scope of a social phenomenon, but it has trouble with the details and the charming little moments of creativity and agency–like fan-created fluffy ponies dancing on rainbows or background ponies portrayed as anthropologists studying humankind. Archival work needs that little-bit-of-everything for context, but it also needs a macro lens that can capture more of those particular and special moments. In anthropology, it might be akin to the difference between Malinowski’s epic “Argonauts of the Western Pacific”–a sprawling work that tried to introduce the entirety of a culture to us–and something like Anthony Seeger’s “Why Suyá Sing,” which performed the humbler, but no less impressive, task of letting us experience the nuances of a single ritual.

Since we can’t archive every little thing to that level of detail … we have to make choices, and that’s where bronies themselves are the best guides. What moments mattered to them, and “where” in cyberspace did they experience those moments? For a concrete example, the moment Twilight Sparkle gained her wings and became an alicorn princess (she was previously just a unicorn…thanks M.A. Larson) was particularly salient in the community, suggesting for some fans Hasbro’s stern hand manipulating the franchise. While there are some other similar instances, the unique expressions through Twitter, Reddit, YouTube, Tumblr, etc. during and immediately following the Season 3 episode “Magical Mystery Cure” (when that transformation occurs) provide a really important look into what holds meaning for this fandom.

On a technical level, I think that means being able to follow links surrounding particular events to multiple levels of depth across multiple media modalities.

Julia: If librarians, archivists and curators wanted to learn more about approaches like yours what examples of other scholars’ work would you suggest? It would be great if you could mention a few other scholars’ work and explain what you think is particularly interesting about their approaches.

Jason: One place to start is to consider what the cultural artifact is and what it is we are analyzing, interpreting, preserving, archiving, etc., because it is not, ethnographically speaking, simply media that we are studying. As Mary Gray has insisted, we should “de-center media as the object of analysis,” instead looking at what that media means and how it is contextualized. For the archivist or curator, I think that means figuring out how people come to understand media and how they attach particular ideologies to it. Ilana Gershon’s “The Breakup 2.0″ and her work on “media ideology” broadly are great examples of shifting our attention so that we can hold both the “text” and “context” in view simultaneously.

Another example is danah boyd’s recent study of young people and their social media use, “It’s Complicated,” in which she inverts older people’s assumptions that teenagers’ social media use is crippling their ability to socialize, instead arguing that the constant texting and messaging indicates a desire to connect with one another that is born out of frustration with the previous generation’s (over-)protectiveness: truancy and loitering law, curfews, school busing, constant organized activity, etc. She arrives at that conclusion not only by studying teens’ messages, but by analyzing the historical conditions that produce the very different concerns of teens and their parents.

Kurt: As far as our approach goes, we’ve also been influenced by scholars working creatively with ethnography as a form or working just outside of its purview. We’ve brought up Kathleen Stewart’s “Ordinary Affects” in our blog and academic papers several times because it has been extremely influential upon both of us through its attempt to understand and express the ordinary moments in people’s lives that, while not unusual, per se, seem to have a weight to them that moves them somewhere in some direction–the little moments that are both ordinary and extraordinary, nondescript and meaningful. Susan M. Schultz’ “Dementia Blog” also comes to mind. While it isn’t necessarily an ethnography, per se, Schultz utilized blogging and its unique structural features (namely, that newer posts come first so that reading the blog in order is actually going backwards in time) as a means of looking into the poetics and tragic beauty of dementia while also expressing and understanding her own feelings as her mother’s mental illness progressed.

Jason: We are not too familiar with scholars who are interacting with fans in precisely the way that we are (or whether there are any), though it is important to be aware of the term “aca-fan” (academic fan) in fandom studies and some of the works being produced under that rubric. Henry Jenkins titles his website “Confessions of an Aca-Fan,” for example, and writes for an audience that includes both scholars and people interested in fandoms in general. The online journal Flow is another example that is somewhat more closely related to our blog, expressly attempting to link scholars with members of the public interested in talking about television. I’m also personally influenced by the work of Michael Wesch and Kembrew McLeod, both scholars who attempt to engage their students and the public in novel ways using media and technology.

Categories: Planet DigiPres

Canvas fingerprinting, the technical stuff

File Formats Blog - 15 August 2014 - 10:57am

The ability of websites to bypass privacy settings with “canvas fingerprinting” has caused quite a bit of concern, and it’s become a hot topic on the Code4lib mailing list. Let’s take a quick look at it from a technical standpoint. It is genuinely disturbing, but it’s not the unstoppable form of scrutiny some people are hyping it as.

The best article to learn about it from is “Pixel Perfect: Fingerprinting Canvas in HTML5,” by Keaton Mowery and Hovav Shacham at UCSD. It describes the basic technique and some implementation details.

Canvas fingerprinting is based on the <canvas> HTML element. It’s been around for a decade but was standardized for HTML5. In itself, <canvas> does nothing but define a blank drawing area with a specified width and height. It isn’t even like the <div> element, which you can put interesting stuff inside; if all you use is unscripted HTML, all you get is some blank space. To draw anything on it, you have to use JavaScript. There are two APIs available for this: the DOM Canvas API and the WebGL API. The DOM API is part of the HTML5 specification; WebGL relies on hardware acceleration and is less widely supported.

Either API lets you draw objects, not just pixels, to a browser. These include geometric shapes, color gradients, and text. The details of drawing are left to the client, so they will be drawn slightly differently depending on the browser, operating system, and hardware. This wouldn’t be too exciting, except that the API can read the pixels back. The getImageData method of the 2D context returns an ImageData object, which is a pixel map. This can be serialized (e.g., as a PNG image) and sent back to the server from which the page originated. For a given set of drawing commands and hardware and software configuration, the pixels are consistent.

Drawing text is one way to use a canvas fingerprint. Modern browsers use a programmatic description of a font rather than a bitmap, so that characters will scale nicely. The fine details of how edges are smoothed and pixels interpolated will vary, perhaps not enough for any user to notice, but enough so that reading back the pixels will show a difference.

However, the technique isn’t as frightening as the worst hype suggests. First, it doesn’t uniquely identify a computer. Two machines that have the same model and come from the same shipment, if their preinstalled software hasn’t been modified, should have the same fingerprint. It has to be used together with other identifying markers to narrow down to one machine. There are several ways for software to stop it, including blocking JavaScript from offending domains and disabling part or all of the Canvas API. What gets people upset is that neither blocking cookies nor using a proxy will stop it.

Was including getImageData in the spec a mistake? This can be argued both ways. Its obvious use is to draw a complex canvas once and then rubber-stamp it if you want it to appear multiple times; this can be faster than repeatedly drawing from scratch. It’s unlikely, though, that the designers of the spec thought about its privacy implications.

Tagged: DOM, HTML, html5, JavaScript, W3C
Categories: Planet DigiPres

Netnography and Digital Records: An Interview with Robert Kozinets

The Signal: Digital Preservation - 13 August 2014 - 1:19pm

Robert V. Kozinets, professor of marketing at York University in Toronto

The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and worked on a range of projects leading up to CurateCamp Digital Culture in July. This is part of a series of interviews Julia conducted to better understand the kinds of born-digital primary sources folklorists, and others interested in studying digital culture, are making use of for their scholarship.

Online communities, and their digital records, can be rich source of information, invaluable to academic researchers and to market researchers. In this installment of the Insights Interviews series, I’m delighted to talk with Robert V. Kozinets, professor of marketing at York University in Toronto and the originator of “netnography.

Julia: In your book “Netnography: Doing Ethnographic Research Online,” you define “netnography” as a “qualitative method devised specifically to investigate the consumer behavior of cultures and communities present on the Internet.”  Can you expand a bit on that definition for us? What is it about online communities that warrants minting a new word for doing ethnographic work online? Further, how would you compare and contrast your approach to other terms like “virtual ethnography”?

Robert: It’s a great question, and one that is difficult to do justice to in a short interview. For readers who are aware of the anthropological technique of ethnography, or participant-observation, it may be fairly easy to grasp that ethnographic work can also be performed in online or social media environments. However, doing ethnographic work on the combination of digital archives and semi-real-time conversations, and much more, that is the Internet is a bit different from, say, traveling to Outer Mongolia to learn about how people live there. The online environment is technologically mediated, it is instantly archived, it is widely accessible, and it is corporately controlled and monitored in ways that face-to-face behavior is not. Netnography is simply a way to approach ethnography online, and it could just as easily be called “virtual,” “digital,” “web,” “mobile” or other kinds of ethnography. The difference, I suppose, is that netnography has been associated with particular research practices in a way that these other terms are not.

Julia: You began implementing netnography as a research method in 1995. The web has changed a good bit since you started doing this work nearly twenty years ago. How has the continued development of web applications and software changed or altered the nature of doing netnographic research? In particular, has the increased popularity of social media (Facebook, Twitter) changed work in studying online communities?

Networking, from user jalroagua on Flickr

Networking, from user jalroagua on Flickr

Robert: This is a little like asking an experimental researcher if the experiments they run are different if they are running them on children or old people, or if they are experimenting on prisoners in a prison, or students at a party. It is a tactical and operational issue. The guiding principles of netnography are exactly the same whether it is a bulletin board, a blog or Facebook. Fundamental questions of focus, data collection, immersion and participation, analysis, and research presentation are identical.

Julia: How do you suggest finding communities online outside of the relatively basic search operations offered by Google and Yahoo? What are some signs that a particular online community will be a good source for netnographic research?

Robert: There are many search tools that are available, but there is no particular need to go beyond Google or Yahoo. The two keys to netnography are finding particularly interesting and relevant data amongst the load of existing data, and paying particular attention to one’s own role and consciousness as participant in the research process. Whatever tools one chooses to work with, this is time-consuming, painstaking and rewarding work. One thing I would love search engines to be able to do is to include and tag visual, audio and audiovisual material. It would be wonderful to have a search engine that spat out results to a search and gave me, along with website, blog and forum links, a full list of links to Instagram photos, YouTube videos and iTunes podcasts.

Julia: Throughout the book, you reinforce the point that the key to generating insight in netnography is building trust. Can you unpack that a bit? What are some ethical concerns researchers should keep in mind when conducting ethnographic research?

Robert: A range of ethical concerns have been raised about the use of Internet data, many of which have proven over the years to be non-starters. Notions of informed consent can be difficult online, and ethical imperatives can be difficult in environments where the line between public and private is so unclear. However, disclosure of the researcher or the research is not always necessary–it depends always upon the context. As with any research ethics question, it is generally a question of weighing potential benefits against potential risks.

Julia: From your perspective as an ethnographer and market researcher, what kinds of online content do you think is the most critical for cultural heritage organizations to preserve for researchers of the future to study this moment in history? Collecting and preserving content isn’t your area, but I’d be interested to hear whether you think there are  particular subcultures, movements or content that aren’t getting enough attention.

Robert: I have used the Wayback Machine from time to time to look at snapshots of the Internet of the past. I also recall a recent research project in which we studied bloggers, and in which some interesting blog material was removed shortly after it was posted. It survived only in our fieldnotes, but we had not archived it. Of course, it would be nice to be able to instantly retrieve “the data that got away.” However, in my research, it is the immediate experience of the Internet which matters most.

Given the rapid spread of social media, I believe that the present holds far more information and insight that any other time in the past. There are so many archives of so many particular groups already, and those archives are, in themselves, rather revealing cultural artifacts. The ones I find the most fascinating to study are the archives that groups make of their own activities. So, to answer your last question, I suppose that, to answer a library sciences question, I would be more interested to see the archives that library science people construct about library science and how they represent themselves to themselves and to wider audiences of assumed “others” that I would about how library science people represent any other group.

Julia: Aside from what to collect, I would be curious to learn a bit more about what kinds of access you think researchers studying digital culture are going to want to have to these collections. How much of this do you think will be focused on close reading of what individual pages and sites looked like and how much on bulk analysis of materials as data?

Rob: I think researchers are hungry for everything. If you ask typical researchers what data they want, they will say everything. That is because, without a specific focus or research question, you want to keep all of your options open. Then the problem becomes what they do with all this data, and they end up with all sorts of big data methods that try to fit as much data as possible into models. My approach is a bit different, in that I am searching for individual experiences online that generate insight. This could come from masses of data, or from one page, one site, even one photograph or one video clip. I think the question of access is tied up with questions of categorizing, interpretation and ownership, and these are all interesting and complex matters that lend themselves to a lot more thought and debate. In the short- to medium-term, what is currently available on the Internet is certainly more than enough for me to work with.

Categories: Planet DigiPres

Coming to "Preserving PDF - identify, validate, repair" in Hamburg?

Open Planets Foundation Blogs - 12 August 2014 - 10:01am

The OPF is holding a PDF event in Hamburg on 1st-2nd September 2014 where we'll be taking an in-depth look at the PDF format, its sub-flavours like PDF/A and open source tools that can help. This is a quick post of list of things you can do to prepare for the event if you're attending and looking to get the most out of it.


The Wikipedia entry on PDF provides a readable overview of the formats history with some technical details. Adobe provide a brief PDF 101 post that avoids technical detail.

Johan van der Knijff's OPF blog has a few interesting posts on PDF preservation risks:

This MacTech article is still a reasonable introduction to PDF for developers. Finally, if you really want a detailed look you could try the Adobe specification page but it's heavy weight reading.


Below are brief details of the main open source tools we'll be working with. It's not essential that you dowload and install these tools. The all require Java and none of them have user friendly install procedures. We'll be looking at ways to improve that at the event. We'll also be providing a pre-configured virtual environement to allow you to experiment in a friendly, throw away environment. See the Software section a little further down.


JHOVE is an open source tool that performs format specific identification, characterisation and validation of digital objects. JHOVE can identify and validate PDF files against the PDF specification while extracting technical and descriptive metadata. JHOVE recognises PDFs that state that they conform to the PDF/A profile, but it can't then validate that a PDF conforms to the PDF/A specification.

Apache Tika

The Apache Foundation's Tika project is an application / toolkit that can be used to identify, parse, extract metadata, and extract content from many file formats.  

Apache PDFBox

Written in Java, Apache PDFBox is an open source library for working with PDF documents. It's primarily aimed at developers but has some basic command line apps. PDFBox also contains a module that verifies PDF/A-1 documents that has a command line utility.

These libraries are of particular interest to Java developers who can incorporate the libraries into their own programs, Apache Tika uses the PDFBox libraries for PDF parsing.

Test Data

These test data sets were chosen because they're freely available. Again it's not necessary to download them before attending but they're good starting points for testing some of the tools or your code:

PDFs from GovDocs selected dataset

The original GovDocs corpora is a test set of nearly 1 million files and is nearly half a terabyte in size. The corpus was reduced in size by removing similar items by David Tarrant, as described in this post. The remaing data set is still large at around 17GB and can be downloaded here.

Isator PDF/A test suite

The Isator test suite is published by the PDF Association's PDF/A competency centre, in their own words: 

This test suite comprises a set of files which can be used to check the conformance of software regarding the PDF/A-1 standard. More precisely, the Isartor test suite can be used to “validate the validators”: It deliberately violates the requirements of PDF/A-1 in a systematic way in order to check whether PDF/A-1 validation software actually finds the violations.

More information about the suite can be found on the PDF Association's website along with a download link.

PDFs from OPF format corpus

The OPF has a GitHub repository where members can upload files that represent preservation risks / problems. This has a couple of sub-collections of PDFs, these show problem PDFs from the GovDocs corpus and this is a collection of PDFs with features that are "undesirable" in an archive setting.


If you'd like the chance to get hands-on with the software tools at the event and try some interactive demonstrations / exercises we'll be providing light virtualised demonstration environments using VirtualBox and Vagrant. It's not essential that you install the software to take part but it does offer the best way to try things for yourself, particularly if you're not a techie. These are available for Windows, Mac, and linux and should run on most people's laptops, download links are shown below.

Vagrant downloads page:

Oracle VirtualBox downloads page:

Be sure to install the VirtualBox extensions also, it's the same download for all platforms.

What next?

I'll be writing another post for Monday 18th August that will take a look at using some of the tools and test data together with a brief analysis of the results. This will be accompanied by a demonstration virtual environment that you can use to repeat the tests and experiment yourself.

Categories: Planet DigiPres

Networked Youth Culture Beyond Digital Natives: An Interview With danah boyd

The Signal: Digital Preservation - 11 August 2014 - 6:00pm
danah boyd, principal researcher Microsoft Research, research assistant professor in media, culture and communication at New York University, and fellow with Harvard's Berkman Center for Internet & Society.

danah boyd, principal researcher, Microsoft Research, research assistant professor in media, culture and communication at New York University, and fellow with Harvard’s Berkman Center for Internet & Society.

The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and worked on a range of projects that lead up to CurateCamp Digital Culture in July. This is part of an ongoing series of interviews Julia conducted to better understand the kinds of born-digital primary sources folklorists, and others interested in studying digital culture, are making use of for their scholarship.

How do teens use the internet? For researchers, reporters and concerned parents alike, that question has never been more relevant. Many adults can only guess, or extrapolate based on news reports or their own social media habits. But researcher danah boyd took an old-fashioned but effective approach: she asked them.

I’m delighted to continue our ongoing Insights Interview series today with danah, a principal researcher at Microsoft Research, a research assistant professor in media, culture and communication at New York University, and a fellow at Harvard’s Berkman Center for Internet & Society. For her new book It’s Complicated: The Social Lives of Networked Teens, she spent about eight years studying how teens interact both on- and off-line.

Julia: The preface to your latest book ends by assuring readers that “by and large, the kids are all right.” What do you mean by that?

danah: To be honest, I really struggle with prescriptives and generalizations, but I had to figure out how to navigate those while writing this book.  But this sentence is a classic example of me trying to add nuance to a calming message.  What I really mean by this – and what becomes much clearer throughout the book – is that the majority of youth are as fine as they ever were.  They struggle with stress and relationships.  They get into trouble for teenage things and aren’t always the best equipped for handling certain situations.  But youth aren’t more at-risk than they ever were.  At the same time, there are some youth who are seriously not OK.  Keep in mind that I spend time with youth who are sexually abused and trafficked for a different project.  I don’t want us to forget that there are youth out there that desperately need our attention. Much to my frustration, we tend to focus our attention on privileged youth, rather than the at-risk youth who are far more visible today because of the internet than ever before.

Many parents and young people from the school and nearby communities attend the pie and box supper, given by the school to raise money for additional repairs and supplies. Each box or pie is auctioned off to the highest bidder, sometimes bringing a good deal, since the girl's "boyfriend" usually wins and has the privilege of eating it with her afterwards. Quicksand school, Breathitt County, Kentucky. 1940 Sept.  Farm Security Administration - Office of War Information Photograph Collection, Prints and Photographs.

Photograph from “pie and box supper,” Quicksand school, Breathitt County, Kentucky, September 1940. Farm Security Administration – Office of War Information Photograph Collection. Photo courtesy of the Library of Congress Prints & Photographs Division.

Julia: In a recent article you stated that “social media mirror, magnify, and complicate countless aspects of everyday life, bringing into question practices that are presumed stable and shedding light on contested social phenomena.” Can you expand a bit on this?

danah: When people see things happening online that feel culturally unfamiliar to them, they often think it’s the internet that causes it. Or when they see things that they don’t like – like bullying or racism – they think that the internet has made it worse.  What I found in my research is that the internet offers a mirror to society, good, bad and ugly.  But because that mirror is so publicly visible and because the dynamics cross geographic and cultural boundaries, things start to get contorted in funny ways.  And so it’s important to look at what’s happening underneath the aspect that is made visible through the internet.

Julia: In a recent interview you expressed frustration with how, in the moral panic surrounding social media, “we get so obsessed with focusing on relatively healthy, relatively fine middle- and upper-class youth, we distract ourselves in ways that don’t allow us to address the problems when people actually are in trouble.” What’s at stake when adults and the media misunderstand or misrepresent teen social media use?

danah: We live in a society and as much as we Americans might not like it, we depend on others.  If we want a functioning democracy, we need to make sure that the fabric of our society is strong and healthy.  All too often, in a country obsessed with individualism, we lose track of this.  But it becomes really clear when we look at youth.  Those youth who are most at-risk online are most at-risk offline.  They often come from poverty or experience abuse at home. They struggle with mental health issues or have family members who do.  These youth are falling apart at the seams and we can see it online.  But we get so obsessed with protecting our own children that we have stopped looking out for those in our communities that are really struggling, those who don’t have parents to support them.  The urban theorist Jane Jacobs used to argue that neighborhoods aren’t safe because you have law enforcement policing them; they are safe because everyone in the community is respectfully looking out for one another.  She talked about “eyes on the street,” not as a mechanism of surveillance but as an act of caring.  We need a lot more of that.

Southington, Connecticut. Young people watching a game.  1942 May 23-30. Farm Security Administration - Office of War Information Photograph Collection.  Prints and Photographs Division

Southington, Connecticut. Young people watching a game. 1942 May 23-30. Farm Security Administration – Office of War Information Photograph Collection. Photo courtesy of the Library of Congress Prints and Photographs Division.

Julia: You conduct research on teen behaviors both on and offline. How are physical environments important to understanding mediated practices? What are the limitations to studying online communities solely by engaging with them online?

danah: We’ve spent the last decade telling teenagers that strangers are dangerous, that anyone who approaches them online is a potential predator.  I can’t just reach out to teens online and expect them to respond to me; they think I’m creepy.  Thus, I long ago learned that I need to start within networks of trust. I meet youth through people in their lives, working networks to get to them so that they will trust me and talk about their lives with me. In the process, I learned that I get a better sense of their digital activities by seeing their physical worlds first.  At the same time, I do a lot of online observation and a huge part of my research has been about piecing together what I see online with what I see offline.

Julia: Researchers interested in young people’s social media use today can directly engage with research participants and a wealth of documentation over the web. When researchers look back on this period, what do you think are going to be the most critical source material for understanding the role of social media in youth culture? In that vein, what are some websites/data sets and other kinds of digital material that you think would be invaluable for future researchers to have access to for studying teen culture of today 50 years from now?

El Centro (vicinity), California. Young people at the Imperial County Fair. 1942 Feb.-Mar. Farm Security Administration - Office of War Information Photograph Collection.

El Centro (vicinity), California. Young people at the Imperial County Fair. 1942 Feb.-Mar. Farm Security Administration – Office of War Information Photograph Collection. Photo courtesy of the Library of Congress Prints and Photographs Division.

danah: Actually, to be honest, I think that anyone who looks purely at the traces left behind will be missing the majority of the story.  A lot has changed in the decade in which I’ve been studying youth, but one of the most significant changes has to do with privacy.  When I started this project, American youth were pretty forward about their lives online. By the end, even though I could read what they tweeted or posted on Instagram, I couldn’t understand it.  Teens started encoding content. In a world where they can’t restrict access to content, they restrict access to meaning.  Certain questions can certainly be asked of online traces, but meaning requires going beyond traces.

Julia: Alongside your work studying networked youth culture, you have also played a role in ongoing discussions of the implications of “big data.” Recognizing that researchers now and in the future are likely going to want to approach documentation and records as data sets, what do you think are some of the most relevant issues from your writing on big data for cultural heritage institutions to consider about collecting, preserving and providing access to social media, and other kinds of cultural data?

teenagers and their smartphones visiting a museum by user vilseskogen on Flickr.

teenagers and their smartphones visiting a museum by user vilseskogen on Flickr.

danah: One of the biggest challenges that archivists always have is interpretation. Just because they can access something doesn’t mean they have the full context.  They work hard to piece things together to the best that they can, but they’re always missing huge chunks of the puzzle.  I’m always amazed when I sit behind the Twitter firehose to see the stream of tweets that make absolutely no sense.  I think that anyone who is analyzing this data knows just how dirty and confusing it can be.  My hope is that it will force us to think about who is doing the interpreting and how.  And needless to say, there are huge ethical components to that.  This is at the crux of what archivists and cultural heritage folks do.

Julia: You’ve stated that “for all of the attention paid to ‘digital natives’ it’s important to realize that most teens are engaging with social media without any deep understanding of the underlying dynamics or structure.” What role can cultural heritage organizations play in facilitating digital literacy learning?

danah: What I love about cultural heritage organizations is that they are good at asking hard questions, challenging assumptions, questioning interpretations.  That honed skill is at the very center of what youth need to develop.  My hope is that cultural heritage organizations can go beyond giving youth the fruits of their labor and inviting them to develop these skills.  These lessons don’t need to be internet-specific. In many ways, they’re a part of what it means to be critically literate period.

Categories: Planet DigiPres

August Library of Congress Digital Preservation Newsletter is Now Available

The Signal: Digital Preservation - 8 August 2014 - 3:02pm

The August Library of Congress Digital Preservation Newsletter is now available:

Included in this issue:Augustcover

  • Digital Preservation 2014: It’s a Thing
  • Preserving Born Digital News
  • LOLCats and Libraries with Amanda Brennan
  • Digital Preservation Questions and Answers
  • End-of-Life Care for Aging, Fragile CDs
  • Education Program updates
  • Interviews with Henry Jenkins and Trevor Blank
  • More on Digital Preservation 2014
  • NDSA News, and more
Categories: Planet DigiPres

Cookbooks vs. learning books

File Formats Blog - 8 August 2014 - 12:30pm

A contract lead got me to try learning more about a technology which the client needs. Working from an e-book which I already had, I soon was thinking that it’s a really confused, ad hoc library. But then I remembered having this feeling before, when the problem was really the book. I looked for websites on the software and found one that explained it much better. The e-book had a lot of errors, using JavaScript terminology incorrectly and its own terminology inconsistently.

A feeling came over me, the same horrified realization the translator of To Serve Man had: “It’s a cookbook!” It wasn’t designed to let you learn how the software works, but to get you turning out code as quickly as possible. There are too many of these books, designed for developers who think that understanding the concepts is a waste of time. Or maybe the fault belongs less to the developers than to managers who want results immediately.

A book that introduces a programming language or API needs to start with the lay of the land. What are its basic concepts? How is it different from other approaches? It has to get the terminology straight. If it has functions, objects, classes, properties, and attributes, make it clear what each one is. There should be examples from the start, so you aren’t teaching arid theory, but you need to follow up with an explanation.

If you’re writing an introduction to Java, your “Hello world” example probably has a class, a main() function, and some code to write to System.out. You should at least introduce the concepts of classes, functions, and importing. That’s not the place to give all the details; the best way to teach a new idea is to give a simple version at first, then come back in more depth later. But if all you say is “Compile and run this code, and look, you’ve got output!” then you aren’t doing your job. You need to present the basic ideas simply and clearly, promise more information later, and keep the promise.

Don’t jump into complicated boilerplate before you’ve covered the elements it’s made of. The point of the examples should be to teach the reader how to use the technology, not to provide recipes for specific problems. The problem the developer has to solve is rarely going to be the one in the book. They can tinker with the examples until they fit their own problem, not really understanding them, but that usually results in complicated, inefficient, unmaintainable code.

Expert developers “steal” code too, but we know how it works, so we can take it apart and put it back together in a way that really suits the problem. The books we can learn from are the ones that put the “how it works” first. Cookbooks are useful too, but we need them after we’ve learned the tech, not when we’re trying to figure it out.

Tagged: books, writing
Categories: Planet DigiPres

Duke’s Legacy: Video Game Source Disc Preservation at the Library of Congress

The Signal: Digital Preservation - 6 August 2014 - 2:18pm

The following is a guest post from David Gibson, a moving image technician in the Library of Congress. He was previously interviewed about the Library of Congress video games collection.

The discovery of that which has been lost or previously unattainable is one of the driving forces behind the archival profession and one of the passions the profession shares with the gaming community. Video game enthusiasts have long been fascinated by unreleased games and “lost levels,” gameplay levels which are partially developed but left out of the final release of the game. Discovery is, of course, a key component to gameplay. Players revel in the thrill of unlocking the secret door or uncovering Easter eggs hidden in the game by developers. In many ways, the fascination with obtaining access to unreleased games or levels brings this thrill of discovery into the real world. In a recent article written for The Atlantic, Heidi Kemps discusses the joy in obtaining online access to playable lost levels from the 1992 Sega Genesis game, Sonic The Hedgehog 2, reveling in the fact that access to these levels gave her a glimpse into how this beloved game was made.

Original source disc as it was received by the Library of Congress.

Original source disc as it was received by the Library of Congress.

Since 2006, the Moving Image section of the Library of Congress has served as the custodial unit for video games. In this capacity, we receive roughly 400 video games per year through the Copyright registration process, about 99% of which are physically published console games. In addition to the games themselves we sometimes receive ancillary materials, such as printed descriptions of the game, DVDs or VHS cassettes featuring excerpts of gameplay, or the occasional printed source code excerpt. These materials are useful, primarily for their contextual value, in helping to tell the story of video game development in this country and are retained along with the games in the collection.

Several months ago, while performing an inventory of recently acquired video games, I happened upon a DVD-R labeled Duke Nukem: Critical Mass (PSP). My first assumption was that the disc, like so many others we have received, was a DVD-R of gameplay. However, a line of text on the Copyright database record for the item intrigued me. It reads: Authorship: Entire video game; computer code; artwork; and music. I placed the disc into my computer’s DVD drive to discover that the DVD-R did not contain video, but instead a file directory, including every asset used to make up the game in a wide variety of proprietary formats. Upon further research, I discovered that the Playstation Portable version of Duke Nukem: Critical Mass was never actually released commercially and was in fact a very different beast than the Nintendo DS version of the game which did see release. I realized then that in my computer was the source disc used to author the UMD for an unreleased PlayStation Portable game. I could feel the lump in my throat. I felt as though I had solved the wizard’s riddle and unlocked the secret door.

Excerpt of code from boot.bin including game text.

Excerpt of code from boot.bin including game text.

The first challenge involved finding a way to access the proprietary Sony file formats contained within the disc, including, but not limited to, graphics files in .gim format and audio files in .AT3 format. I enlisted the aid of Packard Campus Software Developer Matt Derby and we were able to pull the files off of the disc and get a clearer sense of the file structure contained within. Through some research on various PSP homebrew sites we discovered Noesis, a program that would allow us to access the .gim and .gmo files which contain the 3D models and textures used to create the game’s characters and 3D environments. With this program we were able to view a complete 3D view of Duke Nukem himself, soaring through the air on his jetpack and a pre-composite 3D model of one of the game’s nemeses, the Pig Cops. Additionally, we employed Mediacoder and VLC in order to convert the Sony .AT3 (ATRAC3) audio files to MP3 in order to have access to the game’s many music cues.


3D model for Duke Nukem equipped with jetpack. View an animated gif of the model here.

Perhaps the most exciting discovery came when we used a hex editor to access the ASCII text held in the boot.bin folder in the disc’s system directory. Here we located the full text and credit information for the game along with a large chunk of un-obfuscated software code. However, much of what is contained in this folder was presented as compiled binaries. It is my hope that access to both the compiled binaries and ASCII code will allow us to explore future preservation options for video games. Such information becomes even more vital in the case of games such as this Duke Nukem title which were never released for public consumption. In many ways, this source disc can serve as an exemplary case as we work to define preferred format requirements for software received by the Library of Congress. Ultimately, I feel that access to the game assets and source code will prove to be invaluable both to researchers who are interested in game design and mechanics and to any preservation efforts the Library may undertake.

Providing access to the disc’s content to researchers will, unfortunately, remain a challenge. As mentioned above, it was difficult enough for Library of Congress staff to view the proprietary formats found on the disc before seeking help from the homebrew community. The legal and logistical hurdles related to providing access to licensed software will continue to present themselves as we move forward but I hope that increased focus on the tremendous research value of such digital assets will allow for these items to be more accessible in the future. For now the assets and code will be stored in our digital archive at the Packard Campus in Culpeper and the physical disc will be stored in temperature-controlled vaults.

The source disc for the PSP version of Duke Nukem: Critical Mass stands out in the video game collection of the Library of Congress as a true digital rarity. In Doug Reside’s recent article “File Not Found: Rarity in the Age of Digital Plenty” (pdf), he explores the notion of source code as manuscript and the concept of digital palimpsests that are created through the various layers that make up a Photoshop document or which are present in the various saved “layers” of a Microsoft Word document. The ability to view the pre-compiled assets for this unreleased game provides a similar opportunity to view the game as a work-in-progress, or at the very least to see the inner workings and multiple layers of a work of software beyond what is presented to us in the final, published version. In my mind, receiving the source disc for an unreleased game directly from the developer is analogous to receiving the original camera negative for an unreleased film, along with all of the separate production elements used to make the film. The disc is a valuable evidentiary artifact and I hope we will see more of its kind as we continue to define and develop our software preservation efforts.

The staff of the Moving Image section would love the opportunity to work with more source materials for games and I hope that game developers who are interested in preserving their legacy will be willing to submit these kinds of materials to us in the future. Though source discs are not currently a requirement for copyright, they are absolutely invaluable in contributing to our efforts towards stewardship and long term access to the documentation of these creative works.

Special thanks to Matt Derby for his assistance with this project and input for this post.

Categories: Planet DigiPres

National Geospatial Advisory Committee: The Shape of Geo to Come

The Signal: Digital Preservation - 5 August 2014 - 1:24pm

World Map 1689 — No. 1 from user caveman_92223 on Flickr.

Back in late June I attended the National Geospatial Advisory Committee (NGAC) meeting here in DC. NGAC is a Federal Advisory Committee sponsored by the Department of the Interior under the Federal Advisory Committee Act. The committee is composed of (mostly) non-federal representatives from all sectors of the geospatial community and features very high profile participants. For example, ESRI founder Jack Dangermond, the 222nd richest American, has been a member since the committee was first chartered in 2008 (his term has since expired). Current committee members include the creator of Google Earth (Michael Jones) and the founder of OpenStreetMap (Steve Coast).

So what is the committee interested in, and how does it coincide with what the digital stewardship community is interested in? There are number of noteworthy points of intersection:

  • In late March of this year the FGDC released the “National Geospatial Data Asset Management Plan – a Portfolio Management Implementation Plan for the OMB Circular A–16” (pdf). The plan “lays out a framework and processes for managing Federal NGDAs [National Geospatial Data Assets] as a single Federal Geospatial Portfolio in accordance with OMB policy and Administration direction. In addition, the Plan describes the actions to be taken to enable and fulfill the supporting management, reporting, and priority-setting requirements in order to maximize the investments in, and reliability and use of, Federal geospatial assets.”
  • Driven by the release of the NGDA Management Plan, a baseline assessment of the “maturity” of various federal geospatial data assets is currently under way. This includes identifying dataset managers, identifying the sources of data (fed only/fed-state partnerships/consortium/etc.) and determining the maturity level of the datasets across a variety of criteria. With that in mind, several “maturity models” and reports were identified that might prove useful for future work in this area. For example, the state of Utah AGRC has developed a one-page GIS Data Maturity Assessment; the American Geophysical Union has a maturity model for assessing the completeness of climate data records (behind a paywall, unfortunately); the National States Geographic Information Council has a Geospatial Maturity Assessment; and the FGDC has “NGDA Dataset Maturity Annual Assessment Survey and Tool” that is being developed as part of their baseline assessment These maturity models have a lot in common with the NDSA Levels of Preservation work.
  • Lots of discussion on a pair of reports on big data and geolocation privacy. The first, Big Data – Seizing Opportunities, Preserving Values Report from the Executive Office of the President, acknowledges the benefits of data but also notes that “big data technologies also raise challenging questions about how best to protect privacy and other values in a world where data collection will be increasingly ubiquitous, multidimensional, and permanent.” The second, the PCast report on Big Data and Privacy (PCAST is the “President’s Council of Advisors on Science and Technology” and the report is officially called “Big Data: A Technology Perspective”) “begins by exploring the changing nature of privacy as computing technology has advanced and big data has come to the forefront.  It proceeds by identifying the sources of these data, the utility of these data — including new data analytics enabled by data mining and data fusion — and the privacy challenges big data poses in a world where technologies for re-identification often outpace privacy-preserving de-identification capabilities, and where it is increasingly hard to identify privacy-sensitive information at the time of its collection.” The importance of both of these reports to future library and archive collection and access policies regarding data can not be understated.
  • The Spatial Data Transfer Standard is being voted on for withdrawal as an FGDC-endorsed standard. FGDC maintenance authority agencies were asked to review the relevance of the SDTS, and they responded that the SDTS is no longer used by their agencies. There’s a Federal Register link to the proposal. The Geography Markup Language (GML), which the FGDC has endorsed, now satisfies the encoding requirements that SDTS once provided. NARA revised their transfer guidance for geospatial information in April 2014 to make SDTS files “acceptable for imminent transfer formats” but it’s clear that they’ve already moved away from them.  As a side note, GeoRSS is coming up for a vote soon to become an FGDC-endorsed standard.
  • The Office of Management and Budget is reevaluating the geospatial professional classification. The geospatial community has an issue similar to that being faced by the library and archives community, in that the jobs are increasingly information technology jobs but are not necessarily classified as such. This coincides with efforts to reevaluate the federal government library position description.
  • The Federal Geographic Data Committee is working with federal partners to make previously-classified datasets available to the public.  These datasets have been prepared as part of the “HSIP Gold” program. HSIP Gold is a compilation of over 450 geospatial datasets of U.S. domestic infrastructure features that have been assembled from a variety of Federal agencies and commercial sources. The work of assembling HSIP Gold has been tasked to the Homeland Infrastructure Foundation-Level Data (HIFLD) Working Group (say it as “high field”). Not all of the data in HSIP Gold is classified, so they are working to make some of the unclassified portions available to the public.

The next meeting of the NGAC is scheduled for September 23 and 24 in Shepherdstown, WV. The meetings are public.

Categories: Planet DigiPres

Making Scanned Content Accessible Using Full-text Search and OCR

The Signal: Digital Preservation - 4 August 2014 - 12:48pm

This following is a guest post by Chris Adams from the Repository Development Center at the Library of Congress, the technical lead for the World Digital Library.

We live in an age of cheap bits: scanning objects en masse has never been easier, storage has never been cheaper and large-scale digitization has become routine for many organizations. This poses an interesting challenge: our capacity to generate scanned images has greatly outstripped our ability to generate the metadata needed to make those items discoverable. Most people use search engines to find the information they need but our terabytes of carefully produced and diligently preserved TIFF files are effectively invisible for text-based search.

The traditional approach to this problem has been to invest in cataloging and transcription but those services are expensive, particularly as flat budgets are devoted to the race to digitize faster than physical media degrades. This is obviously the right call from a preservation perspective but it still leaves us looking for less expensive alternatives.

OCR is the obvious solution for extracting machine-searchable text from an image but the quality rates usually aren’t high enough to offer the text as an alternative to the original item. Fortunately, we can hide OCR errors by using the text to search but displaying the original image to the human reader. This means our search hit rate will be lower than it would with perfect text but since the content in question is otherwise completely unsearchable anything better than no results will be a significant improvement.

Since November 2013, the World Digital Library has offered combined search results similar to what you can see in the screenshot below:


This system is entirely automated, uses only open-source software and existing server capacity, and provides an easy process to improve results for items as resources allow.

How it Works: From Scan to Web Page Generating OCR Text

As we receive new items, any item which matches our criteria (currently books, journals and newspapers created after 1800) will automatically be placed in a task queue for processing. Each of our existing servers has a worker process which uses idle capacity to perform OCR and other background tasks. We use the Tesseract OCR engine with the generic training data for each of our supported languages to generate an HTML document using hOCR markup.

The hOCR document has HTML markup identifying each detected word and paragraph and its pixel coordinates within the image. We archive this file for future usage but our system also generates two alternative formats for the rest of our system to use:

  • A plain text version for the search engine, which does not understand HTML markup
  • A JSON file with word coordinates which will be used by a browser to display or highlight parts of an image on our search results page and item viewer
Indexing the Text for Search

Search has become a commodity service with a number of stable, feature-packed open-source offerings such as such Apache Solr, ElasticSearch or Xapian. Conceptually, these work with documents — i.e. complete records — which are used to build an inverted index — essentially a list of words and the documents which contain them. When you search for “whaling” the search engine performs stemming to reduce your term to a base form (e.g. “whale”) so it will match closely-related words, finds the term in the index, and retrieves the list of matching documents. The results are typically sorted by calculating a score for each document based on how frequently the terms are used in that document relative to the entire corpus (see the Lucene scoring guide for the exact details about how term frequency-inverse document frequency (TD-IDF) works).

This approach makes traditional metadata-driven search easy: each item has a single document containing all of the available metadata and each search result links to an item-level display. Unfortunately, we need to handle both very large items and page-level results so we can send users directly to the page containing the text they searched for rather than page 1 of a large book. Storing each page as a separate document provides the necessary granularity and avoids document size limits but it breaks the ability to calculate relevancy for the entire item: the score for each page would be calculated separately and it would be impossible to search for multiple words which fall on different pages.

The solution for this final problem is a technique which Solr calls Field Collapsing (the ElasticSearch team has recently completed a similar feature referred to as “aggregation”). This allows us to make a query and specify a field which will be used to group documents before determining relevancy. If we tell Solr to group our results by the item ID the search ranking will be calculated across all of the available pages and the results will contain both the item’s metadata record and any matching OCR pages.

(The django-haystack Solr grouped search backend with Field Collapsing support used on has been released into the public domain.)

Highlighting Results

At this point, we can perform a search and display a nice list of results with a single entry for each item and direct links to specific pages. Unfortunately, the raw OCR text is a simple unstructured stream of text and any OCR glitches will be displayed, as can be seen in this example where the first occurrence of “VILLAGE FOULA” was recognized incorrectly:


The next step is replacing that messy OCR text with a section of the original image. Our search results list includes all of the information we need except for the locations for each word on the page. We can use our list of word coordinates but this is complicated because the search engine’s language analysis and synonym handling mean that we cannot assume that the word on the page is the same word that was typed into the search box (e.g. a search for “runners” might return a page which mentions “running”).

Here’s what the entire process looks like:

1. The server returns an HTML results page containing all of the text returned by Solr with embedded microdata indicating the item, volume and page numbers for results and the highlighted OCR text:


2. JavaScript uses the embedded microdata to determine which search results include page-level hits and an AJAX request is made to retrieve the word coordinate lists for every matching page. The word coordinate list is used to build a list of pixel coordinates for every place where one of our search words occurs on the page:

adams080414image7Now we can find each word highlighted by Solr and locate it in the word coordinates list. Since Solr returned the original word and our word coordinates were generated from the same OCR text which was indexed in Solr, the highlighting code doesn’t need to handle word tenses, capitalization, etc.

3. Since we often find words in multiple places on the same page and we want to display a large, easily readable section of the page rather than just the word, our image slice will always be the full width of the page starting at the top-most result and extending down to include subsequent matches until there is either a sizable gap or the total height is greater than the first third of the page.

Once the image has been loaded, the original text is replaced with the image:


4. Finally, we add a partially transparent overlay over each highlighted word:


  • The WDL management software records the OCR source and review status for each item. This makes it safe to automatically reprocess items when new versions of our software are released without the chance of inadvertently overwriting OCR text which was provided by a partner or which has been hand-corrected.
  • You might be wondering why the highlighting work is performed on the client side rather than having the server return highlighted images. In addition to reducing server load this design improves performance because a given image segment can be reused for multiple results on the same page(rounding the coordinates improves the cache hit ratio significantly) and both the image and word coordinates can be cached independently by CDN edge servers rather than requiring a full round-trip back to the server each time.
  • This benefit is most obvious when you open an item and start reading it: the same word coordinates used on the search results page can be reused by the viewer and since the page images don’t have to be customized with search highlighting, they’re likely to be cached on the CDN. If you change your search text while viewing the book highlighting for the current page will be immediately updated without having to wait for the server to respond.


Challenges & Future Directions

This approach works relatively well but there are a number of areas for improvement:

  • The process described above allows the OCR process to be improved considerably. This provides plenty of room to improve results with technical improvements such as more sophisticated image processing, OCR engine training, and workflow systems incorporating human review and correction.
  • For collections such as WDL’s which include older items OCR accuracy is reduced by the condition of the materials and typographic conventions like the long s (ſ) or ligatures which are no longer in common usage. The Early Modern OCR Project is working on this problem and will hopefully provide a solution for many needs.
  • Finally, there’s considerable appeal to crowd-sourcing corrections as demonstrated by the National Library of Australia’s wonderful Trove project and various experimental projects such as the UMD MITH ActiveOCR project.
  • This research area is of benefit to any organization with large digitized collections, particularly projects with an eye towards generic reuse. Ed Summers and I have casually discussed the idea for a simple web application which would display images with the corresponding hOCR with full version control, allowing the review and correction process to be a generic workflow step for many different projects.
Categories: Planet DigiPres

Computational Linguistics & Social Media Data: An Interview with Bryan Routledge

The Signal: Digital Preservation - 1 August 2014 - 1:15pm
Bryan Routledge, Associate Professor of Finance Tepper School of Business Carnegie Mellon University.

Bryan Routledge, Associate Professor of Finance, Tepper School of Business, Carnegie Mellon University.

The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and worked on a range of projects leading up to CurateCamp Digital Culture last week. This is part of an ongoing series of interviews Julia is conducting to better understand the kinds of born-digital primary sources folklorists, and others interested in studying digital culture, are making use of for their scholarship.

What can a Yelp review or a single tweet reveal about society? How about hundreds of thousands of them? In this installment of the Insights Interviews series, I’m thrilled to talk with researcher Bryan Routledge about two of his projects that utilize a computational linguistic lens to analyze vast quantities of social media data. You can read the article on word choice used in online restaurant reviews here. The article about using Twitter as a predictive tool as compared with traditional public opinion polls here (PDF).

Julia: The research group Noah’s ARK at the Language Technologies Institute, School of Computer Science at Carnegie Mellon University aims in part to “analyze the textual content of social media, including Twitter and blogs, as data that reveal political, linguistic, and economic phenomena in society.”  Can you unpack this a bit for us? What kind of information can social media provide that other kinds of data can’t?

Bryan: Noah Smith, my colleague in the school of computer science at CMU, runs that lab.  He is kind enough to let me hang out over there.  The research we are working on looks at the connection between text and social science (e.g., economics, finance).  The idea is that looking at text through the lens of a forecasting problem — the statistical model between text and some social-science measured variable — gives insight into both the language and social parts.  Online and easily accessed text brings new data to old questions in economics.  More interesting, at least to me, is that grounding the text/language with quantitative external measures (volatility, citations, etc.) gives insight into the text.  What words in corporate 10K annual reports correlate with stock volatility and how that changes over time is cool.


Different metaphors for expensive and inexpensive restaurants in Yelp reviews. From: Dan Jurafsky, Victor Chahuneau, Bryan R. Routledge, and Noah A. Smith. 2014. Narrative framing of consumer sentiment in online restaurant reviews. First Monday 19:4.

Julia: Your work with social media—Yelp and Twitter—are notable for their large sample sizes and emphasis on quantitative methods, using over 900,000 Yelp reviews and 1 billion tweets. How might archivists of social media better serve social science research that depends on these sorts of data sets and methods?

Bryan: That is a good question.  What makes it very hard for archivists is that collecting the right data without knowing the research questions is hard.  The usual answer of “keep everything!” is impractical.  Google’s n-gram project is a good illustration.  They summarized a huge volume of books with word counts (two word pairs, …) by time.  This is great for some research.  But not for the more recent statistical models that use sentences and paragraph information.

Julia:  Your background and most of your work is in the field of finance, which you have characterized as being fundamentally about predicting the behavior of people . How do you see financial research being influenced by social media and other born digital content? Could you tell us a bit about what it means to have a financial background doing this kind of research? What can the fields of finance and archives learn from each other?

 in most locations, the word “baby” is neutral -- it suggests neither high nor low price.  Except in the Wall Street area of lower Manhattan where it is associated with higher priced steak.

In Yelp reviews of Manhattan restaurants with “steak” in the menu (an example). Predict the (log) menu item price using the words used to describe the item by location. For example: in most locations, the word “baby” is neutral — it suggests neither high nor low price. Except in the Wall Street area of lower Manhattan where it is associated with higher priced steak.

Bryan:  Finance (and economics) is about the collective behavior of large number of people in markets.  To make research possible you need simple models of individuals.  Getting the right mix of simplicity and realism is age-old and ongoing research in the area.  More data helps.  Macroeconomic data like GDP and stock returns is informative about the aggregate.  Data on, say, individual portfolio choices in 401K plans lets you refine models.  Social media data is this sort of disaggregated data.  We can get a signal, very noisy, about what is behind an individual decision.  Whether that is ultimately helpful for guiding financial or economic policy is an open, but exciting, question.

More generally, working across disciplines is interesting and fun.  It is not always “additive.”  The research we have done on menus has nothing to do with finance (other than my observation that in NY restaurants near Wall Street, the word “baby” is associated with expensive menu items).  But if we can combine, for example, decision theory finance with generative text models, we get some cool insights into purposefully drafted documents.

Julia: The data your team collected from Yelp was gathered from the site. Your data from Twitter was collected using Twitter’s Streaming API and “Gardenhose,” which deliver a random sampling of tweets in real-time. I’d be curious to hear what role you think content holders like Yelp or Twitter can or could play in providing access to this kind of raw data.

Bryan: As a researcher with only the interests of science at heart, it would be best if they just gave me access to all their data!  Given that much of the data is valuable to the companies (and privacy, of course), I understand that is not possible.  But it is interesting that academic research, and data-sharing more generally, is in a company’s self-interest.  Twitter has encouraged a whole ecosystem that has helped them grow.  Many companies have an API for that purpose that happens to work nicely for academic research.  In general, open access is most preferred in academic settings so that all researchers have access to the same data.  Interesting papers using proprietary access to Facebook are less helpful than Twitter.

Julia: Could you tell us a bit about how you processed and organized the data for analysis and how you are working to manage it for the future? Given that reproducibility is such an important concept for science, what ways are you approaching ensuring that your data will be available in the future?

Bryan: This is not my strong suit.  But at a high-level, the steps are (roughly) “get,” “clean,” “store,” “extract,” “experiment.”  The “get” varies with the data source (an API).  The “clean” step is just a matter of being careful with special characters and making sure data are lining up into fields right.  If the API is sensible, the “clean” is easy.  We usually store things in a JSON format that is flexible.  This is usually a good format to share data.  The “extract” and “experiment” steps depend on what you are interested in.  Word counts? Phrase counts? Other?  The key is not to jump from “get” to “extract” — storing the data in as raw form as possible makes thing flexible.

Julia:  What role, or potential role, do you see for the future of libraries, archives and museums in working with the kinds of data you collect? That is, while your data is valuable for other researchers now, things like 700,000 Yelp reviews of restaurants will be invaluable to all kinds of folks studying culture, economics and society 10, 20, 50 and 100 years from now. So, what kind of role do you think cultural heritage institutions could play in the long-term stewardship of this cultural data? Further, what kinds of relationships do you think might be able to be arranged between researchers and libraries, archives, and museums? For instance, would it make sense for a library to collect, preserve, and provide access to something like the Yelp review data you worked with? Or do you think they should be collecting in other ways?

 Linking Text Sentiment to Public Opinion Time Series. Brendan O'Connor,Ramnath Balasubramanyan, Bryan R. Routledge, and Noah A. Smith. In Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM 2010), pages 122–129, Washington, DC, May 2010

Sentiment on Twitter as compared to Gallup Poll. Appeared in From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series. Brendan O’Connor, Ramnath Balasubramanyan, Bryan R. Routledge and Noah A. Smith. In Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM 2010), pages 122–129, Washington, DC, May 2010

Bryan: This is also a great question and also one for which I do not have a great answer.  I do not know a lot about the research in “digital humanities,” but that would be a good place to look.  People doing digital text-based research on a long-horizon panel of data should provide some insight into what sorts of questions people ask.  Similarly, economic data might provide some hints.  Finance, for example, has a strong empirical component that comes from having easy-to-access stock data (the CRSP).  The hard part for libraries is figuring out which parts to keep.  Sampling Twitter, for example, gets a nice time-series of data but loses the ability to track a group of users or Twitter conversations.

Julia: Talking about the paper you co-authored that analyzed Yelp reviews, Dan Jurafsky said “when you write a review on the web you’re providing a window into your own psyche – and the vast amount of text on the web means that researchers have millions of pieces of data about people’s mindsets.” What do you think are some of the possibilities and limitations for analyzing social media content?

Bryan: There are many limitations, of course.  Twitter and Yelp are not just providing a window into things, they are changing the way the world works.  “Big data” is not just about larger sample sizes of draws from a fixed distribution.  Things are non-stationary.  (In an early paper using Twitter data, we could see the “Oprah” effect as the number of users jumped in the day following her show about Twitter).  Similarly, the data we see in social media is not a representative sample of society cross section.  But both of these are the sort of things good modeling – statistical, economic – should, and do, aim to capture.  The possibilities of all this new data are exciting.  Language is a rich source of data with challenging models needed to turn it into useful information.  More generally, social media is an integral part of many economic and social transactions.  Capturing that in a tractable model makes for an interesting research agenda.

Categories: Planet DigiPres

Digital Preservation 2014: It’s a Thing

The Signal: Digital Preservation - 30 July 2014 - 12:56pm

“Digital preservation makes headlines now, seemingly routinely. And the work performed by the community gathered here is the bedrock underlying such high profile endeavors.” – Matt Kirschenbaum

 Erin Engle.

The registration table at Digital Preservation 2014. Photo credit: Erin Engle.

The annual Digital Preservation meeting, held each summer in Washington, DC, brings together experts in academia, government and the private and non-profit sectors to celebrate key work and share the latest developments, guidelines, best practices and standards in digital preservation.

Digital Preservation 2014, held July 22-24,  marked the 13th major meeting hosted by NDIIPP in support of the broad community of digital preservation practitioners (NDIIPP held two meetings a year from 2005-2007), and it was certainly the largest, if not the best. Starting with the first combined NDIIPP/National Digital Stewardship Alliance meeting in 2011, the annual meeting has rapidly evolved to welcome an ever-expanding group of practitioners, ranging from students to policy-makers to computer scientists to academic researchers. Over 300 people attended this year’s meeting.

“People don’t need drills; they need holes,” stated NDSA Coordinating Committee chairman Micah Altman, the Director of Research at the Massachusetts Institute of Technology Libraries,  in an analogy to digital preservation in his opening talk. As he went on to explain, no one needs digital preservation for its own sake, but it’s essential to support the rule of law, a cumulative evidence base, national heritage, a strategic information reserve, and to communicate to future generations. It’s these challenges that face the current generation of digital stewardship practitioners, many of which are addressed in the 2015 National Agenda for Digital Stewardship, which Altman previewed during his talk (and which will appear later this fall).

 Erin Engle.

A breakout session at Digital Preservation 2014. Photo credit: Erin Engle.

One of those challenges is the preservation of the software record, which was eloquently illuminated by Matt Kirschenbaum, the Associate Director of the Maryland Institute for Technology in the Humanities, during his stellar talk, “Software, It’s a Thing.” Kirschenbaum ranged widely across computer history, art, archeology and pop culture with a number of essential insights. One of the more piquant was his sorting of software into different categories of “things” (software as asset, package, shrinkwrap, notation/score, object, craft, epigraphy, clickwrap, hardware, social media, background, paper trail, service, big data), each of which with its own characteristics. As Kirschenbaum eloquently noted, software is many different “things,” and we’ll need to adjust our future approaches to preservation accordingly.

Associate Professor at the New School Shannon Mattern took yet another refreshing approach, discussing the aesthetics of creative destruction and the challenges of preserving ephemeral digital art. As she noted, “by pushing certain protocols to their extreme, or highlighting snafus and ‘limit cases’ these artists’ work often brings into stark relief the conventions of preservation practice, and poses potential creative new directions for that work.”

 Erin Engle.

Stephen Abrams, Martin Klein, Jimmy Lin and Michael Nelson during the “Web Archiving” panel. Photo credit: Erin Engle.

These three presentations on the morning of the first day provided a thoughtful intellectual substrate upon which a huge variety of digital preservation tools, services, practices and approaches were elaborated over the following days. As befits a meeting that convenes disparate organizations and interests, collaboration and community were big topics of discussion.

A Tuesday afternoon panel on “Community Approaches to Digital Stewardship” brought together a quartet of practitioners who are working collaboratively to advance digital preservation practice across a range of organizations and structures, including small institutions (the POWRR project); data stewards (the Research Data Alliance); academia (the Academic Preservation Trust); and institutional consortiums (the Five College Consortium).

Later, on the second day, a well-received panel on the “Future of Web Archiving” displayed a number of clever collaborative approaches to capturing the digital materials from the web, including updates on the Memento project and Warcbase, an open-source platform for managing web archives.

 Erin Engle.

CurateCamp: Digital Culture. Photo credit: Erin Engle.

In between there were plenary sessions on stewarding space and research data, and over three dozen lightning talks, posters and breakout sessions covering everything from digital repositories for museum collections to a Brazilian digital preservation network to the debut of a new digital preservation questions and answers tool. Additionally, a CurateCamp unconference on the topic of “Digital Culture” was held on a third day at Catholic University, thanks to the support of the CUA Department of Library and Information Science.

The main meeting closed with a thought-provoking presentation from artist and digital conservator Dragan Espenschied. Espenschied utilized emulation and other novel tools to demonstrate some of the challenges related to presenting works authentically, in particular works from the early web and those dependent on a range of web services. Espenschied, also the Digital Conservator at Rhizome, has an ongoing project, One Terabyte of Kilobyte Age, that explores the material captured in the Geocities special collection. Associated with that project is a Tumblr he created that automatically generates a new screenshot from the Geocities archive collection every 20 minutes.

Web history, data stewardship, digital repositories; for digital preservation practitioners it was nerd heaven. Digital preservation 2014, it’s a thing. Now on to 2015!

Categories: Planet DigiPres

Art is Long, Life is Short: the XFR Collective Helps Artists Preserve Magnetic and Digital Works

The Signal: Digital Preservation - 29 July 2014 - 2:44pm

XFR STN (“Transfer Station”) is a grass-roots digitization and digital-preservation project that arose as a response from the New York arts community to rescue creative works off of aging or obsolete audiovisual formats and media. The digital files are stored by the Library of Congress’s NDIIPP partner the Internet Archive and accessible for free online. At the recent Digital Preservation 2014 conference, the NDSA gave XFR STN the NDSA Innovation Award. Last month, members of the XFR collective — Rebecca Fraimow, Kristin MacDonough, Andrea Callard and Julia Kim — answered a few questions for the Signal.

"VHS 1" from XFR Collective.

“VHS 1,” courtesy of Walter Forsberg.

Mike: Can you describe the challenges the XFR Collective faced in its formation?

XFR: Last summer, the New Museum hosted a groundbreaking exhibit called XFR STN.  Initiated by the artist collective Colab and the resulting MWF Video Club, the exhibit was a major success. By the end of the exhibition over 700 videos had been digitized with many available online through the Internet Archive.

It was clear  for all of us involved that there was a real demand for these services, that there are many under-served artists who were having difficulty preserving and accessing their own media. Many of the people involved with the exhibit became passionate about continuing the service of preserving obsolete magnetic and digital media for artists.  We wanted to offer a long-term, non-commercial, grassroots solution.

Using the experience of working on XFR STN as a jumping-off point, we began developing XFR Collective as a separate nonprofit initiative to serve the need that we saw.  Over the course of our development, we’ve definitely faced — and are still facing — a number of challenges in order to make ourselves effective and sustainable.

"VHS 3" by XFR Collective.

“VHS 2,” courtesy of Walter Forsberg.

Perhaps the biggest challenge has simply been deciding what form XFR Collective was going to take.  We started out with a bunch of borrowed equipment and a lot of enthusiasm, so the one thing we knew we could do was digitize, but we had to sit down and really think about things like organizational structure, sustainable pricing for our services, and the convoluted process of becoming a non-profit.

Eventually, we settled on a membership-based structure in order to be able to keep our costs as low as possible.  A lot of how we’re operating is still very experimental — this summer wraps up our six-month test period, during which we limited ourselves to working with only a small number of partners to allow us to figure out what our capacity was and how we could design our projects in the future.

We’ve got a number of challenges still ahead of us — finding a permanent home is a big one — and we still feel like we’re only just getting started, in terms of what we can do for the community of artists who use our services.  It’s going to be interesting for all of us to see how we develop.  We’ve started thinking of ourselves as kind of a grassroots preservation test kitchen. We’ll try almost any kind of project once to see if it works!

Mike: Where are the digital files stored? Who maintains them?

XFR: Our digital files will be stored with the membership organizations and uploaded to the Internet Archive for access and for long-term open-source preservation.  This is an important distinction that may confuse some people: XFR Collective is not an archive.

While we advocate and educate about best practices, we will not hold any of the digital files ourselves; we just don’t have the resources to maintain long-term archival storage.  We encourage material to go onto the Internet Archive because long-term accessibility is part of our mission and because the Internet Archive has the server space to store uncompressed and lossless files as well as access files.  That way if something happens to the storage that our partners are using for their own files, they can always re-download them.  But we can’t take responsibility for those files ourselves. We’re a service point, not a storage repository.

"VHS 2" by XFR Collective

“VHS 3,” courtesy of Walter Forsberg.

Mike: Regarding public access as a means of long-term preservation and sustainability, how do you address copyrighted works?

XFR: This is a great question that confounds a lot of our collaborators initially.  Access-as-preservation creates a lot of intellectual property concerns.  Still, we’re a very small organization, so we can afford to take more risks than a more high-profile institution.  We don’t delve too deeply into the area of copyright; our concern is with the survival of the material.  If someone has a complaint, the Internet Archive will give us a warning in time to re-download the content and then remove it. But so far we haven’t had any complaints.

Mike: What open access tools and resources do you use?

XFR: The Internet Archive itself is something of an open access resource and we’re seeing it used more and more frequently as a kind of accessory to preservation, which is fantastic.  Obviously it’s not the only solution, and you wouldn’t want to rely on that alone any more than you would any kind of cloud storage, but it’s great to have a non-commercial option for streaming and storage that has its own archival mission and that’s open to literally anyone and anything.

Mike:  If anyone is considering a potential collaboration to digitally preserve audiovisual artwork, what can they learn from the experiences of the XFR Collective?

XFR: Don’t be afraid to experiment!  A lot of what we’ve accomplished is just by saying to ourselves that we have to start doing something, and then jumping in and doing it.  We’ve had to be very flexible. A lot of the time we’ll decide something as a set proposition and then find ourselves changing it as soon as we’ve actually talked with our partners and understood their needs.  We’re evolving all the time but that’s part of what makes the work we do so exciting.

We’ve also had a lot of help and we couldn’t have done any of what we’ve accomplished without support and advice from a wide network of individuals, ranging from the amazing team at XFR STN to video archivists across New York City.  None of these collaborations happen in a vacuum, so make friendships, make partnerships, and don’t be nervous about asking for advice.  There are a lot of people out there who care about video preservation and would love to see more initiatives out there working to make it happen.

Categories: Planet DigiPres

The MH17 Crash and Selective Web Archiving

The Signal: Digital Preservation - 28 July 2014 - 4:34pm

The following is a guest post by Nicholas Taylor, Web Archiving Service Manager for Stanford University Libraries.

//">Internet Archive Wayback Machine</a>.

Screenshot of 17 July 2014 15:57 UTC archive snapshot of deleted VKontakte Strelkov blog post regarding downed aircraft, on Internet
Archive Wayback Machine

The Internet Archive Wayback Machine has been mentioned in several news articles within the last week  (see here, here and here) for having archived a since-deleted blog post from a Ukrainian separatist leader touting his shooting down a military transport plane which may have actually been Malaysia Airlines Flight 17. At this early stage in the crash investigation, the significance of the ephemeral post is still unclear, but it could prove to be a pivotal piece of evidence.

An important dimension of the smaller web archiving story is that the blog post didn’t make it into the Wayback Machine by the serendipity of Internet Archive’s web-wide crawlers; an unknown but apparently well-informed individual identified it as important and explicitly designated it for archiving.

Internet Archive crawls the Web every few months, tends to seed those crawls from online directories or compiled lists of top websites that favor popular content, archives more broadly across websites than it does deeply on any given website, and embargoes archived content from public access for at least six months. These parameters make the Internet Archive Wayback Machine an incredible resource for the broadest possible swath of web history in one place, but they don’t dispose it toward ensuring the archiving and immediate re-presentation of a blog post with a three-hour lifespan on a blog that was largely unknown until recently.

Recognizing the value of selective web archiving for such cases, many memory organizations engage in more targeted collecting. Internet Archive itself facilitates this approach through its subscription Archive-It service, which makes web archiving approachable for curators and many organizations. A side benefit is that content archived through Archive-It propagates with minimal delay to the Internet Archive Wayback Machine’s more comprehensive index. Internet Archive also provides a function to save a specified resource into the Wayback Machine, where it immediately becomes available.

Considering the six-month access embargo, it’s safe to say that the provenance of everything that has so far been archived and re-presented in the Wayback Machine relating to the five-month-old Ukraine conflict is either the Archive-It collaborative Ukraine Conflict collection or the Wayback Machine Save Page Now function. In other words, all of the content preserved and made accessible to date, including the key blog post, reflects deliberate curatorial decisions on the part of individuals and institutions.

A curator at the Hoover Institution Library and Archives with a specific concern for the VKontakte Strelkov blog actually added it to the Archive-It collection with a twice-daily capture frequency at the beginning of July. Though the key blog post was ultimately recorded through the Save Page Now feature, what’s clear is that subject area experts play a vital role in focusing web archiving efforts and, in this case, facilitated the preservation of a vital document that would not otherwise have been archived.

At the same time, selective web archiving is limited in scope and can never fully anticipate what resources the future will have wanted us to save, underscoring the value of large-scale archiving across the Web. It’s a tragic incident but an instructive example of how selective web archiving complements broader web archiving efforts.

Categories: Planet DigiPres

Song identification on GitHub

File Formats Blog - 24 July 2014 - 11:42am

The code for my song identification “nichesourcing” web application is now available on GitHub. It’s currently aimed at one project, as I’d mentioned in my earlier post, but has potential for wide use. It allows the following:

  • Users can register as editors or contributors. Only registered users have access.
  • Editors can post recording clips with short descriptions.
  • Contributors can view the list of clips and enter reports on them.
  • Reports specify type of sound, participants, song titles, and instruments. Contributors can enter as much or as little information as they’re able to.
  • Editors can modify clip metadata, delete clips, and delete reports.
  • Contributors and editors can view reports.
  • More features are planned, including an administrator role.

This is my first PHP coding project of any substance, so I’m willing to accept comments about my overall coding approach. It’s inevitable that, to some degree, I’m writing PHP as if it’s Java. If there are any standard practices or patterns I’m overlooking, let me know.

Tagged: music, software, songid
Categories: Planet DigiPres

Understanding the Participatory Culture of the Web: An Interview with Henry Jenkins

The Signal: Digital Preservation - 24 July 2014 - 10:51am
Henry Jenkins, Provost Professor of Communication, Journalism, and Cinematic Arts, a joint professorship at the USC Annenberg School for Communication and the USC School of Cinematic Arts.

Henry Jenkins, Provost Professor of Communication, Journalism, and Cinematic Arts, with USC Annenberg School for Communication and the USC School of Cinematic Arts.

The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and is working on a range of projects related to CurateCamp Digital Culture. This is part of an ongoing series of interviews Julia is conducting to better understand the kinds of born-digital primary sources folklorists, and others interested in studying digital culture, are making use of for their scholarship.

Anyone who has ever liked a TV show’s page on Facebook or proudly sported a Quidditch t-shirt knows that being a fan goes beyond the screen or page.  With the growth of countless blogs, tweets, Tumblr gifsets, Youtube videos, Instagram hashtags, fanart sites and fanfiction sites, accessing fan culture online has never been easier. Whether understood as a vernacular web or as the blossoming of a participatory culture individuals across the world are using the web to respond to and communicate their own stories.

As part of the NDSA Insights interview series, I’m delighted to interview Henry Jenkins, professor at the USC Annenberg School for Communication and self-proclaimed Aca-Fan. He is the author of one of the foundational works exploring fan cultures, “Textual Poachers: Television Fans and Participatory Culture,”  as well as a range of other books, including “Convergence Culture: Where Old and New Media Collide,” and most recently the co-author (with Sam Ford and Joshua Green) “Spreadable Media: Creating Value and Meaning in a Networked Culture.” He blogs at Confessions of an Aca-Fan.

Julia: You state on your website that your time at MIT, “studying culture within one of the world’s leading technical institutions” gave you “some distinctive insights into the ways that culture and technology are reshaping before our very eyes.”  How so? What are some of the changes you’ve observed, from a technical perspective and/or a cultural one?

Henry: MIT was one of the earliest hubs in the Internet. When I arrived there in 1989, Project Athena was in its prime; the MIT Media Lab was in its first half decade and I was part of a now legendary Narrative Intelligence Reading Group (PDF) which brought together some of the smartest of their graduate students and a range of people interested in new media from across Cambridge; many of the key thinkers of early network culture were regular speakers at MIT; and my students were hatching ideas that would become the basis for a range of Silicon Valley start ups. And it quickly became clear to me that I had a ringside seat for some of the biggest transfomations in the media landscape in the past century, all the more so because through my classes, the students were helping me to make connections between my work on fandom as a participatory culture and a wide array of emerging digital practices (from texting to game mods).

Kresge Auditorium, MIT, Historic American Buildings Survey/Historic American Engineering Record/Historic American Landscapes Survey, Library of Congress Prints and Photographs Division,

Kresge Auditorium, MIT, Historic American Buildings Survey/Historic American Engineering Record/Historic American Landscapes Survey, Library of Congress Prints and Photographs Division,

Studying games made sense at MIT because “Spacewar,” one of the first known uses of computers for gaming, had been created by the MIT Model Railroad club in the early 1960s. I found myself helping to program a series that the MIT Women’s Studies Program was running on gender and cyberspace, from which the materials for my book, “From Barbie to Mortal Kombat” emerged. Later, I would spend more than a decade as the housemaster of an MIT dorm, Senior House, which is known to be one of the most culturally creative at the Institute.

Through this, I was among the first outside of Harvard to get a Facebook account; I watched students experimenting with podcasting, video-sharing and file-sharing. Having MIT after my name opened doors at all of the major digital companies and so I was able to go behind the scenes as some of these new technologies were developing, and also see how they were being used by my students in their everyday lives.

So, through the years, my job was to place these developments in their historical and cultural contexts — often literally as Media Lab students would come to me for advice on their dissertation projects, but also more broadly as I wrote about these developments through Technology Review, the publication for MIT’s alumni network. It was there where many of the ideas that would form “Convergence Culture” were first shared with my readers. And the students that came through the Comparative Media Studies graduate program have been at ground zero for some of the key developments in the creative industries in recent years — from the Veronica Mars Kickstarter campaign to the community building practices of Etsy, from key developments in the games and advertising industry to cutting edge experiments in transmedia storytelling. The irony is that I had been really reluctant about accepting the MIT job because I suffer from fairly serious math phobia. :-)

Today, I enjoy another extraordinary vantage point as a faculty member at USC, who is embedded in both the Annenberg School of Communication and Journalism and the Cinema School, and thus positioned to watch how Hollywood and American journalism are responding to the changes that networked communication have forced upon them. I am able to work with future filmmakers who are trying to grasp a shift from a focus on individual stories to an emphasis on world-building, journalists who are trying to imagine new relationships with their publics, and activists who are seeking to make change by any media necessary.

Julia: Much of your work has focused on reframing the media audience as active and creative participants in creating media, rather than passive consumers.  You’ve critiqued use of the terms “viral” and “memes” to describe  internet phenomena as “stripping aside the concept of human agency,” and that the biological language “confuses the actual power relations between producers, properties, brands and consumers.” Can you unpack some of your critiques for us? What is at stake?

Henry: At the core of “Spreadable Media” is a shift in how media travels across the culture. On the one hand, there is distribution as we have traditionally understood it in the era of mass media where content flows in patterns regulated by decisions made by major corporations who control what we see, when we see it and under what conditions. On the other hand, there is circulation, a hybrid system, still shaped top-down by corporate players, but also bottom-up by networks of everyday people, who are seeking to move media that is meaningful to them across their social networks, and will take media where they want it when they want it through means both legal and illegal. The shift towards a circulation-based model for media access is disrupting and transforming many of our media-related practices, and it is not explained well by a model which relies so heavily on metaphors of infection and assumptions of irrationality.

The idea of viral media is a way that the broadcasters hold onto the illusion of their power to set the media agenda at a time when that power is undergoing a crisis. They are the ones who make rational calculations, able to design a killer virus which infects the masses, so they construct making something go viral as either arcane knowledge that can be sold at a price from those in the know or as something that nobody understands, “It just went viral!” But, in fact, we are seeing people, collectively and individually, make conscious decisions about what media to pass to which networks for what purposes with what messages attached through which media channels and we are seeing activist groups, religious groups, indie media producers, educators and fans make savvy decisions about how to get their messages out through networked communications.

Julia: Cases like the Harry Potter Alliance suggest the range of ways that fan cultures on the web function as a significant cultural and political force. Given the significance of fandom, what kinds of records of their online communities do you think will be necessary in the future for us to understand their impact? Said differently, what kinds of records do you think cultural heritage organizations should be collecting to support the study of these communities now and into the future?

Henry: This is a really interesting question. My colleague, Abigail De Kosnik at UC-Berkeley, is finishing up a book right now which traces the history of the fan community’s efforts to archive their own creative output over this period, which has been especially precarious, since we’ve seen some of the major corporations which fans have used to spread their cultural output to each other go out of business and take their archives away without warning or change their user policies in ways that forced massive numbers of people to take down their content.

Image of Paper Print Films in Library of Congress collection.

Image of Paper Print Films in Library of Congress collection. Jenkins notes this collection of prints likely makes it easier to write the history of the first decade of American cinema than to write the history of the first decade of the web.

The reality is that it is probably already easier to write the history of the first decade of American cinema, because of the paper print collection at the Library of Congress, than it is to write the history of the first decade of the web. For that reason, there has been surprisingly little historical research into fandom — even though some of the communication practices that fans use today go back to the publication practices of the Amateur Press Association in the mid-19th century. And even recently, major collections of fan-produced materials have been shunted from library to archive with few in your realm recognizing the value of what these collections contain.

Put simply, many of the roots of today’s more participatory culture can be traced back to fan practices over the last century. Fans have been amongst the leading innovators in terms of the cultural uses of new media. But collecting this material is going to be difficult: fandom is a dispersed but networked community which does not work through traditional organizations; there are no gatekeepers (and few recordkeepers) in fandom, and the scale of fan production — hundreds of thousands if not millions of new works every year — dwarfs that of commercial publishing. And that’s to focus only on fan fiction and would does not even touch the new kinds of fan activism that we are documenting for my forthcoming book, By Any Media Necessary. So, there is an urgent need to archive some of these materials, but the mechanisms for gathering and appraising them are far from clear.

Julia: Your New Media Literacy project aims in part to “provide adults and youth with the opportunity to develop the skills, knowledge, ethical framework and self-confidence needed to be full participants in the cultural changes which are taking place in response to the influx of new media technologies, and to explore the transformations and possibilities afforded by these technologies to reshape education.” In one of your pilot programs, for instance, students studied “Moby-Dick” by updating the novel’s Wikipedia page. Can you tell us a little more about this project? What are some of your goals? Further, what opportunities do you think libraries have to enable this kind of learning?

Henry: We documented this project through our book, “Reading in a Participatory Culture,” and through a free online project, Flows of Reading. It was inspired by the work of Ricardo Pitts-Wiley, the head of the Mixed Magic Theater in Rhode Island, who was spending time going into prisons to get young people to read “Moby-Dick” by getting them to rewrite it, imagining who these characters would be and what issues they would be confronting if they were part of the cocaine trade in the 21st century as opposed to the whaling trade in the 19th century. This resonated with the work I have been doing on fan rewriting and fan remixing practices, as well as what we know about, for example, the ways hip hop artists sample and build on each other’s work.

So, we developed a curriculum which brought together Melville’s own writing and reading practices (as the master mash-up artist of his time) with Pitts-Wiley’s process in developing a stage play that was inspired by his work with the incarcerated youth and with a focus on the place of remix in contemporary culture. We wanted to give young people tools to think ethically and meaningfully about how culture is actually produced and to give teachers a language to connect the study of literature with contemporary cultural practices. Above all, we wanted to help students learn to engage with literary texts creatively as well as critically.

We think libraries can be valuable partners in such a venture, all the more so as regimes of standardized testing make it hard for teachers to bring complex 19th century novels like “Moby-Dick” into their classes or focus student attention on the process and cultural context of reading and writing as literacy practices. Doing so requires librarians to think of themselves not only as curators of physical collections but as mentors and coaches who help students confront the larger resources and practices opened up to them through networked communication. I’ve found librarians and library organizations to be vital partners in this work through the years.

Julia: Your latest book is on the topic of “spreadable media,” arguing that “if it doesn’t spread, it’s dead.”  In a nutshell, how would you define the term “spreadable media”?

Henry:  I talked about this a little above, but let me elaborate. We are proposing spreadable media as an alternative to viral media in order to explain how media content travels across a culture in an age of Facebook, Twitter, YouTube, Reddit, Tumblr, etc. The term emphasizes the act of spreading and the choices which get made as people appraise media content and decide what is worth sharing with the people they know. It places these acts of circulation in a cultural context rather than a purely technological one. At the same time, the word is intended to contrast with older models of “stickiness,” which work on the assumption that value is created by locking down the flow of content and forcing everyone who wants your media to come to your carefully regulated site. This assumes a kind of scarcity where we know what we want and we are willing to deal with content monopolies in order to get it.

But, the reality is that we have more media available to us today that we can process: we count on trusted curators — primarily others in our social networks but also potentially those in your profession — to call media to our attention and the media needs to be able to move where the conversations are taking place or remain permanently hidden from view. That’s the spirit of “If it doesn’t spread, it’s dead!” If we don’t know about the media, if we don’t know where to find it, if it’s locked down where we can’t easily get to it, it becomes irrelevant to the conversations in which we are participating. Spreading increases the value of content.

Julia: What does spreadable media mean to the conversations libraries, archives and museums could  have with their patrons? How can archives be more inclusive of participatory culture?

Henry:  Throughout the book, we use the term “appraisal” to refer to the choices everyday people make, collectively and personally, about what media to pass along to the people they know. Others are calling this process “curating.” But either way, the language takes us immediately to the practices which used to be the domain of “libraries, archives, and museums.” You were the people who decided what culture mattered, what media to save from the endless flow, what media to present to your patrons. But that responsibility is increasingly being shared with grassroots communities, who might “like” something or “vote something up or down” through their social media platforms, or simply decide to intensify the flow of the content through tweeting about it.

We are seeing certain videos reach incredible levels of circulation without ever passing through traditional gatekeepers. Consider “Kony 2012,” which reached more than 100 million viewers in its first week of circulation, totally swamping the highest grossing film at the box office that week (“Hunger Games”) and the highest viewed series on American television (“Modern Family”), without ever being broadcast in a traditional sense. Minimally, that means that archivists may be confronting new brokers of content, museums will be confronting new criteria for artistic merit, and libraries may be needing to work hand in hand with their patrons as they identify the long-term information needs of their communities. It doesn’t mean letting go of their professional judgement, but it does mean examining their prejudices about what forms of culture might matter and it does mean creating mechanisms, such as those around crowd-sourcing and perhaps even crowd-funding, which help to insure greater responsiveness to public interests.

Julia: You wrote in 2006 that there is a lack of fan involvement with works of high culture because “we are taught to think about high culture as untouchable,” which in turn has to do with “the contexts within which we are introduced to these texts and the stained glass attitudes which often surround them.” Further, you argue that this lack of a fan culture makes it difficult to engage with a work, either intellectually or emotionally. Can you expand on this a bit? Do you still believe this to be the case, or has this changed with time? Does the existence of transformative works like “The Lizzie Bennet Diaries” on Youtube or vibrant Austen fan communities on Tumblr reveal a shift in attitudes? Finally, how can libraries, museums, and other institutions help foster a higher level of emotional and intellectual engagement?

Henry:  Years ago, I wrote “Science Fiction Audiences” with the British scholar John Tulloch in which we explored the broad range of ways that fans read and engaged with “Star Trek” and “Doctor Who.” Tulloch then went on to interview audiences at the plays of Anton Checkov and discovered a much narrower range of interpretations and meanings — they repeated back what they had been taught to think about the Russian playwright rather than making more creative uses of their experience at the theater. This was probably the opposite of the way many culture brokers think about the high arts — as the place where we are encouraged to think and explore — and popular arts — as works that are dummied down for mass consumption. This is what I meant when I suggested that the ways we treat these works cut them off from popular engagement.

At the same time, I am inspired by recent experiments which merge the high and the low. I’ve already talked about Mixed Magic’s work with “Moby-Dick,” but “The Lizzie Bennett Diaries” is another spectacular example. It’s inspired to translate Jane Austen’s world through the mechanisms of social media: gossip and scandal plays such a central role in her works; she’s so attentive to what people say about each other and how information travels through various social communities. And the playful appropriation and remixing of “Pride and Prejudice” there has opened up Austen’s work to a whole new generation of readers who might otherwise have known it entirely through Sparknotes and plodding classroom instruction. There are certainly other examples of classical creators — from Gilbert and Sullivan to Charles Dickens and Arthur Conan Doyle — who inspire this kind of fannish devotion from their followers, but by and large, this is not the spirit with which these works get presented to the public by leading cultural institutions.

I would love to see libraries and museums encourage audiences to rewrite and remix these works, to imagine new ways of presenting them, which make them a living part of our culture again. Lawrence Levine’s “Highbrow/Lowbrow” contrasts the way people dealt with Shakespeare in the 19th century — as part of the popular culture of the era — with the ways we have assumed across the 20th century that an appreciation of the Bard is something which must be taught because it requires specific kinds of cultural knowledge and specific reading practices. Perhaps we need to reverse the tides of history in this way and bring back a popular engagement with such works.

Julia: You’re a self-described academic and fan, so I’d be interested in what you think are some particularly vibrant fan communities online that scholars should be paying more attention to.

 A Vlogbrothers FAQ”

Screenshot of the VlogBrothers, Hank and John Green, as they display a symbol of their channel in a video titled “How To Be a Nerdfighter: A Vlogbrothers FAQ”

Henry: The first thing I would say is that librarians, as individuals, have long been an active presence in the kinds of fan communities I study; many of them write and read fan fiction, for example, or go to fan conventions because they know these as spaces where people care passionately about texts, engage in active debates around their interpretation, and often have deep commitments to their preservation. So, many of your readers will not need me to point out the spaces where fandom are thriving right now; they will know that fans have been a central part of the growth of the Young Adult Novel as a literary category which attracts a large number of adult readers so they will be attentive to “Harry Potter,” “Hunger Games,” or the Nerdfighters (who are followers of the YA novels of John Green); they will know that fans are being drawn right now to programs like “Sleepy Hollow” which have helped to promote more diverse casting on American television; and they will know that now as always science fiction remains a central tool which incites the imagination and creative participation of its readers. The term, Aca-Fan, has been a rallying point for a generation of young academics who became engaged with their research topics in part through their involvement within fandom. Whatever you call them, there needs to be a similar movement to help librarians, archivists and curators come out of the closet, identify as fans, and deploy what they have learned within fandom more openly through their work.

Categories: Planet DigiPres

Future Steward on Stewardship’s Future: An Interview with Emily Reynolds

The Signal: Digital Preservation - 23 July 2014 - 10:44am
Emily Reynolds, Winner of 2014 Future Steward NDSA Innovation Award.

Emily Reynolds, Winner of 2014 Future Steward NDSA Innovation Award.

Each year, the NDSA Innovation Working Group reviews nominations from members and non-members alike for the Innovation Awards. Most of those awards are focused on recognizing individuals, projects and organizations that are at the top of their game.

The Future Steward award is a little different. It’s focused on emerging leaders, and while the recipients of the future steward award have all made significant accomplishments and achievements, they have done so as students, learners and professionals in the early stages of their careers. Mat Kelly’s work on WARCreate, Martin Gengebach’s work on forensic workflows and now Emily Reynolds work in a range of organizations on digital preservation exemplify how some of the most vital work in digital preservation is being taken on and accomplished by some of the newest members of our workforce.

I’m thrilled to be able to talk with Emily, who picked up this year’s Future Steward award yesterday during the Digital Preservation 2014 meeting, about the range of her work and her thoughts on the future of the field. Emily was recognized for the quality of her work in a range of internships and student positions with the Interuniversity Consortium for Political and Social Research, the University of Michigan Libraries, the Library of Congress, Brooklyn Historical Society, StoryCorps, and, in particular, her recent work on the World Bank’s eArchives project.

Screenshot of the Arab American National Museum's web archive collections.

Screenshot of the Arab American National Museum’s web archive collections.

Trevor: You have a bit of experience working with web archives at different institutions; scoping web archive projects with the Arab American National Museum, putting together use cases for the Library of Congress and in your coursework at the University of Michigan. Across these experiences, what are your reflections and thoughts on the state of web archiving for cultural heritage organizations?

Emily: It seems to me that many cultural heritage organizations are still uncertain as to where their web archive collections fit within the broader collections of their organization. Maureen McCormick Harlow, a fellow National Digital Stewardship Resident, often spoke about this dynamic; the collections that she created have been included in the National Library of Medicine’s general catalog. But for many organizations, web collections are still a novelty or a fringe part of the collections, and aren’t as discoverable. Because we’re not sure how the collections will be used, it’s difficult to provide access in a way that will make them useful.

I also think that there’s a bit of a skills gap, in terms of the challenges that web archiving can present, as compared to the in-house technical skills at many small organizations. Tools like Archive-It definitely lower the barrier to entry, but still require a certain amount of expertise for troubleshooting and understanding how the tool works. Even as the tools get stronger, the web becomes more and more complex and difficult to capture, so I can’t imagine that it will ever be a totally painless process.

Trevor: You have worked on some very different born-digital collections, processing born-digital materials for StoryCorps in New York and on a TRAC self-audit at ICPSR, one of the most significant holders of social science data sets. While very different kinds of materials, I imagine there are some similarities there too. Could you tell us a bit about what you did and what you learned working for each of these institutions? Further, I would be curious to hear what kinds of parallels or similarities you can draw from the work.


Image of a StoryCorps exhibit at the New Museum which Emily participate in.

Emily: At StoryCorps, I did a lot of hands-on work with incoming interviews and data, so I saw first-hand the amount of effort that goes into making such complex collections discoverable. Their full interviews are not currently available online, but need to be accessible to internal staff. At ICPSR, I was more on the policy side of things, getting an overview of their preservation activities and documenting compliance with the TRAC standard.

StoryCorps and ICPSR are an interesting pair of organizations to compare because there are some striking similarities in the challenges they face in terms of access. The complexity and variety of research data held by ICPSR requires specialized tools and standards for curation, discovery and reuse. Similarly, oral history interviews can be difficult to discover and use without extensive metadata (including, ideally, full transcripts). They’re specialized types of content, and both organizations have to be innovative in figuring out how to preserve and provide access to their collections.

ICPSR has a strong infrastructure and systems for normalizing and documenting the data they ingest, but this work still requires a great deal of human input and quality control. Similarly, metadata for StoryCorps interviews is input manually by staff. I think both organizations have done great work towards finding solutions that work for their individual context, although the tools for providing access to research data seem to have developed faster than those for oral history. I’m hopeful that with tools like Pop Up Archive that will change.

Trevor: Most recently, you’ve played a leadership role in the development of the World Bank’s eArchives project. Could you tell us about this project a little and suggest some of the biggest things you learned from working on it?

Julia Blase and Emily Reynolds present on “Developing Sustainable Digital Archive Systems.” at ALA 2013 Midwinter Meeting. Photo by Jaime McCurry.

Emily: The eArchives program is an effort to digitize the holdings of the World Bank Group Archives that are of greatest interest to researchers. We don’t view our digitization as a preservation action (only insofar as it reduces physical wear and tear on the records), and are primarily interested in providing broader access to the records for our international user base. We’ve scanned around 1500 folders of records at this point, prioritizing records that have been requested by researchers and cleared for public disclosure through the World Bank’s Access to Information Policy.

The project has also included a component of improving the accessibility of digitized records and archival finding aids. We are in the process of launching a public online finding aid portal, using the open-source Access to Memory (AtoM) platform, which will contain the archives’ ISAD(G) finding aids as well as links to the digitized materials. Previously, the finding aids were contained in static HTML pages that needed to be updated manually; soon, the AtoM database will sync regularly with our internal description database. This is going to be a huge upgrade for the archivists, in terms of reducing duplication of work and making their efforts more visible to the public.

It’s been really interesting to collaborate with the archives staff throughout the process of launching our AtoM instance. I’ve been thinking a lot about how compliance with archival standards can actually make records less accessible to the public, since the practices and language involved in finding aids can be esoteric and confusing to an outsider. It has been an interesting balance to ensure that the archivists are happy with the way the descriptions are presented, while also making the site as user-friendly as possible. Anne-Marie Viola, of Dumbarton Oaks, has written a couple of blog posts about the process of conducting usability testing on their AtoM instance, which have been a great resource for me.

Trevor: As I understand it, you are starting out a new position as a program specialist with the Institute for Museum and Library Services. I realize you haven’t started yet, but could you tell us a bit about what you are going to be doing? Along with that, I would be curious to hear you talk a bit about how you see your experience thus far fitting into working for the federal funding for libraries and museums?

Emily: As a Program Specialist, I’ll be working in IMLS’s Library Discretionary Programs division, which includes grant programs like the Laura Bush 21st Century Librarian Program and the National Leadership Grants for Libraries. Among other things, I will be supporting the grant review process, communicating with grant applicants, and coordinating grant documentation. I’ll also have the opportunity to participate in some of the outreach that IMLS does with potential and existing grant applicants.

Even though I haven’t been in the profession for a very long time, I’ve had the opportunity to work in a lot of different areas, and as a result feel that I have a good understanding of the broad issues impacting all kinds of libraries today. I’m excited that I’ll be able to be involved in a variety of initiatives and areas, and to increase my involvement in the professional community. I’ve also been spoiled by the National Digital Stewardship Residency’s focus on professional development, and am excited to be moving on to a workplace where I can continue to attend conferences and stay up-to-date with the field.

Trevor: Staffing is a big concern for the future of access to digital information. The NDSA staffing survey gets into a lot of these issues. Based on your experience, what words of advice would you offer to others interested in getting into this field? How important do you think particular technical capabilities are? What made some of your internships better or more useful than others? What kinds of courses do you think were particularly useful? At this point you’ve graduated among a whole cohort of students in your program. What kinds of things do you think made the difference for those who had an easier time getting started in their careers?

Emily: I believe that it is not the exact technical skills that are so important, but the ability to feel comfortable learning new ones, and the ability to adapt what one knows to a particular situation. I wouldn’t expect every LIS graduate to be adept at programming, but they should have a basic level of technical literacy. I took classes in GIS, PHP and MySQL, Drupal and Python, and while I would not consider myself an expert in any of these topics, they gave me a solid understanding of the basics, and the ability to understand how these tools can be applied.

I think it’s also important for recent graduates to be flexible about what types of jobs they apply for, rather than only applying for positions with “Librarian” or “Archivist” in the title. The work we do is applicable in so many roles and types of organizations, and I know that recent grads who were more flexible about their search were generally able to find work more quickly. I enjoyed your recent blog post on the subject of digital archivists as strategists and leaders, rather than just people who work with floppy discs instead of manuscripts. Of course this is easy for me to say, as I move to my first job outside of archives – but I think I’ll still be able to support and participate in the field in a meaningful way.

Categories: Planet DigiPres

EaaS: Image and Object Archive — Requirements, Implementation and Example Use-Cases

Open Planets Foundation Blogs - 23 July 2014 - 10:33am
bwFLA's Emulation-as-a-Service makes emulation widely available for non-experts and could prove emulation as a valuable tool in digital preservation workflows. Providing these emulation services to access preserved and archived digital objects poses further challenges to data management. Digital artifacts are usually stored and maintained in dedicated repositories and object owners want to – or are required to – stay in control over their intellectual property. This article discusses the problem of managing virtual images, i.e. virtual harddisks bootable by an emulator, and derivatives thereof but the solution proposed can be applied to any digital artifact.RequirementsOnce a digital object is stored in an archive and an appropriate computing environment has been created for access, this environment should be immutable and should not be modified except explicitly by an administrational interface. This guarantees that a memory institution's digital assets are unaltered by the EaaS service and remain available in the future. Immutability, however, is not easy to handle for most emulated environments. Just booting the operating system may change an environment in unpredictable ways. When the emulated software writes parts of this data and reads it again, however, it probably expects the read data to represent the modifications. Furthermore, users that want to interact with the environment should be able to change or customize it. Therefore, data connectors have to provide write access for the emulation service while they cannot write the data back to the archive. The distributed nature of the EaaS approach requires an  efficient network transport of data to allow for immediate data access and usability. However, digital objects stored in archives can be quite large in size. When representing a hard disk image, the installed operating system together with installed software can easily grow up to several GBs in size. Even with today's network bandwidths, copying these digital objects in full to the EaaS service may take minutes and affects the user experience. While the archived amount of data is usually large, the data that is actually accessed frequently can be very small. In a typical emulator scenario, read access to virtual hard disk images is block-aligned and only very few blocks are actually read by the emulated system. Transferring only these blocks instead of the whole disk image file is typically more efficient, especially for larger files. Therefore, the network transport protocol has to support random seeks and sparse reads without the need for actually copying the whole data file. While direct filesystem access provides these features if a digital object is locally available to the EaaS service, such access it is not available in the general case of separate emulation and archive servers that are connected via the internet.ImplementationThe Network Block Device (NBD) protocol provides a simple client/server architecture that allows direct access to single digital objects as well as random access to the data stream within these objects. Furthermore, it can be completely implemented in userspace and does not require a complex software infrastructure to be deployed to the archives.  In order to access digital objects, the emulation environment needs to reference these objects in the emulation environment. Individual objects are identified in the NBD server by using unique export names. While the NBD URL schema directly identifies the digital object and the archive where the digital object can be found, the data references are bound to the actual network location. In a long-term preservation scenario, where emulation environments, once curated, should last longer than a single computer system that acts as the NBD server, this approach has obvious drawbacks. Furthermore, the cloud structure of EaaS allows for interchanging any component that participates in the preservation effort, thus allowing for load balancing and fail-safety. This advantage of distributed systems is offset by static, hostname-bound references.Handle It!To detach the references from the object's network location, the Handle System is used as persistent object identifier throughout our reference implementation. The Handle System provides a complete technological framework to deal with these identifiers (or "Handles'' (HDL) in the Handle System) and constitutes a federated infrastructure that allows the resolution of individual Handles using decentralized Handle Services. Each institution that wants to participate in the Handle System is assigned a prefix and can host a Handle Service. Handles are then resolved by a central resolver by forwarding requests to these services according to the Handle's prefix. As the Handle System, as a sole technological provider, does not pose any strict requirements to the data associated with Handles, this system was used as a PI technology.Persistent User Sessions and DerivativesAs digital objects (in this case the virtual disk image) are not to be modified directly in the archive by the EaaS service, a mechanism to store modifications locally  while reading unchanged data from the archive has to be implemented. Such a transparent write mechanism can be achieved using a copy-on-write access strategy. While NBD allows for arbitrary parts of the data to be read upon request, not requiring any data to be provided locally, data that is written through the data connector is tracked and stored in a local data structure. If a read operation requests a part of data that is already in this data structure, the previously changed version of the data should be returned to the emulation component. Similarly, parts of data that are not in this data structure were never modified and must be read from the original archive server. Over time, a running user session has its own local version of the data, but only those parts of data that were written are actually copied. We used the qcow2 container format from the QEMU project to keep track of local changes to the digital object. Besides supporting copy-on-write, it features an open documentation as well as a widely used and tested reference implementation with a comprehensive API, the QEMU Block Driver. The qcow2 format allows to store all changed data blocks and the respective metadata for tracking these changes in a single file. To define where the original blocks (before copy-on-write) can be found, a backing file definition is used. The Block Driver API provides a continuous view on this qcow2 container,  transparently choosing either the backing file or the copy-on-write data structures as source. This mechanism allows modifications of data to be stored separately and independent from the original digital object during an EaaS user session, allowing to keep every digital object in its original state as it was preserved  Once the session has finished, these changes can be retrieved from the emulation component and used to create a new, derived data object. As any Block Driver format is allowed in the backing file of a qcow2 container, the backing file can also be a qcow2 container again. This allows „chaining" a series of modifications as copy-on-write files that only contain the actually modified data. This greatly facilitates efficient storage of derived environments as a single qcow2 container can directly be used in a binding without having to combine the original data and the modifications to a  consolidated stream of data. However, this makes such bindings rely not only on the availability of the qcow2 container with the modifications, but also on the original data the qcow2 container refers to. Therefore, consolidation is still possible and directly supported by the tools that QEMU provides to handle qcow2 files. Once the data modifications and the changed emulation environment are retrieved after a session, both can be stored again in an archive to make this derivate environment available. Only those chunks of data that actually  were changed by the user have to be retrieved. These, however, reference and  remain dependent on the original, unmodified digital object. The derivate can then be accessed like any other archived environment. Since all derivate environments contain (stable) references to their backing files, modifications can be stored in  a different image archive, as long as the backing file is available. Therefore, each object owner is in charge for providing storage for individualized system environments but is also  able to protect its modification without loosing the benefits of shared base images. Examples and Use-CasesTo provide a better understanding of the image archive implementation, the following three use-cases demonstrate how the current implementation works. Firstly, a so called derivative is created, a tailored system environment suitable to render a specific object. In a second scenario, a container object (CD-ROM) is injected into the environment which is then modified for object access, i.e. installation of a  viewer application and adding the object to the autostart folder. Finally, an existing harddisk image (e.g. an image copy of a real machine) is ingested into the system. This last case requires, besides the technical configuration of the hardware environment, private files to be removed before public access.Derivatives – Tailored Runtime EnvironmentsTypically, an EaaS provider provides a set of so-called base images. These images contain a basic OS installation which has been configured to be run on a certain emulated platform. Depending on the user's requirements, additional software and/or configuration may be required, e.g. the installation of certain software frameworks or text processing or image manipulation software. This can be done by uploading or making available a software installation package. On our current demo instance this is done either by uploading individual files or a CD ISO image. Once the software is installed the modified environment can be saved and made accessible for object rendering or similar purposes. Object Specific CustomizationIn case of complex CD-ROM objects with rich multimedia content from the 90s and early 2000s, e.g. encyclopedias and teaching software, typically a custom viewer application has to be installed to be able to render its content. For these objects, an already prepared environment (installed software, autostart of the application) would be useful and would surely improve the user experience during access as „implicit“ knowledge on using an outdated environment is not required anymore to make use of the object. Since the number of archived media is large, duplicating for instance a Microsoft Windows environment for every one of them would add a few GBs of data to each object. Usually, neither the object’s information content nor the current or expected user demand justify these extra costs. Using derivatives of base images, however, only a few MBs are required for each customized environment since only changed parts of the virtual image are to be stored for each object. In the case of the aforementioned collection of multimedia CD-ROMs, the derivate size varies between 348KBs and 54MBs.  Authentic Archiving and Restricted Access to Existing ComputersSometimes it makes sense to preserve a complete user system like the personal computer of Vilèm Flusser in the Vilèm Flusser Archive. Such complete system environments usually can be achieved by creating a hard disk image of the existing computer and use this image as the virtual hard disk for EaaS. Such hard disk images can, however, contain personal data of the computer's owner. While EaaS aims at providing interactive access to complete software environments, it is impossible to restrict this "interactiveness", e.g. to forbid access to a certain directory directly from the user interface. Instead, our approach to this problem is to create a derivative work with all the personal data being stripped from the system. This allows users with sufficient access permissions (e.g. family or close friends) to access the original system including personal data, while the general public access only sees a computer with all the personal data removed.Conclusion

With our distributed architecture and an efficient network transport protocol, we are able to provide Emulation as a Service quite efficiently while at the same time allowing owners of digital objects to remain in complete control over their intellectual property. Using copy-on-write technology it is possible to create a multitude of different configurations and flavors of the same system with only minimal storage requirements. Derivatives and their respective "parent" system can be handled completely independent from each other and withdrawing access permissions for a parent will automatically invalidate all existing derivatives. This allows for a very efficient and flexible handling of curation processes that involve the installation of (licensed) software, personal information and user customizations.

Open Planets members can test aforementioned features using the bwFLA demo instance. Get the password here:

Taxonomy upgrade extras: EaaSPreservation Topics: Emulation
Categories: Planet DigiPres

Archiving video

File Formats Blog - 19 July 2014 - 10:59am

Suppose you see a cop beating someone up for jaywalking, or you’re stopped at one of the Border Patrol’s internal checkpoints. You’ve got your camera, phone, or tablet, so you make a video record of the incident. What do you do next? The Activists’ Guide to Archiving Video has some solid advice. Its purpose is to help you “make sure that the video documentation you have created or collected can be used for advocacy, as evidence, for education or historical memory – not just now but into the future.” The advice is solid, and most of it applies to any video recording that has long-term importance. In essence, it’s the same advice you’d get from Files that Last or from the Library of Congress. It includes considerations that especially apply to sensitive video, such as encryption and information that might put people at risk, but it’s a valuable addition to anyone’s digital preservation library.

There’s a PDF version of the guide for people who don’t like hopping around web pages. Versions in Spanish and Arabic are also provided.

Tagged: metadata, preservation, video
Categories: Planet DigiPres