The Signal: Digital Preservation

Subscribe to The Signal: Digital Preservation feed
The Signal: Digital Preservation
Updated: 39 min ago

Stewarding Early Space Data: An Interview with Emily Frieda Shaw

11 hours 12 min ago
Em_awkard_headshot2014

Emily Frieda Shaw, Head of Preservation and Reformatting at Ohio State University

Preserving and managing research data is a significant concern for scientists and staff at research libraries. With that noted, many likely don’t realize the length of time in which valuable scientific data has accrued on a range of media in research settings. That is, data management often needs to be both backward- and forward-looking, considering a range of legacy media and formats as well as contemporary practice. To that end, I am excited to interview Emily Frieda Shaw, Head of Preservation and Reformatting at Ohio State University (prior to August 2014 she was the Digital Preservation Librarian at the University of Iowa Libraries). Emily talked about her work on James Van Allen’s data from the Explorer satellites launched in the 1950s at the Digital Preservation 2014 conference and I am excited to explore some of the issues that work raises.

Trevor: Could you tell us a bit about the context of the data you are working with? Who created it, how was it created, what kind of media is it on?

Emily: The data we’re working with was captured on reel-to-reel audio tapes at receiving stations around the globe as Explorer 1 passed overhead in orbit around Earth in the early months of 1958. Explorer predated the founding of NASA and was sent into orbit by a research team led by Dr. James Van Allen, then a Professor of Physics at the University of Iowa, to observe cosmic radiation. Each reel-to-reel Ampex tape contains up to 15 minutes of data on 7 tracks, including time stamps, station identifications and weather reports from station operators, and the “payload” data consisting of clicks, beeps and squeals generated by on-board instrumentation measuring radiation, temperature and micrometeorite impacts.

Once each tape was recorded, it was mailed to Iowa for analysis by a group of graduate students. A curious anomaly quickly emerged: At certain altitudes, the radiation data disappeared. More sensitive instruments sent into orbit by Dr. Van Allen’s team soon after Explorer 1 confirmed what this anomaly suggested: the Earth is surrounded by belts of intense radiation, dubbed soon thereafter as the Van Allen Radiation Belts. When the Geiger counter on board Explorer 1 registered no radiation at all, it was, in fact, actually overwhelmed by extremely high radiation.

We believe these tapes represent the first data set ever transmitted from outside Earth’s atmosphere. Thanks to the hard work and ingenuity of our friends at The MediaPreserve, and some generous funding from the Carver Foundation, we now have about 2 TB of .wav files converted from the Explorer 1 tapes, as well as digitized lab notebooks and personal journals of Drs. Van Allen and Ludwig, along with graphs, correspondence, photos, films and audio recordings.

In our work with this collection, the biggest discovery was a 700-page report from Goddard comprised almost entirely of data tables that represent the orbital ephemeris data set from Explorer 1. This 1959 report was digitized a few years back from the collections at the University of Illinois at Urbana-Champaign as part of the Google Books project and is being preserved in the Hathi Trust. This data set holds the key to interpreting the signals we hear on the tapes. There are some fascinating interplays between analog and digital, past and present, near and far in this project, and I feel very lucky to have landed in Iowa when I did.

Trevor: What challenges does this data represent for getting it off of it’s original media and into a format that is usable?

Emily: When my colleagues were first made aware of the Explorer mission tapes in 2009, they had been sitting in the basement of a building on the University of Iowa’s campus for decades. There was significant mold growth on the boxes and the tapes themselves, and my colleagues secured an emergency grant from the state to clean, move and temporarily rehouse the tapes. Three tapes were then sent to The MediaPreserve to see if they could figure out how to digitize the audio signals. Bob Strauss and Heath Condiotte hunted down a huge, of-the-era machine that could play back all of the discrete tracks on these tapes. As I understand it, Heath had to basically disassemble the entire thing and replace all of the transistors before he got it to work properly. Fortunately, we were able to play some of the digitized audio tracks from these test reels for Dr. George Ludwig, one of the key researchers on Dr. Van Allen’s team, before he passed away in 2012. Dr. Ludwig confirmed that they sounded — at least to his naked ear — as they should, so we felt confident proceeding with the digitization.

Explorer I data tape

Explorer I data tape

So, soon after I was hired in 2012, we secured funding from a private foundation to digitize the Explorer 1 tapes and proceeded to courier all 700 tapes to The MediaPreserve for thorough cleaning, rehousing and digital conversion. The grant is also funding the development and design of a web interface to the data and accompanying archival materials, which we [Iowa] hope to launch (pun definitely intended) some time this fall.

Trevor: What stakeholders are involved in the project? Specifically, I would be interested to hear how you are working with scientists to identify what the significant properties of these particular tapes are.

Emily: No one on the project team we assembled within the Libraries has any particular background in near-Earth physics. So we reached out to our colleagues in the University of Iowa Department of Physics, and they have been tremendously helpful and enthusiastic. After all, this data represents the legacy of their profession in a big picture sense, but also, more intimately, the history of their own department (their offices are in Van Allen Hall). Our colleagues in Physics have helped us understand how the audio signals were converted into usable data, what metadata might be needed in order to analyze the data set using contemporary tools and methods, how to package the data for such analysis, and how to deliver it to scientists where they will actually find and be able to use it.

We’re also working with a journalism professor from Northwestern University, who was Dr. Van Allen’s biographer, to weave an engaging (and historically accurate) narrative to tell the Explorer story to the general public.

Trevor: How are you imagining use and access to the resulting data set?

Emily: Unlike the digitized photos, books, manuscripts, music recordings and films we in libraries and archives have become accustomed to working with, we’re not sure how contemporary scientists (or non-scientists) might use a historic data set like this. Our colleagues in Physics have assured us that once we get this data (and accompanying metadata) packaged into the Common Data Format and archived with the National Space Science Data Center, analysis of the data set will be pretty trivial. They’re excited about this and grateful for the work we’re doing to preserve and provide access to early space data, and believe that almost as quickly as we are able to prepare the data set to be shared with the physics community, someone will pick it up and analyze it.

As the earliest known orbital data set, we know that this holds great historical significance. But the more we learn about Explorer 1, the less confident we are that the data from this first mission is/was scientifically significant. The Explorer I data — or rather, the points in its orbit during which the instruments recorded no data at all — hinted at a big scientific discovery.  But it was really Explorer III, sent into orbit in the summer of 1958 with more sophisticated instrumentation, that produced that data that led to the big “ah-hah” moment. So, we’re hoping to secure funding to digitize the tapes from that mission, which are currently in storage.

I also think there might be some interesting, as-yet-unimagined artistic applications for this data. Some of the audio is really pretty eerie and cool space noise.

Trevor: More broadly, how will this research data fit into the context of managing research data at the university? Is data management something that the libraries are getting significantly involved in? If so could you tell us a bit about your approach.

Emily: The University of Iowa, like all of our peers, is thinking and talking a lot about research data management. The Libraries are certainly involved in these discussions, but as far as I can tell, the focus is, understandably, on active research and is motivated primarily by the need to comply with funding agency requirements. In libraries, archives and museums, many of us are motivated by a moral imperative to preserve historically significant information. However, this ethos does not typically pervade in the realm of active, data-intensive research. Once the big discovery has been made and the papers have been published, archiving the data set is often an afterthought, if not a burden. The fate of the Explorer tapes, left to languish in a damp basement for decades, is a case in point. Time will not be so kind to digital data sets, so we have to keep up the hard work of advocating, educating and partnering with our research colleagues, and building up the infrastructure and services they need to lower the barriers to data archiving and sharing.

Trevor: Backing up out of this particular project, I don’t think I have spoken with many folks with the title “Digital Preservation Librarian.” Other than this, what kinds of projects are you working on and what sort of background did you have to be able to do this sort of work? Could you tell us a bit about what that role means in your case? Is it something you are seeing crop up in many research libraries?

Emily: My professional focus is on the preservation of collections, whether they are manifest in physical or digital form, or both. I’ve always been particularly interested in the overlaps, intersections, and interdependencies of physical/analog and digital information, and motivated to play an active role in the sociotechnical systems that support its creation, use and preservation. In graduate school at the University of Illinois, I worked both as a research assistant with an NSF-funded interdisciplinary research group focused on information technology infrastructure, and in the Library’s Conservation Lab, making enclosures, repairing broken books, and learning the ins and outs of a robust research library preservation program. After completing my MLIS, I pursued a Certificate of Advanced Study in Digital Libraries while working full-time in Preservation & Conservation, managing multi-stream workflows in support of UIUC’s scanning partnership with Google Books.

I came to Iowa at the beginning of 2012 into the newly-created position of Digital Preservation Librarian. My role here has shifted with the needs and readiness of the organization, and has included the creation and management of preservation-minded workflows for digitizing collections of all sorts, the day-to-day administration of digital content in our redundant storage servers, researching and implementing tools and processes for improved curation of digital content, piloting workflows for born-digital archiving, and advocating for ever-more resources to store and manage all of this digital digital stuff. Also, outreach and inreach have both been essential components of my work. As a profession, we’ve made good progress toward raising awareness of digital stewardship, and many of us have begun making progress toward actually doing something about it, but we still have a long way to go.

And actually, I will be leaving my current position at Iowa at the end of this month to take on a new role as the Head of Preservation and Reformatting for The Ohio State University Libraries. My experience as a hybrid preservationist with understanding and appreciation of both the physical and digital collections will give me a broad lens through which to view the challenges and opportunities for long-term preservation and access to research collections. So, there may be a vacancy for a digital preservationist at Iowa in the near future :)

Categories: Planet DigiPres

Upgrading Image Thumbnails… Or How to Fill a Large Display Without Your Content Team Quitting

29 August 2014 - 5:56pm

The following is a guest post by Chris Adams from the Repository Development Center at the Library of Congress, the technical lead for the World Digital Library.

Preservation is usually about maintaining as much information as possible for the future but access requires us to balance factors like image quality against file size and design requirements. These decisions often require revisiting as technology improves and what previously seemed like a reasonable compromise now feels constricting.

I recently ran into an example of this while working on the next version of the World Digital Library website, which still has substantially the same look and feel as it did when the site launched in April of 2009. The web has changed considerably since then with a huge increase in users on mobile phones or tablets and so the new site uses responsive design techniques to adjust the display for a wide range of screen sizes. Because high-resolution displays are becoming common, this has also involved serving images at larger sizes than in the past — perfectly in keeping with our goal of keeping the focus on the wonderful content provided by WDL partners.

When viewing the actual scanned items, this is a simple technical change to serve larger versions of each but one area posed a significant challenge: the thumbnail or reference image used on the main item page. These images are cropped from a hand-selected master image to provide consistently sized, interesting images which represent the nature of the item – a goal which could not easily be met by an automatic process. Unfortunately the content guidelines used in the past specified a thumbnail size of only 308 by 255 pixels, which increasingly feels cramped as popular web sites feature much larger images and modern operating systems display icons as large as 256×256 or even 512×512 pixels. A “Retina” icon is significantly larger than the thumbnail below:

Icon SizesGoing back to the source

All new items being processed for WDL now include a reference image at the maximum possible resolution, which the web servers can resize as necessary. This left around 10,000 images which had been processed before the policy changed and nobody wanted to take time away from expanding the collection to reprocess old items. The new site design allows flexible image sizes but we wanted to find an automated solution to avoid a second-class presentation for the older items.

Our original master images are much higher resolution and we had a record of the source image for each thumbnail but not the crop or rotation settings which had been used to create the original thumbnail. Researching the options for reconstructing those settings lead me to OpenCV, a popular open-source computer vision toolkit.

At first glance, the OpenCV template matching tutorial appears to be perfect for the job: give it a source image and a template image and it will attempt to locate the latter in the former. Unfortunately, the way it works is by sliding the template image around the source image one pixel at a time until it finds a close match, a common approach but one which fails when the images differ in size or have been rotated or enhanced.

Fortunately, there are far more advanced techniques available for what is known as scale and rotation invariant feature detection and OpenCV has an extensive feature detection suite. Encouragingly, the first example in the documentation shows a much harder variant of our problem: locating a significantly distorted image within a photograph – fortunately we don’t have to worry about matching the 3D distortion of a printed image!

Finding the image

The locate-thumbnail program works in three steps:

  1. Locate distinctive features in each image, where features are simply mathematically interesting points which will hopefully be relatively consistent across different versions of the image – resizing, rotation, lighting changes, etc.
  2. Compare the features found in each image and attempt to identify the points in common
  3. If a significant number of matches were found, replicate any rotation which was applied to the original image
  4. Generate a new thumbnail at full resolution and save the matched coordinates and rotation as a separate data file in case future reprocessing is required

You can see this process in the sample visualizations below which have lines connecting each matched point in the thumbnail and full-sized master image:

 An Actor in the Role of Yoshinaka.

An Actor in the Role of Sato Norikiyo who Becomes Saigyo: An Actor in the Role of Yoshinaka.

Maps of Ezo, Sakhalin, and Kuril Islands - note the rotation.

Maps of Ezo, Sakhalin, and Kuril Islands – note the rotation.

The technique even works surprisingly well with relatively low-contrast images such as this 1862 photograph from the Thereza Christina Maria Collection courtesy of the National Library of Brazil where the original thumbnail crop included a great deal of relatively uniform sky or water with few unique points:

//www.wdl.org/en/item/1644/">"Gloria Neighborhood"</a>

“Gloria Neighborhood”

Scaling up

After successful test runs on a small number of images, locate-thumbnail was ready to try against the entire collection. We added a thumbnail reconstruction job to our existing task queue system and over the next week each item was processed using idle time on our cloud servers. Based on the results, some items were reprocessed with different parameters to better handle some of the more unusual images in our collection, such as this example where the algorithm matched only a few points in the drawing, producing an interesting but rather different result:

Vietnam Veterans Memorial, Competition Drawing.

Vietnam Veterans Memorial, Competition Drawing.

Reviewing the results Automated comparison

For the first pass of review, we wanted a fast way to compare images which should be very close to identical. For this work, we turned to libphash which attempts to calculate the perceptual difference between two images so we could find gross failures rather than cases where the original thumbnail had been slightly adjusted or was shifted by an insignificant amount. This approach is commonly used to detect copyright violations but it also works well as a way to quickly and automatically compare images or even cluster a large number of images based similarity.

A simple Python program was created and run across all of the reconstructed images, reporting the similarity of each pair for human review. The gross failures were used to correct bugs in the reconstruction routine and a few interesting cases where the thumbnail had been significantly altered, such as this cover page where a stamp added by a previous owner had been digitally removed:

7778 original7778 reconstructed

 

 

 

 

 

 

 

 

http://www.wdl.org/en/item/7778/ now shows that this was corrected to follow the policy of fidelity to the physical item.

Human review

The entire process until this point has been automated but human review was essential before we could use the results. A simple webpage was created which offered fast keyboard navigation and the ability to view sets of images at either the original or larger sizes:

Screen Shot 2014-08-03 at 18.42.23This was used to review items which had been flagged by phash as less than matching below a particular threshold and to randomly sample items to confirm that the phash algorithm wasn’t masking differences which a human would notice.

In some cases where the source image had interacted poorly with the older down-sampling, the results are dramatic – the reviewers reported numerous eye-catching improvements such as this example of an illustration in an Argentinian newspaper:

Illustration from “El Mosquito, March 2, 1879″ (original).

Illustration from “El Mosquito, March 2, 1879″ (reconstructed).

 

Conclusion

This project completed towards the end of this spring and I hope you will enjoy the results when the new version of WDL.org launches soon. On a wider scale, I also look forward to finding other ways to use computer-vision technology to process large image collections – many groups are used to sophisticated bulk text processing but many of the same approaches are now feasible for image-based collections and there are a number of interesting possibilities such as suggesting items which are visually similar to the one currently being viewed or using clustering or face detection to review incoming archival batches.

Most of the tools referenced above have been released as open-source and are freely available:

Categories: Planet DigiPres

Perpetual Access and Digital Preservation at #SAA14

28 August 2014 - 5:50pm
 Trevor Owens.

A panel discussion at the SAA 2014 conference. Photo credit: Trevor Owens.

I had the distinct pleasure of moderating the opening plenary session of the Joint Annual Meeting of COSA, NAGARA and SAA in Washington D.C. in early August. The panel was on the “state of access,” and I shared the dais with David Cuillier, an Associate Professor and Director of the University of Arizona School of Journalism, as well as the president of the Society of Professional Journalists; and Miriam Nisbet, the Director of the Office of Government Information Services at the National Archives and Records Administration.

The panel was a great opportunity to tease out the spaces between the politics of “open government” and the technologies of “open data” but our time was much too short and we had to end just when the panelists were beginning to get to the juicy stuff.

There were so many more places we could have taken the conversation:

  • Is our government “transparent enough”? Do we get the “open government” we deserve as (sometimes ill-informed) citizens?
  • What is the role of outside organizations in providing enhanced access to government data?
  • What are the potential benefits of reducing the federal government role in making data available?
  • Is there the right balance between voluntary information openness and the need for the Freedom of Information Act?
  • What are the job opportunities for archivists and records managers in the new “open information” environment?
  • Have you seen positive moves towards addressing digital preservation and stewardship issues regarding government information?

I must admit that when I think of “access” and “open information” I’m thinking almost exclusively about digital data because that’s the sandbox I play in. At past SAA conferences I’ve had the feeling that the discussion of digital preservation and stewardship issues was something that happened in the margins. At this year’s meeting those issues definitely moved to the center of the conversation.

Just look at this list of sessions running concurrently during a single hour on Thursday August 14, merely the tip of the iceberg:

There were also a large number of web archiving-related presentations and panels including the SAA Web Archiving Roundtable meeting (with highlights of the upcoming NDSA Web Archiving Survey report), the Archive-IT meetup and very full panels Friday and Saturday.

saa-innovator-owensI was also pleased to see that the work of NDIIPP and the National Digital Stewardship Alliance was getting recognized and used by many of the presenters. There were numerous references to the 2014 National Agenda for Digital Stewardship and the Levels of Preservation work and many NDSA members presenting and in the audience. You’ll find lots more on the digital happenings at SAA on the #SAA14 twitter stream.

We even got the chance to celebrate our own Trevor Owens as the winner of the SAA Archival Innovator award!

The increased focus on digital is great news for the archival profession. Digital stewardship is an issue where our expertise can really be put to good use and where we can have a profound impact. Younger practitioners have recognized this for years and it’s great that the profession itself is finally getting around to it.

Categories: Planet DigiPres

Untangling the Knot of CAD Preservation

27 August 2014 - 3:42pm
//commons.wikimedia.org/wiki/File:T-FLEX-CAD-12-Rus.png">T-FLEX-CAD-12-Rus</a> from Wikimedia Commons

T-FLEX-CAD-12-Rus from Wikimedia Commons.

At the 2014 Society of American Archivists meeting, the CAD/BIM Taskforce held a session titled “Frameworks for the Discussion of Architectural Digital Data” to consider the daunting matter of archiving computer-aided design and Building Information Modelling files. This was the latest evidence that — despite some progress in standards and file exchange — archivists and the international digital preservation community at large are trying to get a firm grasp on the slippery topic of preserving CAD files.

CAD is a suite of design tools, software for 3-D modelling, simulation and testing. It is used in architecture, geographic information systems, archaeology, survey data, geophysics, 3-D printing, engineering, gaming, animation and just about any situation that requires a 3-D virtual model. It comprises geometry, intricate calculations, vector graphics and text.

The data in CAD files resides in structurally complex inter-related layers that are capable of much more than displaying models.  For example, engineers can calculate stress and load, volume and weight for specific materials, the center of gravity and visualize cause-and-effect.  Individual CAD files often relate and link to other CAD files to form a greater whole, such as parts of a machine or components in a building. Revisions are quick in CAD’s virtual environment, compared to paper-based designs, so CAD has eclipsed paper as the tool of choice for 3-D modelling.

CAD files — particularly as used by scientists, engineers and architects — can contain vital information. Still, CAD files are subject to the same risk that threatens all digital files, major and minor: failure of accessibility — being stuck on obsolete storage media or dependent on a specific program, in a specific version, on a specific operating system. In particular, the complexity and range of specifications and formats for CAD files make them even more challenging than many other kinds of born-digital materials.

Skylab

Skylab from NASA.

As for CAD software, commerce thrives on rapid technological change, new versions of software and newer and more innovative software companies. This is the natural evolution of commercial technology. But each new version and type of CAD software increases the risk of software incompatibility and inaccessibility for CAD files created in older versions of software. Vendors, of course, do not have to care about that; the business of business is business — though, in fairness, businesses may continually surpass customer needs and expectations by creating newer and better features. That said, many CAD customers have long realized that it is important — and may someday be crucial — to be able to archive and access older CAD files.

Design for a Flying Machine by Leonardo da Vinci

Design for a Flying Machine by Leonardo da Vinci

Building Information Modelling files and Project Lifecycle Management files also require a digital-preservation solution. BIM and PLM integrate all the information related to a major project, not only the CAD files but also the financial, legal, email and other ancillary files.

Part of a digital preservation workflow is compatibility and portability between systems. So one of the most significant standards for the exchange of product manufacturing information of CAD files is ISO 10303, known as the “Standard for the Exchange of Product model data” or STEP. Michael J. Pratt, of the National Institute of Standards and Technology, wrote in 2001 (pdf), “the development of STEP has been one of the largest efforts ever undertaken by ISO.”

The types of systems that use STEP are CAD, computer-aided engineering and computer-aided manufacturing.

//commons.wikimedia.org/wiki/File:SialkCAD.jpg">CAD rendering of Sialk ziggurat based on archeological evidence</a> from Wikimedia Commons.

CAD rendering of Sialk ziggurat based on archeological evidence from Wikimedia Commons.

Some simple preservation information that comes up repeatedly is to save the original CAD file in its original format. Save the hardware, software and system that runs it too, if you can. Save any metadata or documentation and document a one-to-one relationship with each CAD file’s plotted sheet.

The usual digital-preservation practice applies, which is to organize the files, backup the files to a few different storage devices and put one in a geographically remote location in case of disaster, and every seven years or so migrate to a current storage medium to keep the files accessible. Given the complexity of these files, and recognizing that at its heart digital preservation is an attempt to hedge our bets about mitigating a range of potential risks, it is also advisable to try to generate a range of derivative files which are likely to be more viable in the future. That is, keep the originals, and try to also export to other formats that may lose some functionality and properties but which are far more likely to be able to be opened in the future.  The final report from the FACADE project makes this recommendation: ”For 3-D CAD models we identified the need for four versions with distinct formats to insure long-term preservation. These are:

1. Original (the originally submitted version of the CAD model)
2. Display (an easily viewable format to present to users, normally 3D PDF)
3. Standard (full representation in preservable standard format, normally IFC or STEP)
4. Dessicated (simple geometry in a preservable standard format, normally IGES)”

CAD files now join paper files — such as drawings, plans, elevations, blueprints, images, correspondence and project records — in institutional archives and firms’ libraries. In addition to the ongoing international work on standards and preservation, there needs to be a dialog with the design-software industry to work toward creating archival CAD files in an open-preservation format. Finally, trained professionals need to make sense of the CAD files to better archive them and possibly get them up and running again for production, academic, legal or other professional purposes. That requires knowledge of CAD software, file construction and digital preservation methods.

Either CAD users need better digital curatorial skills to manage their CAD archives or digital archivists need better CAD skills to curate the archives of CAD users. Or both.

Categories: Planet DigiPres

What Do You Do With 100 Million Photos? David A. Shamma and the Flickr Photos Dataset

25 August 2014 - 3:20pm
David

David Ayman Shamma, a scientist and senior research manager with Yahoo Labs and Flickr. Photo from xeeliz on Flickr.

Every day, people from around the world upload photos to share on a range of social media sites and web applications. The results are astounding; collections of billions of digital photographs are now stored and managed by several companies and organizations. In this context, Yahoo Labs recently announced that they were making a data set of 100 million Creative Commons photos from Flickr available to researchers. As part of our ongoing series of Insights Interviews, I am excited to discuss potential uses and implications for collecting and providing access to digital materials with David Ayman Shamma, a scientist and senior research manager with Yahoo Labs and Flickr.

Trevor: Could you give us a sense of the scope and range of this corpus of photos? What date ranges do they span? The kinds of devices they were taken on? Where they were taken? What kinds of information and metadata they come with? Really, anything you can offer for us to better get our heads around what exactly the dataset entails.

Ayman: There’s a lot to answer in that question. Starting at the beginning, Flickr was an early supporter of the Creative Commons and since 2004 devices have come and gone, photographic volume has increased, and interests have changed.  When creating the large-scale dataset, we wanted to cast as wide a representative net as possible.  So the dataset is a fair random sample across the entire corpus of public CC images.  The photos were uploaded from 2004 to early 2014 and were taken by over 27,000 devices, including everything from camera phones to DSLRs. The dataset is a list of photo IDs with a URL to download a JPEG or video plus some corresponding metadata like tags and camera type and location coordinates.  All of this data is public and can generally be accessed from an unauthenticated API call; what we’re providing is a consistent list of photos in a large, rolled-up format.  We’ve rolled up some but not all of the data that is there.  For example, about 48% of the dataset has longitude and latitude data which is included in the rollup, but comments on the photos have not been included, though they can be queried through the API if someone wants to supplement their research with it.

Data, data, data... A glimpse of a small piece of the dataset shared by aymanshamma on Flickr.

Data, data, data… A glimpse of a small piece of the dataset. Image shared by aymanshamma on Flickr.

Trevor: In the announcement about the dataset you mention that there is a 12 GB data set, which seems to have some basic metadata about the images and a 50 TB data set containing the entirety of the collection of images. Could you tell us a bit about the value of each of these separately, the kinds of research both enable and a bit about the kinds of infrastructure required to provide access to and process these data sets?

Ayman: Broadly speaking, research on Flickr can be categorized into two non-exclusive topic areas: social computing and computer vision. In the latter, one has to compute what are called ‘features’ or pixel details about luminosity, texture, cluster and relations to other pixels.  The same is true for audio in the videos.  In effect, it’s a mathematical fingerprint of the media.  Computing these fingerprints can take quite a bit of computational power and time, especially at the scale of 100 million items.  While the core dataset of metadata is only 12 GB, a large collection of features reach into the terabytes. Since these are all CC media files, we thought to also share these computed features.  Our friends at the International Computer Science Institute and Lawrence Livermore National Labs were more than happy to compute and host a standard set of open features for the world to use.  What’s nice is this expands the dataset’s utility.  If you’re from an institution (academic or otherwise), computing the features could be a costly set of compute time.

A 1 million photo sample of the 48 million geotagged photos from the dataset plotted around the globe shared by aymanshamma on Flickr.

A 1 million photo sample of the 48 million geotagged photos from the dataset plotted around the globe. Image shared by aymanshamma on Flickr.

Trevor: The dataset page notes that the dataset has been reviewed to meet “data protection standards, including strict controls on privacy.” Could you tell us a bit about what that means for a dataset like this?

Ayman: The images are all under one of six Creative Commons licenses implemented by Flickr. However, there were additional protections that we put into place. For example, you could upload an image with the license CC Attribution-NoDerivatives and mark it as private. Technically, the image is in the public CC; however, Flickr’s agreement with its users supersedes the CC distribution rights. With that, we only sampled from Flickr’s public collection. There are also some edge cases.  Some photos are public and in the CC but the owner set the geo-metadata to private. Again, while the geo-data might be embedded in the original JPEG and is technically under CC license, we didn’t include it in the rollup.

Trevor: Looking at the Creative Commons page for Flickr, it would seem that this isn’t the full set of Creative Commons images. By my count, there are more than 300 million creative commons licensed photos there. How were the 100 million selected, and what factors went into deciding to release a subset rather than the full corpus?

Ayman: We wanted to create a solid dataset given the potential public dataset size; 100 million seemed like a fair sample size that could bring in close to 50% geo-tagged data and about 800 thousand videos. We envision researchers from all over the world accessing this data, so we did want to account for the overall footprint and feature sizes.  We’ve chatted about the possibility of ‘expansion packs’ down the road, both to increase the size of the dataset and to include things like comments or group memberships on the photos.

Trevor: These images are all already licensed for these kinds of uses, but I imagine that it would have simply been impractical for someone to collect this kind of data via the API. How does this data set extend what researchers could already do with these images based on their licenses? Researchers have already been using Flickr photos as data, what does bundling these up as a dataset do for enabling further or better research?

Ayman: Well, what’s been happening in the past is people have been harvesting the API or crawling the site.  However, there are a few problems with these one-off research collections; the foremost is replication.  By having a large and flexible corpus, we aim to set a baseline reference dataset for others to see if they can replicate or improve upon new methods and techniques.  A few academic and industry players have created targeted datasets for research, such as ImageNet from Stanford or Yelp’s release of its Phoenix-area reviews. Yahoo Labs itself has released a few small targeted Flickr datasets in the past as well.  But in today’s research world, the new paradigm and new research methods require large and diverse datasets, and this is a new dataset to meet the research demands.

Trevor: What kinds of research are you and your colleagues imagining folks will do with these photographs? I imagine a lot of computer science and social network research could make use of them. Are there other areas you imagine these being used in? It would be great if you could mention some examples of existing work that folks have done with Flickr photos to illustrate their potential use.

Ayman: Well, part of the exciting bit is finding new research questions.  In one recent example, we began to examine the shape and structure of events through photos.  Here, we needed to temporally align geo-referenced photos to see when and where a photo was taken. As it turns out, the time the photo was taken and the time reported by the GPS are off by as much as 10 minutes in 40% of the photos.  So, in work that will be published later this year, we designed a method for correcting timestamps that are in disagreement with the GPS time.  It’s not something we would have thought we’d encounter, but it’s an example of what makes a good research question.  With a large corpus available to the research world at-large, we look forward to others also finding new challenges, both immediate and far-reaching.

Trevor: Based on this, and similar webscope data sets, I would be curious for any thoughts and reflections you might offer for libraries, archives and museums looking at making large scale data sets like this available to researchers. Are there any lessons learned you can share with our community?

Ayman: There’s a fair bit of care and precaution that goes into making collections like this -  rarely is it ever just a scrape of public data; ownership and copyright does play a role. These datasets are large collections that reflect people’s practices, behavior and engagement with media like photos, tweets or reviews. So, coming to understand what these datasets mean with regard to culture is something to set our sights on. This applies to the libraries and archives that set to preserve collections and to researchers and scientists, social and computational alike, who aim to understand them.

Categories: Planet DigiPres

Emulation as a Service (EaaS) at Yale University Library

20 August 2014 - 1:35pm

The following is a guest post from Euan Cochrane, ‎Digital Preservation Manager at Yale University Library. This piece continues and extends exploration of the potential of emulation as a service and virtualization platforms.

Increasingly, the intellectual productivity of scholars involves the creation and development of software and software-dependent content. For universities to act as responsible stewards of these materials we need to have a well-formulated approach to how we can make these legacy works of scholarship accessible.

While there have been significant concerns with the practicality of emulation as a mode of access to legacy software, my personal experience (demonstrated via one of my first websites about Amiga emulation) has always been contrary to that view. It is with great pleasure that I can now illustrate the practical utility of Emulation as a Service via three recent case studies from my work at Yale University Library. Consideration of interactive artwork from 1997, interactive Hebrew texts from a 2004 CD-ROM and finance data from 1998 illustrate that it’s no longer really a question of if emulation is a viable option for access and preservation, but of how we can go about scaling up these efforts and removing any remaining obstacles to their successful implementation.

At Yale University Library we are conducting a research pilot of the bwFLA Emulation as a Service software framework.  This framework greatly simplifies the use of emulators and virtualization tools in a wide range of contexts by abstracting all of the emulator configuration (and its associated issues) away from the end-user. As well as simplifying use of emulators it also simplifies access to emulated environments by providing the ability to access and interact with emulated environments from right within your web browser, something that we could only dream of just a few years ago.

At Yale University Library we are evaluating the software against a number of criteria including:

  1. In what use-cases might it be used?
  2. How might it fit in with digital content workflows?
  3. What challenges does it present?

The EaaS software framework shows great promise as a tool for use in many digital content management workflows such as appraisal/selection, preservation and access, but also presents a few unique and particularly challenging issues that we are working to overcome.  The issues are mostly related to copyright and software licensing.  At the bottom of this post I will discuss what these issues are and what we are doing to resolve them, but before I do that let me put this in context by discussing some real-life use-cases for EaaS that have occurred here recently.

It has taken a few months (I started in my position at the Library in September 2013) but recently people throughout the Library system have begun to forward queries to me if they involve anything digital preservation-related. Over the past month or so we have had three requests for access to digital content from the general collections that couldn’t be interacted with using contemporary software.  These requests are all great candidates for resolving using EaaS but, unfortunately (as you will see) we couldn’t do that.

Screenshot of Puppet Motel running in the emulation service using the Basilisk II emulator.

Screenshot of Puppet Motel running in the emulation service using the Basilisk II emulator.

Interactive Artwork, Circa 1997: Use Case One

An Arts PhD student wanted to access an interactive CD-ROM-based artwork (Laurie Anderson’s “Puppet Motel”) from the general collections. The artwork can only be interacted with on old versions of the Apple Mac “classic” operating system.

Fortunately the Digital Humanities Librarian (Peter Leonard) has a collection of old technology and was willing to bring a laptop into the library from his personal collection for the PhD student to use to access it on. This was not an ideal or sustainable solution (what would have happened if Peter’s collection wasn’t available? What happens when that hardware degrades past usability?).

Since responding to this request we have managed to get the Puppet Motel running in the emulation service using the Basilisk II emulator (for research purposes).

This would be a great candidate for accessing via the emulation service. The sound and interaction aspects all work well and it is otherwise very challenging for researchers to access the content.

Screenshot virtual machine used to access CD-ROM that wouldn't play in current OS.

Screenshot virtual machine used to access CD-ROM that wouldn’t play in current OS.

Hebrew Texts, Circa 2004: Use Case Two

One of the Judaica librarians needed to access data for a patron and the data was in a Windows XP CD-ROM (Trope Trainer) from the general collections. The software on the CD would not run on the current Windows 7 operating system that is installed on the desktop PCs here in the library.

The solution we came up with was to create a Windows XP virtual machine for the librarian to have on her desktop. This is a good solution for her as it enables her to print the sections she wants to print and export pdfs for printing elsewhere as needed.

We have since ingested this content into the emulation service for testing purposes. In the EaaS it can run on either the virtualization software from Oracle: VirtualBox (which doesn’t provide full-emulation) or QEMU an emulation and virtualization tool.

It is another great candidate for the service as this version of the content can no longer be accessed on contemporary operating systems and the emulated version enables users to play through the texts and hear them read just as though they were using the CD on their local machine. The ability to easily export content from the emulation service will be added in a future update and will enable this content to become even more useful.

Accessing legacy finance data through a Windows 98 Virtual Machine.

Accessing legacy finance data through a Windows 98 Virtual Machine.

Finance Data, Circa 1998/2003: Use Case Three

A Finance PhD student needed access to data (inter-corporate ownership data) trapped within software within a CD-ROM from the general collection. Unfortunately the software was designed for Windows 98: “As part of my current project I need to use StatCan data saved using some sort of proprietary software on a CD. Unfortunately this software seemed not to be compatible with my version of Windows.” He had been able to get the data out of the disc but couldn’t make any real sense of it without the software: “it was all just random numbers.”

We have recently been developing a collection of old hardware at the Library to support long-term preservation of digital content. Coincidentally, and fortunately, the previous day someone had donated a Windows 98 laptop. Using that laptop we were able to ascertain that the CD hadn’t degraded and the software still worked.  A Windows 98 virtual machine was then created for the student to use to extract the data. Exporting the data to the host system was a challenge. The simplest solution turned out to be having the researcher email the data to himself from within the virtual machine via Gmail using an old web browser (Firefox 2.x).

We were also able to ingest the virtual machine into the emulation service where it can run on either VirtualBox or QEMU.

This is another great candidate for the emulation service. The data is clearly of value but cannot be properly accessed without using the original custom software which only runs on older versions of the Microsoft Windows operating system.

Other uses of the service

In exploring these predictable use-cases for the service, we have also discovered some less-expected scenarios in which the service offers some interesting potential applications. For example, the EaaS framework makes it trivially easy to set up custom environments for patrons. These custom environments take up little space as they are stored as a difference from a base-environment, and they have a unique identifier that can persist over time (or not, as needed).  Such custom environments may be a great way for providing access to sets of restricted data that we are unable to allow patrons to download to their own computers. Being able to quickly configure a Windows 7 virtual machine with some restricted content included in it (and appropriate software for interacting with that content, e.g., an MS Outlook PST archive file with MS Outlook), and provide access to it in this restricted online context, opens entirely new workflows for our archival and special collections staff.

Why we couldn’t use bwFLA’s EaaS

In all three of the use-cases outlined above EaaS was not used as the solution for the end-user. There were two main reasons for this:

  1. We are only in possession of a limited number of physical operating system and application licenses for these older systems. While there is some capacity to use downgrade rights within the University’s volume licensing agreement with Microsoft, with Apple operating systems the situation is much less clear. As a result we are being conservative in our use of the service until we can resolve these issues.
  2. It is not always clear in the license of old software whether this use-case is allowed. Virtualization is rarely (if ever) mentioned in the license agreements. This is likely because it wasn’t very common during the period when much of the software we are dealing with was created. We are working to clarify this point with the General Counsel at Yale and will be discussing it with the software vendors.

Addressing the software licensing challenges

As things stand we are limited in our ability to provide access to EaaS due to licensing agreements (and other legal restrictions) that still apply to the content-supporting operating system and productivity software dependencies. A lot of these dependencies that are necessary for providing access to valuable historic digital content do not have a high economic value themselves.  While this will likely change over time as the value of these dependencies becomes more recognized and the software more rare, it does make for a frustrating situation.  To address this we are beginning to explore options with the software vendors and will be continuing to do this over the following months and years.

We are very interested in the opportunities EaaS offers for opening access to otherwise inaccessible digital assets.  There are many use-cases in which emulation is the only viable approach for preserving access to this content over the long term. Because of this, anything that prevents the use of such services will ultimately lead to the loss of access to valuable and historic digital content, which will effectively mean the loss of that content. Without engagement from software vendors and licensing bodies it may require law change to ensure that this content is not lost forever.

It is our hope that the software vendors will be willing to work with us to save our valuable historic digital assets from becoming permanently inaccessible and lost to future generations. There are definitely good reasons to believe that they will, and so far, those we have contacted have been more than willing to work with us.

Categories: Planet DigiPres

Curating Extragalactic Distances: An interview with Karl Nilsen & Robin Dasler

18 August 2014 - 4:54pm
EDD Homepage

Screenshot of Extragalactic Distance Database Homepage.

While a fair amount of digital preservation focuses on objects that have clear corollaries to objects from our analog world (still and moving images and documents for example), there are a range of forms that are basically natively digital. Completely native digital forms, like database-driven web applications, introduce a variety of challenges for long-term preservation and access. I’m thrilled to discuss just such a form with Karl Nilsen and Robin Dasler from the University of Maryland, College Park. Karl is the Research Data Librarian, and Robin is the Engineering/Research Data Librarian. Karl and Robin spoke on their work to ensure long-term access to the Extragalactic Distance Database at the Digital Preservation 2014 conference.

Trevor: Could you tell us a bit about the Extragalactic Distance Database? What is it? How does it work? Who does it matter to today and who might make use of it in the long term?

//en.wikipedia.org/wiki/Cosmic_distance_ladder#mediaviewer/File:Extragalactic_distance_ladder.JPG">Wikimedia Commons</a>.

Representation of the Extragalactic distance ladder from Wikimedia Commons.

Karl and Robin: The Extragalactic Distance Database contains information that can be used to determine distances between galaxies. For a limited number of nearby galaxies, the distances can be measured directly with a few measurements, but for galaxies beyond these, astronomers have to correlate and calibrate data points obtained from multiple measurements. The procedure is called a distance ladder. From a data curation perspective, the basic task is to collect and organize measurements in such a way that researchers can rapidly collate data points that are relevant to the galaxy or galaxies of interest.

The EDD was constructed by a group of astronomers at various institutions over a period of about a decade and is currently deployed on a server at the Institute for Astronomy at the University of Hawaii. It’s a continuously (though irregularly) updated, actively used database. The technology stack is Linux, Apache, MySQL and PHP. It also has an associated file system that contains FITS files and miscellaneous data and image files. The total system is approximately 500GB.

EDD Result table

Extragalactic Distance Database Result table.

The literature mentioning extragalactic or cosmic distance runs to thousands of papers in Google Scholar, and over one hundred papers have appeared with 2014 publication dates. Explicit references to the EDD appear in twelve papers with 2014 publication dates and a little more than seventy papers published before 2014. We understand that some astronomers use the EDD for research that is not directly related to distances simply because of the variety of data compiled into the database. Future use is difficult to predict, but we view the EDD as a useful reference resource in an active field. That being said, some of the data in the EDD will likely become obsolete as new instruments and techniques facilitate more accurate distances, so a curation strategy could include a reappraisal and retirement plan.

Our agreement with the astronomers has two parts. In the first part, we’ll create a replica of the EDD at our institution that can serve as a geographically distinct backup for the system in Hawaii. We’re using rsync for transfer. Our copy will also serve as a test case for digital curation and preservation research. In this period, the copy in Hawaii will continue to be the database-of-record. In the second part, our copy may become the database-of-record, with responsibility for long-term stewardship passing more fully to the University of Maryland Libraries. In general, this project gives us an opportunity to develop and fine-tune curation processes, procedures, policies and skills with the goal of expanding the Libraries’ capacity to support complex digital curation and preservation projects.

Trevor: How did you get involved with the database? Did the astronomers come to you or did you all go to them?

Karl and Robin: One of the leaders of the EDD project is a faculty member at the University of Maryland and he contacted us. We’re librarians on the Research Data Services team and we assist faculty and graduate students with all aspects of data management, curation, publishing and preservation. As a new program in the University Libraries, we actively seek and cultivate opportunities to carry out research and development projects that will let us explore different data curation strategies and practices. In early 2013 we included a brief overview of our interests and capabilities in a newsletter for faculty, and that outreach effort lead to an inquiry from the faculty member.

We occasionally hear from other faculty members who have developed or would like to develop databases and web applications as a part of their research, so we expect to encounter similar projects in the future. For that reason, we felt that it was important to initiate a project that involves a database. The opportunities and challenges that arise in the course of this project will inform the development of our services and infrastructure, and ultimately, shape how we support faculty and students on our campus.

Trevor: When you started in on this, were there any other particularly important database preservation projects, reports or papers that you looked at to inform your approach? If so, I’d appreciate hearing what you think the takeaways are from related work in the field and how you see your approach fitting into the existing body of work.

Karl and Robin: Yes, we have been looking at work on database preservation as well as work on curating and preserving complex objects. We’re fortunate that there has been a considerable amount of research and development on database preservation and there is a body of literature available. As a starting point, readers may wish to review:

Some of the database preservation efforts have produced software for digital preservation. For example, readers may wish to look at SIARD (Software Independent Archiving of Relational Databases) or the Database Preservation Toolkit. In general, these tools transform the database content into a non-proprietary format such as XML. However, there are quite a few complexities and trade-offs involved. For example, database management systems provide a wide range of functionality and a high level of performance that may be lost or not easily reconstructed after such transformations. Moreover, these preservation tools may involve dependencies that seem trivial now but could introduce significant challenges in the future. We’re interested in these kinds of tools and we hope to experiment with them, but we recognize that heavily transforming a system for the sake of preservation may not be optimal. So we’re open to experimenting with other strategies for longevity, such as emulation or simply migrating the system to state-of-the-art databases and applications.

Trevor:  Having a fixed thing to preserve makes things a lot easier to manage, but the database you are working with is being continuously updated. How are you approaching that challenge? Are you taking snapshots of it? Managing some kind of version control system? Or something else entirely? I would also be interested in hearing a bit about what options you considered in this area and how you made your decision on your approach.

Karl and Robin: We haven’t made a decision about versioning or version control, but it’s obviously an important policy matter. At this stage, the file system is not a major concern because we expect incremental additions that don’t modify existing files. The MySQL database is another story. If we preserve copies of the database as binary objects, we face the challenge of proliferating versions. That being said, it may not be necessary to preserve a complete history of versions. Readers may be interested to know that we investigated Git for transfer and version control, but discovered that it’s not recommended for large binary files.

Trevor: How has your idea of database preservation changed and evolved by working through this project? Are there any assumptions you had upfront that have been challenged?

Karl and Robin: Working with the EDD has forced us to think more about the relationship between preservation and use. The intellectual value of a data collection such as the EDD is as much in the application–joins, conditions, grouping–as in the discrete tables. Our curation and preservation strategy will have to take this fact into account. We expect that data curators, librarians and archivists will increasingly face the difficult task of preservation planning, policy development and workflow design in cases where sustaining the value of data and the viability of knowledge production depends on sustaining access to data, code and other materials as a system. We’re interested to hear from other librarians, archivists and information scientists who are thinking about this problem.

Trevor: Based on this experience, is there a checklist or key questions for librarians or archivists to think through in devising approaches to ensuring long term access to databases?

Karl and Robin: At the outset, the questions that have to be addressed in database preservation are identical to the questions that have to be addressed in any digital preservation project. These have to do with data value, future uses, project goals, sustainability, ownership and intellectual property, ethical issues, documentation and metadata, data quality, technology issues and so on. A couple of helpful resources to consult are:

Databases may complicate these questions or introduce unexpected issues. For example, if the database was constructed from multiple data sources by multiple researchers, which is not unusual, the relevant documentation and metadata may be difficult to compile and the intellectual property issues may be somewhat complicated.

Trevor: Why are the libraries at UMD the place to do this kind of curation and preservation? In many cases scientists have their own data managers, and I imagine there are contributions to this project from researchers at other universities. So what is it that makes UMD the place to do it and how does doing this kind of activity fit into the mission of the university and the libraries in particular?

Karl and Robin: While there are well-funded research projects that employ data managers or dedicated IT specialists, there are far more scientists and scholars who have little or no data management support. The cost of employing a data manager, even part-time, is too great for most researchers and often too great for most collaborations. In addition, while the IT departments at universities provide data storage services and web servers, they are not usually in the business of providing curatorial expertise, publishing infrastructure and long-term preservation and access. Further, while individual researchers recognize the importance of data management to their productivity and impact, surveys show that they have relatively little time available for data curation and preservation. There is also a deficit of expertise in general, though some researchers possess sophisticated data management skills.

Like many academic libraries, the UMD Libraries recognize the importance of data management and curation to the progress of knowledge production, the growth of open science and the success of our faculty and students. We also believe that library and archival science provide foundational principles and sets of practices that can be applied to support these activities. The Research Data Services program is a strategic priority for the University of Maryland Libraries and is highly aligned with the Libraries’ mission to accelerate and support research, scholarship and creativity. We have a cross-functional, interdisciplinary team in the Libraries–made up of subject specialists and digital curation specialists as needed–and partners across the campus, so we can bring a range of perspectives and skills to bear on a particular data curation project. This diversity is, in our view, essential to solving complex data curation and preservation problems.

We have to acknowledge that our work on the EDD involves a number of people in the Libraries. In particular, Jennie Levine Knies, Trevor Muñoz and Ben Wallberg, as well as University of Maryland iSchool students Marlin Olivier and, formerly, Sarah Hovde, have made important contributions to this project.

Categories: Planet DigiPres

Research is Magic: An Interview with Ethnographers Jason Nguyen & Kurt Baer

15 August 2014 - 7:54pm
Jason Nguyen and Kurt Baer, PhD students in the Department of Folklore and Ethnomusicology at Indiana University, drawn in the style of My Little Pony Friendship is Magic

Jason Nguyen and Kurt Baer, PhD students in the Department of Folklore and Ethnomusicology at Indiana University, drawn in the style of “My Little Pony Friendship is Magic”

The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and worked on a range of projects leading up to CurateCamp Digital Culture in July. This is part of a series of interviews Julia conducted to better understand the kinds of born-digital primary sources folklorists, and others interested in studying digital culture, are making use of for their scholarship.

When Hasbro decided to reboot their 1980s “My Little Pony” franchise, who would have guessed that they would give rise to one of the most surprising and interesting fan subcultures on the web? The 2010 animated television series “My Little Pony: Friendship is Magic” has garnered an extremely loyal–and as a 2012 documentary put it, “extremely unexpected”–viewership among adult fans. Known colloquially as “bronies” (a portmanteau of “bro” and “ponies”), these fans are largely treated with fascination and confusion by the mainstream media. All of this interest has resulted in a range of scholars in different fields working to understand this cultural phenomena.

In this installment of the NDSA Insights Interview series, I talk with Jason Nguyen and Kurt Baer. Both PhD students at Indiana University in the Department of Folklore and Ethnomusicology, Jason and Kurt decided to study this unique subculture. Their website is where they both conduct their field research, blog about their findings and invite feedback from the community.

Julia: Can you tell me a little bit more about bronies (and pegasisters)? How do they define themselves? How long have these movements been occurring and where are they communicating online? Do you have any sense of how large these communities are?

Jason: An important starting premise for us is that bronies attach a wide variety of different values and identity markers to the label of brony, imagining and experiencing their relationships to one another in multiple ways–sometimes even conflicting ones. Nonetheless, there are some shared histories that nearly all bronies will describe as specific to this community. Specifically, bronies as a concept unique from My Little Pony fandom arose out of the relaunch/reboot of the Hasbro franchise as My Little Pony: Friendship is Magic in fall 2010. Lauren Faust, particularly known to this group for her work with her husband Craig McCracken on Powerpuff Girls and Foster’s Home for Imaginary Friends, developed the idea and wrote for the show through its first two seasons, and her gender politics has a lot to do with the complex and often non-normative characterization of the ponies. Because of that, bronies will generally start with the content of the show as reason enough for being a fandom: it is smartly written and portrays a positive, socially-oriented world view. Some bronies will portray this oppositionally to other, more negative media, but at the same time, many are involved in multiple fandoms and are often fans of “darker” work as well.

In any case, the label of “brony” has a pretty specific starting point, arising out of the show’s popularity in 2010 on 4chan, which was to some extent ironic, i.e. “Haha, we’re grown men watching a little girls’ show,” though I think the irony of that moment is always overstated (since irony is a useful footing to allow a grown man to watch a little girls’ show if he so desires). Over the following year, the bronies started to overtake 4chan and were kicked out; 4chan eventually opened /mlp/ for them, but the conflict lasted for a few months and was an impetus to organize elsewhere on the web.

At this point, things get more complicated, because people who like FiM search for other fans online, but the cross-demographic appeal means that reasons for being a fan and even ways of being a fan are not necessarily shared in the way you might expect of a more homogenous group. For example, fans coming from other “geek” fandoms are used to the convention scene and fandom as a sort of genre (keeping in touch with friends online, then getting together a few times a year at a convention), but for many bronies, this is the first time they have participated in this kind of mass-mediated imagined community.

Kurt: As far as numbers go, it is really hard to tell how large the brony community is. This is partly due to the varying definitions of what makes a “brony.” However, the brony community (or communities) is quite large and very active both online and off. For instance, Bronycon, the largest brony convention, brought in over 8,000 people last year, Coder Brony’s 2014 herd census received over 18,000 responses from all around the world, and Equestria Daily is, as of now, rapidly approaching 500 million hits on their website. There are brony communities all over Facebook and Reddit (which even has multiple subreddits devoted to sorting out all of the MLP subreddits). There are very active 4chan, Twitter, SoundCloud and DeviantArt communities; brony groups on other online games ranging from Team Fortress to Minecraft to Clash of Clans; over a dozen 24-hour streaming radio stations for Brony music; and major news sites such as Equestria Daily and Everfree that link bronies to relevant information from all over the web. What’s more is that these “communities” are not discrete from one another. People bounce between platforms all of the time, sometimes between different online personas, making coming up with specific numbers very difficult.

Julia: How is your approach to studying bronies similar or different from approaches to studying other fan cultures, and for that matter, any number of other modes of participatory culture?

Jason: In a lot of ways, I don’t think the work we are doing is all that different than many ethnographic studies insofar as the basic process of participant observation is concerned. As for the field of fan/fandom studies, we have thus far not cast our work in that light, though not because of any strong feelings either way. Fandom studies has a strong thread of reception and media studies coming from a more literary and cultural studies perspective that we enjoy but it’s not our theoretical foundation (I’m thinking of Henry Jenkins’ early work, for example).

That emphasis on broad cultural production that I think is heavily influenced by the legacy of the Frankfurt School is perhaps one difference, since we are strongly ethnographic and thus more granular in our approach. That said, many scholars we might read in a fandom studies class have used ethnographic and anthropological methods as well, such as Bonnie Nardi in her great “My Life as a Night Elf Priest” about the “World of Warcraft” fandom.

Kurt: Ultimately, while we might be one of a few people researching about people and brightly colored ponies on the internet at the moment (that number is always growing), the questions that we are looking to understand and the ways that we are trying to understand them are quite similar to research coming from a long line of ethnographers dating (in the anthropological imagination, at least) all the way back to Bronislaw Malinowski. Perhaps one relatively substantial difference that we have at least been trying for, however, lies in the fact that we are trying to use the blog format to allow for more back-and-forth interaction between us and the people who we are studying/studying with than the traditional ethnographic monograph allows. While many ethnographers (such as Steven Feld in his ethnography “Sound and Sentiment”) are able to get feedback from the people they study with and incorporate that into the writing process (or at least their second editions), we have been trying to find ways to speed up that process of garnering feedback, learning from it, and using that knowledge as a means for further theorization.

Screenshot of the Research is Magic blog, which serves as a space for dialog with research participants.

Screenshot of the Research is Magic blog, which serves as a space for dialog with research participants.

Julia: You’ve stated that your blog “represents an attempt at participant-observation that collapses the boundaries between academic and interlocutor.” Can you expand on this? What are some of your goals with this blog? Why start your own blog as opposed to gathering data and engaging with bronies on their own virtual “turf,” like websites like Equestria Daily?

Kurt: One important bit of background information that I feel is important to bring up here is that Jason and I both come from fields that focus primarily upon ethnographic research, and in fact, the blog itself was started as part of a course in creative ethnography taught by Dr. Susan Lepselter that Jason and I took at Indiana University. In approaching this research ethnographically, we wanted to be able to ask questions and elicit observations from bronies themselves in addition to analyzing the various other types of “texts” such as the show itself, other websites, and pre-existing conversations. We also wanted to be clear and open about the fact that we are researchers conducting research. We figured that starting our own blog would give us the space that we needed to be able to ask questions and make observations while still being clear about our research and research objectives. Through our interactions with people on social media sites and on places such as Equestria Daily, it has been our hope that the blog becomes a space that is part of different bronies’ “turfs,” where they can go to interact with us and each other and discuss different aspects of being a brony.

As far as our attempts to collapse the boundaries between academic and interlocutor goes, one of the things that drew us to the brony community in the first place is that they are already very involved in theorization about themselves and about the show. They talk about what it means to be a brony, provide deep textual analyses of the show and its themes, and grapple with the social implications of liking a show that some people think that they shouldn’t. Rather than us going into the “field,” collecting data about bronies, and then returning to write that information up in an article to be published in an academic journal, we hoped to create a space where we can theorize together and and where all of the observations and ideas would be available in the same space to serve as material for more conversation and theorization.

Jason: Another way to think about this is that there is nothing more brony-like than to start a space of your own online. As Kurt has recounted above, bronies have been quite prolific in their production of cyberspaces for communal interaction, and not all of them are big like Equestria Daily. Of course there are always the YouTube stars and Twitter celebrities of any mass-media fandom, but the more mundane spaces are equally important, and the process of making a website, maintaining a Twitter profile, etc.–in short, creating a presentation of self as brony researchers amongst other people similarly engaged in a presentation of self as bronies–has been invaluable in our experience of the “participant” part of participant-observation. We both have web presences, as most bronies do before they join the fandom, but many choose to create fandom-specific identities, and that means anchoring those identities somewhere; we’ve in part chosen to anchor our brony-related identities on the website.

FiM villain Discord with the intellectual hero Michel Foucault by Jason

Photoshop of the MLP:FiM villain Discord with the intellectual hero Michel Foucault by Jason

With all that said, we do spend a lot of time investigating bronies in other spaces and in less explicitly theoretical ways. We live-tweet (tweeting comments about something as it occurs) new episodes from time to time, which is a really fun experience that lets us interact with both fans and show staff alike. I have drawn fan art and Kurt has made fan music that we have shared via Twitter, Reddit and our site.

So we like to think that we are doing both things at the same time. Of course it is important for anyone doing anthropologically informed ethnography to meet people where they are and explore their lives as they lead them, but at the same time, many fans have shown an interest in a space where they can read about and join in conversations that marry explicit theorization with personal observations of their fandom, and the “Research Is Magic” blog produces a hybrid narrative framing that we found was not previously existing in either academic or brony fandom spaces.

Julia: One of the reasons bronies as a group are so interesting is because they appear to subvert both gender and age norms. But you argue that “an analytical orientation that positions bronies as resisters trivializes their rich social interactions and effaces complicated power dynamics within and peripheral to the fandom.” That’s some dense language! Can you unpack this a bit for us?

Kurt: Essentially, our argument here is one against the tendency to find resistance and subversion and then get carried away insisting on interpreting everything about the group in that light. There is certainly some very interesting subversion of age and gender norms going on in the fandom, but bronies are not only, or even (I would argue) primarily, resisting. Most bronies that we have talked to don’t think of themselves as being oppositional, but instead as simply liking a show that they like. While it is both productive and interesting to look at the ways that bronies are resisting gender norms, it is also very easy for academics to fall into the trap of casting everything in that light, limiting the rich and complex social interactions of bronies to a romanticized narrative about bronies rising up together and resisting the gender stereotypes of larger society.

Jason: Resistance as a concept works because of a binary opposition: X resists Y. However, multiple competing discourses may be at work and are probably not all aligned to one another. For example, earlier this year, a North Carolina school kept a nine year old boy from bringing his Rainbow Dash backpack to school because it was getting him bullied by other students. On one level, the reasoning on all sides is obvious. To the other boys, a boy wearing “girly” paraphernalia is ripe to be bullied. The school counselor wanted to ensure the boy’s safety, so removed what was believed to be the problem. Some parents were concerned that the boy was being punished for simply expressing himself, and that the bullies should have been punished instead. …

So, while each person appears to act in resistance according to a particular discourse of meaning, and each person may have a particular narrative, the entire scenario is complicated by these competing ideas of masculinity that intersect with ideologies of personal freedom and liberty. Rainbow Dash (the character on the backpack), for example, is clearly written as a “tomboy” character–good at sports, adventurous, daring and 20 percent cooler than you. If a boy was going to pick a character to identify with that does not break existing standards of masculinity, she would be the one; thus, insofar as male fans identify with her, they’re also identifying with characteristics that don’t challenge their heteronormativity. But she is also the one covered in rainbows, and that has a particular valence as a form of non-heteronormative imagery (e.g. LGBT rights symbolism). In short, there is a density of meaning attached to Rainbow Dash that complicates people’s responses, though I would argue that it’s that complexity and density of meaning that allows different groups to be drawn to MLP in the first place.

Kurt: The ways in which people are using the show in relation to gender norms further complicate things. While in many ways bronies are challenging gender norms through their liking the show and re-defining ideas about masculinity, in other ways many bronies are super heteronormative. While they like a show that some people think is for girls, their argument is less about the fact that gender norms need dismantling than it is about the fact that the show is written in a way that is appealing to heteronormative men and that men can still be manly while liking MLP. The World’s Manliest Brony, for instance, while going against gender norms in some ways by embracing MLP and re-enforcing the manliness of giving charitably, also reinforces them in others–leaving many ideas of masculinity intact but drawing MLP into the list of things that can be manly.

Julia: Psychologist Marsha Redden, one of the conductors of The Brony Study, stated in an interview that the fandom is a normal response to the anxiety of life in a conflict-driven time, saying “they’re tired of being afraid, tired of angst and animosity. They want to go somewhere a lot more pleasant.” Likewise, a lot of what you talk about on your blog has to do with the positivity of the actual show, how each episode has a positive message and emphasizes the importance of friendship and other values.  It feels very rare that we hear something positive about bronies from the mainstream media. Can you talk a bit about this? What draws adults to the show, and to the community? What do you make of the moral panic surrounding Bronies in the mainstream media?

Jason: At the risk of sounding a little persnickety, I’d like to suggest that we invert the way we think about such causal explanations. Explanations similar to Dr. Redden’s–basically, some version of the idea that the world is a rough and cynical place and that MLP presents an alternative space, no matter how delimited or constrained, that is more trusting and open–are pretty common within the fandom as part of people’s personal narratives for why and how they became bronies (obviously, this is not true for everyone, but it’s clearly a fandom trope). In anthropology itself, scholars like Victor Turner and Max Gluckman have suggested that certain carnivalesque (to borrow Bakhtin’s term) rituals act as a kind of “safety valve” for a society to release its pent up frustrations and conflicts without destroying the order of things, and some version of that idea is laden in Redden’s theory and that of many bronies. There are many bronies who see involvement in fandom and watching the show as that safety valve.

But there are many others who narrate their experience as simply watching a show that they like–just like any other show–and, to their surprise finding outside resistance. Indeed, we don’t expect people to explain their affinity for most elements of popular culture. You need not justify why you watch “Breaking Bad” or “Game of Thrones.”

The fact that causal explanations that answer why you are a brony are central to the narratives of many bronies does not really indicate too much about their truth value, but they are a useful indicator of where society draws its lines and how people who find themselves on the wrong sides of social lines create meaning based on their situations. Here, I’m drawing heavily on Lila Abu-Lughod‘s ideas about resistance as a “diagnostic of power” that points us to the methods and configurations of power (“The Romance of Resistance: Tracing Transformations of Power Through Bedouin Women,” 1990). In this case, bronies (and researchers) find themselves having to produce narratives that can explain why they have crossed norms of gender and age appropriateness, even if they don’t live by those norms themselves. Jacob Clifton in “Geek Love: On the Matter of Bronies” does a great job arguing that, being the first generation raised by feminists, of course these young men don’t see any difference between Twilight Sparkle or Han Solo being their idols.

Kurt: Ultimately the fact that bronies have to justify why they like the show is in many ways coming from the fact that they get such negative press and draw such negative stereotypes. We haven’t done too much to tease out what actually draws people to the show, although we’ve seen many people give many different reasons as we’ve gone about our research–the good writing and production, the positive themes, the large and thriving fan community, having friends and relatives that like the show, that they just somehow liked it, etc. I’m not sure that there is necessarily one, or even a few, things inherent in the show or the fandom that draw people to it any more than there being something inherent in basketball that makes people want to watch it. There are a lot of really complex personal, psychological and socio-cultural things at work in personal preference and the reasons people give usually seem to explain less about why they like something (I couldn’t tell you why I like Carly Rae Jepson or George Clinton) than they give culturally-determined reasons why it might be okay for them to like it.

Julia: Right now you have the benefit of both directly looking for source material on the open web, and having it come to you (through participation on your blog). Given your perspective, what kinds of online content do you think are the most critical for cultural heritage organizations to preserve for anthropologists of the future to study this moment in history?

Kurt: That’s a tough one, as even with our research on bronies I feel like everywhere I look, I see someone joining the Brony research herd with a new and different focus. Although we try to do a lot of our work by talking and collaborating directly with bronies, we’ve dealt with Twitter exchanges, media reports about MLP, message board archives, brony music collections, the show itself and just about anything that we can find where people are exchanging their ideas about the fandom. Others have dealt with collection of fanfics, sites dedicated to discussing MLP and religion, fan art, material culture and cosplay, and just about anything else you can think of. I’m always finding people who focus upon and draw insight from archives (both in the sense of actual archives and in the super-general sense of “stuff people use as the basis of their research”) that I would never have thought to use.

This being said, as someone that primarily studies expressive culture (my degree is from the department of Folklore and Ethnomusicology), I tend to place a lot of importance on it. The amount and quality of the music, art, videos, memes, stories, etc. floating around within the fandom has never ceased to astound me and was one of the primary reasons that I became attracted to the fandom in the first place. I feel like these bodies of creative works–from “My Little Dashie,” “Ponies: The Anthology,” and “Love me Cheerilee” to the Twilicane memes and crude saxophone covers of show tunes –are very important to the fandom and to those that want to understand it as scholars.

Jason: Broadly speaking, anthropologists have taken two approaches to describing the lives of others to their audience. The first is like a wide-angle lens, allowing someone to get a sense of the full scope of a social phenomenon, but it has trouble with the details and the charming little moments of creativity and agency–like fan-created fluffy ponies dancing on rainbows or background ponies portrayed as anthropologists studying humankind. Archival work needs that little-bit-of-everything for context, but it also needs a macro lens that can capture more of those particular and special moments. In anthropology, it might be akin to the difference between Malinowski’s epic “Argonauts of the Western Pacific”–a sprawling work that tried to introduce the entirety of a culture to us–and something like Anthony Seeger’s “Why Suyá Sing,” which performed the humbler, but no less impressive, task of letting us experience the nuances of a single ritual.

Since we can’t archive every little thing to that level of detail … we have to make choices, and that’s where bronies themselves are the best guides. What moments mattered to them, and “where” in cyberspace did they experience those moments? For a concrete example, the moment Twilight Sparkle gained her wings and became an alicorn princess (she was previously just a unicorn…thanks M.A. Larson) was particularly salient in the community, suggesting for some fans Hasbro’s stern hand manipulating the franchise. While there are some other similar instances, the unique expressions through Twitter, Reddit, YouTube, Tumblr, etc. during and immediately following the Season 3 episode “Magical Mystery Cure” (when that transformation occurs) provide a really important look into what holds meaning for this fandom.

On a technical level, I think that means being able to follow links surrounding particular events to multiple levels of depth across multiple media modalities.

Julia: If librarians, archivists and curators wanted to learn more about approaches like yours what examples of other scholars’ work would you suggest? It would be great if you could mention a few other scholars’ work and explain what you think is particularly interesting about their approaches.

Jason: One place to start is to consider what the cultural artifact is and what it is we are analyzing, interpreting, preserving, archiving, etc., because it is not, ethnographically speaking, simply media that we are studying. As Mary Gray has insisted, we should “de-center media as the object of analysis,” instead looking at what that media means and how it is contextualized. For the archivist or curator, I think that means figuring out how people come to understand media and how they attach particular ideologies to it. Ilana Gershon’s “The Breakup 2.0″ and her work on “media ideology” broadly are great examples of shifting our attention so that we can hold both the “text” and “context” in view simultaneously.

Another example is danah boyd’s recent study of young people and their social media use, “It’s Complicated,” in which she inverts older people’s assumptions that teenagers’ social media use is crippling their ability to socialize, instead arguing that the constant texting and messaging indicates a desire to connect with one another that is born out of frustration with the previous generation’s (over-)protectiveness: truancy and loitering law, curfews, school busing, constant organized activity, etc. She arrives at that conclusion not only by studying teens’ messages, but by analyzing the historical conditions that produce the very different concerns of teens and their parents.

Kurt: As far as our approach goes, we’ve also been influenced by scholars working creatively with ethnography as a form or working just outside of its purview. We’ve brought up Kathleen Stewart’s “Ordinary Affects” in our blog and academic papers several times because it has been extremely influential upon both of us through its attempt to understand and express the ordinary moments in people’s lives that, while not unusual, per se, seem to have a weight to them that moves them somewhere in some direction–the little moments that are both ordinary and extraordinary, nondescript and meaningful. Susan M. Schultz’ “Dementia Blog” also comes to mind. While it isn’t necessarily an ethnography, per se, Schultz utilized blogging and its unique structural features (namely, that newer posts come first so that reading the blog in order is actually going backwards in time) as a means of looking into the poetics and tragic beauty of dementia while also expressing and understanding her own feelings as her mother’s mental illness progressed.

Jason: We are not too familiar with scholars who are interacting with fans in precisely the way that we are (or whether there are any), though it is important to be aware of the term “aca-fan” (academic fan) in fandom studies and some of the works being produced under that rubric. Henry Jenkins titles his website “Confessions of an Aca-Fan,” for example, and writes for an audience that includes both scholars and people interested in fandoms in general. The online journal Flow is another example that is somewhat more closely related to our blog, expressly attempting to link scholars with members of the public interested in talking about television. I’m also personally influenced by the work of Michael Wesch and Kembrew McLeod, both scholars who attempt to engage their students and the public in novel ways using media and technology.

Categories: Planet DigiPres

Netnography and Digital Records: An Interview with Robert Kozinets

13 August 2014 - 1:19pm
fasdf

Robert V. Kozinets, professor of marketing at York University in Toronto

The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and worked on a range of projects leading up to CurateCamp Digital Culture in July. This is part of a series of interviews Julia conducted to better understand the kinds of born-digital primary sources folklorists, and others interested in studying digital culture, are making use of for their scholarship.

Online communities, and their digital records, can be rich source of information, invaluable to academic researchers and to market researchers. In this installment of the Insights Interviews series, I’m delighted to talk with Robert V. Kozinets, professor of marketing at York University in Toronto and the originator of “netnography.

Julia: In your book “Netnography: Doing Ethnographic Research Online,” you define “netnography” as a “qualitative method devised specifically to investigate the consumer behavior of cultures and communities present on the Internet.”  Can you expand a bit on that definition for us? What is it about online communities that warrants minting a new word for doing ethnographic work online? Further, how would you compare and contrast your approach to other terms like “virtual ethnography”?

Robert: It’s a great question, and one that is difficult to do justice to in a short interview. For readers who are aware of the anthropological technique of ethnography, or participant-observation, it may be fairly easy to grasp that ethnographic work can also be performed in online or social media environments. However, doing ethnographic work on the combination of digital archives and semi-real-time conversations, and much more, that is the Internet is a bit different from, say, traveling to Outer Mongolia to learn about how people live there. The online environment is technologically mediated, it is instantly archived, it is widely accessible, and it is corporately controlled and monitored in ways that face-to-face behavior is not. Netnography is simply a way to approach ethnography online, and it could just as easily be called “virtual,” “digital,” “web,” “mobile” or other kinds of ethnography. The difference, I suppose, is that netnography has been associated with particular research practices in a way that these other terms are not.

Julia: You began implementing netnography as a research method in 1995. The web has changed a good bit since you started doing this work nearly twenty years ago. How has the continued development of web applications and software changed or altered the nature of doing netnographic research? In particular, has the increased popularity of social media (Facebook, Twitter) changed work in studying online communities?

Networking, from user jalroagua on Flickr

Networking, from user jalroagua on Flickr

Robert: This is a little like asking an experimental researcher if the experiments they run are different if they are running them on children or old people, or if they are experimenting on prisoners in a prison, or students at a party. It is a tactical and operational issue. The guiding principles of netnography are exactly the same whether it is a bulletin board, a blog or Facebook. Fundamental questions of focus, data collection, immersion and participation, analysis, and research presentation are identical.

Julia: How do you suggest finding communities online outside of the relatively basic search operations offered by Google and Yahoo? What are some signs that a particular online community will be a good source for netnographic research?

Robert: There are many search tools that are available, but there is no particular need to go beyond Google or Yahoo. The two keys to netnography are finding particularly interesting and relevant data amongst the load of existing data, and paying particular attention to one’s own role and consciousness as participant in the research process. Whatever tools one chooses to work with, this is time-consuming, painstaking and rewarding work. One thing I would love search engines to be able to do is to include and tag visual, audio and audiovisual material. It would be wonderful to have a search engine that spat out results to a search and gave me, along with website, blog and forum links, a full list of links to Instagram photos, YouTube videos and iTunes podcasts.

Julia: Throughout the book, you reinforce the point that the key to generating insight in netnography is building trust. Can you unpack that a bit? What are some ethical concerns researchers should keep in mind when conducting ethnographic research?

Robert: A range of ethical concerns have been raised about the use of Internet data, many of which have proven over the years to be non-starters. Notions of informed consent can be difficult online, and ethical imperatives can be difficult in environments where the line between public and private is so unclear. However, disclosure of the researcher or the research is not always necessary–it depends always upon the context. As with any research ethics question, it is generally a question of weighing potential benefits against potential risks.

Julia: From your perspective as an ethnographer and market researcher, what kinds of online content do you think is the most critical for cultural heritage organizations to preserve for researchers of the future to study this moment in history? Collecting and preserving content isn’t your area, but I’d be interested to hear whether you think there are  particular subcultures, movements or content that aren’t getting enough attention.

Robert: I have used the Wayback Machine from time to time to look at snapshots of the Internet of the past. I also recall a recent research project in which we studied bloggers, and in which some interesting blog material was removed shortly after it was posted. It survived only in our fieldnotes, but we had not archived it. Of course, it would be nice to be able to instantly retrieve “the data that got away.” However, in my research, it is the immediate experience of the Internet which matters most.

Given the rapid spread of social media, I believe that the present holds far more information and insight that any other time in the past. There are so many archives of so many particular groups already, and those archives are, in themselves, rather revealing cultural artifacts. The ones I find the most fascinating to study are the archives that groups make of their own activities. So, to answer your last question, I suppose that, to answer a library sciences question, I would be more interested to see the archives that library science people construct about library science and how they represent themselves to themselves and to wider audiences of assumed “others” that I would about how library science people represent any other group.

Julia: Aside from what to collect, I would be curious to learn a bit more about what kinds of access you think researchers studying digital culture are going to want to have to these collections. How much of this do you think will be focused on close reading of what individual pages and sites looked like and how much on bulk analysis of materials as data?

Rob: I think researchers are hungry for everything. If you ask typical researchers what data they want, they will say everything. That is because, without a specific focus or research question, you want to keep all of your options open. Then the problem becomes what they do with all this data, and they end up with all sorts of big data methods that try to fit as much data as possible into models. My approach is a bit different, in that I am searching for individual experiences online that generate insight. This could come from masses of data, or from one page, one site, even one photograph or one video clip. I think the question of access is tied up with questions of categorizing, interpretation and ownership, and these are all interesting and complex matters that lend themselves to a lot more thought and debate. In the short- to medium-term, what is currently available on the Internet is certainly more than enough for me to work with.

Categories: Planet DigiPres

Networked Youth Culture Beyond Digital Natives: An Interview With danah boyd

11 August 2014 - 6:00pm
danah boyd, principal researcher Microsoft Research, research assistant professor in media, culture and communication at New York University, and fellow with Harvard's Berkman Center for Internet & Society.

danah boyd, principal researcher, Microsoft Research, research assistant professor in media, culture and communication at New York University, and fellow with Harvard’s Berkman Center for Internet & Society.

The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and worked on a range of projects that lead up to CurateCamp Digital Culture in July. This is part of an ongoing series of interviews Julia conducted to better understand the kinds of born-digital primary sources folklorists, and others interested in studying digital culture, are making use of for their scholarship.

How do teens use the internet? For researchers, reporters and concerned parents alike, that question has never been more relevant. Many adults can only guess, or extrapolate based on news reports or their own social media habits. But researcher danah boyd took an old-fashioned but effective approach: she asked them.

I’m delighted to continue our ongoing Insights Interview series today with danah, a principal researcher at Microsoft Research, a research assistant professor in media, culture and communication at New York University, and a fellow at Harvard’s Berkman Center for Internet & Society. For her new book It’s Complicated: The Social Lives of Networked Teens, she spent about eight years studying how teens interact both on- and off-line.

Julia: The preface to your latest book ends by assuring readers that “by and large, the kids are all right.” What do you mean by that?

danah: To be honest, I really struggle with prescriptives and generalizations, but I had to figure out how to navigate those while writing this book.  But this sentence is a classic example of me trying to add nuance to a calming message.  What I really mean by this – and what becomes much clearer throughout the book – is that the majority of youth are as fine as they ever were.  They struggle with stress and relationships.  They get into trouble for teenage things and aren’t always the best equipped for handling certain situations.  But youth aren’t more at-risk than they ever were.  At the same time, there are some youth who are seriously not OK.  Keep in mind that I spend time with youth who are sexually abused and trafficked for a different project.  I don’t want us to forget that there are youth out there that desperately need our attention. Much to my frustration, we tend to focus our attention on privileged youth, rather than the at-risk youth who are far more visible today because of the internet than ever before.

Many parents and young people from the school and nearby communities attend the pie and box supper, given by the school to raise money for additional repairs and supplies. Each box or pie is auctioned off to the highest bidder, sometimes bringing a good deal, since the girl's "boyfriend" usually wins and has the privilege of eating it with her afterwards. Quicksand school, Breathitt County, Kentucky. 1940 Sept.  Farm Security Administration - Office of War Information Photograph Collection, Prints and Photographs.

Photograph from “pie and box supper,” Quicksand school, Breathitt County, Kentucky, September 1940. Farm Security Administration – Office of War Information Photograph Collection. Photo courtesy of the Library of Congress Prints & Photographs Division.

Julia: In a recent article you stated that “social media mirror, magnify, and complicate countless aspects of everyday life, bringing into question practices that are presumed stable and shedding light on contested social phenomena.” Can you expand a bit on this?

danah: When people see things happening online that feel culturally unfamiliar to them, they often think it’s the internet that causes it. Or when they see things that they don’t like – like bullying or racism – they think that the internet has made it worse.  What I found in my research is that the internet offers a mirror to society, good, bad and ugly.  But because that mirror is so publicly visible and because the dynamics cross geographic and cultural boundaries, things start to get contorted in funny ways.  And so it’s important to look at what’s happening underneath the aspect that is made visible through the internet.

Julia: In a recent interview you expressed frustration with how, in the moral panic surrounding social media, “we get so obsessed with focusing on relatively healthy, relatively fine middle- and upper-class youth, we distract ourselves in ways that don’t allow us to address the problems when people actually are in trouble.” What’s at stake when adults and the media misunderstand or misrepresent teen social media use?

danah: We live in a society and as much as we Americans might not like it, we depend on others.  If we want a functioning democracy, we need to make sure that the fabric of our society is strong and healthy.  All too often, in a country obsessed with individualism, we lose track of this.  But it becomes really clear when we look at youth.  Those youth who are most at-risk online are most at-risk offline.  They often come from poverty or experience abuse at home. They struggle with mental health issues or have family members who do.  These youth are falling apart at the seams and we can see it online.  But we get so obsessed with protecting our own children that we have stopped looking out for those in our communities that are really struggling, those who don’t have parents to support them.  The urban theorist Jane Jacobs used to argue that neighborhoods aren’t safe because you have law enforcement policing them; they are safe because everyone in the community is respectfully looking out for one another.  She talked about “eyes on the street,” not as a mechanism of surveillance but as an act of caring.  We need a lot more of that.

Southington, Connecticut. Young people watching a game.  1942 May 23-30. Farm Security Administration - Office of War Information Photograph Collection.  Prints and Photographs Division

Southington, Connecticut. Young people watching a game. 1942 May 23-30. Farm Security Administration – Office of War Information Photograph Collection. Photo courtesy of the Library of Congress Prints and Photographs Division.

Julia: You conduct research on teen behaviors both on and offline. How are physical environments important to understanding mediated practices? What are the limitations to studying online communities solely by engaging with them online?

danah: We’ve spent the last decade telling teenagers that strangers are dangerous, that anyone who approaches them online is a potential predator.  I can’t just reach out to teens online and expect them to respond to me; they think I’m creepy.  Thus, I long ago learned that I need to start within networks of trust. I meet youth through people in their lives, working networks to get to them so that they will trust me and talk about their lives with me. In the process, I learned that I get a better sense of their digital activities by seeing their physical worlds first.  At the same time, I do a lot of online observation and a huge part of my research has been about piecing together what I see online with what I see offline.

Julia: Researchers interested in young people’s social media use today can directly engage with research participants and a wealth of documentation over the web. When researchers look back on this period, what do you think are going to be the most critical source material for understanding the role of social media in youth culture? In that vein, what are some websites/data sets and other kinds of digital material that you think would be invaluable for future researchers to have access to for studying teen culture of today 50 years from now?

El Centro (vicinity), California. Young people at the Imperial County Fair. 1942 Feb.-Mar. Farm Security Administration - Office of War Information Photograph Collection.

El Centro (vicinity), California. Young people at the Imperial County Fair. 1942 Feb.-Mar. Farm Security Administration – Office of War Information Photograph Collection. Photo courtesy of the Library of Congress Prints and Photographs Division.

danah: Actually, to be honest, I think that anyone who looks purely at the traces left behind will be missing the majority of the story.  A lot has changed in the decade in which I’ve been studying youth, but one of the most significant changes has to do with privacy.  When I started this project, American youth were pretty forward about their lives online. By the end, even though I could read what they tweeted or posted on Instagram, I couldn’t understand it.  Teens started encoding content. In a world where they can’t restrict access to content, they restrict access to meaning.  Certain questions can certainly be asked of online traces, but meaning requires going beyond traces.

Julia: Alongside your work studying networked youth culture, you have also played a role in ongoing discussions of the implications of “big data.” Recognizing that researchers now and in the future are likely going to want to approach documentation and records as data sets, what do you think are some of the most relevant issues from your writing on big data for cultural heritage institutions to consider about collecting, preserving and providing access to social media, and other kinds of cultural data?

teenagers and their smartphones visiting a museum by user vilseskogen on Flickr.

teenagers and their smartphones visiting a museum by user vilseskogen on Flickr.

danah: One of the biggest challenges that archivists always have is interpretation. Just because they can access something doesn’t mean they have the full context.  They work hard to piece things together to the best that they can, but they’re always missing huge chunks of the puzzle.  I’m always amazed when I sit behind the Twitter firehose to see the stream of tweets that make absolutely no sense.  I think that anyone who is analyzing this data knows just how dirty and confusing it can be.  My hope is that it will force us to think about who is doing the interpreting and how.  And needless to say, there are huge ethical components to that.  This is at the crux of what archivists and cultural heritage folks do.

Julia: You’ve stated that “for all of the attention paid to ‘digital natives’ it’s important to realize that most teens are engaging with social media without any deep understanding of the underlying dynamics or structure.” What role can cultural heritage organizations play in facilitating digital literacy learning?

danah: What I love about cultural heritage organizations is that they are good at asking hard questions, challenging assumptions, questioning interpretations.  That honed skill is at the very center of what youth need to develop.  My hope is that cultural heritage organizations can go beyond giving youth the fruits of their labor and inviting them to develop these skills.  These lessons don’t need to be internet-specific. In many ways, they’re a part of what it means to be critically literate period.

Categories: Planet DigiPres

August Library of Congress Digital Preservation Newsletter is Now Available

8 August 2014 - 3:02pm

The August Library of Congress Digital Preservation Newsletter is now available:

Included in this issue:Augustcover

  • Digital Preservation 2014: It’s a Thing
  • Preserving Born Digital News
  • LOLCats and Libraries with Amanda Brennan
  • Digital Preservation Questions and Answers
  • End-of-Life Care for Aging, Fragile CDs
  • Education Program updates
  • Interviews with Henry Jenkins and Trevor Blank
  • More on Digital Preservation 2014
  • NDSA News, and more
Categories: Planet DigiPres

Duke’s Legacy: Video Game Source Disc Preservation at the Library of Congress

6 August 2014 - 2:18pm

The following is a guest post from David Gibson, a moving image technician in the Library of Congress. He was previously interviewed about the Library of Congress video games collection.

The discovery of that which has been lost or previously unattainable is one of the driving forces behind the archival profession and one of the passions the profession shares with the gaming community. Video game enthusiasts have long been fascinated by unreleased games and “lost levels,” gameplay levels which are partially developed but left out of the final release of the game. Discovery is, of course, a key component to gameplay. Players revel in the thrill of unlocking the secret door or uncovering Easter eggs hidden in the game by developers. In many ways, the fascination with obtaining access to unreleased games or levels brings this thrill of discovery into the real world. In a recent article written for The Atlantic, Heidi Kemps discusses the joy in obtaining online access to playable lost levels from the 1992 Sega Genesis game, Sonic The Hedgehog 2, reveling in the fact that access to these levels gave her a glimpse into how this beloved game was made.

Original source disc as it was received by the Library of Congress.

Original source disc as it was received by the Library of Congress.

Since 2006, the Moving Image section of the Library of Congress has served as the custodial unit for video games. In this capacity, we receive roughly 400 video games per year through the Copyright registration process, about 99% of which are physically published console games. In addition to the games themselves we sometimes receive ancillary materials, such as printed descriptions of the game, DVDs or VHS cassettes featuring excerpts of gameplay, or the occasional printed source code excerpt. These materials are useful, primarily for their contextual value, in helping to tell the story of video game development in this country and are retained along with the games in the collection.

Several months ago, while performing an inventory of recently acquired video games, I happened upon a DVD-R labeled Duke Nukem: Critical Mass (PSP). My first assumption was that the disc, like so many others we have received, was a DVD-R of gameplay. However, a line of text on the Copyright database record for the item intrigued me. It reads: Authorship: Entire video game; computer code; artwork; and music. I placed the disc into my computer’s DVD drive to discover that the DVD-R did not contain video, but instead a file directory, including every asset used to make up the game in a wide variety of proprietary formats. Upon further research, I discovered that the Playstation Portable version of Duke Nukem: Critical Mass was never actually released commercially and was in fact a very different beast than the Nintendo DS version of the game which did see release. I realized then that in my computer was the source disc used to author the UMD for an unreleased PlayStation Portable game. I could feel the lump in my throat. I felt as though I had solved the wizard’s riddle and unlocked the secret door.

Excerpt of code from boot.bin including game text.

Excerpt of code from boot.bin including game text.

The first challenge involved finding a way to access the proprietary Sony file formats contained within the disc, including, but not limited to, graphics files in .gim format and audio files in .AT3 format. I enlisted the aid of Packard Campus Software Developer Matt Derby and we were able to pull the files off of the disc and get a clearer sense of the file structure contained within. Through some research on various PSP homebrew sites we discovered Noesis, a program that would allow us to access the .gim and .gmo files which contain the 3D models and textures used to create the game’s characters and 3D environments. With this program we were able to view a complete 3D view of Duke Nukem himself, soaring through the air on his jetpack and a pre-composite 3D model of one of the game’s nemeses, the Pig Cops. Additionally, we employed Mediacoder and VLC in order to convert the Sony .AT3 (ATRAC3) audio files to MP3 in order to have access to the game’s many music cues.

duke_jp_gmo

3D model for Duke Nukem equipped with jetpack. View an animated gif of the model here.

Perhaps the most exciting discovery came when we used a hex editor to access the ASCII text held in the boot.bin folder in the disc’s system directory. Here we located the full text and credit information for the game along with a large chunk of un-obfuscated software code. However, much of what is contained in this folder was presented as compiled binaries. It is my hope that access to both the compiled binaries and ASCII code will allow us to explore future preservation options for video games. Such information becomes even more vital in the case of games such as this Duke Nukem title which were never released for public consumption. In many ways, this source disc can serve as an exemplary case as we work to define preferred format requirements for software received by the Library of Congress. Ultimately, I feel that access to the game assets and source code will prove to be invaluable both to researchers who are interested in game design and mechanics and to any preservation efforts the Library may undertake.

Providing access to the disc’s content to researchers will, unfortunately, remain a challenge. As mentioned above, it was difficult enough for Library of Congress staff to view the proprietary formats found on the disc before seeking help from the homebrew community. The legal and logistical hurdles related to providing access to licensed software will continue to present themselves as we move forward but I hope that increased focus on the tremendous research value of such digital assets will allow for these items to be more accessible in the future. For now the assets and code will be stored in our digital archive at the Packard Campus in Culpeper and the physical disc will be stored in temperature-controlled vaults.

The source disc for the PSP version of Duke Nukem: Critical Mass stands out in the video game collection of the Library of Congress as a true digital rarity. In Doug Reside’s recent article “File Not Found: Rarity in the Age of Digital Plenty” (pdf), he explores the notion of source code as manuscript and the concept of digital palimpsests that are created through the various layers that make up a Photoshop document or which are present in the various saved “layers” of a Microsoft Word document. The ability to view the pre-compiled assets for this unreleased game provides a similar opportunity to view the game as a work-in-progress, or at the very least to see the inner workings and multiple layers of a work of software beyond what is presented to us in the final, published version. In my mind, receiving the source disc for an unreleased game directly from the developer is analogous to receiving the original camera negative for an unreleased film, along with all of the separate production elements used to make the film. The disc is a valuable evidentiary artifact and I hope we will see more of its kind as we continue to define and develop our software preservation efforts.

The staff of the Moving Image section would love the opportunity to work with more source materials for games and I hope that game developers who are interested in preserving their legacy will be willing to submit these kinds of materials to us in the future. Though source discs are not currently a requirement for copyright, they are absolutely invaluable in contributing to our efforts towards stewardship and long term access to the documentation of these creative works.

Special thanks to Matt Derby for his assistance with this project and input for this post.

Categories: Planet DigiPres

National Geospatial Advisory Committee: The Shape of Geo to Come

5 August 2014 - 1:24pm
//www.flickr.com/photos/caveman_92223/3185534518/in/photostream/">Flickr</a>.

World Map 1689 — No. 1 from user caveman_92223 on Flickr.

Back in late June I attended the National Geospatial Advisory Committee (NGAC) meeting here in DC. NGAC is a Federal Advisory Committee sponsored by the Department of the Interior under the Federal Advisory Committee Act. The committee is composed of (mostly) non-federal representatives from all sectors of the geospatial community and features very high profile participants. For example, ESRI founder Jack Dangermond, the 222nd richest American, has been a member since the committee was first chartered in 2008 (his term has since expired). Current committee members include the creator of Google Earth (Michael Jones) and the founder of OpenStreetMap (Steve Coast).

So what is the committee interested in, and how does it coincide with what the digital stewardship community is interested in? There are number of noteworthy points of intersection:

  • In late March of this year the FGDC released the “National Geospatial Data Asset Management Plan – a Portfolio Management Implementation Plan for the OMB Circular A–16” (pdf). The plan “lays out a framework and processes for managing Federal NGDAs [National Geospatial Data Assets] as a single Federal Geospatial Portfolio in accordance with OMB policy and Administration direction. In addition, the Plan describes the actions to be taken to enable and fulfill the supporting management, reporting, and priority-setting requirements in order to maximize the investments in, and reliability and use of, Federal geospatial assets.”
  • Driven by the release of the NGDA Management Plan, a baseline assessment of the “maturity” of various federal geospatial data assets is currently under way. This includes identifying dataset managers, identifying the sources of data (fed only/fed-state partnerships/consortium/etc.) and determining the maturity level of the datasets across a variety of criteria. With that in mind, several “maturity models” and reports were identified that might prove useful for future work in this area. For example, the state of Utah AGRC has developed a one-page GIS Data Maturity Assessment; the American Geophysical Union has a maturity model for assessing the completeness of climate data records (behind a paywall, unfortunately); the National States Geographic Information Council has a Geospatial Maturity Assessment; and the FGDC has “NGDA Dataset Maturity Annual Assessment Survey and Tool” that is being developed as part of their baseline assessment These maturity models have a lot in common with the NDSA Levels of Preservation work.
  • Lots of discussion on a pair of reports on big data and geolocation privacy. The first, Big Data – Seizing Opportunities, Preserving Values Report from the Executive Office of the President, acknowledges the benefits of data but also notes that “big data technologies also raise challenging questions about how best to protect privacy and other values in a world where data collection will be increasingly ubiquitous, multidimensional, and permanent.” The second, the PCast report on Big Data and Privacy (PCAST is the “President’s Council of Advisors on Science and Technology” and the report is officially called “Big Data: A Technology Perspective”) “begins by exploring the changing nature of privacy as computing technology has advanced and big data has come to the forefront.  It proceeds by identifying the sources of these data, the utility of these data — including new data analytics enabled by data mining and data fusion — and the privacy challenges big data poses in a world where technologies for re-identification often outpace privacy-preserving de-identification capabilities, and where it is increasingly hard to identify privacy-sensitive information at the time of its collection.” The importance of both of these reports to future library and archive collection and access policies regarding data can not be understated.
  • The Spatial Data Transfer Standard is being voted on for withdrawal as an FGDC-endorsed standard. FGDC maintenance authority agencies were asked to review the relevance of the SDTS, and they responded that the SDTS is no longer used by their agencies. There’s a Federal Register link to the proposal. The Geography Markup Language (GML), which the FGDC has endorsed, now satisfies the encoding requirements that SDTS once provided. NARA revised their transfer guidance for geospatial information in April 2014 to make SDTS files “acceptable for imminent transfer formats” but it’s clear that they’ve already moved away from them.  As a side note, GeoRSS is coming up for a vote soon to become an FGDC-endorsed standard.
  • The Office of Management and Budget is reevaluating the geospatial professional classification. The geospatial community has an issue similar to that being faced by the library and archives community, in that the jobs are increasingly information technology jobs but are not necessarily classified as such. This coincides with efforts to reevaluate the federal government library position description.
  • The Federal Geographic Data Committee is working with federal partners to make previously-classified datasets available to the public.  These datasets have been prepared as part of the “HSIP Gold” program. HSIP Gold is a compilation of over 450 geospatial datasets of U.S. domestic infrastructure features that have been assembled from a variety of Federal agencies and commercial sources. The work of assembling HSIP Gold has been tasked to the Homeland Infrastructure Foundation-Level Data (HIFLD) Working Group (say it as “high field”). Not all of the data in HSIP Gold is classified, so they are working to make some of the unclassified portions available to the public.

The next meeting of the NGAC is scheduled for September 23 and 24 in Shepherdstown, WV. The meetings are public.

Categories: Planet DigiPres

Making Scanned Content Accessible Using Full-text Search and OCR

4 August 2014 - 12:48pm

This following is a guest post by Chris Adams from the Repository Development Center at the Library of Congress, the technical lead for the World Digital Library.

We live in an age of cheap bits: scanning objects en masse has never been easier, storage has never been cheaper and large-scale digitization has become routine for many organizations. This poses an interesting challenge: our capacity to generate scanned images has greatly outstripped our ability to generate the metadata needed to make those items discoverable. Most people use search engines to find the information they need but our terabytes of carefully produced and diligently preserved TIFF files are effectively invisible for text-based search.

The traditional approach to this problem has been to invest in cataloging and transcription but those services are expensive, particularly as flat budgets are devoted to the race to digitize faster than physical media degrades. This is obviously the right call from a preservation perspective but it still leaves us looking for less expensive alternatives.

OCR is the obvious solution for extracting machine-searchable text from an image but the quality rates usually aren’t high enough to offer the text as an alternative to the original item. Fortunately, we can hide OCR errors by using the text to search but displaying the original image to the human reader. This means our search hit rate will be lower than it would with perfect text but since the content in question is otherwise completely unsearchable anything better than no results will be a significant improvement.

Since November 2013, the World Digital Library has offered combined search results similar to what you can see in the screenshot below:

adams080414image1

This system is entirely automated, uses only open-source software and existing server capacity, and provides an easy process to improve results for items as resources allow.

How it Works: From Scan to Web Page Generating OCR Text

As we receive new items, any item which matches our criteria (currently books, journals and newspapers created after 1800) will automatically be placed in a task queue for processing. Each of our existing servers has a worker process which uses idle capacity to perform OCR and other background tasks. We use the Tesseract OCR engine with the generic training data for each of our supported languages to generate an HTML document using hOCR markup.

The hOCR document has HTML markup identifying each detected word and paragraph and its pixel coordinates within the image. We archive this file for future usage but our system also generates two alternative formats for the rest of our system to use:

  • A plain text version for the search engine, which does not understand HTML markup
  • A JSON file with word coordinates which will be used by a browser to display or highlight parts of an image on our search results page and item viewer
Indexing the Text for Search

Search has become a commodity service with a number of stable, feature-packed open-source offerings such as such Apache Solr, ElasticSearch or Xapian. Conceptually, these work with documents — i.e. complete records — which are used to build an inverted index — essentially a list of words and the documents which contain them. When you search for “whaling” the search engine performs stemming to reduce your term to a base form (e.g. “whale”) so it will match closely-related words, finds the term in the index, and retrieves the list of matching documents. The results are typically sorted by calculating a score for each document based on how frequently the terms are used in that document relative to the entire corpus (see the Lucene scoring guide for the exact details about how term frequency-inverse document frequency (TD-IDF) works).

This approach makes traditional metadata-driven search easy: each item has a single document containing all of the available metadata and each search result links to an item-level display. Unfortunately, we need to handle both very large items and page-level results so we can send users directly to the page containing the text they searched for rather than page 1 of a large book. Storing each page as a separate document provides the necessary granularity and avoids document size limits but it breaks the ability to calculate relevancy for the entire item: the score for each page would be calculated separately and it would be impossible to search for multiple words which fall on different pages.

The solution for this final problem is a technique which Solr calls Field Collapsing (the ElasticSearch team has recently completed a similar feature referred to as “aggregation”). This allows us to make a query and specify a field which will be used to group documents before determining relevancy. If we tell Solr to group our results by the item ID the search ranking will be calculated across all of the available pages and the results will contain both the item’s metadata record and any matching OCR pages.

(The django-haystack Solr grouped search backend with Field Collapsing support used on wdl.org has been released into the public domain.)

Highlighting Results

At this point, we can perform a search and display a nice list of results with a single entry for each item and direct links to specific pages. Unfortunately, the raw OCR text is a simple unstructured stream of text and any OCR glitches will be displayed, as can be seen in this example where the first occurrence of “VILLAGE FOULA” was recognized incorrectly:

adams080414image2

The next step is replacing that messy OCR text with a section of the original image. Our search results list includes all of the information we need except for the locations for each word on the page. We can use our list of word coordinates but this is complicated because the search engine’s language analysis and synonym handling mean that we cannot assume that the word on the page is the same word that was typed into the search box (e.g. a search for “runners” might return a page which mentions “running”).

Here’s what the entire process looks like:

1. The server returns an HTML results page containing all of the text returned by Solr with embedded microdata indicating the item, volume and page numbers for results and the highlighted OCR text:

adams080414image3

2. JavaScript uses the embedded microdata to determine which search results include page-level hits and an AJAX request is made to retrieve the word coordinate lists for every matching page. The word coordinate list is used to build a list of pixel coordinates for every place where one of our search words occurs on the page:

adams080414image7Now we can find each word highlighted by Solr and locate it in the word coordinates list. Since Solr returned the original word and our word coordinates were generated from the same OCR text which was indexed in Solr, the highlighting code doesn’t need to handle word tenses, capitalization, etc.

3. Since we often find words in multiple places on the same page and we want to display a large, easily readable section of the page rather than just the word, our image slice will always be the full width of the page starting at the top-most result and extending down to include subsequent matches until there is either a sizable gap or the total height is greater than the first third of the page.

Once the image has been loaded, the original text is replaced with the image:

adams080414image4

4. Finally, we add a partially transparent overlay over each highlighted word:

adams080414image5

Notes
  • The WDL management software records the OCR source and review status for each item. This makes it safe to automatically reprocess items when new versions of our software are released without the chance of inadvertently overwriting OCR text which was provided by a partner or which has been hand-corrected.
  • You might be wondering why the highlighting work is performed on the client side rather than having the server return highlighted images. In addition to reducing server load this design improves performance because a given image segment can be reused for multiple results on the same page(rounding the coordinates improves the cache hit ratio significantly) and both the image and word coordinates can be cached independently by CDN edge servers rather than requiring a full round-trip back to the server each time.
  • This benefit is most obvious when you open an item and start reading it: the same word coordinates used on the search results page can be reused by the viewer and since the page images don’t have to be customized with search highlighting, they’re likely to be cached on the CDN. If you change your search text while viewing the book highlighting for the current page will be immediately updated without having to wait for the server to respond.

adams080414image6

Challenges & Future Directions

This approach works relatively well but there are a number of areas for improvement:

  • The process described above allows the OCR process to be improved considerably. This provides plenty of room to improve results with technical improvements such as more sophisticated image processing, OCR engine training, and workflow systems incorporating human review and correction.
  • For collections such as WDL’s which include older items OCR accuracy is reduced by the condition of the materials and typographic conventions like the long s (ſ) or ligatures which are no longer in common usage. The Early Modern OCR Project is working on this problem and will hopefully provide a solution for many needs.
  • Finally, there’s considerable appeal to crowd-sourcing corrections as demonstrated by the National Library of Australia’s wonderful Trove project and various experimental projects such as the UMD MITH ActiveOCR project.
  • This research area is of benefit to any organization with large digitized collections, particularly projects with an eye towards generic reuse. Ed Summers and I have casually discussed the idea for a simple web application which would display images with the corresponding hOCR with full version control, allowing the review and correction process to be a generic workflow step for many different projects.
Categories: Planet DigiPres

Computational Linguistics & Social Media Data: An Interview with Bryan Routledge

1 August 2014 - 1:15pm
Bryan Routledge, Associate Professor of Finance Tepper School of Business Carnegie Mellon University.

Bryan Routledge, Associate Professor of Finance, Tepper School of Business, Carnegie Mellon University.

The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and worked on a range of projects leading up to CurateCamp Digital Culture last week. This is part of an ongoing series of interviews Julia is conducting to better understand the kinds of born-digital primary sources folklorists, and others interested in studying digital culture, are making use of for their scholarship.

What can a Yelp review or a single tweet reveal about society? How about hundreds of thousands of them? In this installment of the Insights Interviews series, I’m thrilled to talk with researcher Bryan Routledge about two of his projects that utilize a computational linguistic lens to analyze vast quantities of social media data. You can read the article on word choice used in online restaurant reviews here. The article about using Twitter as a predictive tool as compared with traditional public opinion polls here (PDF).

Julia: The research group Noah’s ARK at the Language Technologies Institute, School of Computer Science at Carnegie Mellon University aims in part to “analyze the textual content of social media, including Twitter and blogs, as data that reveal political, linguistic, and economic phenomena in society.”  Can you unpack this a bit for us? What kind of information can social media provide that other kinds of data can’t?

Bryan: Noah Smith, my colleague in the school of computer science at CMU, runs that lab.  He is kind enough to let me hang out over there.  The research we are working on looks at the connection between text and social science (e.g., economics, finance).  The idea is that looking at text through the lens of a forecasting problem — the statistical model between text and some social-science measured variable — gives insight into both the language and social parts.  Online and easily accessed text brings new data to old questions in economics.  More interesting, at least to me, is that grounding the text/language with quantitative external measures (volatility, citations, etc.) gives insight into the text.  What words in corporate 10K annual reports correlate with stock volatility and how that changes over time is cool.

4.

Different metaphors for expensive and inexpensive restaurants in Yelp reviews. From: Dan Jurafsky, Victor Chahuneau, Bryan R. Routledge, and Noah A. Smith. 2014. Narrative framing of consumer sentiment in online restaurant reviews. First Monday 19:4.

Julia: Your work with social media—Yelp and Twitter—are notable for their large sample sizes and emphasis on quantitative methods, using over 900,000 Yelp reviews and 1 billion tweets. How might archivists of social media better serve social science research that depends on these sorts of data sets and methods?

Bryan: That is a good question.  What makes it very hard for archivists is that collecting the right data without knowing the research questions is hard.  The usual answer of “keep everything!” is impractical.  Google’s n-gram project is a good illustration.  They summarized a huge volume of books with word counts (two word pairs, …) by time.  This is great for some research.  But not for the more recent statistical models that use sentences and paragraph information.

Julia:  Your background and most of your work is in the field of finance, which you have characterized as being fundamentally about predicting the behavior of people . How do you see financial research being influenced by social media and other born digital content? Could you tell us a bit about what it means to have a financial background doing this kind of research? What can the fields of finance and archives learn from each other?

 in most locations, the word “baby” is neutral -- it suggests neither high nor low price.  Except in the Wall Street area of lower Manhattan where it is associated with higher priced steak.

In Yelp reviews of Manhattan restaurants with “steak” in the menu (an example). Predict the (log) menu item price using the words used to describe the item by location. For example: in most locations, the word “baby” is neutral — it suggests neither high nor low price. Except in the Wall Street area of lower Manhattan where it is associated with higher priced steak.

Bryan:  Finance (and economics) is about the collective behavior of large number of people in markets.  To make research possible you need simple models of individuals.  Getting the right mix of simplicity and realism is age-old and ongoing research in the area.  More data helps.  Macroeconomic data like GDP and stock returns is informative about the aggregate.  Data on, say, individual portfolio choices in 401K plans lets you refine models.  Social media data is this sort of disaggregated data.  We can get a signal, very noisy, about what is behind an individual decision.  Whether that is ultimately helpful for guiding financial or economic policy is an open, but exciting, question.

More generally, working across disciplines is interesting and fun.  It is not always “additive.”  The research we have done on menus has nothing to do with finance (other than my observation that in NY restaurants near Wall Street, the word “baby” is associated with expensive menu items).  But if we can combine, for example, decision theory finance with generative text models, we get some cool insights into purposefully drafted documents.

Julia: The data your team collected from Yelp was gathered from the site. Your data from Twitter was collected using Twitter’s Streaming API and “Gardenhose,” which deliver a random sampling of tweets in real-time. I’d be curious to hear what role you think content holders like Yelp or Twitter can or could play in providing access to this kind of raw data.

Bryan: As a researcher with only the interests of science at heart, it would be best if they just gave me access to all their data!  Given that much of the data is valuable to the companies (and privacy, of course), I understand that is not possible.  But it is interesting that academic research, and data-sharing more generally, is in a company’s self-interest.  Twitter has encouraged a whole ecosystem that has helped them grow.  Many companies have an API for that purpose that happens to work nicely for academic research.  In general, open access is most preferred in academic settings so that all researchers have access to the same data.  Interesting papers using proprietary access to Facebook are less helpful than Twitter.

Julia: Could you tell us a bit about how you processed and organized the data for analysis and how you are working to manage it for the future? Given that reproducibility is such an important concept for science, what ways are you approaching ensuring that your data will be available in the future?

Bryan: This is not my strong suit.  But at a high-level, the steps are (roughly) “get,” “clean,” “store,” “extract,” “experiment.”  The “get” varies with the data source (an API).  The “clean” step is just a matter of being careful with special characters and making sure data are lining up into fields right.  If the API is sensible, the “clean” is easy.  We usually store things in a JSON format that is flexible.  This is usually a good format to share data.  The “extract” and “experiment” steps depend on what you are interested in.  Word counts? Phrase counts? Other?  The key is not to jump from “get” to “extract” — storing the data in as raw form as possible makes thing flexible.

Julia:  What role, or potential role, do you see for the future of libraries, archives and museums in working with the kinds of data you collect? That is, while your data is valuable for other researchers now, things like 700,000 Yelp reviews of restaurants will be invaluable to all kinds of folks studying culture, economics and society 10, 20, 50 and 100 years from now. So, what kind of role do you think cultural heritage institutions could play in the long-term stewardship of this cultural data? Further, what kinds of relationships do you think might be able to be arranged between researchers and libraries, archives, and museums? For instance, would it make sense for a library to collect, preserve, and provide access to something like the Yelp review data you worked with? Or do you think they should be collecting in other ways?

 Linking Text Sentiment to Public Opinion Time Series. Brendan O'Connor,Ramnath Balasubramanyan, Bryan R. Routledge, and Noah A. Smith. In Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM 2010), pages 122–129, Washington, DC, May 2010

Sentiment on Twitter as compared to Gallup Poll. Appeared in From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series. Brendan O’Connor, Ramnath Balasubramanyan, Bryan R. Routledge and Noah A. Smith. In Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM 2010), pages 122–129, Washington, DC, May 2010

Bryan: This is also a great question and also one for which I do not have a great answer.  I do not know a lot about the research in “digital humanities,” but that would be a good place to look.  People doing digital text-based research on a long-horizon panel of data should provide some insight into what sorts of questions people ask.  Similarly, economic data might provide some hints.  Finance, for example, has a strong empirical component that comes from having easy-to-access stock data (the CRSP).  The hard part for libraries is figuring out which parts to keep.  Sampling Twitter, for example, gets a nice time-series of data but loses the ability to track a group of users or Twitter conversations.

Julia: Talking about the paper you co-authored that analyzed Yelp reviews, Dan Jurafsky said “when you write a review on the web you’re providing a window into your own psyche – and the vast amount of text on the web means that researchers have millions of pieces of data about people’s mindsets.” What do you think are some of the possibilities and limitations for analyzing social media content?

Bryan: There are many limitations, of course.  Twitter and Yelp are not just providing a window into things, they are changing the way the world works.  “Big data” is not just about larger sample sizes of draws from a fixed distribution.  Things are non-stationary.  (In an early paper using Twitter data, we could see the “Oprah” effect as the number of users jumped in the day following her show about Twitter).  Similarly, the data we see in social media is not a representative sample of society cross section.  But both of these are the sort of things good modeling – statistical, economic – should, and do, aim to capture.  The possibilities of all this new data are exciting.  Language is a rich source of data with challenging models needed to turn it into useful information.  More generally, social media is an integral part of many economic and social transactions.  Capturing that in a tractable model makes for an interesting research agenda.

Categories: Planet DigiPres

Digital Preservation 2014: It’s a Thing

30 July 2014 - 12:56pm

“Digital preservation makes headlines now, seemingly routinely. And the work performed by the community gathered here is the bedrock underlying such high profile endeavors.” – Matt Kirschenbaum

 Erin Engle.

The registration table at Digital Preservation 2014. Photo credit: Erin Engle.

The annual Digital Preservation meeting, held each summer in Washington, DC, brings together experts in academia, government and the private and non-profit sectors to celebrate key work and share the latest developments, guidelines, best practices and standards in digital preservation.

Digital Preservation 2014, held July 22-24,  marked the 13th major meeting hosted by NDIIPP in support of the broad community of digital preservation practitioners (NDIIPP held two meetings a year from 2005-2007), and it was certainly the largest, if not the best. Starting with the first combined NDIIPP/National Digital Stewardship Alliance meeting in 2011, the annual meeting has rapidly evolved to welcome an ever-expanding group of practitioners, ranging from students to policy-makers to computer scientists to academic researchers. Over 300 people attended this year’s meeting.

“People don’t need drills; they need holes,” stated NDSA Coordinating Committee chairman Micah Altman, the Director of Research at the Massachusetts Institute of Technology Libraries,  in an analogy to digital preservation in his opening talk. As he went on to explain, no one needs digital preservation for its own sake, but it’s essential to support the rule of law, a cumulative evidence base, national heritage, a strategic information reserve, and to communicate to future generations. It’s these challenges that face the current generation of digital stewardship practitioners, many of which are addressed in the 2015 National Agenda for Digital Stewardship, which Altman previewed during his talk (and which will appear later this fall).

 Erin Engle.

A breakout session at Digital Preservation 2014. Photo credit: Erin Engle.

One of those challenges is the preservation of the software record, which was eloquently illuminated by Matt Kirschenbaum, the Associate Director of the Maryland Institute for Technology in the Humanities, during his stellar talk, “Software, It’s a Thing.” Kirschenbaum ranged widely across computer history, art, archeology and pop culture with a number of essential insights. One of the more piquant was his sorting of software into different categories of “things” (software as asset, package, shrinkwrap, notation/score, object, craft, epigraphy, clickwrap, hardware, social media, background, paper trail, service, big data), each of which with its own characteristics. As Kirschenbaum eloquently noted, software is many different “things,” and we’ll need to adjust our future approaches to preservation accordingly.

Associate Professor at the New School Shannon Mattern took yet another refreshing approach, discussing the aesthetics of creative destruction and the challenges of preserving ephemeral digital art. As she noted, “by pushing certain protocols to their extreme, or highlighting snafus and ‘limit cases’ these artists’ work often brings into stark relief the conventions of preservation practice, and poses potential creative new directions for that work.”

 Erin Engle.

Stephen Abrams, Martin Klein, Jimmy Lin and Michael Nelson during the “Web Archiving” panel. Photo credit: Erin Engle.

These three presentations on the morning of the first day provided a thoughtful intellectual substrate upon which a huge variety of digital preservation tools, services, practices and approaches were elaborated over the following days. As befits a meeting that convenes disparate organizations and interests, collaboration and community were big topics of discussion.

A Tuesday afternoon panel on “Community Approaches to Digital Stewardship” brought together a quartet of practitioners who are working collaboratively to advance digital preservation practice across a range of organizations and structures, including small institutions (the POWRR project); data stewards (the Research Data Alliance); academia (the Academic Preservation Trust); and institutional consortiums (the Five College Consortium).

Later, on the second day, a well-received panel on the “Future of Web Archiving” displayed a number of clever collaborative approaches to capturing the digital materials from the web, including updates on the Memento project and Warcbase, an open-source platform for managing web archives.

 Erin Engle.

CurateCamp: Digital Culture. Photo credit: Erin Engle.

In between there were plenary sessions on stewarding space and research data, and over three dozen lightning talks, posters and breakout sessions covering everything from digital repositories for museum collections to a Brazilian digital preservation network to the debut of a new digital preservation questions and answers tool. Additionally, a CurateCamp unconference on the topic of “Digital Culture” was held on a third day at Catholic University, thanks to the support of the CUA Department of Library and Information Science.

The main meeting closed with a thought-provoking presentation from artist and digital conservator Dragan Espenschied. Espenschied utilized emulation and other novel tools to demonstrate some of the challenges related to presenting works authentically, in particular works from the early web and those dependent on a range of web services. Espenschied, also the Digital Conservator at Rhizome, has an ongoing project, One Terabyte of Kilobyte Age, that explores the material captured in the Geocities special collection. Associated with that project is a Tumblr he created that automatically generates a new screenshot from the Geocities archive collection every 20 minutes.

Web history, data stewardship, digital repositories; for digital preservation practitioners it was nerd heaven. Digital preservation 2014, it’s a thing. Now on to 2015!

Categories: Planet DigiPres

Art is Long, Life is Short: the XFR Collective Helps Artists Preserve Magnetic and Digital Works

29 July 2014 - 2:44pm

XFR STN (“Transfer Station”) is a grass-roots digitization and digital-preservation project that arose as a response from the New York arts community to rescue creative works off of aging or obsolete audiovisual formats and media. The digital files are stored by the Library of Congress’s NDIIPP partner the Internet Archive and accessible for free online. At the recent Digital Preservation 2014 conference, the NDSA gave XFR STN the NDSA Innovation Award. Last month, members of the XFR collective — Rebecca Fraimow, Kristin MacDonough, Andrea Callard and Julia Kim — answered a few questions for the Signal.

"VHS 1" from XFR Collective.

“VHS 1,” courtesy of Walter Forsberg.

Mike: Can you describe the challenges the XFR Collective faced in its formation?

XFR: Last summer, the New Museum hosted a groundbreaking exhibit called XFR STN.  Initiated by the artist collective Colab and the resulting MWF Video Club, the exhibit was a major success. By the end of the exhibition over 700 videos had been digitized with many available online through the Internet Archive.

It was clear  for all of us involved that there was a real demand for these services, that there are many under-served artists who were having difficulty preserving and accessing their own media. Many of the people involved with the exhibit became passionate about continuing the service of preserving obsolete magnetic and digital media for artists.  We wanted to offer a long-term, non-commercial, grassroots solution.

Using the experience of working on XFR STN as a jumping-off point, we began developing XFR Collective as a separate nonprofit initiative to serve the need that we saw.  Over the course of our development, we’ve definitely faced — and are still facing — a number of challenges in order to make ourselves effective and sustainable.

"VHS 3" by XFR Collective.

“VHS 2,” courtesy of Walter Forsberg.

Perhaps the biggest challenge has simply been deciding what form XFR Collective was going to take.  We started out with a bunch of borrowed equipment and a lot of enthusiasm, so the one thing we knew we could do was digitize, but we had to sit down and really think about things like organizational structure, sustainable pricing for our services, and the convoluted process of becoming a non-profit.

Eventually, we settled on a membership-based structure in order to be able to keep our costs as low as possible.  A lot of how we’re operating is still very experimental — this summer wraps up our six-month test period, during which we limited ourselves to working with only a small number of partners to allow us to figure out what our capacity was and how we could design our projects in the future.

We’ve got a number of challenges still ahead of us — finding a permanent home is a big one — and we still feel like we’re only just getting started, in terms of what we can do for the community of artists who use our services.  It’s going to be interesting for all of us to see how we develop.  We’ve started thinking of ourselves as kind of a grassroots preservation test kitchen. We’ll try almost any kind of project once to see if it works!

Mike: Where are the digital files stored? Who maintains them?

XFR: Our digital files will be stored with the membership organizations and uploaded to the Internet Archive for access and for long-term open-source preservation.  This is an important distinction that may confuse some people: XFR Collective is not an archive.

While we advocate and educate about best practices, we will not hold any of the digital files ourselves; we just don’t have the resources to maintain long-term archival storage.  We encourage material to go onto the Internet Archive because long-term accessibility is part of our mission and because the Internet Archive has the server space to store uncompressed and lossless files as well as access files.  That way if something happens to the storage that our partners are using for their own files, they can always re-download them.  But we can’t take responsibility for those files ourselves. We’re a service point, not a storage repository.

"VHS 2" by XFR Collective

“VHS 3,” courtesy of Walter Forsberg.

Mike: Regarding public access as a means of long-term preservation and sustainability, how do you address copyrighted works?

XFR: This is a great question that confounds a lot of our collaborators initially.  Access-as-preservation creates a lot of intellectual property concerns.  Still, we’re a very small organization, so we can afford to take more risks than a more high-profile institution.  We don’t delve too deeply into the area of copyright; our concern is with the survival of the material.  If someone has a complaint, the Internet Archive will give us a warning in time to re-download the content and then remove it. But so far we haven’t had any complaints.

Mike: What open access tools and resources do you use?

XFR: The Internet Archive itself is something of an open access resource and we’re seeing it used more and more frequently as a kind of accessory to preservation, which is fantastic.  Obviously it’s not the only solution, and you wouldn’t want to rely on that alone any more than you would any kind of cloud storage, but it’s great to have a non-commercial option for streaming and storage that has its own archival mission and that’s open to literally anyone and anything.

Mike:  If anyone is considering a potential collaboration to digitally preserve audiovisual artwork, what can they learn from the experiences of the XFR Collective?

XFR: Don’t be afraid to experiment!  A lot of what we’ve accomplished is just by saying to ourselves that we have to start doing something, and then jumping in and doing it.  We’ve had to be very flexible. A lot of the time we’ll decide something as a set proposition and then find ourselves changing it as soon as we’ve actually talked with our partners and understood their needs.  We’re evolving all the time but that’s part of what makes the work we do so exciting.

We’ve also had a lot of help and we couldn’t have done any of what we’ve accomplished without support and advice from a wide network of individuals, ranging from the amazing team at XFR STN to video archivists across New York City.  None of these collaborations happen in a vacuum, so make friendships, make partnerships, and don’t be nervous about asking for advice.  There are a lot of people out there who care about video preservation and would love to see more initiatives out there working to make it happen.

Categories: Planet DigiPres

The MH17 Crash and Selective Web Archiving

28 July 2014 - 4:34pm

The following is a guest post by Nicholas Taylor, Web Archiving Service Manager for Stanford University Libraries.

//web.archive.org/web/20140717155720/https://vk.com/wall-57424472_7256">Internet Archive Wayback Machine</a>.

Screenshot of 17 July 2014 15:57 UTC archive snapshot of deleted VKontakte Strelkov blog post regarding downed aircraft, on Internet
Archive Wayback Machine
.

The Internet Archive Wayback Machine has been mentioned in several news articles within the last week  (see here, here and here) for having archived a since-deleted blog post from a Ukrainian separatist leader touting his shooting down a military transport plane which may have actually been Malaysia Airlines Flight 17. At this early stage in the crash investigation, the significance of the ephemeral post is still unclear, but it could prove to be a pivotal piece of evidence.

An important dimension of the smaller web archiving story is that the blog post didn’t make it into the Wayback Machine by the serendipity of Internet Archive’s web-wide crawlers; an unknown but apparently well-informed individual identified it as important and explicitly designated it for archiving.

Internet Archive crawls the Web every few months, tends to seed those crawls from online directories or compiled lists of top websites that favor popular content, archives more broadly across websites than it does deeply on any given website, and embargoes archived content from public access for at least six months. These parameters make the Internet Archive Wayback Machine an incredible resource for the broadest possible swath of web history in one place, but they don’t dispose it toward ensuring the archiving and immediate re-presentation of a blog post with a three-hour lifespan on a blog that was largely unknown until recently.

Recognizing the value of selective web archiving for such cases, many memory organizations engage in more targeted collecting. Internet Archive itself facilitates this approach through its subscription Archive-It service, which makes web archiving approachable for curators and many organizations. A side benefit is that content archived through Archive-It propagates with minimal delay to the Internet Archive Wayback Machine’s more comprehensive index. Internet Archive also provides a function to save a specified resource into the Wayback Machine, where it immediately becomes available.

Considering the six-month access embargo, it’s safe to say that the provenance of everything that has so far been archived and re-presented in the Wayback Machine relating to the five-month-old Ukraine conflict is either the Archive-It collaborative Ukraine Conflict collection or the Wayback Machine Save Page Now function. In other words, all of the content preserved and made accessible to date, including the key blog post, reflects deliberate curatorial decisions on the part of individuals and institutions.

A curator at the Hoover Institution Library and Archives with a specific concern for the VKontakte Strelkov blog actually added it to the Archive-It collection with a twice-daily capture frequency at the beginning of July. Though the key blog post was ultimately recorded through the Save Page Now feature, what’s clear is that subject area experts play a vital role in focusing web archiving efforts and, in this case, facilitated the preservation of a vital document that would not otherwise have been archived.

At the same time, selective web archiving is limited in scope and can never fully anticipate what resources the future will have wanted us to save, underscoring the value of large-scale archiving across the Web. It’s a tragic incident but an instructive example of how selective web archiving complements broader web archiving efforts.

Categories: Planet DigiPres

Understanding the Participatory Culture of the Web: An Interview with Henry Jenkins

24 July 2014 - 10:51am
Henry Jenkins, Provost Professor of Communication, Journalism, and Cinematic Arts, a joint professorship at the USC Annenberg School for Communication and the USC School of Cinematic Arts.

Henry Jenkins, Provost Professor of Communication, Journalism, and Cinematic Arts, with USC Annenberg School for Communication and the USC School of Cinematic Arts.

The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and is working on a range of projects related to CurateCamp Digital Culture. This is part of an ongoing series of interviews Julia is conducting to better understand the kinds of born-digital primary sources folklorists, and others interested in studying digital culture, are making use of for their scholarship.

Anyone who has ever liked a TV show’s page on Facebook or proudly sported a Quidditch t-shirt knows that being a fan goes beyond the screen or page.  With the growth of countless blogs, tweets, Tumblr gifsets, Youtube videos, Instagram hashtags, fanart sites and fanfiction sites, accessing fan culture online has never been easier. Whether understood as a vernacular web or as the blossoming of a participatory culture individuals across the world are using the web to respond to and communicate their own stories.

As part of the NDSA Insights interview series, I’m delighted to interview Henry Jenkins, professor at the USC Annenberg School for Communication and self-proclaimed Aca-Fan. He is the author of one of the foundational works exploring fan cultures, “Textual Poachers: Television Fans and Participatory Culture,”  as well as a range of other books, including “Convergence Culture: Where Old and New Media Collide,” and most recently the co-author (with Sam Ford and Joshua Green) “Spreadable Media: Creating Value and Meaning in a Networked Culture.” He blogs at Confessions of an Aca-Fan.

Julia: You state on your website that your time at MIT, “studying culture within one of the world’s leading technical institutions” gave you “some distinctive insights into the ways that culture and technology are reshaping before our very eyes.”  How so? What are some of the changes you’ve observed, from a technical perspective and/or a cultural one?

Henry: MIT was one of the earliest hubs in the Internet. When I arrived there in 1989, Project Athena was in its prime; the MIT Media Lab was in its first half decade and I was part of a now legendary Narrative Intelligence Reading Group (PDF) which brought together some of the smartest of their graduate students and a range of people interested in new media from across Cambridge; many of the key thinkers of early network culture were regular speakers at MIT; and my students were hatching ideas that would become the basis for a range of Silicon Valley start ups. And it quickly became clear to me that I had a ringside seat for some of the biggest transfomations in the media landscape in the past century, all the more so because through my classes, the students were helping me to make connections between my work on fandom as a participatory culture and a wide array of emerging digital practices (from texting to game mods).

Kresge Auditorium, MIT, Historic American Buildings Survey/Historic American Engineering Record/Historic American Landscapes Survey, Library of Congress Prints and Photographs Division, http://hdl.loc.gov/loc.pnp/hhh.ma1361/photos.080151

Kresge Auditorium, MIT, Historic American Buildings Survey/Historic American Engineering Record/Historic American Landscapes Survey, Library of Congress Prints and Photographs Division, http://hdl.loc.gov/loc.pnp/hhh.ma1361/photos.080151

Studying games made sense at MIT because “Spacewar,” one of the first known uses of computers for gaming, had been created by the MIT Model Railroad club in the early 1960s. I found myself helping to program a series that the MIT Women’s Studies Program was running on gender and cyberspace, from which the materials for my book, “From Barbie to Mortal Kombat” emerged. Later, I would spend more than a decade as the housemaster of an MIT dorm, Senior House, which is known to be one of the most culturally creative at the Institute.

Through this, I was among the first outside of Harvard to get a Facebook account; I watched students experimenting with podcasting, video-sharing and file-sharing. Having MIT after my name opened doors at all of the major digital companies and so I was able to go behind the scenes as some of these new technologies were developing, and also see how they were being used by my students in their everyday lives.

So, through the years, my job was to place these developments in their historical and cultural contexts — often literally as Media Lab students would come to me for advice on their dissertation projects, but also more broadly as I wrote about these developments through Technology Review, the publication for MIT’s alumni network. It was there where many of the ideas that would form “Convergence Culture” were first shared with my readers. And the students that came through the Comparative Media Studies graduate program have been at ground zero for some of the key developments in the creative industries in recent years — from the Veronica Mars Kickstarter campaign to the community building practices of Etsy, from key developments in the games and advertising industry to cutting edge experiments in transmedia storytelling. The irony is that I had been really reluctant about accepting the MIT job because I suffer from fairly serious math phobia. :-)

Today, I enjoy another extraordinary vantage point as a faculty member at USC, who is embedded in both the Annenberg School of Communication and Journalism and the Cinema School, and thus positioned to watch how Hollywood and American journalism are responding to the changes that networked communication have forced upon them. I am able to work with future filmmakers who are trying to grasp a shift from a focus on individual stories to an emphasis on world-building, journalists who are trying to imagine new relationships with their publics, and activists who are seeking to make change by any media necessary.

Julia: Much of your work has focused on reframing the media audience as active and creative participants in creating media, rather than passive consumers.  You’ve critiqued use of the terms “viral” and “memes” to describe  internet phenomena as “stripping aside the concept of human agency,” and that the biological language “confuses the actual power relations between producers, properties, brands and consumers.” Can you unpack some of your critiques for us? What is at stake?

Henry: At the core of “Spreadable Media” is a shift in how media travels across the culture. On the one hand, there is distribution as we have traditionally understood it in the era of mass media where content flows in patterns regulated by decisions made by major corporations who control what we see, when we see it and under what conditions. On the other hand, there is circulation, a hybrid system, still shaped top-down by corporate players, but also bottom-up by networks of everyday people, who are seeking to move media that is meaningful to them across their social networks, and will take media where they want it when they want it through means both legal and illegal. The shift towards a circulation-based model for media access is disrupting and transforming many of our media-related practices, and it is not explained well by a model which relies so heavily on metaphors of infection and assumptions of irrationality.

The idea of viral media is a way that the broadcasters hold onto the illusion of their power to set the media agenda at a time when that power is undergoing a crisis. They are the ones who make rational calculations, able to design a killer virus which infects the masses, so they construct making something go viral as either arcane knowledge that can be sold at a price from those in the know or as something that nobody understands, “It just went viral!” But, in fact, we are seeing people, collectively and individually, make conscious decisions about what media to pass to which networks for what purposes with what messages attached through which media channels and we are seeing activist groups, religious groups, indie media producers, educators and fans make savvy decisions about how to get their messages out through networked communications.

Julia: Cases like the Harry Potter Alliance suggest the range of ways that fan cultures on the web function as a significant cultural and political force. Given the significance of fandom, what kinds of records of their online communities do you think will be necessary in the future for us to understand their impact? Said differently, what kinds of records do you think cultural heritage organizations should be collecting to support the study of these communities now and into the future?

Henry: This is a really interesting question. My colleague, Abigail De Kosnik at UC-Berkeley, is finishing up a book right now which traces the history of the fan community’s efforts to archive their own creative output over this period, which has been especially precarious, since we’ve seen some of the major corporations which fans have used to spread their cultural output to each other go out of business and take their archives away without warning or change their user policies in ways that forced massive numbers of people to take down their content.

Image of Paper Print Films in Library of Congress collection.

Image of Paper Print Films in Library of Congress collection. Jenkins notes this collection of prints likely makes it easier to write the history of the first decade of American cinema than to write the history of the first decade of the web.

The reality is that it is probably already easier to write the history of the first decade of American cinema, because of the paper print collection at the Library of Congress, than it is to write the history of the first decade of the web. For that reason, there has been surprisingly little historical research into fandom — even though some of the communication practices that fans use today go back to the publication practices of the Amateur Press Association in the mid-19th century. And even recently, major collections of fan-produced materials have been shunted from library to archive with few in your realm recognizing the value of what these collections contain.

Put simply, many of the roots of today’s more participatory culture can be traced back to fan practices over the last century. Fans have been amongst the leading innovators in terms of the cultural uses of new media. But collecting this material is going to be difficult: fandom is a dispersed but networked community which does not work through traditional organizations; there are no gatekeepers (and few recordkeepers) in fandom, and the scale of fan production — hundreds of thousands if not millions of new works every year — dwarfs that of commercial publishing. And that’s to focus only on fan fiction and would does not even touch the new kinds of fan activism that we are documenting for my forthcoming book, By Any Media Necessary. So, there is an urgent need to archive some of these materials, but the mechanisms for gathering and appraising them are far from clear.

Julia: Your New Media Literacy project aims in part to “provide adults and youth with the opportunity to develop the skills, knowledge, ethical framework and self-confidence needed to be full participants in the cultural changes which are taking place in response to the influx of new media technologies, and to explore the transformations and possibilities afforded by these technologies to reshape education.” In one of your pilot programs, for instance, students studied “Moby-Dick” by updating the novel’s Wikipedia page. Can you tell us a little more about this project? What are some of your goals? Further, what opportunities do you think libraries have to enable this kind of learning?

Henry: We documented this project through our book, “Reading in a Participatory Culture,” and through a free online project, Flows of Reading. It was inspired by the work of Ricardo Pitts-Wiley, the head of the Mixed Magic Theater in Rhode Island, who was spending time going into prisons to get young people to read “Moby-Dick” by getting them to rewrite it, imagining who these characters would be and what issues they would be confronting if they were part of the cocaine trade in the 21st century as opposed to the whaling trade in the 19th century. This resonated with the work I have been doing on fan rewriting and fan remixing practices, as well as what we know about, for example, the ways hip hop artists sample and build on each other’s work.

So, we developed a curriculum which brought together Melville’s own writing and reading practices (as the master mash-up artist of his time) with Pitts-Wiley’s process in developing a stage play that was inspired by his work with the incarcerated youth and with a focus on the place of remix in contemporary culture. We wanted to give young people tools to think ethically and meaningfully about how culture is actually produced and to give teachers a language to connect the study of literature with contemporary cultural practices. Above all, we wanted to help students learn to engage with literary texts creatively as well as critically.

We think libraries can be valuable partners in such a venture, all the more so as regimes of standardized testing make it hard for teachers to bring complex 19th century novels like “Moby-Dick” into their classes or focus student attention on the process and cultural context of reading and writing as literacy practices. Doing so requires librarians to think of themselves not only as curators of physical collections but as mentors and coaches who help students confront the larger resources and practices opened up to them through networked communication. I’ve found librarians and library organizations to be vital partners in this work through the years.

Julia: Your latest book is on the topic of “spreadable media,” arguing that “if it doesn’t spread, it’s dead.”  In a nutshell, how would you define the term “spreadable media”?

Henry:  I talked about this a little above, but let me elaborate. We are proposing spreadable media as an alternative to viral media in order to explain how media content travels across a culture in an age of Facebook, Twitter, YouTube, Reddit, Tumblr, etc. The term emphasizes the act of spreading and the choices which get made as people appraise media content and decide what is worth sharing with the people they know. It places these acts of circulation in a cultural context rather than a purely technological one. At the same time, the word is intended to contrast with older models of “stickiness,” which work on the assumption that value is created by locking down the flow of content and forcing everyone who wants your media to come to your carefully regulated site. This assumes a kind of scarcity where we know what we want and we are willing to deal with content monopolies in order to get it.

But, the reality is that we have more media available to us today that we can process: we count on trusted curators — primarily others in our social networks but also potentially those in your profession — to call media to our attention and the media needs to be able to move where the conversations are taking place or remain permanently hidden from view. That’s the spirit of “If it doesn’t spread, it’s dead!” If we don’t know about the media, if we don’t know where to find it, if it’s locked down where we can’t easily get to it, it becomes irrelevant to the conversations in which we are participating. Spreading increases the value of content.

Julia: What does spreadable media mean to the conversations libraries, archives and museums could  have with their patrons? How can archives be more inclusive of participatory culture?

Henry:  Throughout the book, we use the term “appraisal” to refer to the choices everyday people make, collectively and personally, about what media to pass along to the people they know. Others are calling this process “curating.” But either way, the language takes us immediately to the practices which used to be the domain of “libraries, archives, and museums.” You were the people who decided what culture mattered, what media to save from the endless flow, what media to present to your patrons. But that responsibility is increasingly being shared with grassroots communities, who might “like” something or “vote something up or down” through their social media platforms, or simply decide to intensify the flow of the content through tweeting about it.

We are seeing certain videos reach incredible levels of circulation without ever passing through traditional gatekeepers. Consider “Kony 2012,” which reached more than 100 million viewers in its first week of circulation, totally swamping the highest grossing film at the box office that week (“Hunger Games”) and the highest viewed series on American television (“Modern Family”), without ever being broadcast in a traditional sense. Minimally, that means that archivists may be confronting new brokers of content, museums will be confronting new criteria for artistic merit, and libraries may be needing to work hand in hand with their patrons as they identify the long-term information needs of their communities. It doesn’t mean letting go of their professional judgement, but it does mean examining their prejudices about what forms of culture might matter and it does mean creating mechanisms, such as those around crowd-sourcing and perhaps even crowd-funding, which help to insure greater responsiveness to public interests.

Julia: You wrote in 2006 that there is a lack of fan involvement with works of high culture because “we are taught to think about high culture as untouchable,” which in turn has to do with “the contexts within which we are introduced to these texts and the stained glass attitudes which often surround them.” Further, you argue that this lack of a fan culture makes it difficult to engage with a work, either intellectually or emotionally. Can you expand on this a bit? Do you still believe this to be the case, or has this changed with time? Does the existence of transformative works like “The Lizzie Bennet Diaries” on Youtube or vibrant Austen fan communities on Tumblr reveal a shift in attitudes? Finally, how can libraries, museums, and other institutions help foster a higher level of emotional and intellectual engagement?

Henry:  Years ago, I wrote “Science Fiction Audiences” with the British scholar John Tulloch in which we explored the broad range of ways that fans read and engaged with “Star Trek” and “Doctor Who.” Tulloch then went on to interview audiences at the plays of Anton Checkov and discovered a much narrower range of interpretations and meanings — they repeated back what they had been taught to think about the Russian playwright rather than making more creative uses of their experience at the theater. This was probably the opposite of the way many culture brokers think about the high arts — as the place where we are encouraged to think and explore — and popular arts — as works that are dummied down for mass consumption. This is what I meant when I suggested that the ways we treat these works cut them off from popular engagement.

At the same time, I am inspired by recent experiments which merge the high and the low. I’ve already talked about Mixed Magic’s work with “Moby-Dick,” but “The Lizzie Bennett Diaries” is another spectacular example. It’s inspired to translate Jane Austen’s world through the mechanisms of social media: gossip and scandal plays such a central role in her works; she’s so attentive to what people say about each other and how information travels through various social communities. And the playful appropriation and remixing of “Pride and Prejudice” there has opened up Austen’s work to a whole new generation of readers who might otherwise have known it entirely through Sparknotes and plodding classroom instruction. There are certainly other examples of classical creators — from Gilbert and Sullivan to Charles Dickens and Arthur Conan Doyle — who inspire this kind of fannish devotion from their followers, but by and large, this is not the spirit with which these works get presented to the public by leading cultural institutions.

I would love to see libraries and museums encourage audiences to rewrite and remix these works, to imagine new ways of presenting them, which make them a living part of our culture again. Lawrence Levine’s “Highbrow/Lowbrow” contrasts the way people dealt with Shakespeare in the 19th century — as part of the popular culture of the era — with the ways we have assumed across the 20th century that an appreciation of the Bard is something which must be taught because it requires specific kinds of cultural knowledge and specific reading practices. Perhaps we need to reverse the tides of history in this way and bring back a popular engagement with such works.

Julia: You’re a self-described academic and fan, so I’d be interested in what you think are some particularly vibrant fan communities online that scholars should be paying more attention to.

 A Vlogbrothers FAQ”

Screenshot of the VlogBrothers, Hank and John Green, as they display a symbol of their channel in a video titled “How To Be a Nerdfighter: A Vlogbrothers FAQ”

Henry: The first thing I would say is that librarians, as individuals, have long been an active presence in the kinds of fan communities I study; many of them write and read fan fiction, for example, or go to fan conventions because they know these as spaces where people care passionately about texts, engage in active debates around their interpretation, and often have deep commitments to their preservation. So, many of your readers will not need me to point out the spaces where fandom are thriving right now; they will know that fans have been a central part of the growth of the Young Adult Novel as a literary category which attracts a large number of adult readers so they will be attentive to “Harry Potter,” “Hunger Games,” or the Nerdfighters (who are followers of the YA novels of John Green); they will know that fans are being drawn right now to programs like “Sleepy Hollow” which have helped to promote more diverse casting on American television; and they will know that now as always science fiction remains a central tool which incites the imagination and creative participation of its readers. The term, Aca-Fan, has been a rallying point for a generation of young academics who became engaged with their research topics in part through their involvement within fandom. Whatever you call them, there needs to be a similar movement to help librarians, archivists and curators come out of the closet, identify as fans, and deploy what they have learned within fandom more openly through their work.

Categories: Planet DigiPres

Future Steward on Stewardship’s Future: An Interview with Emily Reynolds

23 July 2014 - 10:44am
Emily Reynolds, Winner of 2014 Future Steward NDSA Innovation Award.

Emily Reynolds, Winner of 2014 Future Steward NDSA Innovation Award.

Each year, the NDSA Innovation Working Group reviews nominations from members and non-members alike for the Innovation Awards. Most of those awards are focused on recognizing individuals, projects and organizations that are at the top of their game.

The Future Steward award is a little different. It’s focused on emerging leaders, and while the recipients of the future steward award have all made significant accomplishments and achievements, they have done so as students, learners and professionals in the early stages of their careers. Mat Kelly’s work on WARCreate, Martin Gengebach’s work on forensic workflows and now Emily Reynolds work in a range of organizations on digital preservation exemplify how some of the most vital work in digital preservation is being taken on and accomplished by some of the newest members of our workforce.

I’m thrilled to be able to talk with Emily, who picked up this year’s Future Steward award yesterday during the Digital Preservation 2014 meeting, about the range of her work and her thoughts on the future of the field. Emily was recognized for the quality of her work in a range of internships and student positions with the Interuniversity Consortium for Political and Social Research, the University of Michigan Libraries, the Library of Congress, Brooklyn Historical Society, StoryCorps, and, in particular, her recent work on the World Bank’s eArchives project.

Screenshot of the Arab American National Museum's web archive collections.

Screenshot of the Arab American National Museum’s web archive collections.

Trevor: You have a bit of experience working with web archives at different institutions; scoping web archive projects with the Arab American National Museum, putting together use cases for the Library of Congress and in your coursework at the University of Michigan. Across these experiences, what are your reflections and thoughts on the state of web archiving for cultural heritage organizations?

Emily: It seems to me that many cultural heritage organizations are still uncertain as to where their web archive collections fit within the broader collections of their organization. Maureen McCormick Harlow, a fellow National Digital Stewardship Resident, often spoke about this dynamic; the collections that she created have been included in the National Library of Medicine’s general catalog. But for many organizations, web collections are still a novelty or a fringe part of the collections, and aren’t as discoverable. Because we’re not sure how the collections will be used, it’s difficult to provide access in a way that will make them useful.

I also think that there’s a bit of a skills gap, in terms of the challenges that web archiving can present, as compared to the in-house technical skills at many small organizations. Tools like Archive-It definitely lower the barrier to entry, but still require a certain amount of expertise for troubleshooting and understanding how the tool works. Even as the tools get stronger, the web becomes more and more complex and difficult to capture, so I can’t imagine that it will ever be a totally painless process.

Trevor: You have worked on some very different born-digital collections, processing born-digital materials for StoryCorps in New York and on a TRAC self-audit at ICPSR, one of the most significant holders of social science data sets. While very different kinds of materials, I imagine there are some similarities there too. Could you tell us a bit about what you did and what you learned working for each of these institutions? Further, I would be curious to hear what kinds of parallels or similarities you can draw from the work.

41-580x386

Image of a StoryCorps exhibit at the New Museum which Emily participate in.

Emily: At StoryCorps, I did a lot of hands-on work with incoming interviews and data, so I saw first-hand the amount of effort that goes into making such complex collections discoverable. Their full interviews are not currently available online, but need to be accessible to internal staff. At ICPSR, I was more on the policy side of things, getting an overview of their preservation activities and documenting compliance with the TRAC standard.

StoryCorps and ICPSR are an interesting pair of organizations to compare because there are some striking similarities in the challenges they face in terms of access. The complexity and variety of research data held by ICPSR requires specialized tools and standards for curation, discovery and reuse. Similarly, oral history interviews can be difficult to discover and use without extensive metadata (including, ideally, full transcripts). They’re specialized types of content, and both organizations have to be innovative in figuring out how to preserve and provide access to their collections.

ICPSR has a strong infrastructure and systems for normalizing and documenting the data they ingest, but this work still requires a great deal of human input and quality control. Similarly, metadata for StoryCorps interviews is input manually by staff. I think both organizations have done great work towards finding solutions that work for their individual context, although the tools for providing access to research data seem to have developed faster than those for oral history. I’m hopeful that with tools like Pop Up Archive that will change.

Trevor: Most recently, you’ve played a leadership role in the development of the World Bank’s eArchives project. Could you tell us about this project a little and suggest some of the biggest things you learned from working on it?

Julia Blase and Emily Reynolds present on “Developing Sustainable Digital Archive Systems.” at ALA 2013 Midwinter Meeting. Photo by Jaime McCurry.

Emily: The eArchives program is an effort to digitize the holdings of the World Bank Group Archives that are of greatest interest to researchers. We don’t view our digitization as a preservation action (only insofar as it reduces physical wear and tear on the records), and are primarily interested in providing broader access to the records for our international user base. We’ve scanned around 1500 folders of records at this point, prioritizing records that have been requested by researchers and cleared for public disclosure through the World Bank’s Access to Information Policy.

The project has also included a component of improving the accessibility of digitized records and archival finding aids. We are in the process of launching a public online finding aid portal, using the open-source Access to Memory (AtoM) platform, which will contain the archives’ ISAD(G) finding aids as well as links to the digitized materials. Previously, the finding aids were contained in static HTML pages that needed to be updated manually; soon, the AtoM database will sync regularly with our internal description database. This is going to be a huge upgrade for the archivists, in terms of reducing duplication of work and making their efforts more visible to the public.

It’s been really interesting to collaborate with the archives staff throughout the process of launching our AtoM instance. I’ve been thinking a lot about how compliance with archival standards can actually make records less accessible to the public, since the practices and language involved in finding aids can be esoteric and confusing to an outsider. It has been an interesting balance to ensure that the archivists are happy with the way the descriptions are presented, while also making the site as user-friendly as possible. Anne-Marie Viola, of Dumbarton Oaks, has written a couple of blog posts about the process of conducting usability testing on their AtoM instance, which have been a great resource for me.

Trevor: As I understand it, you are starting out a new position as a program specialist with the Institute for Museum and Library Services. I realize you haven’t started yet, but could you tell us a bit about what you are going to be doing? Along with that, I would be curious to hear you talk a bit about how you see your experience thus far fitting into working for the federal funding for libraries and museums?

Emily: As a Program Specialist, I’ll be working in IMLS’s Library Discretionary Programs division, which includes grant programs like the Laura Bush 21st Century Librarian Program and the National Leadership Grants for Libraries. Among other things, I will be supporting the grant review process, communicating with grant applicants, and coordinating grant documentation. I’ll also have the opportunity to participate in some of the outreach that IMLS does with potential and existing grant applicants.

Even though I haven’t been in the profession for a very long time, I’ve had the opportunity to work in a lot of different areas, and as a result feel that I have a good understanding of the broad issues impacting all kinds of libraries today. I’m excited that I’ll be able to be involved in a variety of initiatives and areas, and to increase my involvement in the professional community. I’ve also been spoiled by the National Digital Stewardship Residency’s focus on professional development, and am excited to be moving on to a workplace where I can continue to attend conferences and stay up-to-date with the field.

Trevor: Staffing is a big concern for the future of access to digital information. The NDSA staffing survey gets into a lot of these issues. Based on your experience, what words of advice would you offer to others interested in getting into this field? How important do you think particular technical capabilities are? What made some of your internships better or more useful than others? What kinds of courses do you think were particularly useful? At this point you’ve graduated among a whole cohort of students in your program. What kinds of things do you think made the difference for those who had an easier time getting started in their careers?

Emily: I believe that it is not the exact technical skills that are so important, but the ability to feel comfortable learning new ones, and the ability to adapt what one knows to a particular situation. I wouldn’t expect every LIS graduate to be adept at programming, but they should have a basic level of technical literacy. I took classes in GIS, PHP and MySQL, Drupal and Python, and while I would not consider myself an expert in any of these topics, they gave me a solid understanding of the basics, and the ability to understand how these tools can be applied.

I think it’s also important for recent graduates to be flexible about what types of jobs they apply for, rather than only applying for positions with “Librarian” or “Archivist” in the title. The work we do is applicable in so many roles and types of organizations, and I know that recent grads who were more flexible about their search were generally able to find work more quickly. I enjoyed your recent blog post on the subject of digital archivists as strategists and leaders, rather than just people who work with floppy discs instead of manuscripts. Of course this is easy for me to say, as I move to my first job outside of archives – but I think I’ll still be able to support and participate in the field in a meaningful way.

Categories: Planet DigiPres