In a previous blog post, the NDSA Standards and Practices Working Group announced the opening of a survey to rank issues in preserving video collections. The survey closed on August 2, 2014 and while there’s work ahead to analyze the results and develop action plans, we can share some preliminary findings.
We purposely cast a wide net in advertising the survey so that respondents represented a range of institutions, experience and collections. About 54% of the respondents who started the survey answered all the required questions.
The blog post on The Signal was the most popular means to get the word out (27%) followed by the Association of Moving Image Archivists list (13%) and the NDSA-ALL list (11%). A significant number of respondents (25%) were directed to the survey through other tools including Twitter, Facebook, PrestoCentre Newsletter and the survey bookmarks distributed at the Digital Preservation 2014 meeting.
The vast majority of respondents who identified their affiliation were from the United States; other countries represented include Germany, Austria, England, South Africa, Australia, Canada, Denmark and Chile.
The survey identified the top three stumbling blocks in preserving video as:
- Getting funding and other resources to start preserving video (18%)
- Supporting appropriate digital storage to accommodate large and complex video files (14%)
- Locating trustworthy technical guidance on video file formats including standards and best practices (11%)
Respondents report that analog/physical media is the most challenging type of video (73%) followed by born digital (42%) and digital on physical media (34%).
Clearly, this high level data doesn’t tell the whole story and we have work ahead to analyze the results. Some topics we’d like to pursue include using the source of the survey invitation to better understand the context of the communities that answered the survey. Some respondents, such as those alerted to the survey through the announcement on the AMIA list, are expected to have more experience with preserving video than respondents directed to the survey from more general sources like Facebook or Twitter.
How do the responses from more mature programs compare with emerging programs? What can we learn from those who reported certain issues as “solved” within their institution? Might these solutions be applicable to other institutions? What about the institutions reporting that analog video is more challenging than born digital video? Are their video preservation programs just starting out? Do they have much born-digital video yet?
After we better understand the data, the NDSA Standards and Practices Working Group will start to consider what actions might be useful to help lower these stumbling blocks. This may include following up with additional survey questions to define the formats and scopes of current and expected video collections. Stay tuned for a more detailed report about the survey results and next steps!
22 participants from 8 countries - the UK, Germany, Denmark, the Netherlands, Switzerland, France, Sweden and the Czech Republic, not to forget umpteenthousand defect or somehow interesting PDF files brought to the event.
Not only is this my first Blog entry on the OPDF website, it is also about my first Hackathon. I guess it was Michelle's idea in the first place to organise a Hackathon with the Open Planets Foundation on the PDF topic and to have the event in our library in Hamburg. I am located in Kiel, but as we are renewing our parquet floor in Kiel at the moment, the room situation in Hamburg is much better (Furthermore, it's Hamburg which has the big airport).
The preparation for the event was pretty intense for me. Not only the organisation in Hamburg (food, rooms, water, coffee, dinner event) had to be done, much more intense was the preparating in terms of the Hacking itself.
I am a library- and information scientiest, not a programmer. Sometimes I would rather be a programmer considering my daily best-of-problems, but you should dress for the body you have, not for the body you'd like to have.
Having learned the little I know about writing code within the last 8 months and most of it just since this july, I am still brand-new to it. As there always is a so-called "summer break" (which means that everybody else is in a holiday and I actually have time to work on difficult stuff) I had some very intense Skype calls with Carl from the OPF, who enabled me to put all my work-in-progress PDF-tools to Github. I learned about Maven and Travis and was not quite recovered when the Hackathon actually started this monday and we all had to install some Virtual Ubuntu machine to be able to try out some best-of-tools like DROID, Tika and Fido and run it over our own PDF files.
We had Olaf Drümmer from the PDF Association as our Keynote Speaker for both days. On the first day, he gave us insights about PDF and PDF/A, and when I say insights, I really mean that. Talking about the building blocks of a PDF, the basic object types and encoding possibilities. This was much better than trying to understand the PDF 1.7 specification of 756 pages just by myself alone in the office with sentences like "a single object of type null, denoted by the keyword null, and having a type and value that are unequal to those of any other object".
We learned about the many different kinds of page content, the page being the most important structure unit of a PDF file and about the fact that a PDF page could have every size you can think of, but Acrobat 7.0 officially only supports a page dimension up to 381 km. The second day, we learned about PDF(/A)-Validation and what would theoretically be needed to have the perfect validator. Talking about the PDF and PDF/A specifications and all the specification quoted and referenced by these, I am under the impression that it would last some months to read them all - and so much is clear, somebody would have to read and understand them all. The complexity of the PDF file, the flexibility of the viewers and the plethora of users and user's needs will always take care of a heterogenious PDF reality with all the strangeness and brokenness possible. As far as I remember it is his guess that about 10 years of manpower would be needed to build a perfect validator, if it could be done at all. Being strucked by this perfectly comprehensible suggestions, it is probably not surprising that some of the participants had more questions at the end of the two days than they had at the beginning.
As PDF viewers tend to conceal problems and tend to display problematic PDF files in a decent way, they are usually no big help in terms of PDF validation or ensuring long-term-availability, quite the contrary.
Some errors can have a big impact on the longterm availability of PDF files, expecially content that is only referred to and not embedded within the file and might just be lost over time. On the other hand, the "invalid page tree node" which e. g. JHOVE likes to put its finger on, is not an error, but just a hint that the page tree is not balanced and the page cannot be found in the most efficient way. Even if all the pages would just be saved as an array and you would have to iterate through the whole array to go to a certain page, this would only slow down the loading, but does not prevent anybody from accessing the page he wants to read, especially if the affected PDF document only has a couple of dozen pages.
During the afternoon of the first day, we collected specific problems everybody has and formed working groups, each engaging in a different problem. One working group (around Olaf) started to seize JHOVE error messages and trying to figure out which ones really bear a risk and what do they mean in the first place, anyway? Some of the error messages definitely describe real existent errors and a rule or specification is hurt, but will practically never cause any problems displaying the file. Is this really an error then? Or just burocracy? Should a good validator even display this as an error - which formally would be the right thing to do - or not disturb the user unnessecarily?
Another group wanted to create a small java tool with an csv output that looks into a PDF file and puts out the information which Software has created the PDF file and which validation errors does it containt, starting with PDFBox, as this was easy to implement in Java. We came so far to get the tool working, but as we brought expecially broken PDF files to the event, it is not yet able to cope with all of them, we still have to make it error-prone.
By the way, it is really nice to be surrounded by people who obviously live in the same nerdy world than I do. When I told them I could not wait to see our new tool's output and was anxious to analyse the findings, the answer was just "And neither can I". Usually, I just get frowning fronts and "I do not get why you are interested in something so boring"-faces.
A third working group went to another room and tested the already existing tools with brought PDF samples in the Virtual Ubuntu Environment.
There were more ideas, some of them seemed to difficult or to impossible to be able to create a solution in such a small time, but some of us are determined to have some follow-up-event soon.
For example, Olaf stated that sometimes the text extraction in a PDF file does not work and the participant who sat next to me suggested to me, we could start to check the output against dicitonaries to see if the output still make sense. "But there are so many languages" I told him, thinking about my libary's content. "Well, start with one" he answered, following the idea that a big problem often can be split in several small ones.
Another participant would like to know more about the quality and compression of the JPEGs embedded within his PDF files, but some other doubted this information could still be retrieved.
When the event was over tuesday around 5 pm, we were all tired, but happy, with clear ideas or new interesting problems in our heads.
And just because I was already asked this today because I might look slightly tired still. We did sleep during the night. We did not hack it all through or slept on mattrasses in our library. Some of us had quite a few pitcher full of beer during the evening, but I am quite sure everybody made it to his or her Hotel room.
Twitter Hashtag #OPDFPDFPreservation Topics: Open Planets Foundation
Preserving and managing research data is a significant concern for scientists and staff at research libraries. With that noted, many likely don’t realize the length of time in which valuable scientific data has accrued on a range of media in research settings. That is, data management often needs to be both backward- and forward-looking, considering a range of legacy media and formats as well as contemporary practice. To that end, I am excited to interview Emily Frieda Shaw, Head of Preservation and Reformatting at Ohio State University (prior to August 2014 she was the Digital Preservation Librarian at the University of Iowa Libraries). Emily talked about her work on James Van Allen’s data from the Explorer satellites launched in the 1950s at the Digital Preservation 2014 conference and I am excited to explore some of the issues that work raises.
Trevor: Could you tell us a bit about the context of the data you are working with? Who created it, how was it created, what kind of media is it on?
Emily: The data we’re working with was captured on reel-to-reel audio tapes at receiving stations around the globe as Explorer 1 passed overhead in orbit around Earth in the early months of 1958. Explorer predated the founding of NASA and was sent into orbit by a research team led by Dr. James Van Allen, then a Professor of Physics at the University of Iowa, to observe cosmic radiation. Each reel-to-reel Ampex tape contains up to 15 minutes of data on 7 tracks, including time stamps, station identifications and weather reports from station operators, and the “payload” data consisting of clicks, beeps and squeals generated by on-board instrumentation measuring radiation, temperature and micrometeorite impacts.
Once each tape was recorded, it was mailed to Iowa for analysis by a group of graduate students. A curious anomaly quickly emerged: At certain altitudes, the radiation data disappeared. More sensitive instruments sent into orbit by Dr. Van Allen’s team soon after Explorer 1 confirmed what this anomaly suggested: the Earth is surrounded by belts of intense radiation, dubbed soon thereafter as the Van Allen Radiation Belts. When the Geiger counter on board Explorer 1 registered no radiation at all, it was, in fact, actually overwhelmed by extremely high radiation.
We believe these tapes represent the first data set ever transmitted from outside Earth’s atmosphere. Thanks to the hard work and ingenuity of our friends at The MediaPreserve, and some generous funding from the Carver Foundation, we now have about 2 TB of .wav files converted from the Explorer 1 tapes, as well as digitized lab notebooks and personal journals of Drs. Van Allen and Ludwig, along with graphs, correspondence, photos, films and audio recordings.
In our work with this collection, the biggest discovery was a 700-page report from Goddard comprised almost entirely of data tables that represent the orbital ephemeris data set from Explorer 1. This 1959 report was digitized a few years back from the collections at the University of Illinois at Urbana-Champaign as part of the Google Books project and is being preserved in the Hathi Trust. This data set holds the key to interpreting the signals we hear on the tapes. There are some fascinating interplays between analog and digital, past and present, near and far in this project, and I feel very lucky to have landed in Iowa when I did.
Trevor: What challenges does this data represent for getting it off of it’s original media and into a format that is usable?
Emily: When my colleagues were first made aware of the Explorer mission tapes in 2009, they had been sitting in the basement of a building on the University of Iowa’s campus for decades. There was significant mold growth on the boxes and the tapes themselves, and my colleagues secured an emergency grant from the state to clean, move and temporarily rehouse the tapes. Three tapes were then sent to The MediaPreserve to see if they could figure out how to digitize the audio signals. Bob Strauss and Heath Condiotte hunted down a huge, of-the-era machine that could play back all of the discrete tracks on these tapes. As I understand it, Heath had to basically disassemble the entire thing and replace all of the transistors before he got it to work properly. Fortunately, we were able to play some of the digitized audio tracks from these test reels for Dr. George Ludwig, one of the key researchers on Dr. Van Allen’s team, before he passed away in 2012. Dr. Ludwig confirmed that they sounded — at least to his naked ear — as they should, so we felt confident proceeding with the digitization.
So, soon after I was hired in 2012, we secured funding from a private foundation to digitize the Explorer 1 tapes and proceeded to courier all 700 tapes to The MediaPreserve for thorough cleaning, rehousing and digital conversion. The grant is also funding the development and design of a web interface to the data and accompanying archival materials, which we [Iowa] hope to launch (pun definitely intended) some time this fall.
Trevor: What stakeholders are involved in the project? Specifically, I would be interested to hear how you are working with scientists to identify what the significant properties of these particular tapes are.
Emily: No one on the project team we assembled within the Libraries has any particular background in near-Earth physics. So we reached out to our colleagues in the University of Iowa Department of Physics, and they have been tremendously helpful and enthusiastic. After all, this data represents the legacy of their profession in a big picture sense, but also, more intimately, the history of their own department (their offices are in Van Allen Hall). Our colleagues in Physics have helped us understand how the audio signals were converted into usable data, what metadata might be needed in order to analyze the data set using contemporary tools and methods, how to package the data for such analysis, and how to deliver it to scientists where they will actually find and be able to use it.
We’re also working with a journalism professor from Northwestern University, who was Dr. Van Allen’s biographer, to weave an engaging (and historically accurate) narrative to tell the Explorer story to the general public.
Trevor: How are you imagining use and access to the resulting data set?
Emily: Unlike the digitized photos, books, manuscripts, music recordings and films we in libraries and archives have become accustomed to working with, we’re not sure how contemporary scientists (or non-scientists) might use a historic data set like this. Our colleagues in Physics have assured us that once we get this data (and accompanying metadata) packaged into the Common Data Format and archived with the National Space Science Data Center, analysis of the data set will be pretty trivial. They’re excited about this and grateful for the work we’re doing to preserve and provide access to early space data, and believe that almost as quickly as we are able to prepare the data set to be shared with the physics community, someone will pick it up and analyze it.
As the earliest known orbital data set, we know that this holds great historical significance. But the more we learn about Explorer 1, the less confident we are that the data from this first mission is/was scientifically significant. The Explorer I data — or rather, the points in its orbit during which the instruments recorded no data at all — hinted at a big scientific discovery. But it was really Explorer III, sent into orbit in the summer of 1958 with more sophisticated instrumentation, that produced that data that led to the big “ah-hah” moment. So, we’re hoping to secure funding to digitize the tapes from that mission, which are currently in storage.
I also think there might be some interesting, as-yet-unimagined artistic applications for this data. Some of the audio is really pretty eerie and cool space noise.
Trevor: More broadly, how will this research data fit into the context of managing research data at the university? Is data management something that the libraries are getting significantly involved in? If so could you tell us a bit about your approach.
Emily: The University of Iowa, like all of our peers, is thinking and talking a lot about research data management. The Libraries are certainly involved in these discussions, but as far as I can tell, the focus is, understandably, on active research and is motivated primarily by the need to comply with funding agency requirements. In libraries, archives and museums, many of us are motivated by a moral imperative to preserve historically significant information. However, this ethos does not typically pervade in the realm of active, data-intensive research. Once the big discovery has been made and the papers have been published, archiving the data set is often an afterthought, if not a burden. The fate of the Explorer tapes, left to languish in a damp basement for decades, is a case in point. Time will not be so kind to digital data sets, so we have to keep up the hard work of advocating, educating and partnering with our research colleagues, and building up the infrastructure and services they need to lower the barriers to data archiving and sharing.
Trevor: Backing up out of this particular project, I don’t think I have spoken with many folks with the title “Digital Preservation Librarian.” Other than this, what kinds of projects are you working on and what sort of background did you have to be able to do this sort of work? Could you tell us a bit about what that role means in your case? Is it something you are seeing crop up in many research libraries?
Emily: My professional focus is on the preservation of collections, whether they are manifest in physical or digital form, or both. I’ve always been particularly interested in the overlaps, intersections, and interdependencies of physical/analog and digital information, and motivated to play an active role in the sociotechnical systems that support its creation, use and preservation. In graduate school at the University of Illinois, I worked both as a research assistant with an NSF-funded interdisciplinary research group focused on information technology infrastructure, and in the Library’s Conservation Lab, making enclosures, repairing broken books, and learning the ins and outs of a robust research library preservation program. After completing my MLIS, I pursued a Certificate of Advanced Study in Digital Libraries while working full-time in Preservation & Conservation, managing multi-stream workflows in support of UIUC’s scanning partnership with Google Books.
I came to Iowa at the beginning of 2012 into the newly-created position of Digital Preservation Librarian. My role here has shifted with the needs and readiness of the organization, and has included the creation and management of preservation-minded workflows for digitizing collections of all sorts, the day-to-day administration of digital content in our redundant storage servers, researching and implementing tools and processes for improved curation of digital content, piloting workflows for born-digital archiving, and advocating for ever-more resources to store and manage all of this digital digital stuff. Also, outreach and inreach have both been essential components of my work. As a profession, we’ve made good progress toward raising awareness of digital stewardship, and many of us have begun making progress toward actually doing something about it, but we still have a long way to go.
And actually, I will be leaving my current position at Iowa at the end of this month to take on a new role as the Head of Preservation and Reformatting for The Ohio State University Libraries. My experience as a hybrid preservationist with understanding and appreciation of both the physical and digital collections will give me a broad lens through which to view the challenges and opportunities for long-term preservation and access to research collections. So, there may be a vacancy for a digital preservationist at Iowa in the near future