- Jpylyzer by the KB (Royal Library of the Netherlands) and partners
- The SPRUCE Project by The University of Leeds and partners
- bwFLA Functional Long Term Archiving and Access by the University of Freiburg and partners
- Practical Digital Preservation: a how to guide for organizations of any size by Adrian Brown
- Skilling the Information Professional by the Aberystwyth University
- Introduction to Digital Curation: An open online UCLeXtend Course by University College London
- Voices from a Disused Quarry by Kerry Evans, Ann McDonald and Sarah Vaughan, University of Aberystwyth
- Game Preservation in the UK by Alasdair Bachell, University of Glasgow
- Emulation v Format Conversion by Victoria Sloyan, University College London
The DPC Award for Safeguarding the Digital Legacy, which celebrates the practical application of preservation tools to protect at-risk digital objects.
- Conservation and Re-enactment of Digital Art Ready-Made, by the University of Freiburg and Partners
- Carcanet Press Email Archive, University of Manchester
- Inspiring Ireland, Digital Repository of Ireland and Partners
- The Cloud and the Cow, Archives and Records Council of Wales
Preservation Topics: Open Planets Foundation
The following is a guest post by Nicholas Woodward, an Information Technology Specialist and the newest member of the Library’s Web Archiving team.
The path that lead me to the Library of Congress was long and circuitous, and it includes everything from a tiny web startup to teaching economics in Nicaragua to rediscovering a passion for developing software in Austin, Texas. Like many folks who develop software in the academic and library world I have a deep interest in the social sciences and humanities, in addition to technology.
But unlike others who began in these fields and subsequently developed technological knowledge and skills to do new and exciting things, I did the opposite. I spent years in the technology industry only to find that it had little value for me without serious contemplation of what effect it has on other peoples’ lives. Only later did I discover that software development in the library and academic environments allows one to incorporate such considerations as the practical applications for research or how different forces in society influence technological development and vice versa into the process of writing code.
But I’m jumping ahead. Let’s get the events out of the way. In 2003 I graduated from the University of Nebraska-Lincoln with a BS in computer science and started working full-time at a very small web development company. After deciding there must be more to life than making websites for a salary, I joined the Peace Corps in 2005 and worked as a high school teacher in Nicaragua for roughly 2.5 years. After a brief stint observing elections in Guatemala, I returned to the U.S. in hopes of going back to school to study the social sciences with a focus on Latin America. My dream scenario took shape when I was accepted to an MA program in the Teresa Lozano Long Institute of Latin American Studies at the University of Texas at Austin. I earned my MA in 2011 and subsequently earned an MS in Library and Information Science in 2013, also at UT.
It was while an MA student that a graduate research assistantship would change my career path for good. As a dual research assistant for the Latin American Network Information Center and the Texas Advanced Computing Center I had the incredible opportunity to conduct research on a large web archive in a high-performance computing environment. In the process I learned about things such as the Hadoop architecture and natural language processing and Bayesian classifiers and distributed computing and…
But the real value, as far as I was concerned, was that I could see directly how software development could be more than just putting together code to do “cool stuff.” I realized that developing software to facilitate research and discovery of massive amounts of data in an open and collaborative fashion not only increases the opportunities for alternative types of knowledge production but also influenced how it gets created in a very profound way. And being a part of this process, however small, was the ideal place for me.
Which brings us to today. I am thrilled to be starting my new role as an Information Technology Specialist with the web archiving team of the Library’s Office of Strategic Initiatives. It is an incredible opportunity to learn new skills, incorporate knowledge I’ve acquired in the past and contribute in whatever ways I can to an outstanding team that is at the forefront of Internet archiving.
As the newest member of the web archiving team, my focus will be to continue the ongoing development of Digiboard 4.0 (pdf), the next version of our web application for managing the web archiving process at the Library of Congress. Digiboard 4.0 will build on previous software that enables Library staff to create collections of web-archived content, nominate new websites and review crawls of the Internet for quality assurance, while also making the process more efficient and expanding opportunities for cataloging archived websites. Additionally, part of my time will include exploratory efforts to expand the infrastructure and capacity of the web archiving team for in-house Internet crawling.
I look forward to the challenges and opportunities that lay ahead as we contribute to the greater web archiving community through establishing best practices, improving organizational workflows for curation, quality review and presentation of web-archived content and generally expanding the boundaries of preserving the Internet for current and future generations.
Every year, The Small Press Expo in Bethesda, Md brings together a community of alternative comic creators and independent publishers. With a significant history of collecting comics, it made sense for the Library of Congress’ Serial and Government Publications Division and the Prints & Photographs Division to partner with SPX to build a collection documenting alternative comics and comics culture. In the last three years, this collection has been developing and growing.
While the collection itself is quite fun (what’s not to like about comics), it is also a compelling example of the way that web archiving can complement and fit into work developing a special collection. To that end, I am excited to talk with Megan Halsband, Reference Librarian with the Library of Congress Serial and Government Publications Division and one of the key staff working on this collection as part of our Content Matters interview series.
Trevor: First off, when people think Library of Congress I doubt “comics” is one of the first things that comes to mind. Could you tell us a bit about the history of the Library’s comics collection, the extent of the collections and what parts of the Library of Congress are involved in working with comics?
Megan: I think you’re right – the comics collection is not necessarily one of the things that people associate with the Library of Congress – but hopefully we’re working on changing that! The Library’s primary comics collections are two-fold – first there are the published comics held by the Serial & Government Publications Division, which appeared in newspapers/periodicals and later in comic books, as well as the original art, which is held by the Prints & Photographs Division.
The Comic Book Collection here in Serials is probably the largest publicly available collection in the country, with over 7,000 titles and more than 125,000 issues. People wonder why our section at the Library is responsible for the comic books – and it’s because most comic books are published serially. Housing the comic collection in Serials also makes sense, as we are also responsible for the newspaper collections (which include comics). The majority of our comic books come through the US Copyright Office via copyright deposit, and we’ve been receiving comic books this way since the 1930′s/1940′s.
The Library tries to have complete sets of all the issues of major comic titles but we don’t necessarily have every issue of every comic ever published (I know what you’re thinking and no, we don’t have an original Action Comics No. 1 – maybe someday someone will donate it to us!). The other main section of the Library that works with comic materials is Prints & Photographs – though Rare Book & Special Collections and the area studies reading rooms probably also have materials that would be considered ‘comics.’
Trevor: How did the idea for the SPX collection come about? What was important about going out to this event as a place to build out part of the collection? Further, in scoping the project, what about it suggested that it would also be useful/necessary to use web archiving to complement the collection?
Megan: The executive director of SPX, Warren Bernard, has been working in the Prints & Photographs Division as a volunteer for a long time, and the collection was established in 2011 after an Memorandum of Understanding was signed between the Library and SPX. I think Warren really was a major driving force behind this agreement, but the curators in both Serials and Prints & Photographs realized that our collections didn’t include materials from this particular community of creators and publishers in the way that it should.
Given that SPX is a local event with an international reputation and awards program (SPX awards the Ignatz) and the fact that we know staff at SPX, I think it made sense for the Library to have an ‘official’ agreement that serves as an acquisition tool for material that we wouldn’t probably otherwise obtain. Actually going to SPX every year gives us the opportunity to meet with the artists, see what they’re working on and pick up material that is often only available at the show – in particular mini-comics or other free things.
Something important to note is that the SPX Collection – the published works, the original art, everything – is all donated to the Library. This is huge for us – we wouldn’t be able to collect the depth and breadth of material (or possibly any material at all) from SPX otherwise. As far as including online content for the collection, the Library’s Comics and Cartoons Collection Policy Statement (PDF) specifically states that the Library will collect online/webcomics, as well as award-winning comics. The SPX Collection, with its web archiving component, specifically supports both of these goals.
Trevor: What kinds of sites were selected for the web archive portion of the collection? In this case, I would be interested in hearing a bit about the criteria in general and also about some specific examples. What is it about these sites that is significant? What kinds of documentation might we lose if we didn’t have these materials in the collection?
Megan: Initially the SPX webarchive (as I refer to it – though its official name is Small Press Expo and Comic Art Collection) was extremely selective – only the SPX website itself and the annual winner of the Ignatz Award for Outstanding Online Comic were captured. The staff wanted to see how hard it would be to capture websites with lots of image files (of various types). Turns out it works just fine (if there’s not paywall/subscriber login credentials required) – so we expanded the collection to include all the Ignatz nominees in the Outstanding Online Comic category as well.
Some of these sites, such as Perry Bible Fellowship and American Elf, are long-running online comics who’s creators have been awarded Eisner, Harvey and Ignatz awards. There’s a great deal of content on these websites that isn’t published or available elsewhere – and I think that this is one of the major reasons for collecting this type of material. Sometimes the website might have initial drafts or ideas that later are published, sometimes the online content is not directly related to published materials, but for in-depth research on an artist or publication, often this type of related content is extremely useful.
Trevor: You have been working with SPX to build this collection for a few years now. Could you give us an overview of what the collection consists of at this point? Further, I would be curious to know a bit about how the idea of the collection is playing out in practice. Are you getting the kinds of materials you expected? Are there any valuable lessons learned along the way that you could share? If anyone wants access to the collection how would they go about that?
Megan: At this moment in time, the SPX Collection materials that are here in Serials include acquisitions from 2011-2013, plus two special collections that were donated to us, the Dean Haspiel Mini-Comics Collection and the Heidi MacDonald Mini-Comics Collection. I would say that the collection has close to 2,000 items (we don’t have an exact count since we’re still cataloging everything) as well as twelve websites in the web archive. We have a wonderful volunteer who has been working on cataloging items from the collection, and so far there are over 550 records available in the Library’s online catalog.
Personally, I didn’t have any real expectations of what kinds of materials we would be getting – I think that definitely we are getting a good selection of mini-comics, but it seems like there are more graphic novels that I anticipated. One of the fun things about this collection are the new and exciting things that you end up finding at the show – like an unexpected tiny comic that comes with its own magnifying glass or an oversize newsprint series.
The process of collecting has definitely gotten easier over the years. For example, the Head of the Newspaper Section, Georgia Higley, and I just received the items that were submitted in consideration for the 2014 Ignatz Awards. We’ll be able to prep permission forms/paperwork in advance of the show for the materials we’re keeping from this material, and it will help us cut down on potential duplication. This is definitely a valuable lesson learned! We’ve also come up with a strategy for visiting the tables at the show – there are 287 tables this year – so we divide up the ballroom between four of us (Georgia and I, as well as two curators from Prints & Photographs – Sara Duke and Martha Kennedy) to make it manageable.
We also try to identify items that we know we want to ask for in advance of the show – such as ongoing serial titles or debut items listed on the SPX website – to maximize our time when we’re actually there. Someone wanting to access the collection would come to the Newspaper & Current Periodical Reading Room to request the comic books and mini-comics. Any original art or posters from the show would be served in the Prints & Photographs Reading Room. As I mentioned – there is still a portion of this collection that is unprocessed – and may not be immediately accessible.
Trevor: Stepping back from the specifics of the collection, what about this do you think stands for a general example of how web archiving can complement the development of special collections?
Megan: One of the true strengths of the Library of Congress is that our collections often include not only the published version, but also the ephemeral material related to the published item/creator, all in one place. From my point of view, collecting webcomics gives the Library the opportunity to collect some of this ‘ephemera’ related to comics collections and only serves to enhance what we are preserving for future research. And as I mentioned earlier, some of the content on the websites provides context, as well as material for comparison, to the physical collection materials that we have, which is ideal from a research perspective.
Trevor: Is there anything else with web archiving and comics on the horizon for your team? Given that web comics are such significant part of digital culture I’m curious to know if this is something you are exploring. If so, is there anything you can tell us about that?
We recently began another web archive collection to collect additional webcomics beyond those nominated for Ignatz Awards – think Dinosaur Comics and XKCD. It’s very new (and obviously not available for research use yet) – but I am really excited about adding materials to this collection. There are a lot of webcomics out there – and I’m glad that the Library will now be able to say we have a selection of this type of content in our collection! I’m also thinking about proposing another archive to capture comics literature and criticism on the web – stay tuned!
The following is a guest post from Julia Fernandez, this year’s NDIIPP Junior Fellow. Julia has a background in American studies and working with folklife institutions and worked on a range of projects leading up to CurateCamp Digital Culture in July. This is part of a series of interviews Julia conducted to better understand the kinds of born-digital primary sources folklorists, and others interested in studying digital culture, are making use of for their scholarship.
The numbers around user-generated video are staggering. YouTube, one of the largest user-generated video platforms, has more than 100 hours of video content uploaded to it every minute. What does this content mean for us and our society? What of it should we aspire to ensure long-term access to?
As part of the NDSA Insights interview series, I’m delighted to interview Alexandra Juhasz, professor of Media Studies at Pitzer College. Dr. Juhasz has written multiple articles on digital media and produced the feature films “The Owls” and “The Watermelon Woman.” Her innovative “video-book” “Learning from YouTube” was published by MIT Press, but partly enabled through YouTube itself, and is available for free here. In this regard, her work is relevant to those working in digital preservation both in better understanding the significance of user-generated video platforms like YouTube and in understanding new hybrid forms of digital scholarly publishing.
Julia: In the intro to your online video-book “Learning From YouTube” you say “YouTube is the Problem, and YouTube is the solution.” Can you expand on that a bit for us?
Alex: I mean “problem” in two ways. The first is more neutral: YouTube is my project’s problematic, its subject or concern. But I also mean it more critically as well: YouTube’s problems are multiple–as are its advantages–but our culture has focused much more uncritically on how it chooses to sell itself: as a democratic space for user-made production and interaction. The “video-book” understands this as a problem because it’s not exactly true. I discuss how YouTube isn’t democratic in the least; how censorship dominates its logic (as does distraction, the popular and capital).
YouTube is also a problem in relation to the name and goals of the course that the publication was built around (my undergraduate Media Studies course also called “Learning from YouTube” held about, and also on, the site over three semesters, starting in 2007). As far as pedagogy in the digital age is concerned, the course suggests there’s a problem if we do all or most or even a great deal of our learning on corporate-owned platforms that we have been given for free, and this for many reasons that my students and I elaborate, but only one of which I will mention here as it will be most near and dear to your readers’ hearts: it needs a good archivist and a reasonable archiving system if it’s to be of any real use for learners, teachers or scholars. Oh, and also some system to evaluate content.
YouTube is the solution because I hunkered down there, with my students, and used the site to both answer the problem, and name the problems I have enumerated briefly above.
Julia: What can you tell us about how you approached the challenges of teaching a course about YouTube? What methods of analysis did you apply to its content? How did you select which materials to examine given the vast scope and diversity of YouTube’s content?
Alex: I have taught the course three times (2007, 2008, 2010). In each case the course was taught on and about YouTube. This is to say, we recorded class sessions (the first year only), so the course could be seen on YouTube; all the class assignments needed to take the form of YouTube “writing” and needed to be posted on YouTube (as videos or comments); and the first time I taught it, the students could only do their research on YouTube (thereby quickly learning the huge limits of its vast holdings). You can read more about my lessons learned teaching the course here and here.
The structure of the course mirrors many of the false promises of YouTube (and web 2.0 more generally), thereby allowing students other ways to see its “problems.” It was anarchic, user-led (students chose what we would study, although of course I graded them: there’s always a force of control underlying these “free” systems), public, and sort of silly (but not really).
As the course developed in its later incarnations, I developed several kinds of assignments (or methods of analysis as you put it), including traditional research looking at the results of published scholars, ethnographic research engaging with YouTubers, close-textual analysis (of videos and YouTube’s architecture), and what I call YouTours, where students link together a set of YouTube videos to make an argument inside of and about and with its holdings. I also have them author their own “Texteo” as their final (the building blocks, or pages, of my video-book; texteo=the dynamic linking of text and video), where they make a concise argument about some facet of YouTube in their own words and the words of videos they make or find (of course, this assignment allows them to actually author a “page” of my “book,” thereby putting into practice web 2.0′s promise of the decline of expertise and the rise of crowd-sourced knowledge production).
Students choose the videos and themes we study on YouTube. I like this structure (giving them this “control”) because they both enjoy and know things I would never look at, and they give me a much more accurate reading of mainstream YouTube than I would ever find on my own. My own use of the site tends to take me into what I call NicheTube (the second, parallel structure of YouTube, underlying the first where a few videos are seen by many many people, and these are wholly predictable in their points of view and concerns. On YouTube it’s easy to find popular videos. On NicheTube content is rarely seen, hard to find and easy to lose; everything might be there, but very few people will ever see it.
Now that YouTube Studies has developed, I also assign several of the book-length studies written about it from a variety of disciplines (I list these below). When I first taught the class in 2007, my students and I were generating the primary research and texts of YouTube Studies: producing work that was analytical and critical about the site, in its vernaculars, and on its pages.
Julia: What were some of the challenges of publishing an academic work in digital form? A large part of the work depends on linking to YouTube videos that you did not create and/or are no longer available. What implications are there for long-term access to your work?
Alex: I discuss this in greater length in the video-book because another one of its self-reflexive structures, mirroring those of YouTube, is self-reflexivity: an interest in its own processes, forms, structures and histories.
While MIT Press was extremely interested and supportive, they had never “published” anything like this before. The problems were many, and varied, and we worked through them together. I’ve detailed answers to your question in greater details within the video-book, but here’s one of the lists of differences I generated:
- Delivery of the Work
- Author’s Warranty
- Previous Publication
- Size of the Work
- Materials Created by Other Persons
- Author’s Alterations
Many of these differences are legal and respond directly to the original terms in the contract they gave me that made no sense at all with a born-digital, digital-only object, and in particular about writing a book composed of many things I did not “own,” about “selling” a book for free, making a book that was already-made, or moving a book that never needed to be shipped.
One solution is that the video-book points to videos, but they remained “owned” by YouTube (I backed up some of the most important and put them on Critical Commons knowing that they might go away). But, in the long run, I do not mind that many of the videos fade away, or that the book itself will probably become quickly unreadable (because the systems is written on will become obsolete). It is another myth of the Internet that everything there is lasting, permanent, forever. In fact, by definition, much of what is housed or written there is unstable, transitory, difficult to find, or difficult to access as platforms, software and hardware change.
In “On Publishing My YouTube “Book” Online (September 24, 2009)” I mention these changes as well:
- Audience. When you go online your readers (can) include nonacademics.
- Commitment. Harder to command amid the distractions.
- Design. Matters more; and it has meaning.
- Finitude. The page(s) need never close.
- Interactivity. Should your readers, who may or may not be experts, author too?
- Linearity. Goes out the window, unless you force it.
- Multimodality. Much can be expressed outside the confines of the word.
- Network. How things link is within or outside the author’s control.
- Single author. Why hold out the rest of the Internet?
- Temporality. People read faster online. Watching video can be slow. A book is long.
Now, when I discuss the project with other academics, I suggest there are many reasons to write and publish digitally: access, speed, multi-modality, etc. (see here), but if you want your work to be seen in the future, better to publish a book!
Julia: At this point you have been studying video production since the mid 90s. I would be curious to hear a bit about how your approach and perspective have developed over time.
Alex: My research (and production) interests have stayed consistent: how might everyday people’s access to media production and distribution contribute to people’s and movement’s empowerment? How can regular citizens have a voice within media and therefore culture more broadly, so that our interests, concerns and criticisms become part of this powerful force?
Every time I “study” the video of political people (AIDS activists, feminists, YouTubers), I make video myself. I theorize from my practice, and I call this “Media Praxis” (see more about that here). But what has changed during the years I’ve been doing this and thinking about it is that more and more people really do have access to both media production and distribution since when I first began my studies (and waxed enthusiastically about how camcorders were going to foster a revolution). Oddly, this access can be said to have produced many revolutions (for instance the use of people-made media in the Arab Spring) and to have quieted just as many (we are more deeply entrenched in both capitalism’s pull and self-obsessions then at any time in human history, it seems to me!). I think a lot about that in the YouTube video-book and in projects since (like this special issue on feminist queer digital media praxis that I just edited for the online journal Ada).
Julia: You end up being rather critical of how popularity works on YouTube. You argue that “YouTube is not democratic. Its architecture supports the popular. Critical and original expression is easily lost to or censored by its busy users, who not only make YouTube’s content, but sift and rate it, all the while generating its business.” You also point to the existence of what you call “NicheTube,” the vast sea of little-seen YouTube videos that are hard to find given YouTube’s architecture of ranking and user-generated tags.” Could you tell us a bit more about your take on the role of filtering and sorting in it’s system?
Alex: YouTube is corporate owned, just as is Facebook, and Google, and the many other systems we use to find, speak, navigate and define our worlds, words, friends, interests and lives. Filtering occurs in all these places in ways that benefit their bottom lines (I suggest in “Learning From YouTube” that a distracted logic of attractions keeps our eyeballs on the screen, which is connected to their ad-based business plan). In the process, we get more access to more and more immediate information, people, places and ideas than humans ever have, but it’s filtered through the imperatives of capitalism rather than say those of a University Library (that has its own systems to be sure, of great interest to think through, and imbued by power like anything else, but not the power of making a few people a lot of money).
The fact that YouTube’s “archive” is unorganized, user-tagged, chaotic and uncurated is their filtering system.
Julia: If librarians, archivists and curators wanted to learn more about approaches like yours to understanding the significance and role of online video what examples of other scholars’ work would you suggest? It would be great if you could mention a few other scholars’ work and explain what you think is particularly interesting about their approaches.
Alex: I assign these books in “Learning from YouTube”: Patrick Vonderau, “The YouTube Reader”; Burgess and Green, “YouTube” and Michael Strangelove, “Watching YouTube.” I also really like the work of Michael Wesch and Patricia Lange who are anthropologists whose work focuses on the site and its users.
Outside of YouTube itself, many of us are calling this kind of work “platform studies,” where we look critically and carefully at the underlying structures of the underlying structures of Internet culture. Some great people working here are Caitlin Benson-Allott, danah boyd, Wendy Chun, Laine Nooney, Tara McPherson, Siva Vaidhyanathan and Michelle White.
I also think that as a piece of academic writing, Learning from YouTube (which I understand to be a plea for the longform written in tweets, or a plea for the classroom written online) is in conversation with scholarly work that is thinking about the changing nature of academic writing and publishing (and all writing and publishing, really). Here I like the work of Kathleen Fitzpatrick or Elizabeth Losh, as just two examples.
Julia: I would also be interested in what ways of thinking about the web you see this as being compatible or incompatible with other approaches to theorizing the web. How is your approach to studying video production online similar or different from other approaches in new media studies, internet research, anthropology, sociology or the digital humanities?
Alex: “Learning from YouTube” is new media studies, critical Internet studies, and DH, for sure. As you say above, my whole career has looked at video; since video moved online, I did too. I think of myself as an artist and a humanist (and an activist) and do not think of myself as using social science methods although I do learn a great deal from research done with in these disciplines.
After “Learning from YouTube” I have done two further web-based projects: a website that tries to think about and produce alternatives to corporate-made and owned Internet experiences (rather than just critique this situation), www.feministonlinespaces.com; and a collaborative criticism of the MOOC (Massive Online Open course), what we call a DOCC (Distributed Open Collaborative Course): http://femtechnet.newschool.edu.
In all three cases I think that “theorizing the web” is about making and using the web we want and not the version that corporations have given to us for free. I do this using the structures, histories, theories, norms and practices of feminism, but any ethical system will do!
For many organizations that are just starting to tackle digital preservation, it can be a daunting challenge – and particularly difficult to figure out the first steps to take. Education and training may be the best starting point, creating and expanding the expertise available to handle this kind of challenge. The Digital Preservation Outreach and Education program here at the Library aims to do just that, by providing the materials as well as the hands-on instruction to help build the expertise needed for current and future professionals working on digital preservation.
Recently, the Library was host to a meeting of the DPOE Working Group, consisting of a core group of experts and educators in the field of digital preservation. The Working Group participants were Robin Dale (Institute of Museum and Library Services), Sam Meister (University of Montana-Missoula), Mary Molinaro (University of Kentucky), and Jacob “Jake” Nadal (Princeton University). The meeting was chaired by George Coulbourne of the Library of Congress, and Library staffers Barrie Howard and Kris Nelson also participated.
The main goal of the meeting was to update the existing DPOE Curriculum, which is used as the basis for the Program’s training workshops and then subsequently, by the trainees themselves. A survey is being conducted to gather even more information, and will help inform this curriculum as well (see a related blog post). The Working Group reviewed and edited all of the six substantive modules which are based on terms from the OAIS Reference Model framework:
- Identify (What digital content do you have?)
- Select (What portion of your digital content will be preserved?)
- Store (What issues are there for long-term storage?)
- Protect (What steps are needed to protect your digital content?)
- Manage (What provisions are needed for long-term management?)
- Provide (What considerations are there for long-term access?)
The group also discussed adding a seventh module on implementation. Each of these existing modules contains a description, goals, concepts and resources designed to be used by current and/or aspiring digital preservation practitioners.
Mary Molinaro, Director, Research Data Center at the University of Kentucky Libraries, noted that “as we worked through the various modules it became apparent how flexible this curriculum is for a wide range of institutions. It can be adapted for small, one-person cultural heritage institutions and still be relevant for large archives and libraries. ”
Mary also spoke to the advantages of having a focused, group effort to work through these changes: “Digital preservation has some core principles, but it’s also a discipline subject to rapid technological change. Focusing on the curriculum together as an instructor group allowed us to emphasize those things that have not changed while at the same time enhancing the materials to reflect the current technologies and thinking.”
These curriculum modules are currently in the process of further refinement and revision, including an updated list of resources. The updated version of the curriculum will be available later this month. The Working Group also recommended some strategies for extending the curriculum to address executive audiences, and how to manage the process of updating the curriculum going forward.
In a previous blog post, the NDSA Standards and Practices Working Group announced the opening of a survey to rank issues in preserving video collections. The survey closed on August 2, 2014 and while there’s work ahead to analyze the results and develop action plans, we can share some preliminary findings.
We purposely cast a wide net in advertising the survey so that respondents represented a range of institutions, experience and collections. About 54% of the respondents who started the survey answered all the required questions.
The blog post on The Signal was the most popular means to get the word out (27%) followed by the Association of Moving Image Archivists list (13%) and the NDSA-ALL list (11%). A significant number of respondents (25%) were directed to the survey through other tools including Twitter, Facebook, PrestoCentre Newsletter and the survey bookmarks distributed at the Digital Preservation 2014 meeting.
The vast majority of respondents who identified their affiliation were from the United States; other countries represented include Germany, Austria, England, South Africa, Australia, Canada, Denmark and Chile.
The survey identified the top three stumbling blocks in preserving video as:
- Getting funding and other resources to start preserving video (18%)
- Supporting appropriate digital storage to accommodate large and complex video files (14%)
- Locating trustworthy technical guidance on video file formats including standards and best practices (11%)
Respondents report that analog/physical media is the most challenging type of video (73%) followed by born digital (42%) and digital on physical media (34%).
Clearly, this high level data doesn’t tell the whole story and we have work ahead to analyze the results. Some topics we’d like to pursue include using the source of the survey invitation to better understand the context of the communities that answered the survey. Some respondents, such as those alerted to the survey through the announcement on the AMIA list, are expected to have more experience with preserving video than respondents directed to the survey from more general sources like Facebook or Twitter.
How do the responses from more mature programs compare with emerging programs? What can we learn from those who reported certain issues as “solved” within their institution? Might these solutions be applicable to other institutions? What about the institutions reporting that analog video is more challenging than born digital video? Are their video preservation programs just starting out? Do they have much born-digital video yet?
After we better understand the data, the NDSA Standards and Practices Working Group will start to consider what actions might be useful to help lower these stumbling blocks. This may include following up with additional survey questions to define the formats and scopes of current and expected video collections. Stay tuned for a more detailed report about the survey results and next steps!
22 participants from 8 countries - the UK, Germany, Denmark, the Netherlands, Switzerland, France, Sweden and the Czech Republic, not to forget umpteenthousand defect or somehow interesting PDF files brought to the event.
Not only is this my first Blog entry on the OPDF website, it is also about my first Hackathon. I guess it was Michelle's idea in the first place to organise a Hackathon with the Open Planets Foundation on the PDF topic and to have the event in our library in Hamburg. I am located in Kiel, but as we are renewing our parquet floor in Kiel at the moment, the room situation in Hamburg is much better (Furthermore, it's Hamburg which has the big airport).
The preparation for the event was pretty intense for me. Not only the organisation in Hamburg (food, rooms, water, coffee, dinner event) had to be done, much more intense was the preparating in terms of the Hacking itself.
I am a library- and information scientiest, not a programmer. Sometimes I would rather be a programmer considering my daily best-of-problems, but you should dress for the body you have, not for the body you'd like to have.
Having learned the little I know about writing code within the last 8 months and most of it just since this july, I am still brand-new to it. As there always is a so-called "summer break" (which means that everybody else is in a holiday and I actually have time to work on difficult stuff) I had some very intense Skype calls with Carl from the OPF, who enabled me to put all my work-in-progress PDF-tools to Github. I learned about Maven and Travis and was not quite recovered when the Hackathon actually started this monday and we all had to install some Virtual Ubuntu machine to be able to try out some best-of-tools like DROID, Tika and Fido and run it over our own PDF files.
We had Olaf Drümmer from the PDF Association as our Keynote Speaker for both days. On the first day, he gave us insights about PDF and PDF/A, and when I say insights, I really mean that. Talking about the building blocks of a PDF, the basic object types and encoding possibilities. This was much better than trying to understand the PDF 1.7 specification of 756 pages just by myself alone in the office with sentences like "a single object of type null, denoted by the keyword null, and having a type and value that are unequal to those of any other object".
We learned about the many different kinds of page content, the page being the most important structure unit of a PDF file and about the fact that a PDF page could have every size you can think of, but Acrobat 7.0 officially only supports a page dimension up to 381 km. The second day, we learned about PDF(/A)-Validation and what would theoretically be needed to have the perfect validator. Talking about the PDF and PDF/A specifications and all the specification quoted and referenced by these, I am under the impression that it would last some months to read them all - and so much is clear, somebody would have to read and understand them all. The complexity of the PDF file, the flexibility of the viewers and the plethora of users and user's needs will always take care of a heterogenious PDF reality with all the strangeness and brokenness possible. As far as I remember it is his guess that about 10 years of manpower would be needed to build a perfect validator, if it could be done at all. Being strucked by this perfectly comprehensible suggestions, it is probably not surprising that some of the participants had more questions at the end of the two days than they had at the beginning.
As PDF viewers tend to conceal problems and tend to display problematic PDF files in a decent way, they are usually no big help in terms of PDF validation or ensuring long-term-availability, quite the contrary.
Some errors can have a big impact on the longterm availability of PDF files, expecially content that is only referred to and not embedded within the file and might just be lost over time. On the other hand, the "invalid page tree node" which e. g. JHOVE likes to put its finger on, is not an error, but just a hint that the page tree is not balanced and the page cannot be found in the most efficient way. Even if all the pages would just be saved as an array and you would have to iterate through the whole array to go to a certain page, this would only slow down the loading, but does not prevent anybody from accessing the page he wants to read, especially if the affected PDF document only has a couple of dozen pages.
During the afternoon of the first day, we collected specific problems everybody has and formed working groups, each engaging in a different problem. One working group (around Olaf) started to seize JHOVE error messages and trying to figure out which ones really bear a risk and what do they mean in the first place, anyway? Some of the error messages definitely describe real existent errors and a rule or specification is hurt, but will practically never cause any problems displaying the file. Is this really an error then? Or just burocracy? Should a good validator even display this as an error - which formally would be the right thing to do - or not disturb the user unnessecarily?
Another group wanted to create a small java tool with an csv output that looks into a PDF file and puts out the information which Software has created the PDF file and which validation errors does it containt, starting with PDFBox, as this was easy to implement in Java. We came so far to get the tool working, but as we brought expecially broken PDF files to the event, it is not yet able to cope with all of them, we still have to make it error-prone.
By the way, it is really nice to be surrounded by people who obviously live in the same nerdy world than I do. When I told them I could not wait to see our new tool's output and was anxious to analyse the findings, the answer was just "And neither can I". Usually, I just get frowning fronts and "I do not get why you are interested in something so boring"-faces.
A third working group went to another room and tested the already existing tools with brought PDF samples in the Virtual Ubuntu Environment.
There were more ideas, some of them seemed to difficult or to impossible to be able to create a solution in such a small time, but some of us are determined to have some follow-up-event soon.
For example, Olaf stated that sometimes the text extraction in a PDF file does not work and the participant who sat next to me suggested to me, we could start to check the output against dicitonaries to see if the output still make sense. "But there are so many languages" I told him, thinking about my libary's content. "Well, start with one" he answered, following the idea that a big problem often can be split in several small ones.
Another participant would like to know more about the quality and compression of the JPEGs embedded within his PDF files, but some other doubted this information could still be retrieved.
When the event was over tuesday around 5 pm, we were all tired, but happy, with clear ideas or new interesting problems in our heads.
And just because I was already asked this today because I might look slightly tired still. We did sleep during the night. We did not hack it all through or slept on mattrasses in our library. Some of us had quite a few pitcher full of beer during the evening, but I am quite sure everybody made it to his or her Hotel room.
Twitter Hashtag #OPDFPDFPreservation Topics: Open Planets Foundation
Preserving and managing research data is a significant concern for scientists and staff at research libraries. With that noted, many likely don’t realize the length of time in which valuable scientific data has accrued on a range of media in research settings. That is, data management often needs to be both backward- and forward-looking, considering a range of legacy media and formats as well as contemporary practice. To that end, I am excited to interview Emily Frieda Shaw, Head of Preservation and Reformatting at Ohio State University (prior to August 2014 she was the Digital Preservation Librarian at the University of Iowa Libraries). Emily talked about her work on James Van Allen’s data from the Explorer satellites launched in the 1950s at the Digital Preservation 2014 conference and I am excited to explore some of the issues that work raises.
Trevor: Could you tell us a bit about the context of the data you are working with? Who created it, how was it created, what kind of media is it on?
Emily: The data we’re working with was captured on reel-to-reel audio tapes at receiving stations around the globe as Explorer 1 passed overhead in orbit around Earth in the early months of 1958. Explorer predated the founding of NASA and was sent into orbit by a research team led by Dr. James Van Allen, then a Professor of Physics at the University of Iowa, to observe cosmic radiation. Each reel-to-reel Ampex tape contains up to 15 minutes of data on 7 tracks, including time stamps, station identifications and weather reports from station operators, and the “payload” data consisting of clicks, beeps and squeals generated by on-board instrumentation measuring radiation, temperature and micrometeorite impacts.
Once each tape was recorded, it was mailed to Iowa for analysis by a group of graduate students. A curious anomaly quickly emerged: At certain altitudes, the radiation data disappeared. More sensitive instruments sent into orbit by Dr. Van Allen’s team soon after Explorer 1 confirmed what this anomaly suggested: the Earth is surrounded by belts of intense radiation, dubbed soon thereafter as the Van Allen Radiation Belts. When the Geiger counter on board Explorer 1 registered no radiation at all, it was, in fact, actually overwhelmed by extremely high radiation.
We believe these tapes represent the first data set ever transmitted from outside Earth’s atmosphere. Thanks to the hard work and ingenuity of our friends at The MediaPreserve, and some generous funding from the Carver Foundation, we now have about 2 TB of .wav files converted from the Explorer 1 tapes, as well as digitized lab notebooks and personal journals of Drs. Van Allen and Ludwig, along with graphs, correspondence, photos, films and audio recordings.
In our work with this collection, the biggest discovery was a 700-page report from Goddard comprised almost entirely of data tables that represent the orbital ephemeris data set from Explorer 1. This 1959 report was digitized a few years back from the collections at the University of Illinois at Urbana-Champaign as part of the Google Books project and is being preserved in the Hathi Trust. This data set holds the key to interpreting the signals we hear on the tapes. There are some fascinating interplays between analog and digital, past and present, near and far in this project, and I feel very lucky to have landed in Iowa when I did.
Trevor: What challenges does this data represent for getting it off of it’s original media and into a format that is usable?
Emily: When my colleagues were first made aware of the Explorer mission tapes in 2009, they had been sitting in the basement of a building on the University of Iowa’s campus for decades. There was significant mold growth on the boxes and the tapes themselves, and my colleagues secured an emergency grant from the state to clean, move and temporarily rehouse the tapes. Three tapes were then sent to The MediaPreserve to see if they could figure out how to digitize the audio signals. Bob Strauss and Heath Condiotte hunted down a huge, of-the-era machine that could play back all of the discrete tracks on these tapes. As I understand it, Heath had to basically disassemble the entire thing and replace all of the transistors before he got it to work properly. Fortunately, we were able to play some of the digitized audio tracks from these test reels for Dr. George Ludwig, one of the key researchers on Dr. Van Allen’s team, before he passed away in 2012. Dr. Ludwig confirmed that they sounded — at least to his naked ear — as they should, so we felt confident proceeding with the digitization.
So, soon after I was hired in 2012, we secured funding from a private foundation to digitize the Explorer 1 tapes and proceeded to courier all 700 tapes to The MediaPreserve for thorough cleaning, rehousing and digital conversion. The grant is also funding the development and design of a web interface to the data and accompanying archival materials, which we [Iowa] hope to launch (pun definitely intended) some time this fall.
Trevor: What stakeholders are involved in the project? Specifically, I would be interested to hear how you are working with scientists to identify what the significant properties of these particular tapes are.
Emily: No one on the project team we assembled within the Libraries has any particular background in near-Earth physics. So we reached out to our colleagues in the University of Iowa Department of Physics, and they have been tremendously helpful and enthusiastic. After all, this data represents the legacy of their profession in a big picture sense, but also, more intimately, the history of their own department (their offices are in Van Allen Hall). Our colleagues in Physics have helped us understand how the audio signals were converted into usable data, what metadata might be needed in order to analyze the data set using contemporary tools and methods, how to package the data for such analysis, and how to deliver it to scientists where they will actually find and be able to use it.
We’re also working with a journalism professor from Northwestern University, who was Dr. Van Allen’s biographer, to weave an engaging (and historically accurate) narrative to tell the Explorer story to the general public.
Trevor: How are you imagining use and access to the resulting data set?
Emily: Unlike the digitized photos, books, manuscripts, music recordings and films we in libraries and archives have become accustomed to working with, we’re not sure how contemporary scientists (or non-scientists) might use a historic data set like this. Our colleagues in Physics have assured us that once we get this data (and accompanying metadata) packaged into the Common Data Format and archived with the National Space Science Data Center, analysis of the data set will be pretty trivial. They’re excited about this and grateful for the work we’re doing to preserve and provide access to early space data, and believe that almost as quickly as we are able to prepare the data set to be shared with the physics community, someone will pick it up and analyze it.
As the earliest known orbital data set, we know that this holds great historical significance. But the more we learn about Explorer 1, the less confident we are that the data from this first mission is/was scientifically significant. The Explorer I data — or rather, the points in its orbit during which the instruments recorded no data at all — hinted at a big scientific discovery. But it was really Explorer III, sent into orbit in the summer of 1958 with more sophisticated instrumentation, that produced that data that led to the big “ah-hah” moment. So, we’re hoping to secure funding to digitize the tapes from that mission, which are currently in storage.
I also think there might be some interesting, as-yet-unimagined artistic applications for this data. Some of the audio is really pretty eerie and cool space noise.
Trevor: More broadly, how will this research data fit into the context of managing research data at the university? Is data management something that the libraries are getting significantly involved in? If so could you tell us a bit about your approach.
Emily: The University of Iowa, like all of our peers, is thinking and talking a lot about research data management. The Libraries are certainly involved in these discussions, but as far as I can tell, the focus is, understandably, on active research and is motivated primarily by the need to comply with funding agency requirements. In libraries, archives and museums, many of us are motivated by a moral imperative to preserve historically significant information. However, this ethos does not typically pervade in the realm of active, data-intensive research. Once the big discovery has been made and the papers have been published, archiving the data set is often an afterthought, if not a burden. The fate of the Explorer tapes, left to languish in a damp basement for decades, is a case in point. Time will not be so kind to digital data sets, so we have to keep up the hard work of advocating, educating and partnering with our research colleagues, and building up the infrastructure and services they need to lower the barriers to data archiving and sharing.
Trevor: Backing up out of this particular project, I don’t think I have spoken with many folks with the title “Digital Preservation Librarian.” Other than this, what kinds of projects are you working on and what sort of background did you have to be able to do this sort of work? Could you tell us a bit about what that role means in your case? Is it something you are seeing crop up in many research libraries?
Emily: My professional focus is on the preservation of collections, whether they are manifest in physical or digital form, or both. I’ve always been particularly interested in the overlaps, intersections, and interdependencies of physical/analog and digital information, and motivated to play an active role in the sociotechnical systems that support its creation, use and preservation. In graduate school at the University of Illinois, I worked both as a research assistant with an NSF-funded interdisciplinary research group focused on information technology infrastructure, and in the Library’s Conservation Lab, making enclosures, repairing broken books, and learning the ins and outs of a robust research library preservation program. After completing my MLIS, I pursued a Certificate of Advanced Study in Digital Libraries while working full-time in Preservation & Conservation, managing multi-stream workflows in support of UIUC’s scanning partnership with Google Books.
I came to Iowa at the beginning of 2012 into the newly-created position of Digital Preservation Librarian. My role here has shifted with the needs and readiness of the organization, and has included the creation and management of preservation-minded workflows for digitizing collections of all sorts, the day-to-day administration of digital content in our redundant storage servers, researching and implementing tools and processes for improved curation of digital content, piloting workflows for born-digital archiving, and advocating for ever-more resources to store and manage all of this digital digital stuff. Also, outreach and inreach have both been essential components of my work. As a profession, we’ve made good progress toward raising awareness of digital stewardship, and many of us have begun making progress toward actually doing something about it, but we still have a long way to go.
And actually, I will be leaving my current position at Iowa at the end of this month to take on a new role as the Head of Preservation and Reformatting for The Ohio State University Libraries. My experience as a hybrid preservationist with understanding and appreciation of both the physical and digital collections will give me a broad lens through which to view the challenges and opportunities for long-term preservation and access to research collections. So, there may be a vacancy for a digital preservationist at Iowa in the near future
In the early days of HTML, the most hated tag was the <blink> tag, which made text under it blink. There were hardly any sensible uses for it, and a lot of browsers now disable it. I just tested it in this post, and WordPress actually deleted the tag from my draft when I tried to save it. (I approve!)
Today, though, the <blink> tag isn’t annoying enough. Now we have the animated GIF. It’s been around since the eighties, but for some reason it’s become much more prevalent recently. It’s the equivalent of waving a picture in your face while you’re trying to read something.
I can halfway understand it when it’s done in ads. Advertisers want to get your attention away from the page you’re reading and click on the link to theirs. What I don’t understand is why people use it in their own pages and user icons. It must be a desire to yell “Look how clever I am!!!” over and over again as the animation cycles.
If you think that your web page is boring and adding some animated GIFs is just what’s needed to bring back the excitement — Don’t. Just don’t.
Update: I just discovered that a page that was driving me crazy because even disabling animated GIFs wouldn’t stop it was actually using the <marquee> tag. I believe that tag is banned by the Geneva Convention.
Tagged: GIF, HTML, rant
The following is a guest post by Chris Adams from the Repository Development Center at the Library of Congress, the technical lead for the World Digital Library.
Preservation is usually about maintaining as much information as possible for the future but access requires us to balance factors like image quality against file size and design requirements. These decisions often require revisiting as technology improves and what previously seemed like a reasonable compromise now feels constricting.
I recently ran into an example of this while working on the next version of the World Digital Library website, which still has substantially the same look and feel as it did when the site launched in April of 2009. The web has changed considerably since then with a huge increase in users on mobile phones or tablets and so the new site uses responsive design techniques to adjust the display for a wide range of screen sizes. Because high-resolution displays are becoming common, this has also involved serving images at larger sizes than in the past — perfectly in keeping with our goal of keeping the focus on the wonderful content provided by WDL partners.
When viewing the actual scanned items, this is a simple technical change to serve larger versions of each but one area posed a significant challenge: the thumbnail or reference image used on the main item page. These images are cropped from a hand-selected master image to provide consistently sized, interesting images which represent the nature of the item – a goal which could not easily be met by an automatic process. Unfortunately the content guidelines used in the past specified a thumbnail size of only 308 by 255 pixels, which increasingly feels cramped as popular web sites feature much larger images and modern operating systems display icons as large as 256×256 or even 512×512 pixels. A “Retina” icon is significantly larger than the thumbnail below:Going back to the source
All new items being processed for WDL now include a reference image at the maximum possible resolution, which the web servers can resize as necessary. This left around 10,000 images which had been processed before the policy changed and nobody wanted to take time away from expanding the collection to reprocess old items. The new site design allows flexible image sizes but we wanted to find an automated solution to avoid a second-class presentation for the older items.
Our original master images are much higher resolution and we had a record of the source image for each thumbnail but not the crop or rotation settings which had been used to create the original thumbnail. Researching the options for reconstructing those settings lead me to OpenCV, a popular open-source computer vision toolkit.
At first glance, the OpenCV template matching tutorial appears to be perfect for the job: give it a source image and a template image and it will attempt to locate the latter in the former. Unfortunately, the way it works is by sliding the template image around the source image one pixel at a time until it finds a close match, a common approach but one which fails when the images differ in size or have been rotated or enhanced.
Fortunately, there are far more advanced techniques available for what is known as scale and rotation invariant feature detection and OpenCV has an extensive feature detection suite. Encouragingly, the first example in the documentation shows a much harder variant of our problem: locating a significantly distorted image within a photograph – fortunately we don’t have to worry about matching the 3D distortion of a printed image!Finding the image
The locate-thumbnail program works in three steps:
- Locate distinctive features in each image, where features are simply mathematically interesting points which will hopefully be relatively consistent across different versions of the image – resizing, rotation, lighting changes, etc.
- Compare the features found in each image and attempt to identify the points in common
- If a significant number of matches were found, replicate any rotation which was applied to the original image
- Generate a new thumbnail at full resolution and save the matched coordinates and rotation as a separate data file in case future reprocessing is required
You can see this process in the sample visualizations below which have lines connecting each matched point in the thumbnail and full-sized master image:
The technique even works surprisingly well with relatively low-contrast images such as this 1862 photograph from the Thereza Christina Maria Collection courtesy of the National Library of Brazil where the original thumbnail crop included a great deal of relatively uniform sky or water with few unique points:Scaling up
After successful test runs on a small number of images, locate-thumbnail was ready to try against the entire collection. We added a thumbnail reconstruction job to our existing task queue system and over the next week each item was processed using idle time on our cloud servers. Based on the results, some items were reprocessed with different parameters to better handle some of the more unusual images in our collection, such as this example where the algorithm matched only a few points in the drawing, producing an interesting but rather different result:Reviewing the results Automated comparison
For the first pass of review, we wanted a fast way to compare images which should be very close to identical. For this work, we turned to libphash which attempts to calculate the perceptual difference between two images so we could find gross failures rather than cases where the original thumbnail had been slightly adjusted or was shifted by an insignificant amount. This approach is commonly used to detect copyright violations but it also works well as a way to quickly and automatically compare images or even cluster a large number of images based similarity.
A simple Python program was created and run across all of the reconstructed images, reporting the similarity of each pair for human review. The gross failures were used to correct bugs in the reconstruction routine and a few interesting cases where the thumbnail had been significantly altered, such as this cover page where a stamp added by a previous owner had been digitally removed:
http://www.wdl.org/en/item/7778/ now shows that this was corrected to follow the policy of fidelity to the physical item.Human review
The entire process until this point has been automated but human review was essential before we could use the results. A simple webpage was created which offered fast keyboard navigation and the ability to view sets of images at either the original or larger sizes:
This was used to review items which had been flagged by phash as less than matching below a particular threshold and to randomly sample items to confirm that the phash algorithm wasn’t masking differences which a human would notice.
In some cases where the source image had interacted poorly with the older down-sampling, the results are dramatic – the reviewers reported numerous eye-catching improvements such as this example of an illustration in an Argentinian newspaper:
This project completed towards the end of this spring and I hope you will enjoy the results when the new version of WDL.org launches soon. On a wider scale, I also look forward to finding other ways to use computer-vision technology to process large image collections – many groups are used to sophisticated bulk text processing but many of the same approaches are now feasible for image-based collections and there are a number of interesting possibilities such as suggesting items which are visually similar to the one currently being viewed or using clustering or face detection to review incoming archival batches.
Most of the tools referenced above have been released as open-source and are freely available:
We recently posted an article on the UK Web Archive blog that may be of interest here, User-Driven Digital Preservation, where we summarise our work with the SCAPE Project on a little prototype application that explores how we might integrate user feedback and preservation actions into our usual discovery and access processes. The idea is that we need to gather better information about which resources are difficult for users to use, and which formats they would prefer, so that we can use this data to drive our preservation work.
The prototype also provides a convenient way to run Apache Tika and DROID on any URL, and exposes the contents of its internal 'format registry' as a set of web pages that you can browse through (e.g. here's what it knows about text/plain). It only supports a few preservation actions right now, but it does illustrates what might be possible if we can find a way to build a more comprehensive and sustainable system.Preservation Topics: Preservation ActionsWeb ArchivingSCAPE
I had the distinct pleasure of moderating the opening plenary session of the Joint Annual Meeting of COSA, NAGARA and SAA in Washington D.C. in early August. The panel was on the “state of access,” and I shared the dais with David Cuillier, an Associate Professor and Director of the University of Arizona School of Journalism, as well as the president of the Society of Professional Journalists; and Miriam Nisbet, the Director of the Office of Government Information Services at the National Archives and Records Administration.
The panel was a great opportunity to tease out the spaces between the politics of “open government” and the technologies of “open data” but our time was much too short and we had to end just when the panelists were beginning to get to the juicy stuff.
There were so many more places we could have taken the conversation:
- Is our government “transparent enough”? Do we get the “open government” we deserve as (sometimes ill-informed) citizens?
- What is the role of outside organizations in providing enhanced access to government data?
- What are the potential benefits of reducing the federal government role in making data available?
- Is there the right balance between voluntary information openness and the need for the Freedom of Information Act?
- What are the job opportunities for archivists and records managers in the new “open information” environment?
- Have you seen positive moves towards addressing digital preservation and stewardship issues regarding government information?
I must admit that when I think of “access” and “open information” I’m thinking almost exclusively about digital data because that’s the sandbox I play in. At past SAA conferences I’ve had the feeling that the discussion of digital preservation and stewardship issues was something that happened in the margins. At this year’s meeting those issues definitely moved to the center of the conversation.
Just look at this list of sessions running concurrently during a single hour on Thursday August 14, merely the tip of the iceberg:
- Getting Things Done with Born-Digital Collections
- Spreading the Word: Access to Oral History Collections in the Digital Age
- Editathon: You Have One Hour to Increase Access to Archival Science Info on Wikipedia…Go!
- Ethics, Provenance, Metadata: Trust and Recordkeeping in the Cloud?
There were also a large number of web archiving-related presentations and panels including the SAA Web Archiving Roundtable meeting (with highlights of the upcoming NDSA Web Archiving Survey report), the Archive-IT meetup and very full panels Friday and Saturday.
I was also pleased to see that the work of NDIIPP and the National Digital Stewardship Alliance was getting recognized and used by many of the presenters. There were numerous references to the 2014 National Agenda for Digital Stewardship and the Levels of Preservation work and many NDSA members presenting and in the audience. You’ll find lots more on the digital happenings at SAA on the #SAA14 twitter stream.
The increased focus on digital is great news for the archival profession. Digital stewardship is an issue where our expertise can really be put to good use and where we can have a profound impact. Younger practitioners have recognized this for years and it’s great that the profession itself is finally getting around to it.
It is well-known that PDF documents can contain features that are preservation risks (e.g. see here and here). Migration of existing PDFs to PDF/A is sometimes advocated as a strategy for mitigating these risks. However, the benefits of this approach are often questionable, and the migration process can also be quite risky in itself. As I often get questions on this subject, I thought it might be worthwhile to do a short write-up on this.PDF/A is a profile
First, it's important to stress that each of the PDF/A standards (A-1, A-2 and A-3) are really just profiles within the PDF format. More specifically, PDF/A-1 offers a subset of PDF 1.4, whereas PDF/A-2 and PDF/A-3 are based on the ISO 32000 version of PDF 1.7. What these profiles have in common, is that they prohibit some features (e.g. multimedia, encryption, interactive content) that are allowed in 'regular' PDF. Also, they narrow down the way other features are implemented, for example by requiring that all fonts are embedded in the document. Keeping this in mind, it's easy to see that migrating an arbitrary PDF to PDF/A can easily result in problems.Loss, alteration during migration
Suppose, as an example, that we have a PDF that contains a movie. This is prohibited in PDF/A, so migrating to PDF/A will simply result in the loss of the multimedia content. Another example are fonts: all fonts in a PDF/A document must be embedded. But what happens if the source PDF uses non-embedded fonts that are not available on the machine on which the migration is run? Will the migration tool exit with a warning, or will it silently use some alternative, perhaps similar font? And how do you check for this?Complexity and effect of errors
Also, migrations like these typically involve a complete re-processing of the PDF's internal structure. The format's complexity implies that there's a lot of potential for things to go wrong in this process. This is particularly true if the source PDF contains subtle errors, in which case the risk of losing information is very real (even though the original document may be perfectly readable in a viewer). Since we don't really have any tools for detecting such errors (i.e. a sufficiently reliable PDF validator), these cases can be difficult to deal with. Some further considerations can be found here (the context there is slightly different, but the risks are similar).Digitised vs born-digital
The origin of the source PDFs may be another thing to take into account. If PDFs were originally created as part of a digitisation project (e.g. scanned books), the PDF is usually little more than a wrapper around a bunch of images, perhaps augmented by an OCR layer. Migrating such PDFs to PDF/A is pretty straightforward, since the source files are unlikely to contain any features that are not allowed in PDF/A. At the same time, this also means that the benefits of migrating such files to PDF/A are pretty limited, since the source PDFs weren't problematic to begin with!
The potential benefits PDF/A may be more obvious for a lot of born-digital content; however, for the reasons listed in the previous section, the migration is more complex, and there's just a lot more that can go wrong (see also here for some additional considerations).Conclusions
Although migrating PDF documents to PDF/A may look superficially attractive, it is actually quite risky in practice, and it may easily result in unintentional data loss. Moreover, the risks increase with the number of preservation-unfriendly features, meaning that the migration is most likely to be successful for source PDFs that weren't problematic to begin with, which belies the very purpose of migrating to PDF/A. For specific cases, migration to PDF/A may still be a sensible approach, but the expected benefits should be weighed carefully against the risks. In the absence of stable, generally accepted tools for assessing the quality of PDFs (both source and destination!), it would also seem prudent to always keep the originals.Taxonomy upgrade extras: PDFPreservation Topics: Preservation ActionsMigrationTools
At the 2014 Society of American Archivists meeting, the CAD/BIM Taskforce held a session titled “Frameworks for the Discussion of Architectural Digital Data” to consider the daunting matter of archiving computer-aided design and Building Information Modelling files. This was the latest evidence that — despite some progress in standards and file exchange — archivists and the international digital preservation community at large are trying to get a firm grasp on the slippery topic of preserving CAD files.
CAD is a suite of design tools, software for 3-D modelling, simulation and testing. It is used in architecture, geographic information systems, archaeology, survey data, geophysics, 3-D printing, engineering, gaming, animation and just about any situation that requires a 3-D virtual model. It comprises geometry, intricate calculations, vector graphics and text.
The data in CAD files resides in structurally complex inter-related layers that are capable of much more than displaying models. For example, engineers can calculate stress and load, volume and weight for specific materials, the center of gravity and visualize cause-and-effect. Individual CAD files often relate and link to other CAD files to form a greater whole, such as parts of a machine or components in a building. Revisions are quick in CAD’s virtual environment, compared to paper-based designs, so CAD has eclipsed paper as the tool of choice for 3-D modelling.
CAD files — particularly as used by scientists, engineers and architects — can contain vital information. Still, CAD files are subject to the same risk that threatens all digital files, major and minor: failure of accessibility — being stuck on obsolete storage media or dependent on a specific program, in a specific version, on a specific operating system. In particular, the complexity and range of specifications and formats for CAD files make them even more challenging than many other kinds of born-digital materials.
As for CAD software, commerce thrives on rapid technological change, new versions of software and newer and more innovative software companies. This is the natural evolution of commercial technology. But each new version and type of CAD software increases the risk of software incompatibility and inaccessibility for CAD files created in older versions of software. Vendors, of course, do not have to care about that; the business of business is business — though, in fairness, businesses may continually surpass customer needs and expectations by creating newer and better features. That said, many CAD customers have long realized that it is important — and may someday be crucial — to be able to archive and access older CAD files.
Building Information Modelling files and Project Lifecycle Management files also require a digital-preservation solution. BIM and PLM integrate all the information related to a major project, not only the CAD files but also the financial, legal, email and other ancillary files.
Part of a digital preservation workflow is compatibility and portability between systems. So one of the most significant standards for the exchange of product manufacturing information of CAD files is ISO 10303, known as the “Standard for the Exchange of Product model data” or STEP. Michael J. Pratt, of the National Institute of Standards and Technology, wrote in 2001 (pdf), “the development of STEP has been one of the largest efforts ever undertaken by ISO.”
- Here are some other CAD preservation resources, many of which refer to STEP:
- The United States National CAD Standard encompasses The American Institute of Architect’s CAD Layer Guidelines, the Construction Specification Institute’s Uniform Drawing System and the National Institute of Building Sciences Plotting Guidelines.
- MIT conducted a two-year project (which included digital preservation pioneer Stephen Abrams on their advisory board) called “Future-proofing Architectural Computer-Aided DEsign,” where they analyzed CAD data from three renown architects and their projects. The FACADE project’s final report (pdf) details recommendations and best practices.
- The National Archives’ “Revised Format Guidance for the Transfer of Permanent Electronic Records” lists “Preferred” and “Acceptable” formats for CAD.
- The Art Institute of Chicago Department of Architecture published “Collecting, Archiving and Exhibiting Digital Design Data” (pdf).
- In July, 2013, the Digital Preservation Coalition held a conference titled “Preserving Computer Aided Design.” In 2013, the DPC also released their Technology Watch report, authored by Alex Ball, “Preserving Computer-Aided Design.”
- ISO 13567 is a CAD layer standard.
- CAD standards on Wikipedia.
- ISO 16739 for BIM data.
- The “CAD: A Guide to Good Practice” is a collaborative effort from the UK’s Archaeology Data Service and the US’s Digital Antiquity.
- The list of CAD file formats is stunning.
Some simple preservation information that comes up repeatedly is to save the original CAD file in its original format. Save the hardware, software and system that runs it too, if you can. Save any metadata or documentation and document a one-to-one relationship with each CAD file’s plotted sheet.
The usual digital-preservation practice applies, which is to organize the files, backup the files to a few different storage devices and put one in a geographically remote location in case of disaster, and every seven years or so migrate to a current storage medium to keep the files accessible. Given the complexity of these files, and recognizing that at its heart digital preservation is an attempt to hedge our bets about mitigating a range of potential risks, it is also advisable to try to generate a range of derivative files which are likely to be more viable in the future. That is, keep the originals, and try to also export to other formats that may lose some functionality and properties but which are far more likely to be able to be opened in the future. The final report from the FACADE project makes this recommendation: ”For 3-D CAD models we identified the need for four versions with distinct formats to insure long-term preservation. These are:
1. Original (the originally submitted version of the CAD model)
2. Display (an easily viewable format to present to users, normally 3D PDF)
3. Standard (full representation in preservable standard format, normally IFC or STEP)
4. Dessicated (simple geometry in a preservable standard format, normally IGES)”
CAD files now join paper files — such as drawings, plans, elevations, blueprints, images, correspondence and project records — in institutional archives and firms’ libraries. In addition to the ongoing international work on standards and preservation, there needs to be a dialog with the design-software industry to work toward creating archival CAD files in an open-preservation format. Finally, trained professionals need to make sense of the CAD files to better archive them and possibly get them up and running again for production, academic, legal or other professional purposes. That requires knowledge of CAD software, file construction and digital preservation methods.
Either CAD users need better digital curatorial skills to manage their CAD archives or digital archivists need better CAD skills to curate the archives of CAD users. Or both.
The first part of the workshop will be a panel session at which David Giaretta (APARSEN), Ross King (SCAPE), and Ed Fay (OPF) will be discussing digital preservation.
After this a range of digital preservation projects will be presented at different stalls. This part will begin with an elevator pitch session at which each project will have exactly one minute to present their project.
Everybody is invited to visit all stalls and learn more about the different projects, their results and thoughts on sustainability. At the same time there will be a special ‘clinic’ stall at which different experts will be ready to answer any questions you have on their specific topic – for instance PREMIS metadata or audit processes.
The workshop takes place at City University London, 8 September 2014, 1pm to 5pm.
Looking forward to meeting you!
Register for the workshop (please notice! Registration for this workshop should not be done via the DL registration page)
Oh, did I forget? We also have a small competition going on… Read more.
Preservation Topics: SCAPE
I had occasion today to look up the “Rendering Matters" report I wrote while at Archives New Zealand (I was looking for this list of questions/object attributes that were tested for and included as an appendix in the report) and got distracted re-reading the findings in the report.
Summary findings from “Rendering Matters”:
- The choice of rendering environment (software) used to open or “render” an office file invariably has an impact on the information presented through that rendering. When files are rendered in environments that differ from the original then they will often present altered information to the user. In some cases the information presented can differ from the original in ways that may be considered significant.
- The emulated environments, with minimal testing or quality assurance, provided significantly better rendering functionality than the modern office suites. 60-100% of the files rendered using the modern office suites displayed at least one change compared to 22-35% of the files rendered using the emulated hardware and original software.
- In general, the Microsoft Office 2007 suite functioned significantly better as a rendering tool for older office files than either the open source LibreOffice suite or Corel’s Word Perfect Office X5 suite.
- Given the effectiveness of modern office applications to open the office files, many files may not need to have content migrated from them at this stage as current applications can render much of the content effectively (and the content’s accessibility will not be improved by performing this migration as the same proportion of the content can currently be accessed).
- Users do not often include a lot of problematic attributes in their files but often include at least one. This in turn indicates a level of unpredictability and inconsistency in the occurrence of rendering issues which may make it difficult to test the results of migration actions on files like these.
There were more detailed findings towards the end of the report:
"The [findings] show quantitatively that the choice of rendering environment (software) used to open or “render” an office file invariably has an impact on the information presented through that rendering. When files are rendered in environments that differ from the original they will often present altered information to the user. In some cases the information presented can differ from the original in ways that may be considered significant. This result is useful as it gives a set of ground-truth data to refer to when discussing the impact of rendering on issues of authenticity, completeness and the evidential value of digital office files.
The results give an indication of the efficacy of modern office suites as rendering tools for older office files. Risk analysis of digital objects in current digital repositories could be informed by this research. Digital preservation risk analysts could use this research to evaluate whether having access to these modern office suites means that files that can be “opened “by them are not at risk.
The results highlight the difficulty and expense in testing migration approaches by showing how long it took to test only ~100 files comprehensively (at least 13.5 hours). Scaling this to 0.5% of 1,000,000 files would give 675 hours or nearly 17 weeks at 40 hours per week. This level of testing may be considered excessive depending on the context, but similarly comprehensive testing of only 100 files per 1,000,000 of each format (.01%) would take at least 13.5 hours per format, per tool. More information on how long testing would take for a variety of different sample sizes and percentages of objects (e.g. 1% of 100,000 objects would take 150 hours) is available in Appendix 3.
The results also show the promise of running original software on emulated hardware to authenticate the rendering of files to ensure that all the content has been preserved. Although emulated environment renderings were not shown to be 100% accurate in this research, they were shown to have a far greater degree of accuracy in their renderings than current office suites (which are the tools currently used for migrating office files). Additionally, some of the changes introduced in the emulated environments may have been due to poor environment configuration.
The results give an indication of how prevalent certain attributes are in office files. With a greater sample size this research this could help to show whether or not it is true that “most users only use the same 10% of functionality in office applications” (the data from this small sample indicates that in fact they only use about 10% of the functionality/attributes each, but often it is a different 10%).”
Findings specific to the prevalence of rendering “errors”
Personally, found the findings related to the prevalence of problematic attributes in the files tested to be most enlightening. The relevant findings from the report are included below:
- "The likelihood that any single file has a particular attribute that does not render properly in a particular rendering environment is low,
- The likelihood that the same file will have at least one attribute that doesn’t render properly in a particular environment is quite high (~60% and above).
In other words, the results indicate that users do not often include a lot of attributes in their files that caused rendering issues when rendered in modern environments but often include at least one. This in turn indicates a level of unpredictability and inconsistency in the occurrence of rendering issues.
A significant challenge for digital preservation practitioners is evaluating the effectiveness of digital preservation approaches. When faced with a large and ever increasing volume of digital files to be preserved, practitioners are forced to consider approaches that can be automated. The results in this report indicate that the occurrence of problematic attributes is inconsistent and they therefore may be difficult to automatically identify. Without identifying such attributes pre-migration it will not be possible to test whether the attributes exist post-migration and so the effectiveness of the migration will not be able to be evaluated. Without automatically identifying such attributes pre-migration then it is unlikely that any effective evaluation will be able to be made cost-effectively. The cost to manually identify these attributes for every object would likely be prohibitively large for most organisations given reasonably sized collections.”
Time to manually validate object rendering
Also included in the appendices was a table estimating the time it would take to manually validate a set % of objects for a given collection size. This was based on the average of 9 minutes it took to undertake the tests as part of the rendering matters research. I’ve included this table below, it is sobering.
Also included as an appendix in the report, and included in a separate web page, are some examples of the types of rendering issues that were identified, including screenshots e.g.:
Replicating the results
It has now been three and a half years since the publication of this report and as far as I am aware nobody has attempted to replicate the approach or the findings. Personally I found the process as enlightening as the results, and would welcome (and where possible, help) the replication of this research by others.
Every day, people from around the world upload photos to share on a range of social media sites and web applications. The results are astounding; collections of billions of digital photographs are now stored and managed by several companies and organizations. In this context, Yahoo Labs recently announced that they were making a data set of 100 million Creative Commons photos from Flickr available to researchers. As part of our ongoing series of Insights Interviews, I am excited to discuss potential uses and implications for collecting and providing access to digital materials with David Ayman Shamma, a scientist and senior research manager with Yahoo Labs and Flickr.
Trevor: Could you give us a sense of the scope and range of this corpus of photos? What date ranges do they span? The kinds of devices they were taken on? Where they were taken? What kinds of information and metadata they come with? Really, anything you can offer for us to better get our heads around what exactly the dataset entails.
Ayman: There’s a lot to answer in that question. Starting at the beginning, Flickr was an early supporter of the Creative Commons and since 2004 devices have come and gone, photographic volume has increased, and interests have changed. When creating the large-scale dataset, we wanted to cast as wide a representative net as possible. So the dataset is a fair random sample across the entire corpus of public CC images. The photos were uploaded from 2004 to early 2014 and were taken by over 27,000 devices, including everything from camera phones to DSLRs. The dataset is a list of photo IDs with a URL to download a JPEG or video plus some corresponding metadata like tags and camera type and location coordinates. All of this data is public and can generally be accessed from an unauthenticated API call; what we’re providing is a consistent list of photos in a large, rolled-up format. We’ve rolled up some but not all of the data that is there. For example, about 48% of the dataset has longitude and latitude data which is included in the rollup, but comments on the photos have not been included, though they can be queried through the API if someone wants to supplement their research with it.
Trevor: In the announcement about the dataset you mention that there is a 12 GB data set, which seems to have some basic metadata about the images and a 50 TB data set containing the entirety of the collection of images. Could you tell us a bit about the value of each of these separately, the kinds of research both enable and a bit about the kinds of infrastructure required to provide access to and process these data sets?
Ayman: Broadly speaking, research on Flickr can be categorized into two non-exclusive topic areas: social computing and computer vision. In the latter, one has to compute what are called ‘features’ or pixel details about luminosity, texture, cluster and relations to other pixels. The same is true for audio in the videos. In effect, it’s a mathematical fingerprint of the media. Computing these fingerprints can take quite a bit of computational power and time, especially at the scale of 100 million items. While the core dataset of metadata is only 12 GB, a large collection of features reach into the terabytes. Since these are all CC media files, we thought to also share these computed features. Our friends at the International Computer Science Institute and Lawrence Livermore National Labs were more than happy to compute and host a standard set of open features for the world to use. What’s nice is this expands the dataset’s utility. If you’re from an institution (academic or otherwise), computing the features could be a costly set of compute time.
Trevor: The dataset page notes that the dataset has been reviewed to meet “data protection standards, including strict controls on privacy.” Could you tell us a bit about what that means for a dataset like this?
Ayman: The images are all under one of six Creative Commons licenses implemented by Flickr. However, there were additional protections that we put into place. For example, you could upload an image with the license CC Attribution-NoDerivatives and mark it as private. Technically, the image is in the public CC; however, Flickr’s agreement with its users supersedes the CC distribution rights. With that, we only sampled from Flickr’s public collection. There are also some edge cases. Some photos are public and in the CC but the owner set the geo-metadata to private. Again, while the geo-data might be embedded in the original JPEG and is technically under CC license, we didn’t include it in the rollup.
Trevor: Looking at the Creative Commons page for Flickr, it would seem that this isn’t the full set of Creative Commons images. By my count, there are more than 300 million creative commons licensed photos there. How were the 100 million selected, and what factors went into deciding to release a subset rather than the full corpus?
Ayman: We wanted to create a solid dataset given the potential public dataset size; 100 million seemed like a fair sample size that could bring in close to 50% geo-tagged data and about 800 thousand videos. We envision researchers from all over the world accessing this data, so we did want to account for the overall footprint and feature sizes. We’ve chatted about the possibility of ‘expansion packs’ down the road, both to increase the size of the dataset and to include things like comments or group memberships on the photos.
Trevor: These images are all already licensed for these kinds of uses, but I imagine that it would have simply been impractical for someone to collect this kind of data via the API. How does this data set extend what researchers could already do with these images based on their licenses? Researchers have already been using Flickr photos as data, what does bundling these up as a dataset do for enabling further or better research?
Ayman: Well, what’s been happening in the past is people have been harvesting the API or crawling the site. However, there are a few problems with these one-off research collections; the foremost is replication. By having a large and flexible corpus, we aim to set a baseline reference dataset for others to see if they can replicate or improve upon new methods and techniques. A few academic and industry players have created targeted datasets for research, such as ImageNet from Stanford or Yelp’s release of its Phoenix-area reviews. Yahoo Labs itself has released a few small targeted Flickr datasets in the past as well. But in today’s research world, the new paradigm and new research methods require large and diverse datasets, and this is a new dataset to meet the research demands.
Trevor: What kinds of research are you and your colleagues imagining folks will do with these photographs? I imagine a lot of computer science and social network research could make use of them. Are there other areas you imagine these being used in? It would be great if you could mention some examples of existing work that folks have done with Flickr photos to illustrate their potential use.
Ayman: Well, part of the exciting bit is finding new research questions. In one recent example, we began to examine the shape and structure of events through photos. Here, we needed to temporally align geo-referenced photos to see when and where a photo was taken. As it turns out, the time the photo was taken and the time reported by the GPS are off by as much as 10 minutes in 40% of the photos. So, in work that will be published later this year, we designed a method for correcting timestamps that are in disagreement with the GPS time. It’s not something we would have thought we’d encounter, but it’s an example of what makes a good research question. With a large corpus available to the research world at-large, we look forward to others also finding new challenges, both immediate and far-reaching.
Trevor: Based on this, and similar webscope data sets, I would be curious for any thoughts and reflections you might offer for libraries, archives and museums looking at making large scale data sets like this available to researchers. Are there any lessons learned you can share with our community?
Ayman: There’s a fair bit of care and precaution that goes into making collections like this - rarely is it ever just a scrape of public data; ownership and copyright does play a role. These datasets are large collections that reflect people’s practices, behavior and engagement with media like photos, tweets or reviews. So, coming to understand what these datasets mean with regard to culture is something to set our sights on. This applies to the libraries and archives that set to preserve collections and to researchers and scientists, social and computational alike, who aim to understand them.
In this post I'll be taking a look at format identification of PDF files and highlighting a difference in opinion between format identification tools. Some of the details are a little dry but I'll restrict myself to a single issue and be as light on technical details as possible. I hope I'll show that once the technical details are clear it really boils down to policy and requirements for PDF processing.Assumptions
I'm considering format identification in its simplest role as first contact with a file that little, if anything, is known about. In these circumstances the aim is to identify the format as quickly and accurately as possible then pass the file to format specific tools for deeper analysis.
I'll also restrict the approach to magic number identification rather than trust the file extension, more on this a little later.Software and data
- the fine free file utility (also known simply as file),
- FIDO, and
- Apache Tika.
I used as up to date versions as possible but will spare the details until I publish the results in full.So is this a PDF?
So there was plenty of disagreement between the results from the different tools, I'll be showing these in more detail at our upcoming PDF Event. For now I'll focus on a single issue, there are a set of files that FIDO and DROID don't identify as PDFs that file and Tika do. I've attached one example to this post, Google chrome won't open it but my ubuntu based document viewer does. It's a three page PDF about Rumen Microbiology and this was obviously the intention of the creator. I've not systematically tested multiple readers yet but Libre Office won't open it while ubuntu's print preview will. Feel free to try the reader of your choice and comment.What's happening here?
It appears we have a malformed PDF and this is the case . The issue is caused by a difference in the way that the tools go about identifying PDFs in the first place. This is where it gets a little dull but bear with me. All of these tools use "magic" or "signature" based identification. This means that they look for unique (hopefully) strings of characters in specific positions in the file to work out the format. Here's the Tika 1.5 signature for PDF:
<match value="%PDF-" type="string" offset="0"/>
What this says is look for the string %PDF- (the value) at the start of the file (offset="0") and if it's there identify this as a PDF. The attached file indeed starts:
meaning it's a PDF version 1.2. Now we can have a look at the DROID signature (version 77) for the PDF 1.2 sig:<InternalSignature ID="125" Specificity="Specific"> <ByteSequence Reference="BOFoffset"> <SubSequence MinFragLength="0" Position="1" SubSeqMaxOffset="0" SubSeqMinOffset="0"> <Sequence>255044462D312E32</Sequence> <DefaultShift>9</DefaultShift> <Shift Byte="25">8</Shift> <Shift Byte="2D">4</Shift> <Shift Byte="2E">2</Shift> <Shift Byte="31">3</Shift> <Shift Byte="32">1</Shift> <Shift Byte="44">6</Shift> <Shift Byte="46">5</Shift> <Shift Byte="50">7</Shift> </SubSequence> </ByteSequence> <ByteSequence Reference="EOFoffset"> <SubSequence MinFragLength="0" Position="1" SubSeqMaxOffset="1024" SubSeqMinOffset="0"> <Sequence>2525454F46</Sequence> <DefaultShift>-6</DefaultShift> <Shift Byte="25">-1</Shift> <Shift Byte="45">-3</Shift> <Shift Byte="46">-5</Shift> <Shift Byte="4F">-4</Shift> </SubSequence> </ByteSequence></InternalSignature>Which is a little more complex than Tika's signature but what it says is a matching file should start with the string %PDF-1.2, which our sample does. This is in the first <ByteSequence Reference="BOFoffset"> section, a begining of file offset. Crucially this signature adds another condition, that the file contains the string %EOF within 1024 bytes of the end of the tile. There are two things that are different here. The start condition change, i.e. Tika's "%PDF-" vs. DROID's "%PDF-1.2%" is to support DROID's capability to identify versions of formats. Tika simply detects that a file looks like a PDF and returns the application/pdf mime type and has a single signature for the job. DROID can distinguish between versions and so has 29 different signatures for PDF. It's also NOT the cause of the problem. The disagreement between the results is caused by DROID's requirement for a valid end of file marker %EOF. A hex search of our PDF confirms that it doesn't contain an %EOF marker.So who's right?
An interesting question. The PDF 1.3 Reference states:The last line of the file contains only the end-of-file marker,%%EOF. (See implementation note 15 in Appendix H.)The referenced implementation note reads:3.4.4, “File Trailer”15. Acrobat viewers require only that the %%EOF marker appear somewherewithin the last 1024 bytes of the file.
So DROID's signature is indeed to the letter of the law plus amendments. It's really a matter of context when using the tools. Does DROID's signature introduce an element of format validation to the identification process? In a way yes, but understanding what's happening and making an informed decision is what really matters.What's next?
I'll be putting some more detailed results onto GitHub along with a VM demonstrator. I'll tweet and add a short post when this is finished, it may have to wait until next week.Preservation Topics: Identification AttachmentSize It looks like a PDF to me....44.06 KB
On September 8 the SCAPE/ APARSEN workshop Digital Preservation Sustainability on the EU Level is held at London City University in connection with the DL2014 conference.
The main objective of the workshop is to provide an overview of solutions to challenges within Digital Preservation Sustainability developed by current and past Digital Preservation research projects. The event brings together various EU projects/initiatives to present their solutions and approaches, and to find synergies between them.
Attached to the workshop Digital Preservation Sustainability on the EU Level SCAPE and APARSEN launch a competition:
Which message do YOU want to send to the EU for the future of Digital Preservation projects?
You can join the competition on Twitter. Only tweets including the hashtag #DP2EU are contending in the competition. You are allowed to include a link to a text OR one picture with your message. Messages which contain more than 300 characters in total are excluded from the competition, though.
The competition will close September 8th at 16:30 UK time. The workshop panel will then choose one of the tweets as a winner. The winner will receive an e-book reader as a prize.
There are only a few places left for the workshop. Registration for the workshop is FREE and must be completed by filling out the form here - http://bit.ly/DPSustainability. Please don’t register for this workshop on the DL2014 registration page, since this workshop is free of charge!
The following is a guest post from Euan Cochrane, Digital Preservation Manager at Yale University Library. This piece continues and extends exploration of the potential of emulation as a service and virtualization platforms.
Increasingly, the intellectual productivity of scholars involves the creation and development of software and software-dependent content. For universities to act as responsible stewards of these materials we need to have a well-formulated approach to how we can make these legacy works of scholarship accessible.
While there have been significant concerns with the practicality of emulation as a mode of access to legacy software, my personal experience (demonstrated via one of my first websites about Amiga emulation) has always been contrary to that view. It is with great pleasure that I can now illustrate the practical utility of Emulation as a Service via three recent case studies from my work at Yale University Library. Consideration of interactive artwork from 1997, interactive Hebrew texts from a 2004 CD-ROM and finance data from 1998 illustrate that it’s no longer really a question of if emulation is a viable option for access and preservation, but of how we can go about scaling up these efforts and removing any remaining obstacles to their successful implementation.
At Yale University Library we are conducting a research pilot of the bwFLA Emulation as a Service software framework. This framework greatly simplifies the use of emulators and virtualization tools in a wide range of contexts by abstracting all of the emulator configuration (and its associated issues) away from the end-user. As well as simplifying use of emulators it also simplifies access to emulated environments by providing the ability to access and interact with emulated environments from right within your web browser, something that we could only dream of just a few years ago.
At Yale University Library we are evaluating the software against a number of criteria including:
- In what use-cases might it be used?
- How might it fit in with digital content workflows?
- What challenges does it present?
The EaaS software framework shows great promise as a tool for use in many digital content management workflows such as appraisal/selection, preservation and access, but also presents a few unique and particularly challenging issues that we are working to overcome. The issues are mostly related to copyright and software licensing. At the bottom of this post I will discuss what these issues are and what we are doing to resolve them, but before I do that let me put this in context by discussing some real-life use-cases for EaaS that have occurred here recently.
It has taken a few months (I started in my position at the Library in September 2013) but recently people throughout the Library system have begun to forward queries to me if they involve anything digital preservation-related. Over the past month or so we have had three requests for access to digital content from the general collections that couldn’t be interacted with using contemporary software. These requests are all great candidates for resolving using EaaS but, unfortunately (as you will see) we couldn’t do that.
Interactive Artwork, Circa 1997: Use Case One
An Arts PhD student wanted to access an interactive CD-ROM-based artwork (Laurie Anderson’s “Puppet Motel”) from the general collections. The artwork can only be interacted with on old versions of the Apple Mac “classic” operating system.
Fortunately the Digital Humanities Librarian (Peter Leonard) has a collection of old technology and was willing to bring a laptop into the library from his personal collection for the PhD student to use to access it on. This was not an ideal or sustainable solution (what would have happened if Peter’s collection wasn’t available? What happens when that hardware degrades past usability?).
Since responding to this request we have managed to get the Puppet Motel running in the emulation service using the Basilisk II emulator (for research purposes).
This would be a great candidate for accessing via the emulation service. The sound and interaction aspects all work well and it is otherwise very challenging for researchers to access the content.
Hebrew Texts, Circa 2004: Use Case Two
One of the Judaica librarians needed to access data for a patron and the data was in a Windows XP CD-ROM (Trope Trainer) from the general collections. The software on the CD would not run on the current Windows 7 operating system that is installed on the desktop PCs here in the library.
The solution we came up with was to create a Windows XP virtual machine for the librarian to have on her desktop. This is a good solution for her as it enables her to print the sections she wants to print and export pdfs for printing elsewhere as needed.
We have since ingested this content into the emulation service for testing purposes. In the EaaS it can run on either the virtualization software from Oracle: VirtualBox (which doesn’t provide full-emulation) or QEMU an emulation and virtualization tool.
It is another great candidate for the service as this version of the content can no longer be accessed on contemporary operating systems and the emulated version enables users to play through the texts and hear them read just as though they were using the CD on their local machine. The ability to easily export content from the emulation service will be added in a future update and will enable this content to become even more useful.
Finance Data, Circa 1998/2003: Use Case Three
A Finance PhD student needed access to data (inter-corporate ownership data) trapped within software within a CD-ROM from the general collection. Unfortunately the software was designed for Windows 98: “As part of my current project I need to use StatCan data saved using some sort of proprietary software on a CD. Unfortunately this software seemed not to be compatible with my version of Windows.” He had been able to get the data out of the disc but couldn’t make any real sense of it without the software: “it was all just random numbers.”
We have recently been developing a collection of old hardware at the Library to support long-term preservation of digital content. Coincidentally, and fortunately, the previous day someone had donated a Windows 98 laptop. Using that laptop we were able to ascertain that the CD hadn’t degraded and the software still worked. A Windows 98 virtual machine was then created for the student to use to extract the data. Exporting the data to the host system was a challenge. The simplest solution turned out to be having the researcher email the data to himself from within the virtual machine via Gmail using an old web browser (Firefox 2.x).
We were also able to ingest the virtual machine into the emulation service where it can run on either VirtualBox or QEMU.
This is another great candidate for the emulation service. The data is clearly of value but cannot be properly accessed without using the original custom software which only runs on older versions of the Microsoft Windows operating system.
Other uses of the service
In exploring these predictable use-cases for the service, we have also discovered some less-expected scenarios in which the service offers some interesting potential applications. For example, the EaaS framework makes it trivially easy to set up custom environments for patrons. These custom environments take up little space as they are stored as a difference from a base-environment, and they have a unique identifier that can persist over time (or not, as needed). Such custom environments may be a great way for providing access to sets of restricted data that we are unable to allow patrons to download to their own computers. Being able to quickly configure a Windows 7 virtual machine with some restricted content included in it (and appropriate software for interacting with that content, e.g., an MS Outlook PST archive file with MS Outlook), and provide access to it in this restricted online context, opens entirely new workflows for our archival and special collections staff.
Why we couldn’t use bwFLA’s EaaS
In all three of the use-cases outlined above EaaS was not used as the solution for the end-user. There were two main reasons for this:
- We are only in possession of a limited number of physical operating system and application licenses for these older systems. While there is some capacity to use downgrade rights within the University’s volume licensing agreement with Microsoft, with Apple operating systems the situation is much less clear. As a result we are being conservative in our use of the service until we can resolve these issues.
- It is not always clear in the license of old software whether this use-case is allowed. Virtualization is rarely (if ever) mentioned in the license agreements. This is likely because it wasn’t very common during the period when much of the software we are dealing with was created. We are working to clarify this point with the General Counsel at Yale and will be discussing it with the software vendors.
Addressing the software licensing challenges
As things stand we are limited in our ability to provide access to EaaS due to licensing agreements (and other legal restrictions) that still apply to the content-supporting operating system and productivity software dependencies. A lot of these dependencies that are necessary for providing access to valuable historic digital content do not have a high economic value themselves. While this will likely change over time as the value of these dependencies becomes more recognized and the software more rare, it does make for a frustrating situation. To address this we are beginning to explore options with the software vendors and will be continuing to do this over the following months and years.
We are very interested in the opportunities EaaS offers for opening access to otherwise inaccessible digital assets. There are many use-cases in which emulation is the only viable approach for preserving access to this content over the long term. Because of this, anything that prevents the use of such services will ultimately lead to the loss of access to valuable and historic digital content, which will effectively mean the loss of that content. Without engagement from software vendors and licensing bodies it may require law change to ensure that this content is not lost forever.
It is our hope that the software vendors will be willing to work with us to save our valuable historic digital assets from becoming permanently inaccessible and lost to future generations. There are definitely good reasons to believe that they will, and so far, those we have contacted have been more than willing to work with us.