For many organizations that are just starting to tackle digital preservation, it can be a daunting challenge – and particularly difficult to figure out the first steps to take. Education and training may be the best starting point, creating and expanding the expertise available to handle this kind of challenge. The Digital Preservation Outreach and Education program here at the Library aims to do just that, by providing the materials as well as the hands-on instruction to help build the expertise needed for current and future professionals working on digital preservation.
Recently, the Library was host to a meeting of the DPOE Working Group, consisting of a core group of experts and educators in the field of digital preservation. The Working Group participants were Robin Dale (Institute of Museum and Library Services), Sam Meister (University of Montana-Missoula), Mary Molinaro (University of Kentucky), and Jacob “Jake” Nadal (Princeton University). The meeting was chaired by George Coulbourne of the Library of Congress, and Library staffers Barrie Howard and Kris Nelson also participated.
The main goal of the meeting was to update the existing DPOE Curriculum, which is used as the basis for the Program’s training workshops and then subsequently, by the trainees themselves. A survey is being conducted to gather even more information, and will help inform this curriculum as well (see a related blog post). The Working Group reviewed and edited all of the six substantive modules which are based on terms from the OAIS Reference Model framework:
- Identify (What digital content do you have?)
- Select (What portion of your digital content will be preserved?)
- Store (What issues are there for long-term storage?)
- Protect (What steps are needed to protect your digital content?)
- Manage (What provisions are needed for long-term management?)
- Provide (What considerations are there for long-term access?)
The group also discussed adding a seventh module on implementation. Each of these existing modules contains a description, goals, concepts and resources designed to be used by current and/or aspiring digital preservation practitioners.
Mary Molinaro, Director, Research Data Center at the University of Kentucky Libraries, noted that “as we worked through the various modules it became apparent how flexible this curriculum is for a wide range of institutions. It can be adapted for small, one-person cultural heritage institutions and still be relevant for large archives and libraries. ”
Mary also spoke to the advantages of having a focused, group effort to work through these changes: “Digital preservation has some core principles, but it’s also a discipline subject to rapid technological change. Focusing on the curriculum together as an instructor group allowed us to emphasize those things that have not changed while at the same time enhancing the materials to reflect the current technologies and thinking.”
These curriculum modules are currently in the process of further refinement and revision, including an updated list of resources. The updated version of the curriculum will be available later this month. The Working Group also recommended some strategies for extending the curriculum to address executive audiences, and how to manage the process of updating the curriculum going forward.
In a previous blog post, the NDSA Standards and Practices Working Group announced the opening of a survey to rank issues in preserving video collections. The survey closed on August 2, 2014 and while there’s work ahead to analyze the results and develop action plans, we can share some preliminary findings.
We purposely cast a wide net in advertising the survey so that respondents represented a range of institutions, experience and collections. About 54% of the respondents who started the survey answered all the required questions.
The blog post on The Signal was the most popular means to get the word out (27%) followed by the Association of Moving Image Archivists list (13%) and the NDSA-ALL list (11%). A significant number of respondents (25%) were directed to the survey through other tools including Twitter, Facebook, PrestoCentre Newsletter and the survey bookmarks distributed at the Digital Preservation 2014 meeting.
The vast majority of respondents who identified their affiliation were from the United States; other countries represented include Germany, Austria, England, South Africa, Australia, Canada, Denmark and Chile.
The survey identified the top three stumbling blocks in preserving video as:
- Getting funding and other resources to start preserving video (18%)
- Supporting appropriate digital storage to accommodate large and complex video files (14%)
- Locating trustworthy technical guidance on video file formats including standards and best practices (11%)
Respondents report that analog/physical media is the most challenging type of video (73%) followed by born digital (42%) and digital on physical media (34%).
Clearly, this high level data doesn’t tell the whole story and we have work ahead to analyze the results. Some topics we’d like to pursue include using the source of the survey invitation to better understand the context of the communities that answered the survey. Some respondents, such as those alerted to the survey through the announcement on the AMIA list, are expected to have more experience with preserving video than respondents directed to the survey from more general sources like Facebook or Twitter.
How do the responses from more mature programs compare with emerging programs? What can we learn from those who reported certain issues as “solved” within their institution? Might these solutions be applicable to other institutions? What about the institutions reporting that analog video is more challenging than born digital video? Are their video preservation programs just starting out? Do they have much born-digital video yet?
After we better understand the data, the NDSA Standards and Practices Working Group will start to consider what actions might be useful to help lower these stumbling blocks. This may include following up with additional survey questions to define the formats and scopes of current and expected video collections. Stay tuned for a more detailed report about the survey results and next steps!
22 participants from 8 countries - the UK, Germany, Denmark, the Netherlands, Switzerland, France, Sweden and the Czech Republic, not to forget umpteenthousand defect or somehow interesting PDF files brought to the event.
Not only is this my first Blog entry on the OPDF website, it is also about my first Hackathon. I guess it was Michelle's idea in the first place to organise a Hackathon with the Open Planets Foundation on the PDF topic and to have the event in our library in Hamburg. I am located in Kiel, but as we are renewing our parquet floor in Kiel at the moment, the room situation in Hamburg is much better (Furthermore, it's Hamburg which has the big airport).
The preparation for the event was pretty intense for me. Not only the organisation in Hamburg (food, rooms, water, coffee, dinner event) had to be done, much more intense was the preparating in terms of the Hacking itself.
I am a library- and information scientiest, not a programmer. Sometimes I would rather be a programmer considering my daily best-of-problems, but you should dress for the body you have, not for the body you'd like to have.
Having learned the little I know about writing code within the last 8 months and most of it just since this july, I am still brand-new to it. As there always is a so-called "summer break" (which means that everybody else is in a holiday and I actually have time to work on difficult stuff) I had some very intense Skype calls with Carl from the OPF, who enabled me to put all my work-in-progress PDF-tools to Github. I learned about Maven and Travis and was not quite recovered when the Hackathon actually started this monday and we all had to install some Virtual Ubuntu machine to be able to try out some best-of-tools like DROID, Tika and Fido and run it over our own PDF files.
We had Olaf Drümmer from the PDF Association as our Keynote Speaker for both days. On the first day, he gave us insights about PDF and PDF/A, and when I say insights, I really mean that. Talking about the building blocks of a PDF, the basic object types and encoding possibilities. This was much better than trying to understand the PDF 1.7 specification of 756 pages just by myself alone in the office with sentences like "a single object of type null, denoted by the keyword null, and having a type and value that are unequal to those of any other object".
We learned about the many different kinds of page content, the page being the most important structure unit of a PDF file and about the fact that a PDF page could have every size you can think of, but Acrobat 7.0 officially only supports a page dimension up to 381 km. The second day, we learned about PDF(/A)-Validation and what would theoretically be needed to have the perfect validator. Talking about the PDF and PDF/A specifications and all the specification quoted and referenced by these, I am under the impression that it would last some months to read them all - and so much is clear, somebody would have to read and understand them all. The complexity of the PDF file, the flexibility of the viewers and the plethora of users and user's needs will always take care of a heterogenious PDF reality with all the strangeness and brokenness possible. As far as I remember it is his guess that about 10 years of manpower would be needed to build a perfect validator, if it could be done at all. Being strucked by this perfectly comprehensible suggestions, it is probably not surprising that some of the participants had more questions at the end of the two days than they had at the beginning.
As PDF viewers tend to conceal problems and tend to display problematic PDF files in a decent way, they are usually no big help in terms of PDF validation or ensuring long-term-availability, quite the contrary.
Some errors can have a big impact on the longterm availability of PDF files, expecially content that is only referred to and not embedded within the file and might just be lost over time. On the other hand, the "invalid page tree node" which e. g. JHOVE likes to put its finger on, is not an error, but just a hint that the page tree is not balanced and the page cannot be found in the most efficient way. Even if all the pages would just be saved as an array and you would have to iterate through the whole array to go to a certain page, this would only slow down the loading, but does not prevent anybody from accessing the page he wants to read, especially if the affected PDF document only has a couple of dozen pages.
During the afternoon of the first day, we collected specific problems everybody has and formed working groups, each engaging in a different problem. One working group (around Olaf) started to seize JHOVE error messages and trying to figure out which ones really bear a risk and what do they mean in the first place, anyway? Some of the error messages definitely describe real existent errors and a rule or specification is hurt, but will practically never cause any problems displaying the file. Is this really an error then? Or just burocracy? Should a good validator even display this as an error - which formally would be the right thing to do - or not disturb the user unnessecarily?
Another group wanted to create a small java tool with an csv output that looks into a PDF file and puts out the information which Software has created the PDF file and which validation errors does it containt, starting with PDFBox, as this was easy to implement in Java. We came so far to get the tool working, but as we brought expecially broken PDF files to the event, it is not yet able to cope with all of them, we still have to make it error-prone.
By the way, it is really nice to be surrounded by people who obviously live in the same nerdy world than I do. When I told them I could not wait to see our new tool's output and was anxious to analyse the findings, the answer was just "And neither can I". Usually, I just get frowning fronts and "I do not get why you are interested in something so boring"-faces.
A third working group went to another room and tested the already existing tools with brought PDF samples in the Virtual Ubuntu Environment.
There were more ideas, some of them seemed to difficult or to impossible to be able to create a solution in such a small time, but some of us are determined to have some follow-up-event soon.
For example, Olaf stated that sometimes the text extraction in a PDF file does not work and the participant who sat next to me suggested to me, we could start to check the output against dicitonaries to see if the output still make sense. "But there are so many languages" I told him, thinking about my libary's content. "Well, start with one" he answered, following the idea that a big problem often can be split in several small ones.
Another participant would like to know more about the quality and compression of the JPEGs embedded within his PDF files, but some other doubted this information could still be retrieved.
When the event was over tuesday around 5 pm, we were all tired, but happy, with clear ideas or new interesting problems in our heads.
And just because I was already asked this today because I might look slightly tired still. We did sleep during the night. We did not hack it all through or slept on mattrasses in our library. Some of us had quite a few pitcher full of beer during the evening, but I am quite sure everybody made it to his or her Hotel room.
Twitter Hashtag #OPDFPDFPreservation Topics: Open Planets Foundation