Planet DigiPres

The Library of Congress Wants You (and Your File Format Ideas)!

The Signal: Digital Preservation - 3 October 2014 - 4:24pm
"Uncle Sam Needs You" painted by James Montgomery Flagg

“Uncle Sam Needs You” painted by James Montgomery Flagg

In June of this year, the Library of Congress announced a list of formats it would prefer for digital collections. This list of recommended formats is an ongoing work; the Library will be reviewing the list and making revisions for an updated version in June 2015. Though the team behind this work continues to put a great deal of thought and research into listing the formats, there is still one more important component needed for the project: the Library of Congress needs suggestions from you.

This request is not half-hearted. As the Library increasingly relies on the list to identify preferred formats for acquisition of digital collections, no doubt other institutions will adopt the same list. It is important, therefore, that as the Library undertakes this revision of the recommended formats, it conducts a public dialog about them in order to reach an informed consensus.

This public dialog includes librarians, library students, teachers, vendors, publishers, information technologists — anyone and everyone with an opinion on the matter and a stake in preserving digital files. Collaboration is essential for digital preservation. No single institution can know everything and do everything alone. This is a shared challenge.

Librarians, what formats would you prefer to receive your digital collections in? What file formats are easiest for you to process and access? Publishers and vendors, what format do you think you should create your digital publications in if you want your stuff to last and be accessible into the future? The time may come when you want to re-monetize a digital publication, so you want to ensure that it is accessible.

Those are general questions, of course. Let’s look at the specific file formats the Library has selected so far. The preferred formats are categorized by:

  • Textual Works and Musical Compositions
  • Still Image Works
  • Audio Works
  • Moving Image Works
  • Software and Electronic Gaming and Learning
  • Datasets/Databases

Take, for example, digital photographs. Here is the list of formats the Library would most prefer to receive for digital preservation:

  • TIFF (uncompressed)
  • JPEG2000 (lossless (*.jp2)
  • PNG (*.png)
  • JPEG/JFIF (*.jpg)
  • Digital Negative DNG (*.dng)
  • JPEG2000 (lossy) (*.jp2)
  • TIFF (compressed)
  • BMP (*.bmp)
  • GIF (*.gif)

Is there anything you think should be changed in that list? If so, why? Or anything added to this list? There’s a section on metadata on that page. Does it say enough? Or too little? Is it clear enough? Should the Library add some description about adding photo metadata into the photo files themselves?

Please look over the file categories that interest you and tell us what you think. Help us shape a policy that will affect future digital collections, large and small. Be as specific as you can.

Email your questions and comments to the digital preservation experts below. Your emails will be confidential; they will not be published on this blog post. So don’t be shy. We welcome all questions and comments, great and small.

Send general email about preferred formats to Theron Westervelt (thwe at loc.gov) Send email about specific categories to:

  • Ardie Bausenbach (abau at loc.gov) for Textual Works and Musical Compositions
  • Phil Michel (pmic at loc.gov) for Still Image Works
  • Gene DeAnna (edea at loc.gov) for Audio Works
  • Mike Mashon (mima at loc.gov) for Moving Image Works
  • Trevor Owens (trow at loc.gov) for Software and Electronic Gaming and Learning
  • Donna Scanlon (dscanlon at loc.gov) for Datasets/Databases

They are all very nice people who are up to their eyeballs in digital-preservation work and would appreciate hearing your fresh perspective on the subject.

One last thing. The recommended formats are just that: recommended. It is not a fixed set of standards. And the Library of Congress will not reject any digital collection of value simply because the file formats in the collection might not conform to the recommended formats.

Categories: Planet DigiPres

Residency Program Success Stories, Part One

The Signal: Digital Preservation - 2 October 2014 - 1:34pm

The following is a guest post by Julio Díaz Laabes, HACU intern and Program Management Assistant at the Library of Congress.

Coming off the heels of a successful beginning for the Boston and New York set of cohorts, the National Digital Stewardship Residency Program is becoming a model for digital stewardship residencies on a national scale. This residency program, funded by the Institute of Museum and Library Services,offers recent master’s and doctoral program graduates in specialized fields- library science, information science, museum studies, archival studies and others- the opportunity to gain professional experience in the field of digital preservation.

Clockwise from top left: Lee Nilsson, Maureen McCormick Harlow, Erica Titkemeyer and Heidi Elaine Dowding.

The inaugural year of the NDSR program was completed in May of 2014. During this year, ten residents were placed in various organizations in the Washington, DC area. Since completing the program, all ten residents are now working in positions related to the field of digital preservation! Here are some accounts of how the program has impacted each of the resident’s lives and where they are now in their careers.

Lee Nilsson is employed in a contract position as a junior analyst at the Department of State, Bureau of International Information and programs. Specifically, he is working in the analytics office on foreign audience research. On how the residency helped him, Lee said, “The residency got me to D.C and introduced me to some great people. Without NDSR I would not have made it this far.” Furthermore, Lee commented that the most interesting aspect of his job is “the opportunity to work with some very talented people on some truly global campaigns.”

Following the residency, Maureen McCormick Harlow accepted a permanent position as the new Digital Librarian at PBS (Public Broadcast Service). She works in the Media Library and her tasks include  consulting on the development of the metadata schema for an enterprise-wide digital asset management system, fulfilling archival requests for legacy materials and working with copyright holders to facilitate the next phase of a digitization project (which builds on the NDSR project of Lauren Work). Maureen stated that “NDSR helped her to foster and cultivate a network of digital preservationists and practitioners in the DC area over the nine months that I participated in it.” An interesting aspect of her job is working with the history of PBS and learning about PBS programming to see how it has changed over the years.

On an international scale, Heidi Elaine Dowding is currently in a three-year PhD Research Fellow position at the Royal Dutch Academy of Arts and Sciences Huygens ING Institute. This position is funded through the European Commission. “My research involves the long-term publication and dissemination of digital scholarly editions, so aspects of digital preservation will be key,” said Heidi. On the best part of her position, Heidi said, “I am lucky enough to be fully funded, which allows me to focus on my studies. This gives me that opportunity to research things that I am interested in every day.”

Erica Titkemeyer is currently employed at the University of North Carolina at Chapel Hill as the Project Director and AV Conservator for the Southern Folklife Collection. This position was created as part of a grant-funded initiative to research and analyze workflows for the mass reformatting and preservation of legacy audiovisual materials. “NDSR allotted me the opportunity to participate in research and projects related to the implementation of digital preservation standards. It provided me access to a number of networking events and meetings related to digital stewardship.” In her position, she hopes to help see improved access to the collections, while also having the opportunity to learn more about the rich cultural content they contain.

Given these success stories, the National Digital Stewardship Residency has proven to be an invaluable program, providing opportunity for real world practical experience in the field of digital preservation. Also, the diversity of host institutions and location areas across major U.S. cities gives residents the opportunity to build up an extensive network of colleges, practitioners and potential employers in diverse fields. Stay tuned for part two of this blog post which will showcase the remaining residents of the 2013-2014 Washington D.C cohort.

Categories: Planet DigiPres

Announcing the Release of the 2015 National Agenda For Digital Stewardship

The Signal: Digital Preservation - 1 October 2014 - 8:38pm

2015-nat-agenda-coverThe National Digital Stewardship Alliance is pleased to announce the release today of the “2015 National Agenda for Digital Stewardship.”  The Agenda provides funders, decision‐makers and practitioners with insight into emerging technological trends, gaps in digital stewardship capacity and key areas for research and development to support the work needed to ensure that today’s valuable digital content remains accessible, useful, and comprehensible in the future, supporting a thriving economy, a robust democracy and a rich cultural heritage.

The 2015 National Agenda is the result of many months of individual effort and dedicated institutional support from across the NDSA community and it integrates the perspective of leading government, academic, nonprofit and private-sector organizations with digital stewardship responsibilities.

This year’s Agenda builds on the foundations of the 2014 Agenda (PDF) and outlines the challenges and opportunities related to digital preservation activities in four broad areas: Key Issues in Digital Collection Building; Organizational Policies and Practices; Technical Infrastructure Development; and Research Priorities. Each section articulates priorities then offers a set of actionable recommendations to address the challenges.

A theme running through the Agenda is that while there is more content being created than ever, there’s also increasing recognition by businesses, research institutions, policymakers and funders that legacy digital content contributes to positive job creation and international competitive advantage. At the same time, digital stewardship processes are reaching a critical mass of maturity and uptake, and more work is being done to steward digital content than ever before.

The Agenda addresses both of these trends and attempts to make sense of the changing landscape and articulate the priority actions that will have the most impact on community and practice.

Key Issues in Building Digital Content Collections

Much of the investment and effort in the field of digital preservation has been focused on developing technical infrastructure, networks of partnerships, education and training, and establishing standards and practices. Little has been invested in understanding how the stewardship community will coordinate the acquisition and management of born‐digital materials in a systematic and public way.

A key issue in building digital content collections is that a gap is starting to emerge between the types of materials that are being created and used in our society and the types of materials that make their way into libraries and archives. The stewardship community must recognize this gap and explore ways to address it. Other core digital content recommendations include:

  • Build the evidence base for evaluating at‐risk, large‐scale digital content for acquisition. Develop contextual knowledge about born‐digital content areas that characterizes the risks and efforts to ensure durable access to them.
  • Understand the technical implications of acquiring large‐scale digital content. Extend systematic surveys and environmental scans of organizational capacity and preservation storage practices to help guide selection decisions.
  • Share information about what content is being collected and what level of access is provided. Communicate and coordinate collection priority statements at national, regional and institutional levels.
  • Support partnerships, donations and agreements with creators, owners and stewards of digital content. Connect with digital content creation communities across commercial, nonprofit, private and public sectors to leverage their incentives to preserve.

Organizational Policies and Practices

The digital preservation community is struggling with ways to advocate for resources and adequate staffing while articulating the shared responsibility for stewardship. The Agenda identifies efforts in the area of organizational roles and policies for digital stewardship that focus on actions that support the development of an environment where the mandate and need for digital preservation are matched with the resources, staffing and professional community prepared to meet those mandates and needs. These include:

  • Advocate for resources. Share strategies and develop unified messages to advocate for funding and resources; share cost information and models; and develop tools and strategies that inform the evaluation and management of digital collection value and usage.
  • Enhance staffing and training. Explore and expand models of support that provide interdisciplinary and practical experiences for emerging professionals and apply those models to programs for established professionals. Evaluate and articulate both the broad mix of roles and the specialized set of skills in which digital stewardship professionals are involved.
  • Foster multi‐institutional collaboration. Foster collaboration through open source software development; information sharing on staffing and resources; coordination on content selection and engagement with the development of standards and practices; and identify, understand and connect with stakeholders outside of the cultural heritage sector.

Technical Infrastructure Development

The 2015 Agenda continues a focus on technical infrastructure development, defined as “the set of interconnected technical elements that provide a framework for supporting an entire structure of design, development, deployment and documentation in service of applications, systems and tools for digital preservation,” including hardware, software and systems. The key technical infrastructure recommendations include:

  • Coordinate and sustain an ecosystem of shared services. Better identify and implement processes to maintain key software platforms, tools and services; identify technologies which integrate well to form a sustainable digital workflow; and identify better models to support long‐term sustainability for common goods.
  • Foster best practice development. Give priority to the development of standards and best practices, especially in the areas of format migration and long‐term data integrity.

Research Priorities

Finally, the Agenda recognizes that research is critical to the advancement of both basic understanding and the effective practice of digital preservation. Generally speaking, research in digital preservation is under‐resourced, in part because the payoff from long-term preservation arrives in the distant future and is shared across multiple communities. Still, investments in core research will yield large impacts. Core research recommendations include:

  • Build the evidence base for digital preservation. Give priority to programs that systematically contribute to the overall cumulative evidence base for digital preservation practice and resulting outcomes–including supporting test beds for systematic comparison of preservation practices.
  • Better integrate research and practice. Give priority to programs that rigorously integrate research and practice or that increase the scalability of digital stewardship.

The 2015 Agenda is designed as a catalyst to action for legislators, funders, decision-makers and practitioners. The NDSA will support the release of the Agenda with outreach and education events across the country over the course of the next year, while diving deeper into questions posed by the Agenda with research papers to address particular issues, such as file fixity.

Download the full report and the executive summary at http://www.digitalpreservation.gov/ndsa/nationalagenda/index.html.

Founded in 2010, the NDSA is a consortium of over 150 member institutions committed to the long-term preservation and stewardship of digital information. NDSA member institutions come from all sectors, and include universities, consortia, professional societies, commercial business, professional associations, and government agencies at the federal, state, and local level. Further information about the NDSA can be found at http://NDSA.org.

Categories: Planet DigiPres

QCTools: Open Source Toolset to Bring Quality Control for Video within Reach

The Signal: Digital Preservation - 30 September 2014 - 12:01pm

In this interview, part of the Insights Interview series, FADGI talks with Dave Rice and Devon Landes about the QCTools project.

In a previous blog post, I interviewed Hannah Frost and Jenny Brice about the AV Artifact Atlas, one of the components of Quality Control Tools for Video Preservation, an NEH-funded project which seeks to design and make available community oriented products to reduce the time and effort it takes to perform high-quality video preservation. The less “eyes on” time it takes to do QC work, the more time can be redirected towards quality control and assessment of video on the digitized content most deserving of attention.

Devon

QCTools’ Devon Landes

In this blog post, I interview archivists and software developers Dave Rice and Devon Landes about the latest release version of the QCTools, an open source software toolset to facilitate accurate and efficient assessment of media integrity throughout the archival digitization process.

Kate:  How did the QCTools project come about?

Devon:  There was a recognized need for accessible & affordable tools out there to help archivists, curators, preservationists, etc. in this space. As you mention above, manual quality control work is extremely labor and resource intensive but a necessary part of the preservation process. While there are tools out there, they tend to be geared toward (and priced for) the broadcast television industry, making them out of reach for most non-profit organizations. Additionally, quality control work requires a certain skill set and expertise. Our aim was twofold: to build a tool that was free/open source, but also one that could be used by specialists and non-specialists alike.

Dave

QCTools’ Dave Rice

Dave:  Over the last few years a lot of building blocks for this project were coming in place. Bay Area Video Coalition had been researching and gathering samples of digitization issues through the A/V Artifact Atlas project and meanwhile FFmpeg had made substantial developments in their audiovisual filtering library. Additionally, open source technology for archival and preservation applications has been finding more development, application, and funding. Lastly, the urgency related to the obsolescence issues surrounding analog video and lower costs for digital video management meant that more organizations were starting their own preservation projects for analog video and creating a greater need for an open source response to quality control issues. In 2013, the National Endowment for the Humanities awarded BAVC with a Preservation and Access Research and Development grant to develop QCTools.

Kate: Tell us what’s new in this release. Are you pretty much sticking to the plan or have you made adjustments based on user feedback that you didn’t foresee? How has the pilot testing influenced the products?

QCToolsPlaybackFilters

QCTools provides many playback filters. Here the left window shows a frame with the two fields presented separately (revealing the lack of chroma data in field 2). The right window here shows the V plane of the video per field to show what data the deck is providing.

Devon:  The users’ perspective is really important to us and being responsive to their feedback is something we’ve tried to prioritize. We’ve had several user-focused training sessions and workshops which have helped guide and inform our development process. Certain processing filters were added or removed in response to user feedback; obviously UI and navigability issues were informed by our testers. We’ve also established a GitHub issue tracker to capture user feedback which has been pretty active since the latest release and has been really illuminating in terms of what people are finding useful or problematic, etc.

The newest release has quite a few optimizations to improve speed and responsiveness, some additional playback & viewing options, better documentation and support for the creation of an xml-format report.

Dave:  The most substantial example of going ‘off plan’ was the incorporation of video playback. Initially the grant application focused on QCTools as a purely analytical tool which would assess and present quantifications of video metrics via graphs and data visualization. Initial work delved deeply into identifying methodology to use to pick out the right metrics to find what could be unnatural to digitized analog video (such as pixels too dissimilar from their temporal neighbors, or the near-exact repetition of pixel rows, or discrepancies in the rate of change over time between the two video fields). When presenting the earliest prototypes of QCTools to users a recurring question was “How can I see the video?” We redesigned the project so that QCTools would present the video alongside the metrics along with various scopes, meters and visual tools so that now it has a visual and an analytic side.

Kate:   I love that the Project Scope for QCTools quotes both the Library of Congress’s Sustainability of Digital Formats and the Federal Agencies Digitization Guidelines Initiative as influential resources which encourage best practices and standards in audiovisual digitization of analog material for users. I might be more than a little biased but I agree completely. Tell me about some of the other resources and communities that you and the rest of the project team are looking at.

QCTools_vectorscope_illegal

Here the QCTools vectorscope shows a burst of illegal color values. With the QCTools display of plotted graphs this corresponds to a spike in the maximum saturation (SATMAX).

Devon: Bay Area Video Coalition connected us with a group of testers from various backgrounds and professional environments so we’ve been able to tap into a pretty varied community in that sense. Also, their A/V Artifact Atlas has also been an important resource for us and was really the starting point from which QCTools was born.

Dave:  This project would not at all be feasible without the existing work of FFmpeg. QCTools utilizes FFmpeg for all decoding, playback, metadata expression and visual analytics. The QCTools data format is an expression of FFmpeg’s ffprobe schema, which appeared to be one of the only audiovisual file format standards that could efficiently store masses of frame-based metadata.

Kate:   What are the plans for training and documentation on how to use the product(s)?

Devon:  We want the documentation to speak to a wide range of backgrounds and expertise, but it is a challenge to do that and as such it is an ongoing process. We had a really helpful session during one of our tester retreats where users directly and collaboratively made comments and suggestions to the documentation; because of the breadth of their experience it really helped to illuminate gaps and areas for improvement on our end. We hope to continue that kind of engagement with users and also offer them a place to interact more directly with each other via a discussion page or wiki. We’ve also talked about the possibility of recording some training videos and hope to better incorporate the A/V Artifact Atlas as a source of reference in the next release.

Kate:   What’s next for QCTools?

Dave:   We’re presenting the next release of QCTools at the Association of Moving Image Archivists Annual Meeting on October 9th for which we anticipate supporting better summarization of digitization issues per file in a comparative manner. After AMIA, we’ll focus on audio and the incorporation of audio metrics via FFmpeg’s EBUr128 filter. QCTools has been integrated into workflows at BAVC, Dance Heritage Coalition, MOMA, Anthology Film Archives and Die Osterreichische Mediathek so the QCTools issue tracker has been filling up with suggestions which we’ll be tackling in the upcoming months.

Categories: Planet DigiPres

Scape Demonstration: Migration of audio using xcorrSound

Open Planets Foundation Blogs - 30 September 2014 - 10:32am

As part of the scape project, we did a large-scale experiment and evaluation of audio migration using the xcorrSound tool waveform-compare for content comparison in the quality assurance.

I did a presentation of the results at the demonstration day at the State and University Library, see the SCAPE Demo Day at Statsbiblioteket blog post by Jette G. Junge.

And now I present the screencast of this demonstration:

 scape demonstration of audio migration using xcorrsound in qa

The brief summary is:

  • New tool: using xcorrSound waveform-compare, we can automate audio file content comparison for quality assurance
  • Scalability: using Hadoop we can migrate our 20TB radio broadcast mp3 collection to the wav file format in a month (on the current SB Hadoop cluster set-up) rather than in years :)

And just a few notes:

  • the large scale experiment did not include property extraction and comparison, but we are confident (based on earlier experiment) that we can do this effectively using FFprobe
  • the large scale experiment did also not include file format validation. We made an early decision not to use JHOVE2 based on performance. The open question is if we are satisfied with the "pseudo validation" that the ffprobe property extraction and the xcorrSound waveform-compare cross correlation algorithm were both able to read the file...

Oh, and the slides are also on Slideshare: Migration of audio files using Hadoop.

 

Preservation Topics: SCAPE
Categories: Planet DigiPres

Beyond Us and Them: Designing Storage Architectures for Digital Collections 2014

The Signal: Digital Preservation - 29 September 2014 - 5:39pm

The following post was authored by Erin Engle, Michelle Gallinger, Butch Lazorchak, Jane Mandelbaum and Trevor Owens from the Library of Congress.

The Library of Congress held the 10th annual Designing Storage Architectures for Digital Collections meeting September 22-23, 2014. This meeting is an annual opportunity for invited technical industry experts, IT  professionals, digital collections and strategic planning staff and digital preservation practitioners to discuss the challenges of digital storage and to help inform decision-making in the future. Participants come from a variety of government agencies, cultural heritage institutions and academic and research organizations.

 Peter Krogh/DAM Useful Publishing.

The DSA Meeting. Photo credit: Peter Krogh/DAM Useful Publishing.

Throughout the two days of the meeting the speakers took the participants back in time and then forward again. The meeting kicked-off with a review of the origins of the DSA meeting. It started ten years ago with a gathering of Library of Congress and external experts who discussed requirements for digital storage architectures for the Library’s Packard Campus of the National Audio-Visual Conservation Center. Now, ten years later, the speakers included representatives from Facebook and Amazon Web Services, both of which manage significant amounts of content and neither of which existed in 2004 when the DSA meeting started.

The theme of time passing continued with presentations by strategic technical experts from the storage industry who began with an overview of the capacity and cost trends in storage media over the past years. Two of the storage media being tracked weren’t on anyone’s radar in 2004, but loom large for the future – flash memory and Blu-ray disks. Moving from the past quickly to the future, the experts then offered predictions, with the caveat that predictions beyond a few years are predictably unpredictable in the storage world.

Another facet of time – “back to the future” – came up in a series of discussions on the emergence of object storage in up-and-coming hardware and software products.  With object storage, hardware and software can deal with data objects (like files), rather than physical blocks of data.  This is a concept familiar to those in the digital curation world, and it turns out that it was also familiar to long-time experts in the computer architecture world, because the original design for this was done ten years ago. Here are some of the key meeting presentations on object storage:

Several speakers talked about the impact of the passage of time on existing digital storage collections in their institutions and the need to perform migrations of content from one set of hardware or software to another as time passes.  The lessons of this were made particularly vivid by one speaker’s analogy, which compared the process to the travails of someone trying to manage the physical contents of a car over one’s lifetime.

Even more vivid was the “Cost of Inaction” calculator, which provides black-and-white evidence of the costs of not preserving analog media over time, starting with the undeniable fact that you have to start with an actual date in the future for the “doomsday” when all your analog media will be unreadable.

 Trevor Owens

The DSA Meeting. Photo Credit: Trevor Owens

Several persistent time-related themes engaged the participants in lively interactive discussions during the meeting.  One topic was the practical methods for checking the data integrity of content  in digital collections.  This concept, called fixity, has been a common topic of interest in the digital preservation community. Similarly, a thread of discussion on predicting and dealing with failure and data loss over time touched on a number of interesting concepts, including “anti-entropy,” a type of computer “gossip” protocol designed to query, detect and correct damaged distributed digital files. Participants agreed it would be useful to find a practical approach to identifying and quantifying types of failures.  Are the failures relatively regular but small enough that the content can be reconstructed? Or are the data failures highly irregular but catastrophic in nature?

Another common theme that arose is how to test and predict the lifetime of storage media.  For example, how would one test the lifetime of media projected to last 1000 years without having a time-travel machine available?  Participants agreed to continue the discussions of these themes over the next year with the goal of developing practical requirements for communication with storage and service providers.

The meeting closed with presentations from vendors working on the cutting edge of new archival media technologies.  One speaker dealt with questions about the lifetime of media by serenading the group with accompaniment from a 32-year-old audio CD copy of Pink Floyd’s “Dark Side of the Moon.” The song “Us and Them” underscored how the DSA meeting strives to bridge the boundaries placed between IT conceptions of storage systems and architectures and the practices, perspectives and values of storage and preservation in the cultural heritage sector. The song playing back from three decade old media on a contemporary device was a fitting symbol of the objectives of the meeting.

Background reading (PDF) was circulated prior to the meeting and the meeting agenda and copies of the presentations are available at http://www.digitalpreservation.gov/meetings/storage14.html.

Categories: Planet DigiPres

Siegfried - a PRONOM-based, file format identification tool

Open Planets Foundation Blogs - 27 September 2014 - 7:52am

Ok. I know what you're thinking. Do we really need another PRONOM-based, file format identification tool?

A year or so I might have said "no" myself. In DROID and FIDO, we are already blessed with two brilliant tools. In my workplace, we're very happy users of DROID. We trust it as the reference implementation of PRONOM, it is fast, and it has a rich GUI with useful filtering and reporting options. I know that FIDO has many satisified users too: it is also fast, great for use at the command line, and, as a Python program, is easy to integrate with digital preservation workflows (such as Archivematica). The reason I wrote Siegfried wasn't to displace either of these tools, it was simply to scratch an itch: when I read the blog posts announcing FIDO a few years ago, I was intrigued at the different matching strategies used (FIDO's regular expressions and DROID's Boyer-Moore-Horspool string searching) and wondered what other approaches might be possible. I started Siegfried simply as a hobby project to explore whether a multiple-string search algorithm, Aho Corasick, could perform well at matching signatures.

Having dived down the file format identification rabbit hole, my feeling now is that, the more PRONOM-based, file format identification tools we have, the better. Multiple implementations of PRONOM make PRONOM itself stronger. For one thing, having different algorithms implement the same signatures is a great way of validating those signatures. Siegfried is tested using Ross Spencer's skeleton suite (a fantastic resource that made developing Siegfried much, much easier). During development of Siegfried, Ross and I were in touch about a number of issues thrown up during that testing, and these issues led to a small number of updates to PRONOM. I imagine the same thing happened for FIDO. Secondly, although many institutions use PRONOM, we all have different needs, and different tools suit different use cases differently. For example, for a really large set of records, with performance the key consideration, your best bet would probably be Nanite (a Hadoop implementation of DROID). For smaller projects, you might favour DROID for its GUI or FIDO for its Archivematica integration. I hope that Siegfried might find a niche too, and it has a few interesting features that I think commend it.

Simple command-line interface

I've tried to design Siegfried to be the least intimidating command-line tool possible. You run it with:

sf FILE sf DIR

There are only two other commands `-version` and `-update` (to update your signtures). There aren't any options: directory recursion is automatic, no default size on search buffers, and output is YAML only. Why YAML? It is a structured format, so you can do interesting things with it, and it has a clean syntax that doesn't look horrible in a terminal.

YAML Output

Good performance, without buffer limits

I'm one of those DROID users that always sets the buffer size to -1, just in case I miss any matches. The trade-off is that this can make matching a bit slower. I understand the use of buffers limits (options to limit the bytes scanned in a file) in DROID and FIDO - the great majority of signatures are found close to the beginning or end of the file and IO has a big impact on performance - but you need to be careful with them. Buffer limits can confuse users ("I can see a PRONOM signature for PDF/A, why isn't it matching?"). The use of buffer limits also need to be documented if you want to accurately record how puids were assigned. This is because you are effectively changing the PRONOM signatures by overriding any variable offsets. In other words, you can't just say, "matched 'fmt/111' with DROID signatures v 77", but now need to say, "matched 'fmt/111' with DROID signatures v 77 and with a maximum BOF offset of 32000 and EOF offset of 16000".

Siegfried is designed so that it doesn't need buffer limits for good performance. Instead, Siegfried searches as much, or as little, of a file as it needs to in order to satisfy itself that it has obtained the best possible match. Because Siegfried matches signatures concurrently, it can apply PRONOM's priority rules during the matching process, rather than at the end. The downside of this approach is that while average performance is good, there is variability: Siegfried slows down for files (like PDFs) where it can't be sure what the best match is until much, or all, of the file has been read.

Detailed basis information

As well as telling you what it matched, Siegfried will also report why it matched. Where byte signatures are defined, this "basis" information includes the offset and length of byte matches. While many digital archivists won't need this level of justification, this information can be useful. It can be a great debugging tool if you are creating new signatures and want to test how they are matching. It might also be useful for going back and re-testing files after PRONOM signature updates: if signatures change and you have an enormous quanitity of files that need to have their puids re-validated, then you could use this offset information to just test the relevant parts of files. Finally, by aggregating this information over time, it may also be possible to use it to refine PRONOM signatures: for example, are all PDF/A's matching within a set distance from the EOF? Could that variable offset be changed to a fixed one?

Where can I get my hands on it?

You can download Siegfried here. You can also try Siegfried, without downloading it, by dragging files onto the picture of Wagner's Siegfried on that page. The source is hosted on Github if you prefer to compile it yourself (you just need Go installed). Please report any bugs or feature requests there. It is still in beta (v 0.5.0) and probably won't get a version one release until early next year. I wouldn't recommend using it as your only form of file format identification until then (unless you are brave!). But please try it and send feedback.

Finally, I'd like to say thanks very much to the TNA for PRONOM and DROID and to Ross Spencer for his skeleton suite(s).

Preservation Topics: Identification AttachmentSize YAML output20.64 KB Try Siegfried30.24 KB
Categories: Planet DigiPres

Library to Launch 2015 Class of NDSR

The Signal: Digital Preservation - 26 September 2014 - 7:05pm
Last year's class of Residents at ALA Mid-winter

Last year’s class of Residents, along with LC staff, at the ALA Mid-winter conference

The Library of Congress Office of Strategic Initiatives, in partnership with the Institute of Museum and Library Services, has recently announced the 2015 National Digital Stewardship Residency program, which will be held in the Washington, DC area starting in June 2015.

As you may know (NDSR was well represented on the blog last year), this program is designed for recent graduates with an advanced degree who are interested in the field of digital stewardship.  This will be the fourth class of residents for this program overall – the first in 2013, was held in Washington, DC and the second and third classes, starting in September 2014, are being held concurrently in New York and Boston.

The five 2015 residents will each be paired with an affiliated host institution for a 12-month program that will provide them with an opportunity to develop, apply and advance their digital stewardship knowledge and skills in real-world settings. The participating hosts and projects for the 2015 cohort will be announced in early December and the applications will be available  shortly after.  News and updates will be posted to the NDSR webpage, and here on The Signal.

In addition to providing great career benefits for the residents, the successful NDSR program also provides benefits to the institutions involved as well as the library and archives field in general. For an example of what the residents have accomplished in the past, see this previous blog post about a symposium held last spring, organized entirely by last year’s residents.

Another recent success for the program – all of the former residents now have substantive jobs or fellowships in a related field.  Erica Titkemeyer, a former resident who worked at the Smithsonian Institution Archives, now has a position at the University of North Carolina at Chapel Hill as the Project Director and AV Conservator for the Southern Folklife Collection. Erica said the Residency provided the opportunity to utilize skills gained through her graduate education and put them to practical use in an on-the-job setting.  In this case, she was involved in research and planning for preservation of time-based media art at the Smithsonian.

Erica notes some other associated benefits. “I had a number of chances to network within the D.C. area through the Library of Congress, external digital heritage groups and professional conferences as well,” she said. “I have to say, I am most extremely grateful for having had a supportive group of fellow residents. The cohort was, and still remains, a valuable resource for knowledge and guidance.”

This residency experience no doubt helped Erica land her new job, one that includes a lot of responsibility for digital library projects. “Currently we are researching options and planning for mass-digitization of the collection, which contains thousands of recordings on legacy formats pertaining to the music and culture of the American South,” she said.

George Coulbourne, Executive Program Officer at the Library of Congress, remarked on the early success of this program: “We are excited with the success of our first class of residents, and look forward to continuing this success with our upcoming program in Washington, DC. The experience gained by the residents along with the tangible benefits for the host institution will help set the stage for a national residency model in digital preservation that can be replicated in various public and private sector environments.”

So, this is a heads-up to graduate students and all interested institutions – start thinking about how you might want to participate in the 2015 NDSR.  Keep checking our website and blog for updated information, applications, dates, etc. We will post this information as it becomes available.

(See the Library’s official press release.)

Categories: Planet DigiPres

In defence of migration

Open Planets Foundation Blogs - 26 September 2014 - 3:38pm

There is a trend in digital preservation circles to question the need for migration.  The argument varies a little from proponent to proponent but in essence, it states that software exists (and will continue to exist) that will read (and perform requisite functions, e.g., render) old formats.  Hence, proponents conclude, there is no need for migration.  I had thought it was a view held by a minority but at a recent workshop it became apparent that it has been accepted by many.

 

 

 

 

However, I’ve never thought this is a very strong argument.  I’ve always seen a piece of software that can deal with not only new formats but also old formats as really just a piece of software that can deal with new formats with a migration tool seamlessly bolted onto the front of it.  In essence, it is like saying I don’t need a migration tool and a separate rendering tool because I have a combined migration and rendering tool.  Clearly that’s OK but it does not mean you’re not performing a migration?

 

As I see it, whenever a piece of software is used to interpret a non-native format it will need to perform some form of transformation from the information model inherent in the format to the information model used in the software.  It can then perform a number of subsequent operations, e.g., render to the screen or maybe even save to a native format of that software.  (If the latter happens this would, of course, be a migration.) 

 

Clearly the way software behaves is infinitely variable but it seems to me that it is fair to say that there will normally be a greater risk of information loss in the first operation (the transformation between information models) than in subsequent operations that are likely to utilise the information model inherent in the software (be it rendering or saving in the native format).  Hence, if we are concerned with whether or not we are seeing a faithful representation of the original it is the transformation step that should be verified. 

 

This is where using a separate migration tool comes into its own (at least in principle).  The point is that it allows an independent check to be made of the quality of the transformation to take place (by comparing the significant properties of the files before and after).  Subsequent use of the migrated file (e.g., by a rendering tool) is assumed to be lossless (or at least less lossy) since you can choose the migrated format so that it is the native format of the tool you intend to use subsequently (meaning when the file is read no transformation of information model is required). 

However, I would concede that there are some pragmatic things to consider...

 

First of all, migration either has a cost (if it requires the migrated file to be stored) or is slow (if it is done on demand).  Hence, there are probably cases where simply using a combined migration and rendering tool is a more convenient solution and might be good enough.

 

Secondly, is migration validation worth the effort?  Certainly it is worth simply testing, say, a rendering tool with some example files before deciding to use it which should be sufficient to determine that the tool works without detailed validation most of the time.  However, we have cases where we detect uncommon issues in common migration libraries so migration validation does detect issues that would go unnoticed if the same libraries are used in a combined migration and rendering tool. 

 

Thirdly, is migration validation comprehensive enough?  The answer to this depends on the formats but for some (even common) formats it is clear that better, more comprehensive tools would do a better job.  Of course the hope is that this will continually improve over time. 

 

So, to conclude, I do see migration as a valid technique (and in fact a technique that almost everyone uses even if they don’t realise it).  I see one of the aims of the digital preservation community should be to provide an intellectually sound view of what constitutes a high quality migration (e.g., through a comprehensive view of significant properties across a wide range of object types).  It might be that real-life tools provide some pragmatic approximation to this idealistic vision (potentially using short cuts like using a combined migration and rendering tool) but we should at least understand and be able to express what these short cuts are.

 

I hope this post helps to generate some useful debate.

 

Rob

 

 

 

 
Categories: Planet DigiPres

Six ways to decode a lossy JP2

Open Planets Foundation Blogs - 26 September 2014 - 1:06pm
*/

Some time ago Will Palmer, Peter May and Peter Cliff of the British Library published a really interesting paper that investigated three different JPEG 2000 codecs, and their effects on image quality in response to lossy compression. Most remarkably, their analysis revealed differences not only in the way these codecs encode (compress) an image, but also in the decoding phase. In other words: reading the same lossy JP2 produced different results depending on which implementation was used to decode it.

A limitation of the paper's methodology is that it obscures the individual effects of the encoding and decoding components, since both are essentially lumped in the analysis. Thus, it's not clear how much of the observed degradation in image quality is caused by the compression, and how much by the decoding. This made me wonder how similar the decode results of different codecs really are.

An experiment

To find out, I ran a simple experiment:

  1. Encode a TIFF image to JP2.
  2. Decode the JP2 back to TIFF using different decoders.
  3. Compare the decode results using some similarity measure.
Codecs used

I used the following codecs:

Note that GraphicsMagick still uses the JasPer library for JPEG 2000. ImageMagick now uses OpenJPEG (older versions used JasPer). IrfanViews's JPEG 2000 plugin is made by LuraTech.

Creating the JP2

First I compressed my source TIFF (a grayscale newspaper page) to a lossy JP2 with a compression ratio about about 4:1. For this example I used OpenJPEG, with the following command line:

opj_compress -i krant.tif -o krant_oj_4.jp2 -r 4 -I -p RPCL -n 7 -c [256,256],[256,256],[256,256],[256,256],[256,256],[256,256],[256,256] -b 64,64 Decoding the JP2

Next I decoded this image back to TIFF using the aforementioned codecs. I used the following command lines:

CodecCommand lineopj20 opj_decompress -i krant_oj_4.jp2 -o krant_oj_4_oj.tif kakadu kdu_expand -i krant_oj_4.jp2 -o krant_oj_4_kdu.tif kakadu-precise kdu_expand -i krant_oj_4.jp2 -o krant_oj_4_kdu_precise.tif -precise irfanUsed GUIim convert krant_oj_4.jp2 krant_oj_4_im.tif gm gm convert krant_oj_4.jp2 krant_oj_4_gm.tif

This resulted in 6 images. Note that I ran Kakadu twice: once using the default settings, and also with the -precise switch, which "forces the use of 32-bit representations".

Overall image quality

As a first analysis step I computed the overall peak signal to noise ratio (PSNR) for each decoded image, relative to the source TIFF:

DecoderPSNRopj2048.08kakadu48.01kakadu-precise48.08irfan48.08im48.08gm48.07

So relative to the source image these results are only marginally different.

Similarity of decoded images

But let's have a closer look at how similar the different decoded images are. I did this by computing PSNR values of all possible decoder pairs. This produced the following matrix:

Decoderopj20kakadukakadu-preciseirfanimgmopj20-57.5278.5379.1796.3564.43kakadu57.52-57.5157.5257.5257.23kakadu-precise78.5357.51-79.0078.5364.52irfan79.1757.5279.00-79.1864.44im96.3557.5278.5379.18-64.43gm64.4357.2364.5264.4464.43-

Note that, unlike the table in the previous section, these PSNR values are only a measure of the similarity between the different decoder results. They don't directly say anything about quality (since we're not comparing against the source image). Interestingly, the PSNR values in the matrix show two clear groups:

  • Group A: all combinations of OpenJPEG, Irfanview, ImageMagick and Kakadu in precise mode, all with a PSNR of > 78 dB.
  • Group B: all remaining decoder combinations, with a PSNR of < 64 dB.

What this means is that OpenJPEG, Irfanview, ImageMagick and Kakadu in precise mode all decode the image in a similar way, whereas Kakadu (default mode) and GraphicsMagick behave differently. Another way of looking at this is to count the pixels that have different values for each combination. This yields up to 2 % different pixels for all combinations in group A, and about 12 % in group B. Finally, we can look at the peak absolute error value (PAE) of each combination, which is the maximum value difference for any pixel in the image. This figure was 1 pixel level (0.4 % of the full range) in both groups.

I also repeated the above procedure for a small RGB image. In this case I used Kakadu as the encoder. The decoding results of that experiment showed the same overall pattern, although the differences between groups A and B were even more pronounced, with PAE values in group B reaching up to 3 pixel values (1.2 % of full range) for some decoder combinations.

What does this say about decoding quality?

It would be tempting to conclude from this that the codecs that make up group A provide better quality decoding than the others (GraphicsMagick, Kakadu in default mode). If this were true, one would expect that the overall PSNR values relative to the source TIFF (see previous table) would be higher for those codecs. But the values in the table are only marginally different. Also, in the test on the small RGB image, running Kakadu in precise mode lowered the overall PSNR value (although by a tiny amount). Such small effects could be due to chance, and for a conclusive answer one would need to repeat the experiment for a large number of images, and test the PSNR differences for statistical significance (as was done in the BL analysis).

I'm still somewhat surprised that even in group A the decoding results aren't identical, but I suspect this has something to do with small rounding errors that arise during the decode process (maybe someone with a better understanding of the mathematical intricacies of JPEG 2000 decoding can comment on this). Overall, these results suggest that the errors that are introduced by the decode step are very small when compared against the encode errors.

Conclusions

OpenJPEG, (recent versions of) ImageMagick, IrfanView and Kakadu in precise mode all produce similar results when decoding lossily compressed JP2s, whereas Kakadu in default mode and GraphicsMagick (which uses the JasPer library) behave differently. These differences are very small when compared to the errors that are introduced by the encoding step, but for critical decode applications (migrate lossy JP2 to something else) they may still be significant. As both ImageMagick and GraphicsMagick are often used for calculating image (quality) statistics, the observed differences also affect the outcome of such analyses: calculating PSNR for a JP2 with ImageMagick and GraphicsMagick results in two different outcomes!

For losslessy compressed JP2s, the decode results for all tested codecs are 100% identical1.

This tentative analysis does not support any conclusions on which decoders are 'better'. That would need additional tests with more images. I don't have time for that myself, but I'd be happy to see others have a go at this!

Link

William Palmer, Peter May and Peter Cliff: An Analysis of Contemporary JPEG2000 Codecs for Image Format Migration (Proceedings, iPres 2013)

  1. Identical in terms of pixel values; for this analysis I didn't look at things such as embedded ICC profiles, which not all encoders/decoders handle well

Taxonomy upgrade extras: JPEG2000JP2Preservation Topics: MigrationToolsSCAPE
Categories: Planet DigiPres

We’re All Digital Archivists Now: An Interview with Sibyl Schaefer

The Signal: Digital Preservation - 24 September 2014 - 3:44pm
Sibyl Schaefer, Head of Digital Programs at the Rockefeller Archive

Sibyl Schaefer, Head of Digital Programs at the Rockefeller Archive

Digital was everywhere at this year’s Society of American Archivists annual meeting. What is particularly exciting is that many of these sessions were practical and pragmatic. That is, many sessions focused on exactly how archivists are meeting the challenge of born-digital records.

In one such session, Sibyl Schaefer, Head of Digital Programs at the Rockefeller Archive Center, offered such advice. I am excited to discuss some of the themes from her talk, “We’re All Digital Archivists: Digital Forensic Techniques in Everyday Practice,” here as part of the ongoing Insights Interview series.

Trevor: Could you unpack the title of your talk a bit for us? Why exactly is it time for all archivists to be digital archivists? What does that mean to you in practice?

Sibyl: We don’t all need to be digital archivists, but we do need to be archivists who work with digital materials. It’s not scalable to have one person, or one team, focus on the “digital stuff.” When I was first considering how to structure the Digital Team (or D-Team) at the RAC, it crossed my mind to mirror the structure of my organization, which is based on the main functions of an archive: collection development, accessioning, preservation, description, and access. I quickly realized that integrating digital practices into existing functions was essential.

The archivists at my institution take great pride in their knowledge of the collections, and not tapping into that knowledge would disadvantage the digital collections. We also don’t have many purely digital collections; the vast majority are hybrid. It wouldn’t make sense for one person to arrange and describe analog materials and another the digital materials. The principles of arrangement and description don’t change due to the format of the materials. Our archivists just need guidance in how to be effective in handling digital records, they need experience using tools so they feel comfortable with them, and they need someone available to ask if they have questions. So the digital archivists on my team are figuring out which software and tools to adopt, which workflows are the most efficient, and how to best educate the rest of the staff so they can do the actual archival work. The digital archivists aren’t actually doing traditional archival work and in that sense, “digital archivist” is a misnomer.

Trevor: If an archivist wants to get caught up-to-speed on the state and role of digital forensics for his or her work, what would you suggest they read/review? Further, what about these works do you see as particularly important?

Sibyl: The CLIR report, “Digital Forensics and Born-Digital Content in Cultural Heritage Collections,” is an excellent place to start. It clearly outlines what is gained by using forensics techniques in archival practice: namely the ability to capture digital archival materials in a secure manner that preserves more of their context and original order. These techniques also allow archivists to search through and review those materials without worrying about inadvertently altering them and affecting their authenticity.

I was ecstatic when I first saw Peter Chan’s YouTube video on processing born-digital materials using the Forensic ToolKit software. It was the first time I saw how functionality in FTK could be mapped to traditional processing activities: weeding duplicates, identifying Personally Identifiable Information and restricted records, arranging materials hierarchically, etc. It really answers the question of “So you have a disk image, now what do you do with it?” It also conveyed that the program could be picked up fairly easily by processing archivists.

The “From Bitstreams to Heritage: Putting Digital Forensics into Practice in Collecting Institutions” report (pdf) provides a really good overview of the recent activities in this area and a practical analysis of some of the capabilities and limitations of the forensics tools available.

Trevor: Could you tell us a bit about how the digital team works at the Rockefeller Archive Center? What kinds of roles do people take in the staff? How does the team fit into the structure of the Archive? How do you define the services you provide?

Sibyl: My team takes a user-centered approach in fulfilling our mission of leveraging technology to support all our program areas. We generally start by identifying a need for new technology, whether it be to place our finding aids online, create digital exhibits for our donors, preserve the context and authenticity of materials as they move from one physical medium to another, or increase our efficiency in managing researcher requests. We then try to involve users — both internal and external — as much as possible throughout the process. This involvement is crucial given that we usually aren’t the primary users of the software we implement.

One archivist focuses on delivery and access, which includes managing our online finding aid delivery system, as well as working very closely with our reference staff to develop and integrate tools that will help increase the efficiency of their work. Another team member is focused on digitization and metadata projects which includes things like scanning and outsourced digitization projects, as well as migrating from the Archivists’ Toolkit to ArchivesSpace. We just hired a new digital archivist to really delve into the digital forensics work I discussed in my presentation at SAA. She will be disk imaging and teaching our processing archivists to use FTK for description. In addition to overseeing the work of all the team members, I interface with our donor institutions, create policies and procedures, set team priorities and oversee our digital preservation system.

As I mentioned before, the RAC is divided up into five different archival functional areas: donor services, collections management, processing, reference and the digital team. Certain services, like digital preservation and digital duplication for special projects, are within our realm of responsibility, while with others we take a more advisory role. For example, we’re in the midst of an Aeon special collections management tool implementation, and although we won’t be internally hosting the server, we are helping our reference staff articulate and revise their workflows to take advantage of the efficiencies that system enables.

Our services are quite loosely defined; one of our program goals is to “leverage technology in an innovative way in support of all RAC program areas.” This gives us a lot of leeway in what we choose to do. I prioritize our preservation work based on risk and our systems work based on an evaluation of institutional priorities. For example, over the last year the RAC has been trying to increase the efficiency of our reference services, so we evaluated their workflows, replaced an unscalable method for organizing reference interactions with a user-friendly ticketing system, and are now aiding with the Aeon implementation.

Trevor: Could you tell us a bit about the workflow you have put in place to implement digital forensics in processing digital records? What roles do members of your team play and what roles do others play in that workflow?

Sibyl: My team takes care of inventorying removable media, creating disk images, running virus checks on those images, and providing them to the processing staff for analysis and description. Processing staff then identifies duplicates, restricted materials, and materials that contain PII. They arrange and describe materials within FTK. When they have finished, they notify the D-Team and we add the description to the Archivists’ Toolkit (or ArchivesSpace — we’re preparing to transition over soon) and ingest those files and related metadata into Archivematica.

There’s a lot of details we need to add in that will greatly increase the complexity of the process, and some of them will require actual policy decisions to be made. For example, the question of redaction comes up every time I review this process with our archivists. Redaction can be pretty straightforward with certain file formats, but definitely not with all. Also, how do we relay that information has been redacted to our researchers? We need to have a policy that clearly outlines when we redact information (for materials going online? for use in the reading room?) what types of information we redact, and what types of files can securely be redacted.

Diagram of the digital records processing workflow at RAC.

Diagram of the digital records processing workflow at RAC.

Trevor: As your process is established and refined, what do you see as the future role and place of the digital team within the archive? That is, what things are on the horizon for you and your team?

Sibyl: In the years since I joined the RAC we’ve placed our finding aids and digital objects online in an access system, architected a system for digital preservation, and configured forensics workflows. Now that we’ve got that foundation for managing and accessing our digital materials, I want to start embodying our goals to be innovative and leaders in the field. One area I think we can contribute to is integrating systems. For example, we’re launching a new project with Artefactual, the developers of Archivematica, to create a self-submission mechanism for donors to transfer records to us. Part of the project includes integrating ArchivesSpace with Archivematica. How cool would it be to have an accession record automatically created in ArchivesSpace when a donor transfers materials to our Archivematica instance?

Likewise, I’ve been talking with a few people about using data in FTK to create interactive interfaces for researchers. We could use directory data captured during imaging or created during analysis (like labeling materials “restricted”) to recreate (but not necessarily emulate) the way files were originally organized, including listing deleted and duplicate files and then linking that directly to their final, archival organization. The researcher would be able to see how the files were originally organized by the donor and what is missing (or restricted) from what is presented as the final archival organization. I get giddy when I think of how we can use technology to increase the transparency of what happens during archival processing. I’m also excited about the prospect of working EAC-CPF records into our discovery interface to bolster our description.

We also have a great deal of less innovative but very necessary tasks ahead of us. We need to implement a DAMS to help corral the digitized materials that are created on request and also to provide more granular permissions to materials than what we currently have. We need to create and implement policies to fill in gaps in our policy framework and inch towards TRAC compliance. And lastly, we need to systematize our preservation planning. We have a lot of work to keep us busy! That said, it’s a really great time to be in the archival field. Digital materials may present new and complex challenges, but we also have a chance to be creative and innovative with systems design and applying traditional archival practices to new workflows.

Categories: Planet DigiPres

Tool highlight: SCAPE Online Demos

Open Planets Foundation Blogs - 23 September 2014 - 2:12pm

Now that we are entering the final days of the SCAPE project, we would like to highlight some SCAPE Quality Assurance tools that have an online demonstrator.

 

See http://scape.demos.opf-labs.org/ for the following  tools:

Pagelyzer: Compares web pages

Monitor your web content.

 

Jpylyzer: Validates images

JP2K validator and properties extractor.

 

Xcorr-sound: Compares audio sounds

Improve your digital audio recordings.

 

Flint: Validates different files and formats

Validate PDF/EPUB files against an institutional policy

 

Matchbox: Compares documents (following soon)

Duplicate image detection tool.

 

For more info on these and other tools and the SCAPE project, see http://scape.usb.opf-labs.org

for the content of our SCAPE info USB stick.   

 

Preservation Topics: SCAPE
Categories: Planet DigiPres

Interview with a SCAPEr - Ed Fay

Open Planets Foundation Blogs - 23 September 2014 - 12:21pm
Ed FayWho are you?

My name is Ed Fay, I’m the Executive Director of the Open Planets Foundation.

Tell us a bit about your role in SCAPE and what SCAPE work you are involved in right now?

OPF has been involved in technical and take-up work all the way through the project, but right now we’re focused on sustainability – what happens to all the great results that have been produced after the end of the project.

Why is your organisation involved in SCAPE?

OPF has been responsible for leading the sustainability work and will provide a long-term home for the outputs, preserving the software and providing an ongoing collaboration of project partners and others on best practices and other learning. OPF members include many institutions who have not been part of SCAPE but who have an interest in continuing to develop the products, and through the work that has been done - for example on software maturity and training materials - OPF can help to lower barriers to adoption by these institutions and others.

What are the biggest challenges in SCAPE as you see it?

The biggest challenge in sustainability is identifying a collaboration model that can persist outside of project funding. As cultural heritage budgets are squeezed around the world and institutions adapt to a rapidly changing digital environment the community needs to make best use of the massive investment in R&D that has been made, by bodies such as the EC in projects such as SCAPE. OPF is a sustainable membership organisation which is helping to answer these challenges for its members and provide effective and efficient routes to implementing the necessary changes to working practices and infrastructure. In 20 years we won’t be asking how to sustain work such as this – it will be business as usual for memory institutions everywhere – but right now the digital future is far from evenly distributed.

But from the SCAPE perspective we have a robust plan which encompasses many different routes to adoption, which is of course the ultimate route to sustainability – production use of the outputs by the community for which they were intended. The fact that many outputs are already in active use – as open-source tools and embedded into commercial systems – shows that SCAPE has produced not only great research but mature products which are ready to be put to work in real-world situations.

What do you think will be the most valuable outcome of SCAPE?

This is very difficult for me to answer! Right now OPF has the privileged perspective of transferring everything that has matured during the project into our stewardship - from initial research, through development, and now into mature products which are ready for the community. So my expectation is that there are lots of valuable outputs which are not only relevant in the context of SCAPE but also as independent components. One particular product has already been shortlisted for the Digital Preservation Awards 2014 which is being co-sponsored by OPF this year while others have won awards at DL2014. These might be the most visible in receiving accolades, but there are many other tools and services which provide the opportunity to enhance digital preservation practice within a broad range of institutions. I think the fact that SCAPE is truly cross-domain is very exciting – working with scientific data, cultural heritage, web harvesting – it shows that digital preservation is truly maturing as a discipline.

If there could be one thing to come out of this, it would be a understanding of how to continue the outstanding collaboration that SCAPE has enabled to sustain cost-effective digital preservation solutions that can be adopted by institutions of all sizes and diversity.

Contact information

ed@openplanetsfoundation.org

twitter.com/digitalfay

Preservation Topics: SCAPE
Categories: Planet DigiPres

Weirder than old: The CP/M File System and Legacy Disk Extracts for New Zealand’s Department of Conservation

Open Planets Foundation Blogs - 23 September 2014 - 8:14am

We’ve been doing legacy disk extracts at Archives New Zealand for a number of years with much of the effort enabling us to do this work being done by colleague Mick Crouch, and former Archives New Zealand colleague Euan Cochran – earlier this year, we received some disks from New Zealand’s Department of Conservation (DoC) which we successfully imaged and extracted what was needed by the department. While it was a pretty straightforward exercise, there was enough about it that was cool enough to warrant that this blog be an opportunity to document another facet of the digital preservation work we’re doing, especially in the spirit of being another war story that other’s in the community can refer to. We do conclude with a few thoughts about where we still relied on a little luck, and we’ll have to keep that in mind moving forward.

We received 32 180kb 5.25 inch disks from DoC. Maxell MD1-D, single sided, double-density, containing what we expected to be Survey Data circa 1984/1985.

Our goal with these disks, as with any that we are finding outside of a managed records system, is to transfer the data to a more stable medium, as disk images, and then extract the objects on the imaged file system to enable further appraisal. From there a decision will be made about how much more effort should be put into preserving the content and making suitable access copies of whatever we have found – a triage.

For agencies with 3.5-inch floppy disks, we normally help to develop a workflow within that organisation that enables them to manage this work for themselves using more ubiquitous 3.5-inch USB disk drives. With 5.25-inch disks it is more difficult to find suitable floppy disk drive controllers so we try our best at Archives to do this work on behalf of colleagues using equipment we’ve set up using the KryoFlux Universal USB floppy disk controller. The device enables the write-blocked reading, and imaging of legacy disk formats at a forensic level, using modern PC equipment.

We create disk images of the floppies using the KryoFlux and continue to use those images as a master copy for further triage. A rough outline of the process we tend to follow, plus some of its rationale is documented by Euan Cochran in his Open Planets Foundation blog: Bulk disk imaging and disk-format identification with KryoFlux.

Through a small amount of trial and error we discovered that the image format with which we were capable of reading the most sectors without error was MFM (Modified Frequency Modulation / Magnetic Force Microscopy) with the following settings:

Image Type: MFM Sector Image Start Track: At least 0 End Track: At most 83 Side Mode: Side 0 Sector Size: 256 Bytes Sector Count: Any Track Distance: 40 Tracks Target RPM: By Image type Flippy Mode: Off

We didn’t experiment to see if these settings could be further optimised as we found a good result. The non-default settings in the case of these disks were side mode zero, sector size 256 bytes, track distance at 40, and flippy mode was turned off.

Taken away from volatile and unstable media, we have binary objects that we can now attach fixity to, and treat using more common digital preservation workflows. We managed to read 30 out of 32 disks.

Exploding the Disk Images

With the disk images in hand we found ourselves facing our biggest challenge. The images, although clearly well-formed, i.e. not corrupt, would not mount with Virtual Floppy Disk or mount in Linux.

Successful imaging alone doesn’t guarantee ease of mounting. We still needed to understand the underlying file system.

The images that we’ve seen before have been FAT12 and mount with ease in MS-DOS or Linux. These disks did not share the same identifying signatures at the beginning of the bitstream. We needed a little help in identifying them and fortunately through forensic investigation, and a little experience demonstrated by a colleague, it was quite clear the disks were CP/M formatted; the following ASCII text appearing as-is in the bitstream:

 

************************* * MIC-501 V1.6 * * 62K CP/M VERS 2.2 * ************************* COPYRIGHT 1983, MULTITECH BIOS VERS 1.6

 

CP/M (Control Program for Microcomputers) is a 1970’s early 1980’s operating system for early Intel microcomputers. The makers of the operating system were approached by IBM about licensing CP/M for their Personal Computer product, but talks failed, and the IBM went with MS-DOS from Microsoft; the rest is ancient history…

With the knowledge that we were looking at a CP/M file system we were able to source a mechanism to mount the disks in Windows. Cpmtools is a privately maintained suite of utilities for interacting with CP/M file systems. It was developed for working with CP/M in emulated environments, but works with floppy disks, and disk images equally well. The tool is available in Windows and POSIX compliant systems.

Commands for the different utilities look like the following:

That resulted in a command line to generate a file listing like this:

Creating a directory listing:

C:> cpmls –f bw12 disk-images\disk-one.img

This will list the user number (a CP/M concept), and the files objects belonging to that user.

E.g.:

0: File1.txt File2.txt

Extracting objects based on user number:

C:> cpmcp -f bw12 -p -t disk-images\disk-one.img 0:* output-dir

This will extract all objects collected logically under user 0: and put them into an output directory.

Finding the right commands was a little tricky at first, but once the correct set of arguments were found, it was straightforward to keep repeating them for each of the disks.

One of the less intuitive values supplied to the command line was the ‘bw12’ disk definition. This refers to a definition file, detailing the layout of the disk. The definition that worked best for our disks was the following:

# Bondwell 12 and 14 disk images in IMD raw binary format diskdef bw12 seclen 256 tracks 40 sectrk 18 blocksize 2048 maxdir 64 skew 1 boottrk 2 os 2.2 end

The majority of the disks extracted well. A small, on-image modification we made was the conversion of filenames containing forward slashes. The forward slashes did not play well with Windows, and so I took the decision to change the slashes to hashes in hex to ensure the objects were safely extracted into the output directory.

WordStar and other bits ‘n’ pieces

Content on the disks was primarily WordStar – CP/M’s flavour of word processor. Despite MS-DOS versions of WordStar; almost in parallel with the demise of CP/M, the program eventually lost market share in the 1980’s to WordPerfect. It took a little searching to source a converter to turn the WordStar content into something more useful but we did find something fairly quickly. The biggest issue viewing WordStar content as-is, in a standard text editor is the format’s use of the high-order bits within individual bytes to define word boundaries, as well as being used to make other denotations.

Example text, read verbatim might look like:

thå  southerî coasô = the southern coast

At first, I was sure this was a sign of bit-flipping on less stable media. Again, the experience colleagues had with older formats was useful here, and a consultation with Google soon helped me to understand what we were seeing.

Looking for various readers or migration tools led me to a number of dead websites, but with the Internet Archive coming to the rescue to allow us to see them: WordStar to other format solutions.

The tool we ended up using was the HABit WorsStar Converter, with more information on Softpedia.com. It does bulk conversion of WordStar to plain text or HTML. We didn’t have to worry too much about how faithful the representation would be, as this was just a triage, we were more interested in the intellectual value of the content, or data. Rudimentary preservation of layout would be enough. We we’re very happy with plain text output with the option of HTML output too.

Unfortunately, when we approached Henry Bartlett, the developer of the tool, about a small bug in the bulk conversion where the tool neutralises file format extensions on output of the text file, causing naming collisions; we were informed by his wife that he’d sadly passed away. I hoped it would prove to be some reassurance to her to know that at the very least his work was still of great use for a good number of people doing format research, and for those who will eventually consume the objects that we’re working on.

Conversion was still a little more manual than we’d like if we had larger numbers of files, but everything ran smoothly. Each of the deliverables were collected, and taken back to the parent department on a USB stick along with the original 3.25-inch disks.

We await further news from DoC about what they’re planning on doing with the extracts next.

Conclusions

The research to complete this work took a couple of weeks overall. With more dedicated time it might have taken a week.

On completion, and delivery to The Department of Conservation, we’ve since run through the same process on another number of disks. This took a fraction of the time – possibly an afternoon. The process can be refined each further iteration.

The next step is to understand the value in what was extracted. This might mean using the extract to source printed copies of the content and understanding that we can dispose of these disks and their content. An even better result might be discovering that there are no other copies of the material and these digital objects can become records in their own right with potential for long term retention. At the very least those conversations can now begin. In the latter instance, we’ll need to understand what out of the various deliverables, i.e. the disk images; the extracted objects; and the migrated objects, will be considered the record.

Demonstrable value acts like a weight on the scales of digital preservation where we try and balance effort with value; especially in this instance, where the purpose of the digital material is yet unknown. This case study is borne from an air-gap in the recordkeeping process that sees the parent department attempting to understand the information in its possession in lieu of other recordkeeping metadata.

Aside from the value in what was extracted, there is still a benefit to us as an archive, and as a team in working with old technology, and equipment. Knowledge gained here will likely prove useful somewhere else down the line. 

Identifying the file system could have been a little easier, and so we’d echo the call from Euan in the aforementioned blog post to have identification mechanisms for image formats in DROID-like tools.

Forensic analysis of the disk images and comparing that data to that extracted by CP/M Tools showed a certain amount of data remanence, that is, data that only exists forensically on the disk. It was extremely tempting to do more work with this, but we settled for notifying our contact at DoC, and thus far, we haven’t been called on to extract it.

We required a number of tools to perform this work. How we maintain the knowledge of this work, and maintain the tools used are two important questions. I haven’t an answer for the latter, while this blog serves in some way as documentation of the former.

What we received from DoC was old, but it wasn’t a problem that it was old. The right tools enabled this work to be done fairly easily – that goes for any organisation willing to put modest tools in the arms of their analysts and researchers such as the KryoFlux, and other legacy equipment. The disks were in good shape too. The curveball in this instance was that some of the pieces of the puzzle that we were interacting with were weirder than expected; a slightly different file system, and a word processing format that encoded data in an unexpected way making 1:1 extract and use a little more difficult. We got around it though. And indeed, as it stands, this wasn’t a preservation exercise; it was a low-cost and pragmatic exercise to support appraisal, continuity, and potential future preservation. The files have been delivered to DoC in its various forms: disk images; extracted objects; and migrated objects. We’ll await a further nod from them to understand where we go next. 

Preservation Topics: Preservation ActionsIdentificationMigrationPreservation RisksTools
Categories: Planet DigiPres

18 Years of Kairos Webtexts: An interview with Douglas Eyman & Cheryl E. Ball

The Signal: Digital Preservation - 22 September 2014 - 2:05pm
Cheryl E. Ball

Cheryl E. Ball, associate professor of digital publishing studies at West Virginia University, is editor of Kairos

Since 1996 the electronic journal Kairos has published a diverse range of webtexts, scholarly pieces made up of a range of media and hypermedia. The 18 years of digital journal texts are both interesting in their own right and as a collection of complex works of digital scholarship that illustrate a range of sophisticated issues for ensuring long-term access to new modes of publication. Douglas Eyman, Associate Professor of Writing and Rhetoric at George Mason University is senior editor and publisher of Kairos. Cheryl E. Ball, associate professor of digital publishing studies at West Virginia University, is editor of Kairos. In this Insights Interview, I am excited to learn about the kinds of issues that this body of work exposes for considering long-term access to born-digital modes of scholarship. [There was also a presentation on Kairos at the Digital Preservation 2014 meeting.]

Trevor: Could you describe Kairos a bit for folks who aren’t familiar with it? In particular, could you tell us a bit about what webtexts are and how the journal functions and operates?

Doug: Webtexts are texts that are designed to take advantage of the web-as-concept, web-as-medium, and web-as-platform. Webtexts should engage a range of media and modes and the design choices made by the webtext author or authors should be an integral part of the overall argument being presented. One of our goals (that we’ve met with some success I think) is to publish works that can’t be printed out — that is, we don’t accept traditional print-oriented articles and we don’t post PDFs. We publish scholarly webtexts that address theoretical, methodological or pedagogical issues which surface at the intersections of rhetoric and technology, with a strong interest in the teaching of writing and rhetoric in digital venues.

dougbooks2

Douglas Eyman, Associate Professor of Writing and Rhetoric at George Mason University is senior editor and publisher of Kairos

(As an aside, there was a debate in 1997-98 about whether we were publishing hypertexts, which then tended to be available in proprietary formats and platforms and not freely available on the WWW or not; founding editor Mick Doherty argued that we were publishing much more than only hypertexts, so we moved from calling what we published ‘hypertexts’ to ‘webtexts’ — Mick tells that story in the 3.1 loggingon column).

Cheryl: WDS (What Doug said ;) One of the ways I explain webtexts to potential authors and administrators is that the design of a webtext should, ideally, enact authors’ scholarly arguments, so that the form and content of the work are inseparable.

Doug: The journal was started by an intrepid group of graduate students, and we’ve kept a fairly DIY approach since that first issue appeared on New Year’s day in 1996. All of our staff contribute their time and talents and help us to publish innovative work in return for professional/field recognition, so we are able to sustain a complex venture with a fairly unique economic model where the journal neither takes in nor spends any funds. We also don’t belong to any parent organization or institution, and this allows us to be flexible in terms of how the editors choose to shape what the journal is and what it does.

Cheryl: We are lucky to have a dedicated staff who are scattered across (mostly) the US: teacher-scholars who want to volunteer their time to work on the journal, and who implement the best practices of pedagogical models for writing studies into their editorial work. At any given time, we have about 25 people on staff (not counting the editorial board).

Doug: Operationally, the journal functions much like any other peer-reviewed scholarly journal: we accept submissions, review them editorially, pass on the ones that are ready for review to our editorial board, engage the authors in a revision process (depending on the results of the peer-review) and then put each submission through an extensive and rigorous copy-, design-, and code-editing process before final publication. Unlike most other journals, our focus on the importance of design and our interest in publishing a stable and sustainable archive mean that we have to add those extra layers of support for design-editing and code review: our published webtexts need to be accessible, usable and conform to web standards.

Trevor: Could you point us to a few particularly exemplary works in the journal over time for readers to help wrap their heads around what these pieces look like? They could be pieces you think are particularly novel or interesting or challenging or that exemplify trends in the journal. Ideally, you could link to it, describe it and give us a sentence or two about what you find particularly significant about it.

Cheryl: Sure! We sponsor an award every year for Best Webtext, and that’s usually where we send people to find exemplars, such as the ones Doug lists below.

Doug: From our peer-reviewed sections, we point readers to the following webtexts (the first two are especially useful for their focus on the process of webtext authoring and editing):

Cheryl: From our editorially (internally) reviewed sections, here are a few other examples:

Trevor: Given the diverse range of kinds of things people might publish in a webtext, could you tell us a bit about the kinds of requirements you have enforced upfront to try and ensure that the works the journal publishes are likely to persist into the future? For instance, any issues that might come up from embedding material from other sites, or running various kinds of database-driven works or things that might depend on external connections to APIs and such.

Doug: We tend to discourage work that is in proprietary formats (although we have published our fair share of Flash-based webtexts) and we ask our authors to conform to web standards (XHTML or HTML5 now). We think it is critical to be able to archive any and all elements of a given webtext on our server, so even in cases where we’re embedding, for instance, a YouTube video, we have our own copy of that video and its associated transcript.

One of the issues we are wrestling with at the moment is how to improve our archival processes so we don’t rely on third-party sites. We don’t have a streaming video server, so we use YouTube now, but we are looking at other options because YouTube allows large corporations to apply bogus copyright-holder notices to any video they like, regardless of whether there is any infringing content (as an example, an interview with a senior scholar in our field was flagged and taken down by a record company; there wasn’t even any background audio that could account for the notice. And since there’s a presumption of guilt, we have to go through an arduous process to get our videos reinstated.) What’s worse is when the video *isn’t* taken down, but the claimant instead throws ads on top of our authors’ works. That’s actually copyright infringement against us that is supported by YouTube itself.

Another issue is that many of the external links in works we’ve published (particularly in older webtexts) tend to migrate or disappear. We used to replace these where we can with links to archive.org (aka The Wayback Machine), but we’ve discovered that their archive is corrupted because they allow anyone to remove content from their archive without reason or notice.[1] So, despite its good intentions, it has become completely unstable as a reliable archive. But we don’t, alas, have the resources to host copies of everything that is linked to in our own archives.

Cheryl: Kairos holds the honor within rhetoric and composition of being the longest-running, and most stable, online journal, and our archival and technical policies are a major reason for that. (It should be noted that many potential authors have told us how scary those guidelines look. We are currently rewriting the guidelines to make them more approachable while balancing the need to educate authors on their necessity for scholarly knowledge-making and -preservation on the Web.)

Of course, being that this field is grounded in digital technology, not being able to use some of that technology in a webtext can be a rather large constraint. But our authors are ingenious and industrious. For example, Deborah Balzhiser et al created an HTML-based interface to their webtext that mimicked Facebook’s interface for their 2011 webtext, “The Facebook Papers.” Their self-made interface allowed them to do some rhetorical work in the webtext that Facebook itself wouldn’t have allowed. Plus, it meant we could archive the whole thing on the Kairos server in perpetuity.

Trevor: Could you give us a sense of the scope of the files that make up the issues? For instance, the total number of files, the range of file types you have, the total size of the data, and or a breakdown of the various kinds of file types (image, moving image, recorded sound, text, etc.) that exist in the run of the journal thus far?

Doug: The whole journal is currently around 20 Gb — newer issues are larger in terms of data size because there has be an increase in the use of audio and video (luckily, html and css files don’t take up a whole lot of room, even with a lot of content in them). At last count, there are 50,636 files residing in 4,545 directories (this count includes things like all the system files for WordPress installs and so on). A quick summary of primary file types:

  • HTML files:     12247
  • CSS:               1234
  • JPG files:        5581
  • PNG:               3470
  • GIF:                 7475
  • MP2/3/4:         295
  • MOV               237
  • PDF:                191

Cheryl: In fact, our presentation at Digital Preservation 2014 this year [was] partly about the various file types we have. A few years ago, we embarked on a metadata-mining project for the back issues of Kairos. Some of the fields we mined for included Dublin Core standards such as MIMEtype and DCMIType. DCMIType, for the most part, didn’t reveal too much of interest from our perspective (although I am sure librarians will see it differently!! :) but the MIMEtype search revealed both the range of filetypes we had published and how that range has changed over the journal’s 20-year history. Every webtext has at least one HTML file. Early webtexts (from 1996-2000ish) that have images generally have GIFs and, less prominent, JPEGs. But since PNGs rose to prominence (becoming an international standard in 2003), we began to see more and more of them. The same with CSS files around 2006, after web-standards groups starting enforcing their use elsewhere on the Web. As we have all this rich data about the history of webtextual design, and too many research questions to cover in our lifetimes, we’ve released the data in Dropbox (until we get our field-specific data repository, rhetoric.io, completed).

Trevor: In the 18 years that have transpired since the first issue of Kairos a lot has changed in terms of web standards and functionality. I would be curious to know if you have found any issues with how earlier works render in contemporary web browsers. If so, what is your approach to dealing with that kind of degradation over time?

Cheryl: If we find something broken, we try to fix it as soon as we can. There are lots of 404s to external links that we will never have the time or human resources to fix (anyone want to volunteer??), but if an author or reader notifies us about a problem, we will work with them to correct the glitch. One of the things we seem to fix often is repeating backgrounds. lol. “Back in the days…” when desktop monitors were tiny and resolutions were tinier, it was inconceivable that a background set to repeat at 1200 pixels would ever actually repeat. Now? Ugh.

But we do not change designs for the sake of newer aesthetics. In that respect, the design of a white-text-on-black-background from 1998 is as important a rhetorical point as the author’s words in 1998. And, just as the ideas in our scholarship grow and mature as we do, so do our designs, which have to be read in the historical context of the surrounding scholarship.

Of course, with the bettering of technology also comes our own human degradation in the form of aging and poorer eyesight. We used to mandate webtexts not be designed over 600 pixels wide, to accommodate our old branding system that ran as a 60-pixel frame down the left-hand side of all the webtexts. That would also allow for a little margin around the webtext. Now, designing for specific widths — especially ones that small — seems ludicrous (and too prescriptive), but I often find myself going into authors’ webtexts during the design-editing stage and increasing their typeface size in the CSS so that I can even read it on my laptop. There’s a balance I face, as editor, of retaining the authors’ “voice” through their design and making the webtext accessible to as many readers as possible. Honestly, I don’t think the authors even notice this change.

Trevor: I understand you recently migrated the journal from a custom platform to the Open Journal System platform. Could you tell us a bit about what motivated that move and issues that occurred in that migration?

Doug: Actually, we didn’t do that.

Cheryl: Yeah, I know it sounds like we did from our Digital Preservation 2014 abstract, and we started to migrate, but ended up not following through for technical reasons. We were hoping we could create plug-ins for OJS that would allow us to incorporate our multimedia content into its editorial workflow. But it didn’t work. (Or, at least, wasn’t possible with the $50,000 NEH Digital Humanities Start-Up Grant we had to work with.) We wanted to use OJS to help streamline and automate our editorial workflow–you know, the parts about assigning reviewers and copy-editors, etc., — and as a way to archive those processes.

I should step back here and say that Kairos has never used a CMS; everything we do, we do by hand — manually SFTPing files to the server, manually making copies of webtext folders in our kludgy way of version control, using YahooGroups (because it was the only thing going in 1998 when we needed a mail system to archive all of our collaborative editorial board discussions) for all staff and reviewer conversations, etc.–not because we like being old school, but because there were always too many significant shortcomings with any out-of-the-box systems given our outside-the-box journal. So the idea of automating, and archiving, some of these processes in a centralized database such as OJS was incredibly appealing. The problem is that OJS simply can’t handle the kinds of multimedia content we publish. And rewriting the code-base to accommodate any plug-ins that might support this work was not in the budget. (We’ve written about this failed experiment in a white paper for NEH.)

[1] Archive.org will obey robots.txt files if they ask not to be indexed. So, for instance, early versions of Kairos itself are no longer available on archive.org because such a file is on the Texas Tech server where the journal lived until 2004. We put that file there because we want Google to point to the current home of the journal, but we actually would like that history to be in the Internet Archive. You can think of this as just a glitch, but here’s the more pressing issue: if I find someone has posted a critical blog post of my work, if I ever get ahold of the domain it was originally posted, I can take it down there *and* retroactively make it unavailable on archive.org, even if it used to show up there. Even without such nefarious purpose, just the constant trade in domains and site locations means that no researcher can trust that archive when using it for history or any kind of digital scholarship.

Categories: Planet DigiPres

How trustworthy is the SCAPE Preservation Environment?

Open Planets Foundation Blogs - 19 September 2014 - 1:51pm

Over the last three and a half years, the SCAPE project worked in several directions in order to propose new solutions for digital preservation, as well as improving existing ones. One of the results of this work is the SCAPE preservation environment (SPE). It is a loosely coupled system, which enables extending existing digital repository systems (e.g. RODA) with several components that cover collection profiling (i.e. C3PO), preservation monitoring (i.e. SCOUT) and preservation planning (i.e. Plato). Those components address key functionalities defined in the Open Archival Information System (OAIS) functional model.

Establishing trustworthiness of digital repositories is a major concern of the digital preservation community as it makes the threats and risks within a digital repository understandable. There are several approaches developed over recent years on how to address trust in digital repositories. The most notable is Trustworthy Repositories Audit and Certification (TRAC), which has later been promoted to an ISO standard by the International Standards Organization (ISO 16363, released in 2012). The standard comprises of three pillars: organizational infrastructure, digital object management, and infrastructure and security management and for each of these it provides a set of requirements and the expected evidence needed for compliance.

A recently published whitepaper reports on the work done to validate the SCAPE Preservation Environment against the ISO 16363 – a framework for Audit and Certification of Trustworthy Digital Repositories. The work aims to demonstrate that a preservation ecosystem composed of building blocks as the ones developed in SCAPE is able to comply with most of the system-related requirements of the ISO 16363.

From a total of 108 metrics included in the assessment, the SPE fully supports 69 of them. 31 metrics were considered to be “out of scope” as they refer to organisational issues that cannot be solved by technology alone nor can they be analysed outside the framework of a breathing organisation, leaving 2 metrics to be considered “partially supported” and 6 metrics to be considered “not supported”. This gives an overall compliancy level of roughly 90% (if the organisational oriented metrics are not taken into account).

This work also enabled us to identify the main weak points of the SCAPE Preservation Environment that should be addressed in the near future. In summary the gaps found were:

  • The ability to manage and maintain contracts or deposit agreements through the repository user interfaces;
  • Support for tracking intellectual property rights;
  • Improve technical documentation, especially on the conversion of Submission Information Packages (SIP) into Archival Information Packages (AIP);
  • The ability to aid the repository manager to perform better risk management.

Our goal is to ensure that the SCAPE Preservation Environment fully supports the system-related metrics of the ISO 16363. In order to close the gaps encountered, additional features have been added to the roadmap of the SPE.

To get your hands on the full report, please go to http://www.scape-project.eu/wp-content/uploads/2014/09/SCAPE_MS63_KEEPS-V1.0.pdf

 

Preservation Topics: Preservation StrategiesPreservation RisksSCAPE
Categories: Planet DigiPres

Emerging Collaborations for Accessing and Preserving Email

The Signal: Digital Preservation - 19 September 2014 - 1:02pm

The following is a guest post by Chris Prom, Assistant University Archivist and Professor, University of Illinois at Urbana-Champaign.

I’ll never forget one lesson from my historical methods class at Marquette University.  Ronald Zupko–famous for his lecture about the bubonic plague and a natural showman–was expounding on what it means to interrogate primary sources–to cast a skeptical eye on every source, to see each one as a mere thread of evidence in a larger story, and to remember that every event can, and must, tell many different stories.

He asked us to name a few documentary genres, along with our opinions as to their relative value.  We shot back: “Photographs, diaries, reports, scrapbooks, newspaper articles,” along with the type of ill-informed comments graduate students are prone to make.  As our class rattled off responses, we gradually came to realize that each document reflected the particular viewpoint of its creator–and that the information a source conveyed was constrained by documentary conventions and other social factors inherent to the medium underlying the expression. Settling into the comfortable role of skeptics, we noted the biases each format reflected.  Finally, one student said: “What about correspondence?”  Dr Zupko erupted: “There is the real meat of history!  But, you need to be careful!”

problemInbox

Dangerous Inbox by Recrea HQ. Photo courtesy of Flickr through a CC BY-NC-SA 2.0 license.

Letters, memos, telegrams, postcards: such items have long been the stock-in-trade for archives.  Historians and researchers of all types, while mindful of the challenges in using correspondence, value it as a source for the insider perspective it provides on real-time events.   For this reason, the library and archives community must find effective ways to identify, preserve and provide access to email and other forms of electronic correspondence.

After I researched and wrote a guide to email preservation (pdf) for the Digital Preservation Coalition’s Technology Watch Report series, I concluded that the challenges are mostly cultural and administrative.

I have no doubt that with the right tools, archivists could do what we do best: build the relationships that underlie every successful archival acquisition.  Engaging records creators and donors in their digital spaces, we can help them preserve access to the records that are so sorely needed for those who will write histories.  But we need the tools, and a plan for how to use them.  Otherwise, our promises are mere words.

For this reason, I’m so pleased to report on the results of a recent online meeting organized by the National Digital Stewardship Alliance’s Standards and Practices Working Group.  On August 25, a group of fifty-plus experts from more than a dozen institutions informally shared the work they are doing to preserve email.

For me, the best part of the meeting was that it represented the diverse range of institutions (in terms of size and institutional focus) that are interested in this critical work. Email preservation is not something of interest only to large government archives,or to small collecting repositories, but also to every repository in between. That said, the representatives displayed a surprising similar vision for how email preservation can be made effective.

Robert Spangler, Lisa Haralampus, Ken  Hawkins and Kevin DeVorsey described challenges that the National Archives and Records Administration has faced in controlling and providing access to large bodies of email. Concluding that traditional records management practices are not sufficient to task, NARA has developed the Capstone approach, seeking to identify and preserve particular accounts that must be preserved as a record series, and is currently revising its transfer guidance.  Later in the meeting, Mark Conrad described the particular challenge of preserving email from the Executive Office of the President, highlighting the point that “scale matters”–a theme that resonated across the board.

The whole account approach that NARA advocates meshes well with activities described by other presenters.  For example, Kelly Eubank from North Carolina State Archives and the EMCAP project discussed the need for software tools to ingest and process email records while Linda Reib from the Arizona State Library noted that the PeDALS Project is seeking to continue their work, focusing on account-level preservation of key state government accounts.

Functional comparison of selected email archives tools/services. Courtesy Wendy Gogel.

Functional comparison of selected email archives tools/services. Courtesy Wendy Gogel.

Ricc Ferrante and Lynda Schmitz Fuhrig from the Smithsonian Institution Archives discussed the CERP project which produced, in conjunction with the EMCAP project, an XML schema for email objects among its deliverables. Kate Murray from the Library of Congress reviewed the new email and related calendaring formats on the Sustainability of Digital Formats website.

Harvard University was up next.  Andrea Goethels and Wendy Gogel shared information about Harvard’s Electronic Archiving Service.  EAS includes tools for normalizing email from an account into EML format (conforming to the Internet Engineering Task Force RFC 2822), then packaging it for deposit into Harvard’s digital repository.

One of the most exciting presentations was provided by Peter Chan and Glynn Edwards from Stanford University.  With generous funding from the National Historical Publications and Records Commission, as well as some internal support, the ePADD Project (“Email: Process, Appraise, Discover, Deliver”) is using natural language processing and entity extraction tools to build an application that will allow archivists and records creators to review email, then process it for search, display and retrieval.  Best of all, the web-based application will include a built-in discovery interface and users will be able to define a lexicon and to provide visual representations of the results.  Many participants in the meeting commented that the ePADD tools may provided a meaningful focus for additional collaborations.  A beta version is due out next spring.

In the discussion that followed the informal presentations, several presenters congratulated the Harvard team on a slide Wendy Gogel shared, comparing the functions provided by various tools and services (reproduced above).

As is apparent from even a cursory glance at the chart, repositories are doing wonderful work—and much yet remains.

Collaboration is the way forward. At the end of the discussion, participants agreed to take three specific steps to drive email preservation initiatives to the next level: (1) providing tool demo sessions; (2) developing use cases; and (3) working together.

The bottom line: I’m more hopeful about the ability of the digital preservation community to develop an effective approach toward email preservation than I have been in years.  Stay tuned for future developments!

Categories: Planet DigiPres

The return of music DRM?

File Formats Blog - 18 September 2014 - 12:58pm

U2, already the most hated band in the world thanks to its invading millions of iOS devices with unsolicited files, isn’t stopping. An article on Time‘s website tells us, in vague terms, that

Bono, Edge, Adam Clayton and Larry Mullen Jr believe so strongly that artists should be compensated for their work that they have embarked on a secret project with Apple to try to make that happen, no easy task when free-to-access music is everywhere (no) thanks to piracy and legitimate websites such as YouTube. Bono tells TIME he hopes that a new digital music format in the works will prove so irresistibly exciting to music fans that it will tempt them again into buying music—whole albums as well as individual tracks.

It’s hard to read this as anything but an attempt to bring digital rights management (DRM) back to online music distribution. Users emphatically rejected it years ago, and Apple was among the first to drop it. You haven’t really “bought” anything with DRM on it; you’ve merely leased it for as long as the vendor chooses to support it. People will continue to break DRM, if only to avoid the risk of loss. The illegal copies will offer greater value than legal ones.

It would be nice to think that what U2 and Apple really mean is just that the new format will offer so much better quality that people will gladly pay for it, but that’s unlikely. Higher-quality formats such as AAC have been around for a long time, and they haven’t pushed the old standby MP3 out of the picture. Existing levels of quality are good enough for most buyers, and vendors know it.

Time implies that YouTube doesn’t compensate artists for their work. This is false. They often don’t bother with small independent musicians, though they will if they’re reminded hard enough (as Heather Dale found out), but it’s hard to believe that groups with powerful lawyers, such as U2, aren’t being compensated for every view.

DRM and force-feeding of albums are two sides of the same coin of vendor control over our choices. This new move shouldn’t be a surprise.


Tagged: Apple, audio, DRM
Categories: Planet DigiPres

SCAPE Project Ends on the 30th of September

Open Planets Foundation Blogs - 18 September 2014 - 12:11pm

It is difficult to write that headline. After nearly four years of hard work, worry, setbacks, triumphs, weariness, and exultation, the SCAPE project is finally coming to an end.

I am convinced that I will look back at this period as one of the highlights of my career. I hope that many of my SCAPE colleagues will feel the same way.

I believe SCAPE was an outstanding example of a successful European project, characterised by

  • an impressive level of trouble-free international cooperation;
  • sustained effort and dedication from all project partners;
  • high quality deliverables and excellent review ratings;
  • a large number of amazing results, including more software tools than we can demonstrate in one day!

I also believe SCAPE has made and will continue to make a significant impact on the community and practice of digital preservation. We have achieved this impact through

I would like to thank all the people who contributed to the SCAPE project, who are far too numerous to name here. In particular I would like to thank our General Assembly members, our Executive Board/Sub-project leads, the Work Package leads, and the SCAPE Office, all of whom have contributed to the joy and success of SCAPE.

Finally, I would like to thank the OPF for ensuring that the SCAPE legacy will continue to live and even grow long after the project itself is just a fond memory.

It's been a pleasure folks. Well done!

Preservation Topics: SCAPE
Categories: Planet DigiPres

Digital Preservation Sustainability on the EU Policy Level - a workshop report

Open Planets Foundation Blogs - 18 September 2014 - 5:52am

On Monday 8 September 2014 APARSEN and SCAPE together hosted a workshop, called ‘Digital Preservation Sustainability on the EU Policy Level’. The workshop was held in connection with the conference Digital Libraries 2014 in London.

The room for the workshop was ‘The Great Hall’ at City University London – a lovely, old, large room with a stage at one end and lots of space for the 12 stalls featuring the invited projects and  the 85 attendees.

The first half of the workshop was dedicated to a panel session. The three panellists each had 10-15 minutes to present their views on both the achievements and future of digital preservation, followed by a discussion moderated by Hildelies Balk from the Royal Library of the Netherlands, with real time visualisations made by Elco van Staveren.

‘As a community we have failed’

With these words David Giaretta, Director of APARSEN (see presentation and visualisation), pinpointed the fact that there will be no EU funding for digital preservation research in the future and that the EU expects to see some result from the 100 M € already distributed. The EU sees data as the new gold, and we should start mining it! A big difference between gold and data is that gold does not perish whereas data is not imperishable.

The important thing to do is to create some results – ‘A rising tide floats all boats’ – if we can at least show something that can be used, that will help funding the rest of the preservation.

Let’s climb the wall!

David Giaretta was followed by Ross King, Project Coordinator of SCAPE (see presentation and visualisation), who started his presentation with a comparison between the two EU projects Planets and SCAPE - the latter being a follow-up project from the first. Many issues already addressed in Planets were further explored and developed in SCAPE, the biggest difference being scalability – how to handle large volumes, scalability in planning processes, more automation etc. – which was the focal point of SCAPE.

To Ross King there were three lessons learned from working with Planets and SCAPE:

  • there is still a wall between Production on one side and Research & Development on the other, 
  • the time issue – although libraries, archives etc. work with long term horizons, most business have a planning horizon of five years or less,
  • format migration  may not be as important as we thought it was.
Who will pay?

Ed Fay, director of Open Planets Foundation (see presentation and visualisation), opened with the message that by working with digital preservation we have a great responsibility of helping to define the future of information management. With no future EU funded projects community collaboration on all levels is more needed than ever. Shared services and infrastructure are essential.

The Open Planets Foundation was founded after the Planets project to help sustaining the results of this project. Together with SCAPE and other projects OPF is now trying to mature tools so they can be widely adopted and sustained (See SCAPE Final Sustainability Plan).

There are a lot of initiatives and momentum, from DPC, NDIPP or JISC to OPF or APA - but how will the future look like? How do we ensure that initiatives are aligned up to the policy level?

Sustainability is about working out who pays – and when…

If digital preservation was delivering business objectives we wouldn’t be here to talk about sustainability - it would just be embedded in how organisations work - we are not there yet!

A diverse landscape with many facets

The panellist’s presentations were followed by questions from the audience, mostly concerned about risk approach. During the discussion it was stated that although the three presenters see the digital landscape from different views they all agree on its importance. People do need to preserve and to get digital value from that. The DP initiatives and organisations are the shopping window, members have lots of skills that the market could benefit from.

The audience were asked if they find it important to have a DP community - apparently nobody disagreed! And it seemed that almost everyone were members of OPF, APARSEN or other similar initiatives.

There are not many H2020 digital preservation bids. In earlier days everybody had several proposals running in these rounds, but this is not catastrophic – good research has been made and now we want the products to be consolidated. We would like to reach a point where digital preservation is an infrastructure service as obvious as your email. But we are not there yet!

Appraisal and ingest is still not solved - we need to choose the data to be preserved, especially when talking about petabytes!

The wrap-up of the discussion was done by discussing the visualisation made by Elco van Staveren.

An overall comment was that even though there are no money directed towards digital preservation, there is still lots of money for problems that can be solved by digital preservation. It is important that the community of digital preservation thinks of itself NOT as the problem but as part of the solution. And although the visualisation is mostly about sustainability, risks still play an important part. If you cannot explain the risk of doing nothing you cannot persuade anyone to pay!

Clinic with experts

After the panel and one minute project elevator pitches there was a clinic session at which all the different projects could present themselves and their results at different stalls. A special clinic table was in turn manned by experts from different areas of digital preservation.

This was the time to meet a lot of different people from the Digital Preservation field, to catch up and build new relations.  For a photo impression of the workshop see: http://bit.ly/1u7Lmnq.

Preservation Topics: SCAPE AttachmentSize IMG_9928.JPG1.75 MB Elcovs discussion.jpg134.17 KB IMG_3361.JPG311.72 KB IMG_3325.JPG409.54 KB
Categories: Planet DigiPres