Feed aggregator

WCT2-EX1 Comparing newly archived Web sites against a verified copy (single node)

SCAPE Wiki Activity Feed - 1 October 2014 - 11:54am

Page edited by Leïla Medjkoune

View Online | Add Comment Leïla Medjkoune 2014-10-01T11:54:11Z
Categories: SCAPE

Comparison of web snapshots -IM

OPF Wiki Activity Feed - 1 October 2014 - 10:03am

Page added by Leïla Medjkoune

View Online | Add Comment Leïla Medjkoune 2014-10-01T10:03:49Z

Comparison of web snapshots -IM

SCAPE Wiki Activity Feed - 1 October 2014 - 10:03am

Page added by Leïla Medjkoune

View Online | Add Comment Leïla Medjkoune 2014-10-01T10:03:49Z
Categories: SCAPE

Experiment Overview

OPF Wiki Activity Feed - 1 October 2014 - 9:42am

Page edited by Leïla Medjkoune

View Online | Add Comment Leïla Medjkoune 2014-10-01T09:42:12Z

EVAL 3 - Web Pages comparison - Pagelyzer

OPF Wiki Activity Feed - 1 October 2014 - 9:42am

Page edited by Leïla Medjkoune

View Online | Add Comment Leïla Medjkoune 2014-10-01T09:42:12Z

Experiment Overview

SCAPE Wiki Activity Feed - 1 October 2014 - 9:42am

Page edited by Leïla Medjkoune

View Online | Add Comment Leïla Medjkoune 2014-10-01T09:42:12Z
Categories: SCAPE

EVAL 3 - Web Pages comparison - Pagelyzer

SCAPE Wiki Activity Feed - 1 October 2014 - 9:42am

Page edited by Leïla Medjkoune

View Online | Add Comment Leïla Medjkoune 2014-10-01T09:42:12Z
Categories: SCAPE

EVAL 2 - Web Pages comparison - Pagelyzer

OPF Wiki Activity Feed - 1 October 2014 - 9:41am

Page edited by Leïla Medjkoune

View Online | Add Comment Leïla Medjkoune 2014-10-01T09:41:17Z

EVAL 2 - Web Pages comparison - Pagelyzer

SCAPE Wiki Activity Feed - 1 October 2014 - 9:41am

Page edited by Leïla Medjkoune

View Online | Add Comment Leïla Medjkoune 2014-10-01T09:41:17Z
Categories: SCAPE

EVAL 1 - Web Pages comparison - Pagelyzer

OPF Wiki Activity Feed - 1 October 2014 - 9:39am

Page edited by Leïla Medjkoune

View Online | Add Comment Leïla Medjkoune 2014-10-01T09:39:24Z

EVAL 1 - Web Pages comparison - Pagelyzer

SCAPE Wiki Activity Feed - 1 October 2014 - 9:39am

Page edited by Leïla Medjkoune

View Online | Add Comment Leïla Medjkoune 2014-10-01T09:39:24Z
Categories: SCAPE

QCTools: Open Source Toolset to Bring Quality Control for Video within Reach

The Signal: Digital Preservation - 30 September 2014 - 12:01pm

In this interview, part of the Insights Interview series, FADGI talks with Dave Rice and Devon Landes about the QCTools project.

In a previous blog post, I interviewed Hannah Frost and Jenny Brice about the AV Artifact Atlas, one of the components of Quality Control Tools for Video Preservation, an NEH-funded project which seeks to design and make available community oriented products to reduce the time and effort it takes to perform high-quality video preservation. The less “eyes on” time it takes to do QC work, the more time can be redirected towards quality control and assessment of video on the digitized content most deserving of attention.

Devon

QCTools’ Devon Landes

In this blog post, I interview archivists and software developers Dave Rice and Devon Landes about the latest release version of the QCTools, an open source software toolset to facilitate accurate and efficient assessment of media integrity throughout the archival digitization process.

Kate:  How did the QCTools project come about?

Devon:  There was a recognized need for accessible & affordable tools out there to help archivists, curators, preservationists, etc. in this space. As you mention above, manual quality control work is extremely labor and resource intensive but a necessary part of the preservation process. While there are tools out there, they tend to be geared toward (and priced for) the broadcast television industry, making them out of reach for most non-profit organizations. Additionally, quality control work requires a certain skill set and expertise. Our aim was twofold: to build a tool that was free/open source, but also one that could be used by specialists and non-specialists alike.

Dave

QCTools’ Dave Rice

Dave:  Over the last few years a lot of building blocks for this project were coming in place. Bay Area Video Coalition had been researching and gathering samples of digitization issues through the A/V Artifact Atlas project and meanwhile FFmpeg had made substantial developments in their audiovisual filtering library. Additionally, open source technology for archival and preservation applications has been finding more development, application, and funding. Lastly, the urgency related to the obsolescence issues surrounding analog video and lower costs for digital video management meant that more organizations were starting their own preservation projects for analog video and creating a greater need for an open source response to quality control issues. In 2013, the National Endowment for the Humanities awarded BAVC with a Preservation and Access Research and Development grant to develop QCTools.

Kate: Tell us what’s new in this release. Are you pretty much sticking to the plan or have you made adjustments based on user feedback that you didn’t foresee? How has the pilot testing influenced the products?

QCToolsPlaybackFilters

QCTools provides many playback filters. Here the left window shows a frame with the two fields presented separately (revealing the lack of chroma data in field 2). The right window here shows the V plane of the video per field to show what data the deck is providing.

Devon:  The users’ perspective is really important to us and being responsive to their feedback is something we’ve tried to prioritize. We’ve had several user-focused training sessions and workshops which have helped guide and inform our development process. Certain processing filters were added or removed in response to user feedback; obviously UI and navigability issues were informed by our testers. We’ve also established a GitHub issue tracker to capture user feedback which has been pretty active since the latest release and has been really illuminating in terms of what people are finding useful or problematic, etc.

The newest release has quite a few optimizations to improve speed and responsiveness, some additional playback & viewing options, better documentation and support for the creation of an xml-format report.

Dave:  The most substantial example of going ‘off plan’ was the incorporation of video playback. Initially the grant application focused on QCTools as a purely analytical tool which would assess and present quantifications of video metrics via graphs and data visualization. Initial work delved deeply into identifying methodology to use to pick out the right metrics to find what could be unnatural to digitized analog video (such as pixels too dissimilar from their temporal neighbors, or the near-exact repetition of pixel rows, or discrepancies in the rate of change over time between the two video fields). When presenting the earliest prototypes of QCTools to users a recurring question was “How can I see the video?” We redesigned the project so that QCTools would present the video alongside the metrics along with various scopes, meters and visual tools so that now it has a visual and an analytic side.

Kate:   I love that the Project Scope for QCTools quotes both the Library of Congress’s Sustainability of Digital Formats and the Federal Agencies Digitization Guidelines Initiative as influential resources which encourage best practices and standards in audiovisual digitization of analog material for users. I might be more than a little biased but I agree completely. Tell me about some of the other resources and communities that you and the rest of the project team are looking at.

QCTools_vectorscope_illegal

Here the QCTools vectorscope shows a burst of illegal color values. With the QCTools display of plotted graphs this corresponds to a spike in the maximum saturation (SATMAX).

Devon: Bay Area Video Coalition connected us with a group of testers from various backgrounds and professional environments so we’ve been able to tap into a pretty varied community in that sense. Also, their A/V Artifact Atlas has also been an important resource for us and was really the starting point from which QCTools was born.

Dave:  This project would not at all be feasible without the existing work of FFmpeg. QCTools utilizes FFmpeg for all decoding, playback, metadata expression and visual analytics. The QCTools data format is an expression of FFmpeg’s ffprobe schema, which appeared to be one of the only audiovisual file format standards that could efficiently store masses of frame-based metadata.

Kate:   What are the plans for training and documentation on how to use the product(s)?

Devon:  We want the documentation to speak to a wide range of backgrounds and expertise, but it is a challenge to do that and as such it is an ongoing process. We had a really helpful session during one of our tester retreats where users directly and collaboratively made comments and suggestions to the documentation; because of the breadth of their experience it really helped to illuminate gaps and areas for improvement on our end. We hope to continue that kind of engagement with users and also offer them a place to interact more directly with each other via a discussion page or wiki. We’ve also talked about the possibility of recording some training videos and hope to better incorporate the A/V Artifact Atlas as a source of reference in the next release.

Kate:   What’s next for QCTools?

Dave:   We’re presenting the next release of QCTools at the Association of Moving Image Archivists Annual Meeting on October 9th for which we anticipate supporting better summarization of digitization issues per file in a comparative manner. After AMIA, we’ll focus on audio and the incorporation of audio metrics via FFmpeg’s EBUr128 filter. QCTools has been integrated into workflows at BAVC, Dance Heritage Coalition, MOMA, Anthology Film Archives and Die Osterreichische Mediathek so the QCTools issue tracker has been filling up with suggestions which we’ll be tackling in the upcoming months.

Categories: Planet DigiPres

Scape Demonstration: Migration of audio using xcorrSound

SCAPE Blog Posts - 30 September 2014 - 10:32am

As part of the scape project, we did a large-scale experiment and evaluation of audio migration using the xcorrSound tool waveform-compare for content comparison in the quality assurance.

I did a presentation of the results at the demonstration day at the State and University Library, see the SCAPE Demo Day at Statsbiblioteket blog post by Jette G. Junge.

And now I present the screencast of this demonstration:

 scape demonstration of audio migration using xcorrsound in qa

The brief summary is:

  • New tool: using xcorrSound waveform-compare, we can automate audio file content comparison for quality assurance
  • Scalability: using Hadoop we can migrate our 20TB radio broadcast mp3 collection to the wav file format in a month (on the current SB Hadoop cluster set-up) rather than in years :)

And just a few notes:

  • the large scale experiment did not include property extraction and comparison, but we are confident (based on earlier experiment) that we can do this effectively using FFprobe
  • the large scale experiment did also not include file format validation. We made an early decision not to use JHOVE2 based on performance. The open question is if we are satisfied with the "pseudo validation" that the ffprobe property extraction and the xcorrSound waveform-compare cross correlation algorithm were both able to read the file...

Oh, and the slides are also on Slideshare: Migration of audio files using Hadoop.

 

Preservation Topics: SCAPE
Categories: SCAPE

Scape Demonstration: Migration of audio using xcorrSound

Open Planets Foundation Blogs - 30 September 2014 - 10:32am

As part of the scape project, we did a large-scale experiment and evaluation of audio migration using the xcorrSound tool waveform-compare for content comparison in the quality assurance.

I did a presentation of the results at the demonstration day at the State and University Library, see the SCAPE Demo Day at Statsbiblioteket blog post by Jette G. Junge.

And now I present the screencast of this demonstration:

 scape demonstration of audio migration using xcorrsound in qa

The brief summary is:

  • New tool: using xcorrSound waveform-compare, we can automate audio file content comparison for quality assurance
  • Scalability: using Hadoop we can migrate our 20TB radio broadcast mp3 collection to the wav file format in a month (on the current SB Hadoop cluster set-up) rather than in years :)

And just a few notes:

  • the large scale experiment did not include property extraction and comparison, but we are confident (based on earlier experiment) that we can do this effectively using FFprobe
  • the large scale experiment did also not include file format validation. We made an early decision not to use JHOVE2 based on performance. The open question is if we are satisfied with the "pseudo validation" that the ffprobe property extraction and the xcorrSound waveform-compare cross correlation algorithm were both able to read the file...

Oh, and the slides are also on Slideshare: Migration of audio files using Hadoop.

 

Preservation Topics: SCAPE
Categories: Planet DigiPres

Beyond Us and Them: Designing Storage Architectures for Digital Collections 2014

The Signal: Digital Preservation - 29 September 2014 - 5:39pm

The following post was authored by Erin Engle, Michelle Gallinger, Butch Lazorchak, Jane Mandelbaum and Trevor Owens from the Library of Congress.

The Library of Congress held the 10th annual Designing Storage Architectures for Digital Collections meeting September 22-23, 2014. This meeting is an annual opportunity for invited technical industry experts, IT  professionals, digital collections and strategic planning staff and digital preservation practitioners to discuss the challenges of digital storage and to help inform decision-making in the future. Participants come from a variety of government agencies, cultural heritage institutions and academic and research organizations.

 Peter Krogh/DAM Useful Publishing.

The DSA Meeting. Photo credit: Peter Krogh/DAM Useful Publishing.

Throughout the two days of the meeting the speakers took the participants back in time and then forward again. The meeting kicked-off with a review of the origins of the DSA meeting. It started ten years ago with a gathering of Library of Congress and external experts who discussed requirements for digital storage architectures for the Library’s Packard Campus of the National Audio-Visual Conservation Center. Now, ten years later, the speakers included representatives from Facebook and Amazon Web Services, both of which manage significant amounts of content and neither of which existed in 2004 when the DSA meeting started.

The theme of time passing continued with presentations by strategic technical experts from the storage industry who began with an overview of the capacity and cost trends in storage media over the past years. Two of the storage media being tracked weren’t on anyone’s radar in 2004, but loom large for the future – flash memory and Blu-ray disks. Moving from the past quickly to the future, the experts then offered predictions, with the caveat that predictions beyond a few years are predictably unpredictable in the storage world.

Another facet of time – “back to the future” – came up in a series of discussions on the emergence of object storage in up-and-coming hardware and software products.  With object storage, hardware and software can deal with data objects (like files), rather than physical blocks of data.  This is a concept familiar to those in the digital curation world, and it turns out that it was also familiar to long-time experts in the computer architecture world, because the original design for this was done ten years ago. Here are some of the key meeting presentations on object storage:

Several speakers talked about the impact of the passage of time on existing digital storage collections in their institutions and the need to perform migrations of content from one set of hardware or software to another as time passes.  The lessons of this were made particularly vivid by one speaker’s analogy, which compared the process to the travails of someone trying to manage the physical contents of a car over one’s lifetime.

Even more vivid was the “Cost of Inaction” calculator, which provides black-and-white evidence of the costs of not preserving analog media over time, starting with the undeniable fact that you have to start with an actual date in the future for the “doomsday” when all your analog media will be unreadable.

 Trevor Owens

The DSA Meeting. Photo Credit: Trevor Owens

Several persistent time-related themes engaged the participants in lively interactive discussions during the meeting.  One topic was the practical methods for checking the data integrity of content  in digital collections.  This concept, called fixity, has been a common topic of interest in the digital preservation community. Similarly, a thread of discussion on predicting and dealing with failure and data loss over time touched on a number of interesting concepts, including “anti-entropy,” a type of computer “gossip” protocol designed to query, detect and correct damaged distributed digital files. Participants agreed it would be useful to find a practical approach to identifying and quantifying types of failures.  Are the failures relatively regular but small enough that the content can be reconstructed? Or are the data failures highly irregular but catastrophic in nature?

Another common theme that arose is how to test and predict the lifetime of storage media.  For example, how would one test the lifetime of media projected to last 1000 years without having a time-travel machine available?  Participants agreed to continue the discussions of these themes over the next year with the goal of developing practical requirements for communication with storage and service providers.

The meeting closed with presentations from vendors working on the cutting edge of new archival media technologies.  One speaker dealt with questions about the lifetime of media by serenading the group with accompaniment from a 32-year-old audio CD copy of Pink Floyd’s “Dark Side of the Moon.” The song “Us and Them” underscored how the DSA meeting strives to bridge the boundaries placed between IT conceptions of storage systems and architectures and the practices, perspectives and values of storage and preservation in the cultural heritage sector. The song playing back from three decade old media on a contemporary device was a fitting symbol of the objectives of the meeting.

Background reading (PDF) was circulated prior to the meeting and the meeting agenda and copies of the presentations are available at http://www.digitalpreservation.gov/meetings/storage14.html.

Categories: Planet DigiPres

Siegfried - a PRONOM-based, file format identification tool

Open Planets Foundation Blogs - 27 September 2014 - 7:52am

Ok. I know what you're thinking. Do we really need another PRONOM-based, file format identification tool?

A year or so I might have said "no" myself. In DROID and FIDO, we are already blessed with two brilliant tools. In my workplace, we're very happy users of DROID. We trust it as the reference implementation of PRONOM, it is fast, and it has a rich GUI with useful filtering and reporting options. I know that FIDO has many satisified users too: it is also fast, great for use at the command line, and, as a Python program, is easy to integrate with digital preservation workflows (such as Archivematica). The reason I wrote Siegfried wasn't to displace either of these tools, it was simply to scratch an itch: when I read the blog posts announcing FIDO a few years ago, I was intrigued at the different matching strategies used (FIDO's regular expressions and DROID's Boyer-Moore-Horspool string searching) and wondered what other approaches might be possible. I started Siegfried simply as a hobby project to explore whether a multiple-string search algorithm, Aho Corasick, could perform well at matching signatures.

Having dived down the file format identification rabbit hole, my feeling now is that, the more PRONOM-based, file format identification tools we have, the better. Multiple implementations of PRONOM make PRONOM itself stronger. For one thing, having different algorithms implement the same signatures is a great way of validating those signatures. Siegfried is tested using Ross Spencer's skeleton suite (a fantastic resource that made developing Siegfried much, much easier). During development of Siegfried, Ross and I were in touch about a number of issues thrown up during that testing, and these issues led to a small number of updates to PRONOM. I imagine the same thing happened for FIDO. Secondly, although many institutions use PRONOM, we all have different needs, and different tools suit different use cases differently. For example, for a really large set of records, with performance the key consideration, your best bet would probably be Nanite (a Hadoop implementation of DROID). For smaller projects, you might favour DROID for its GUI or FIDO for its Archivematica integration. I hope that Siegfried might find a niche too, and it has a few interesting features that I think commend it.

Simple command-line interface

I've tried to design Siegfried to be the least intimidating command-line tool possible. You run it with:

sf FILE sf DIR

There are only two other commands `-version` and `-update` (to update your signtures). There aren't any options: directory recursion is automatic, no default size on search buffers, and output is YAML only. Why YAML? It is a structured format, so you can do interesting things with it, and it has a clean syntax that doesn't look horrible in a terminal.

YAML Output

Good performance, without buffer limits

I'm one of those DROID users that always sets the buffer size to -1, just in case I miss any matches. The trade-off is that this can make matching a bit slower. I understand the use of buffers limits (options to limit the bytes scanned in a file) in DROID and FIDO - the great majority of signatures are found close to the beginning or end of the file and IO has a big impact on performance - but you need to be careful with them. Buffer limits can confuse users ("I can see a PRONOM signature for PDF/A, why isn't it matching?"). The use of buffer limits also need to be documented if you want to accurately record how puids were assigned. This is because you are effectively changing the PRONOM signatures by overriding any variable offsets. In other words, you can't just say, "matched 'fmt/111' with DROID signatures v 77", but now need to say, "matched 'fmt/111' with DROID signatures v 77 and with a maximum BOF offset of 32000 and EOF offset of 16000".

Siegfried is designed so that it doesn't need buffer limits for good performance. Instead, Siegfried searches as much, or as little, of a file as it needs to in order to satisfy itself that it has obtained the best possible match. Because Siegfried matches signatures concurrently, it can apply PRONOM's priority rules during the matching process, rather than at the end. The downside of this approach is that while average performance is good, there is variability: Siegfried slows down for files (like PDFs) where it can't be sure what the best match is until much, or all, of the file has been read.

Detailed basis information

As well as telling you what it matched, Siegfried will also report why it matched. Where byte signatures are defined, this "basis" information includes the offset and length of byte matches. While many digital archivists won't need this level of justification, this information can be useful. It can be a great debugging tool if you are creating new signatures and want to test how they are matching. It might also be useful for going back and re-testing files after PRONOM signature updates: if signatures change and you have an enormous quanitity of files that need to have their puids re-validated, then you could use this offset information to just test the relevant parts of files. Finally, by aggregating this information over time, it may also be possible to use it to refine PRONOM signatures: for example, are all PDF/A's matching within a set distance from the EOF? Could that variable offset be changed to a fixed one?

Where can I get my hands on it?

You can download Siegfried here. You can also try Siegfried, without downloading it, by dragging files onto the picture of Wagner's Siegfried on that page. The source is hosted on Github if you prefer to compile it yourself (you just need Go installed). Please report any bugs or feature requests there. It is still in beta (v 0.5.0) and probably won't get a version one release until early next year. I wouldn't recommend using it as your only form of file format identification until then (unless you are brave!). But please try it and send feedback.

Finally, I'd like to say thanks very much to the TNA for PRONOM and DROID and to Ross Spencer for his skeleton suite(s).

Preservation Topics: Identification AttachmentSize YAML output20.64 KB Try Siegfried30.24 KB
Categories: Planet DigiPres

Library to Launch 2015 Class of NDSR

The Signal: Digital Preservation - 26 September 2014 - 7:05pm
Last year's class of Residents at ALA Mid-winter

Last year’s class of Residents, along with LC staff, at the ALA Mid-winter conference

The Library of Congress Office of Strategic Initiatives, in partnership with the Institute of Museum and Library Services, has recently announced the 2015 National Digital Stewardship Residency program, which will be held in the Washington, DC area starting in June 2015.

As you may know (NDSR was well represented on the blog last year), this program is designed for recent graduates with an advanced degree who are interested in the field of digital stewardship.  This will be the fourth class of residents for this program overall – the first in 2013, was held in Washington, DC and the second and third classes, starting in September 2014, are being held concurrently in New York and Boston.

The five 2015 residents will each be paired with an affiliated host institution for a 12-month program that will provide them with an opportunity to develop, apply and advance their digital stewardship knowledge and skills in real-world settings. The participating hosts and projects for the 2015 cohort will be announced in early December and the applications will be available  shortly after.  News and updates will be posted to the NDSR webpage, and here on The Signal.

In addition to providing great career benefits for the residents, the successful NDSR program also provides benefits to the institutions involved as well as the library and archives field in general. For an example of what the residents have accomplished in the past, see this previous blog post about a symposium held last spring, organized entirely by last year’s residents.

Another recent success for the program – all of the former residents now have substantive jobs or fellowships in a related field.  Erica Titkemeyer, a former resident who worked at the Smithsonian Institution Archives, now has a position at the University of North Carolina at Chapel Hill as the Project Director and AV Conservator for the Southern Folklife Collection. Erica said the Residency provided the opportunity to utilize skills gained through her graduate education and put them to practical use in an on-the-job setting.  In this case, she was involved in research and planning for preservation of time-based media art at the Smithsonian.

Erica notes some other associated benefits. “I had a number of chances to network within the D.C. area through the Library of Congress, external digital heritage groups and professional conferences as well,” she said. “I have to say, I am most extremely grateful for having had a supportive group of fellow residents. The cohort was, and still remains, a valuable resource for knowledge and guidance.”

This residency experience no doubt helped Erica land her new job, one that includes a lot of responsibility for digital library projects. “Currently we are researching options and planning for mass-digitization of the collection, which contains thousands of recordings on legacy formats pertaining to the music and culture of the American South,” she said.

George Coulbourne, Executive Program Officer at the Library of Congress, remarked on the early success of this program: “We are excited with the success of our first class of residents, and look forward to continuing this success with our upcoming program in Washington, DC. The experience gained by the residents along with the tangible benefits for the host institution will help set the stage for a national residency model in digital preservation that can be replicated in various public and private sector environments.”

So, this is a heads-up to graduate students and all interested institutions – start thinking about how you might want to participate in the 2015 NDSR.  Keep checking our website and blog for updated information, applications, dates, etc. We will post this information as it becomes available.

(See the Library’s official press release.)

Categories: Planet DigiPres

In defence of migration

Open Planets Foundation Blogs - 26 September 2014 - 3:38pm

There is a trend in digital preservation circles to question the need for migration.  The argument varies a little from proponent to proponent but in essence, it states that software exists (and will continue to exist) that will read (and perform requisite functions, e.g., render) old formats.  Hence, proponents conclude, there is no need for migration.  I had thought it was a view held by a minority but at a recent workshop it became apparent that it has been accepted by many.

 

 

 

 

However, I’ve never thought this is a very strong argument.  I’ve always seen a piece of software that can deal with not only new formats but also old formats as really just a piece of software that can deal with new formats with a migration tool seamlessly bolted onto the front of it.  In essence, it is like saying I don’t need a migration tool and a separate rendering tool because I have a combined migration and rendering tool.  Clearly that’s OK but it does not mean you’re not performing a migration?

 

As I see it, whenever a piece of software is used to interpret a non-native format it will need to perform some form of transformation from the information model inherent in the format to the information model used in the software.  It can then perform a number of subsequent operations, e.g., render to the screen or maybe even save to a native format of that software.  (If the latter happens this would, of course, be a migration.) 

 

Clearly the way software behaves is infinitely variable but it seems to me that it is fair to say that there will normally be a greater risk of information loss in the first operation (the transformation between information models) than in subsequent operations that are likely to utilise the information model inherent in the software (be it rendering or saving in the native format).  Hence, if we are concerned with whether or not we are seeing a faithful representation of the original it is the transformation step that should be verified. 

 

This is where using a separate migration tool comes into its own (at least in principle).  The point is that it allows an independent check to be made of the quality of the transformation to take place (by comparing the significant properties of the files before and after).  Subsequent use of the migrated file (e.g., by a rendering tool) is assumed to be lossless (or at least less lossy) since you can choose the migrated format so that it is the native format of the tool you intend to use subsequently (meaning when the file is read no transformation of information model is required). 

However, I would concede that there are some pragmatic things to consider...

 

First of all, migration either has a cost (if it requires the migrated file to be stored) or is slow (if it is done on demand).  Hence, there are probably cases where simply using a combined migration and rendering tool is a more convenient solution and might be good enough.

 

Secondly, is migration validation worth the effort?  Certainly it is worth simply testing, say, a rendering tool with some example files before deciding to use it which should be sufficient to determine that the tool works without detailed validation most of the time.  However, we have cases where we detect uncommon issues in common migration libraries so migration validation does detect issues that would go unnoticed if the same libraries are used in a combined migration and rendering tool. 

 

Thirdly, is migration validation comprehensive enough?  The answer to this depends on the formats but for some (even common) formats it is clear that better, more comprehensive tools would do a better job.  Of course the hope is that this will continually improve over time. 

 

So, to conclude, I do see migration as a valid technique (and in fact a technique that almost everyone uses even if they don’t realise it).  I see one of the aims of the digital preservation community should be to provide an intellectually sound view of what constitutes a high quality migration (e.g., through a comprehensive view of significant properties across a wide range of object types).  It might be that real-life tools provide some pragmatic approximation to this idealistic vision (potentially using short cuts like using a combined migration and rendering tool) but we should at least understand and be able to express what these short cuts are.

 

I hope this post helps to generate some useful debate.

 

Rob

 

 

 

 
Categories: Planet DigiPres

Six ways to decode a lossy JP2

Open Planets Foundation Blogs - 26 September 2014 - 1:06pm
*/

Some time ago Will Palmer, Peter May and Peter Cliff of the British Library published a really interesting paper that investigated three different JPEG 2000 codecs, and their effects on image quality in response to lossy compression. Most remarkably, their analysis revealed differences not only in the way these codecs encode (compress) an image, but also in the decoding phase. In other words: reading the same lossy JP2 produced different results depending on which implementation was used to decode it.

A limitation of the paper's methodology is that it obscures the individual effects of the encoding and decoding components, since both are essentially lumped in the analysis. Thus, it's not clear how much of the observed degradation in image quality is caused by the compression, and how much by the decoding. This made me wonder how similar the decode results of different codecs really are.

An experiment

To find out, I ran a simple experiment:

  1. Encode a TIFF image to JP2.
  2. Decode the JP2 back to TIFF using different decoders.
  3. Compare the decode results using some similarity measure.
Codecs used

I used the following codecs:

Note that GraphicsMagick still uses the JasPer library for JPEG 2000. ImageMagick now uses OpenJPEG (older versions used JasPer). IrfanViews's JPEG 2000 plugin is made by LuraTech.

Creating the JP2

First I compressed my source TIFF (a grayscale newspaper page) to a lossy JP2 with a compression ratio about about 4:1. For this example I used OpenJPEG, with the following command line:

opj_compress -i krant.tif -o krant_oj_4.jp2 -r 4 -I -p RPCL -n 7 -c [256,256],[256,256],[256,256],[256,256],[256,256],[256,256],[256,256] -b 64,64 Decoding the JP2

Next I decoded this image back to TIFF using the aforementioned codecs. I used the following command lines:

CodecCommand lineopj20 opj_decompress -i krant_oj_4.jp2 -o krant_oj_4_oj.tif kakadu kdu_expand -i krant_oj_4.jp2 -o krant_oj_4_kdu.tif kakadu-precise kdu_expand -i krant_oj_4.jp2 -o krant_oj_4_kdu_precise.tif -precise irfanUsed GUIim convert krant_oj_4.jp2 krant_oj_4_im.tif gm gm convert krant_oj_4.jp2 krant_oj_4_gm.tif

This resulted in 6 images. Note that I ran Kakadu twice: once using the default settings, and also with the -precise switch, which "forces the use of 32-bit representations".

Overall image quality

As a first analysis step I computed the overall peak signal to noise ratio (PSNR) for each decoded image, relative to the source TIFF:

DecoderPSNRopj2048.08kakadu48.01kakadu-precise48.08irfan48.08im48.08gm48.07

So relative to the source image these results are only marginally different.

Similarity of decoded images

But let's have a closer look at how similar the different decoded images are. I did this by computing PSNR values of all possible decoder pairs. This produced the following matrix:

Decoderopj20kakadukakadu-preciseirfanimgmopj20-57.5278.5379.1796.3564.43kakadu57.52-57.5157.5257.5257.23kakadu-precise78.5357.51-79.0078.5364.52irfan79.1757.5279.00-79.1864.44im96.3557.5278.5379.18-64.43gm64.4357.2364.5264.4464.43-

Note that, unlike the table in the previous section, these PSNR values are only a measure of the similarity between the different decoder results. They don't directly say anything about quality (since we're not comparing against the source image). Interestingly, the PSNR values in the matrix show two clear groups:

  • Group A: all combinations of OpenJPEG, Irfanview, ImageMagick and Kakadu in precise mode, all with a PSNR of > 78 dB.
  • Group B: all remaining decoder combinations, with a PSNR of < 64 dB.

What this means is that OpenJPEG, Irfanview, ImageMagick and Kakadu in precise mode all decode the image in a similar way, whereas Kakadu (default mode) and GraphicsMagick behave differently. Another way of looking at this is to count the pixels that have different values for each combination. This yields up to 2 % different pixels for all combinations in group A, and about 12 % in group B. Finally, we can look at the peak absolute error value (PAE) of each combination, which is the maximum value difference for any pixel in the image. This figure was 1 pixel level (0.4 % of the full range) in both groups.

I also repeated the above procedure for a small RGB image. In this case I used Kakadu as the encoder. The decoding results of that experiment showed the same overall pattern, although the differences between groups A and B were even more pronounced, with PAE values in group B reaching up to 3 pixel values (1.2 % of full range) for some decoder combinations.

What does this say about decoding quality?

It would be tempting to conclude from this that the codecs that make up group A provide better quality decoding than the others (GraphicsMagick, Kakadu in default mode). If this were true, one would expect that the overall PSNR values relative to the source TIFF (see previous table) would be higher for those codecs. But the values in the table are only marginally different. Also, in the test on the small RGB image, running Kakadu in precise mode lowered the overall PSNR value (although by a tiny amount). Such small effects could be due to chance, and for a conclusive answer one would need to repeat the experiment for a large number of images, and test the PSNR differences for statistical significance (as was done in the BL analysis).

I'm still somewhat surprised that even in group A the decoding results aren't identical, but I suspect this has something to do with small rounding errors that arise during the decode process (maybe someone with a better understanding of the mathematical intricacies of JPEG 2000 decoding can comment on this). Overall, these results suggest that the errors that are introduced by the decode step are very small when compared against the encode errors.

Conclusions

OpenJPEG, (recent versions of) ImageMagick, IrfanView and Kakadu in precise mode all produce similar results when decoding lossily compressed JP2s, whereas Kakadu in default mode and GraphicsMagick (which uses the JasPer library) behave differently. These differences are very small when compared to the errors that are introduced by the encoding step, but for critical decode applications (migrate lossy JP2 to something else) they may still be significant. As both ImageMagick and GraphicsMagick are often used for calculating image (quality) statistics, the observed differences also affect the outcome of such analyses: calculating PSNR for a JP2 with ImageMagick and GraphicsMagick results in two different outcomes!

For losslessy compressed JP2s, the decode results for all tested codecs are 100% identical1.

This tentative analysis does not support any conclusions on which decoders are 'better'. That would need additional tests with more images. I don't have time for that myself, but I'd be happy to see others have a go at this!

Link

William Palmer, Peter May and Peter Cliff: An Analysis of Contemporary JPEG2000 Codecs for Image Format Migration (Proceedings, iPres 2013)

  1. Identical in terms of pixel values; for this analysis I didn't look at things such as embedded ICC profiles, which not all encoders/decoders handle well

Taxonomy upgrade extras: JPEG2000JP2Preservation Topics: MigrationToolsSCAPE
Categories: Planet DigiPres

Six ways to decode a lossy JP2

SCAPE Blog Posts - 26 September 2014 - 1:06pm
*/

Some time ago Will Palmer, Peter May and Peter Cliff of the British Library published a really interesting paper that investigated three different JPEG 2000 codecs, and their effects on image quality in response to lossy compression. Most remarkably, their analysis revealed differences not only in the way these codecs encode (compress) an image, but also in the decoding phase. In other words: reading the same lossy JP2 produced different results depending on which implementation was used to decode it.

A limitation of the paper's methodology is that it obscures the individual effects of the encoding and decoding components, since both are essentially lumped in the analysis. Thus, it's not clear how much of the observed degradation in image quality is caused by the compression, and how much by the decoding. This made me wonder how similar the decode results of different codecs really are.

An experiment

To find out, I ran a simple experiment:

  1. Encode a TIFF image to JP2.
  2. Decode the JP2 back to TIFF using different decoders.
  3. Compare the decode results using some similarity measure.
Codecs used

I used the following codecs:

Note that GraphicsMagick still uses the JasPer library for JPEG 2000. ImageMagick now uses OpenJPEG (older versions used JasPer). IrfanViews's JPEG 2000 plugin is made by LuraTech.

Creating the JP2

First I compressed my source TIFF (a grayscale newspaper page) to a lossy JP2 with a compression ratio about about 4:1. For this example I used OpenJPEG, with the following command line:

opj_compress -i krant.tif -o krant_oj_4.jp2 -r 4 -I -p RPCL -n 7 -c [256,256],[256,256],[256,256],[256,256],[256,256],[256,256],[256,256] -b 64,64 Decoding the JP2

Next I decoded this image back to TIFF using the aforementioned codecs. I used the following command lines:

CodecCommand lineopj20 opj_decompress -i krant_oj_4.jp2 -o krant_oj_4_oj.tif kakadu kdu_expand -i krant_oj_4.jp2 -o krant_oj_4_kdu.tif kakadu-precise kdu_expand -i krant_oj_4.jp2 -o krant_oj_4_kdu_precise.tif -precise irfanUsed GUIim convert krant_oj_4.jp2 krant_oj_4_im.tif gm gm convert krant_oj_4.jp2 krant_oj_4_gm.tif

This resulted in 6 images. Note that I ran Kakadu twice: once using the default settings, and also with the -precise switch, which "forces the use of 32-bit representations".

Overall image quality

As a first analysis step I computed the overall peak signal to noise ratio (PSNR) for each decoded image, relative to the source TIFF:

DecoderPSNRopj2048.08kakadu48.01kakadu-precise48.08irfan48.08im48.08gm48.07

So relative to the source image these results are only marginally different.

Similarity of decoded images

But let's have a closer look at how similar the different decoded images are. I did this by computing PSNR values of all possible decoder pairs. This produced the following matrix:

Decoderopj20kakadukakadu-preciseirfanimgmopj20-57.5278.5379.1796.3564.43kakadu57.52-57.5157.5257.5257.23kakadu-precise78.5357.51-79.0078.5364.52irfan79.1757.5279.00-79.1864.44im96.3557.5278.5379.18-64.43gm64.4357.2364.5264.4464.43-

Note that, unlike the table in the previous section, these PSNR values are only a measure of the similarity between the different decoder results. They don't directly say anything about quality (since we're not comparing against the source image). Interestingly, the PSNR values in the matrix show two clear groups:

  • Group A: all combinations of OpenJPEG, Irfanview, ImageMagick and Kakadu in precise mode, all with a PSNR of > 78 dB.
  • Group B: all remaining decoder combinations, with a PSNR of < 64 dB.

What this means is that OpenJPEG, Irfanview, ImageMagick and Kakadu in precise mode all decode the image in a similar way, whereas Kakadu (default mode) and GraphicsMagick behave differently. Another way of looking at this is to count the pixels that have different values for each combination. This yields up to 2 % different pixels for all combinations in group A, and about 12 % in group B. Finally, we can look at the peak absolute error value (PAE) of each combination, which is the maximum value difference for any pixel in the image. This figure was 1 pixel level (0.4 % of the full range) in both groups.

I also repeated the above procedure for a small RGB image. In this case I used Kakadu as the encoder. The decoding results of that experiment showed the same overall pattern, although the differences between groups A and B were even more pronounced, with PAE values in group B reaching up to 3 pixel values (1.2 % of full range) for some decoder combinations.

What does this say about decoding quality?

It would be tempting to conclude from this that the codecs that make up group A provide better quality decoding than the others (GraphicsMagick, Kakadu in default mode). If this were true, one would expect that the overall PSNR values relative to the source TIFF (see previous table) would be higher for those codecs. But the values in the table are only marginally different. Also, in the test on the small RGB image, running Kakadu in precise mode lowered the overall PSNR value (although by a tiny amount). Such small effects could be due to chance, and for a conclusive answer one would need to repeat the experiment for a large number of images, and test the PSNR differences for statistical significance (as was done in the BL analysis).

I'm still somewhat surprised that even in group A the decoding results aren't identical, but I suspect this has something to do with small rounding errors that arise during the decode process (maybe someone with a better understanding of the mathematical intricacies of JPEG 2000 decoding can comment on this). Overall, these results suggest that the errors that are introduced by the decode step are very small when compared against the encode errors.

Conclusions

OpenJPEG, (recent versions of) ImageMagick, IrfanView and Kakadu in precise mode all produce similar results when decoding lossily compressed JP2s, whereas Kakadu in default mode and GraphicsMagick (which uses the JasPer library) behave differently. These differences are very small when compared to the errors that are introduced by the encoding step, but for critical decode applications (migrate lossy JP2 to something else) they may still be significant. As both ImageMagick and GraphicsMagick are often used for calculating image (quality) statistics, the observed differences also affect the outcome of such analyses: calculating PSNR for a JP2 with ImageMagick and GraphicsMagick results in two different outcomes!

For losslessy compressed JP2s, the decode results for all tested codecs are 100% identical1.

This tentative analysis does not support any conclusions on which decoders are 'better'. That would need additional tests with more images. I don't have time for that myself, but I'd be happy to see others have a go at this!

Link

William Palmer, Peter May and Peter Cliff: An Analysis of Contemporary JPEG2000 Codecs for Image Format Migration (Proceedings, iPres 2013)

  1. Identical in terms of pixel values; for this analysis I didn't look at things such as embedded ICC profiles, which not all encoders/decoders handle well

Taxonomy upgrade extras: JPEG2000JP2Preservation Topics: MigrationToolsSCAPE
Categories: SCAPE