Paper on JPEG 2000 for preservation

The JPEG 2000 compression standard is steadily becoming more and more popular in the archival community. Several large (national) libraries are now using the JP2 format (which corresponds to Part 1 of the standard) as the master format in mass digitisation projects. However, some aspects of the JP2 file format are defined in ways that are open to multiple interpretations. This applies to the embedding of ICC profiles (which are used to define colour space information), and the definition of grid resolution. This situation has lead to a number of interoperability issues that are potential risks for long-term preservation.


I recently addressed this in a paper that has just been published in D-Lib Magazine. An earlier version of the paper was used as a 'defect report' by the JPEG committee. The paper gives a detailed description of the problems, and shows to what extent the most widely-used JPEG 2000 encoders are affected by these issues.

The paper also suggests some possible solutions. Importantly, none of the found problems require any changes to the actual file format; rather, some features should simply be defined slightly differently. In the case of the ICC profile issue this boils down to allowing a widely used class of ICC profiles that are currently prohibited in JPEG 2000. The resolution issue could be fixed by a more specific definition of the existing resolution fields.


Both issues will be addressed in an amendment to the standard. Rob Buckley provides more details on this (along with some interesting background information on colour space support in JP2) in a recent blog entry on the Wellcome Library's JPEG 2000 blog. As Rob puts it:

"The final outcome of all this will be a JP2 file format standard that aligns with current practice; supports RGB spaces such as Adobe RGB 1998, ProPhoto RGB and eci RGB v2; and provides a smooth migration path from TIFF masters as JP2 increasingly becomes used as an image preservation format."

So, some relatively small adjustments to the standard could result in a significant improvement of the suitability of JP2 for preservation purposes.



Since various institutions are using JPEG 2000 now, the paper also provides some practical recommendations that may help in mitigating the risks for existing collections.

 

Link to paper: JPEG 2000 for Long-term Preservation: JP2 as a Preservation Format

 

Johan van der Knijff

KB / National Library of the Netherlands

Comments

Andy Jackson's picture

I think it's worth nothing that, since your original presentation, the activity on OpenJPEG has picked up a bit (possibly driven by the GIS community). The last release was 1.4 in January of this year, and the mailing list has been ticking over nicely. See http://www.openjpeg.org/ for details.

I'm not sure we want to expend that much effort on the encoder side, as most partners are willing to pay for a high-performance encoder during digitisation. However, we may want to consider investing in open source decoders (both functionality and performance), so that we know we can maintain access to our images.

Johan van der Knijff's picture

I agree with putting our effort on the decoding side. Back when I did my presentation at the Wellcome Library seminar I was still seeing the  lack of open-source encoding options as a problem. Since then I've come to realise that this is really pretty much irrelevant: several good commercial solutions are available, and the costs involved aren't really an issue relative to typical budgets for digitisation projects. In addition, the quality of commercial encoders has improved as well (e.g. the Luratech encoder now enables embedding of ICC profiles in JP2, which wasn't possible a year ago).

For long-term access, I completely agree that decoding is the main problem. If in say 30 years time we would need to migrate our scanned newspaper masters (which are huge) to some other format using the open source libraries that are available now (assuming for a moment a worst-case scenario in which JPEG 2000 is obsolete and unsupported by any commercial vendors by that time), we would be in pretty serious trouble. Investing in high performance open decoding options now could result in an important safeguard against any preservation-related risks in the future.

Euan Cochrane's picture

Hi Johan, 

Interesting post!

I used to have a very intimate knowledge of the Data Documentation Initiative (DDI) metadata standard. This standard defines a bunch of pieces of information that should be collected about statistical data files and preserved along with the data files to ensure that they can be understood in the future. 

When I was working with DDI files I found that different people regularly interpreted what was meant by field definitions in different ways. 

Similarly, when a developer writes a program to create files that adhere to the JPEG-2000 "format" they have to interpret what that format specification actually means/requires. This often means that different applications create files that should be identical but actually differ in various ways but which all still adhere to the "format standard".

I would suggest that there are always going to be differences in interpretation amongst developers and that this problem will never go away. Standards are never going to be devoid of any interpretation and trying to get formats to be completely unambiguous is an impossible task (in my opinion). Language is too ambiguous and digital concepts still to fluid for a common understanding to be assumed. 

The same problem holds true for the developers that have to interpret the standards to write applications that will render files formatted according to the standard. They also can introduce differences that cause the information in the files to be rendered differently or cause the end result of the rendering to convey different information to the user. 

Just look at the way Microsoft excel and OpenOffice differ in the way they open the same ODS formatted files. 

This is one of the reasons why I am an advocate of emulation as a long-term digital preservation strategy. Using emulation you can serve users up the same information that the original users saw.

You can do this without having to use an application to migrate it that may have been developed based on a different interpretation of the layout or "Formatting" of the files involved, and without having to use a different application to render or present the information that may have similar problems. 

 

My guess is that even with the very intelligent and well-thought out changes you suggest, there will still be differences in interpretation of the standard. For a taster of the problem, try asking three different people what a paragraph is and see if you get three identical responses. 

 

Regards,

 

Euan

 

 

 

 

Johan van der Knijff's picture

Hi Euan,



Thanks for your comments. I'm completely aware of the difficulties in overcoming ambiguities in standards, and I partially agree, but I'd like to add two considerations.


First of all, you state that there are always going to be differences in interpretation amongst developers, and that this problem will never go away. You then illustrate this using two examples: the DDI standard and cross-operability issues with ODS files. Starting with ODS: this is a format that is much more complex than a simple bitmap image format such as JP2. I'm not familiar with DDI, but after taking a quick glance at the information at the DDI Aliance website my first impression is that this is also a standard that is quite a bit more complex than the header fields in JP2.


The thing about bitmap image formats is that at a fundamental level they're all pretty simple. In the case of JP2, the real complexity is in the image codestream (something I know very little about myself, actually). Everything else - header fields, representation information, embedded ICC profiles, and so on- is similar to any other image format such as TIFF, JPEG, PNG, BMP, and so on. Many of these formats have been around for decades, and even though the terminology that is used for describing their header fields may be different among these formats, mapping them back to e.g. NISO/MIX is usually pretty straightforward as well. So the point is that I'm not sure to what extent your observations on ODS and DDI apply to these much simpler image formats.



Second, regarding your comments on emulation: in the case of formats such as ODS emulation allows a user to view a file in its original creator application. Most JP2s are created using command-line tools and libraries that are deployed in automated workflows (e.g. the Luratech, Aware or Kakadu command-line encoders). Although some of these tools have associated viewer applications, these are not typically available to an end user (Kakadu may be an exception here). So this situation is quite a bit different from the case of ODS. Also, if viewer A cannot find ICC profiles in images that were created by application B, then this may also affect the way images are displayed or printed by viewer A within an emulated environment. The same applies to the resolution issue.


That doesn't mean I'm advocating migration as a (future) preservation strategy for JP2. It's just to illustrate that the issues that I described in the paper may also affect emulation.


The common thread in both considerations above is that observations and preservation strategies from one particular digital object class (e.g. spreadsheets) may not necessarily apply to another one (e.g. image files). Generalisations may have little value without accounting for factors such as format complexity, the way files were created, and the way in which end users will be using the files.

OK, off to lunch now ...

Cheers,

Johan

Andy Jackson's picture

I have to agree with Johan, in that this depends very much on the format in question. There have been many successful standards that have allowed otherwise platform/context-dependent code and data to be maintained and interpreted effectively. A classic example is the IEEE 754 standard for floating-point arithmetic. It has helped ensure that scientific codes give the most accurate and consistent results possible, whether on a Cray supercomputer or a humble laptop. For images, the content is so simple and, once uncompressed, so close to the performance that there is relatively little room for confusion.

Of course, any specification that depends exclusively on prose to render it's meaning will tend to fall short of describing a computational process unambigiously. But this is why we invented things like this, and why many standards come with a reference implementation. Source code is just another document we can use to described a standard.

Furthermore, I'm wary of seeing emulation as a silver bullet. The rendering of an Office document may be more authentic under emulation, as long as the right fonts are available and (worse still) the right printer is present, as without those a Word document will not be formatted or paginated correctly. Similarly, unless you have a precise clone of the original techinical environment, you have to worry about whether an item depends on the parts of the environment you have not captured (video drivers, sound card, etc. and for older platforms, undocumented chip features and other oddities). In many cases, these uncomfortable variations and complexities are precisely what formats and standards are trying to protect us from.

Euan Cochrane's picture

Hi Johan and Andy,

 

I agree with you both that some formats are worse than others with this, and that emulation on its own is not a silver bullet.

I do think that even with notation techniques like the Backus-Naur Form there will always be ambiguity in language though. Such notation schemes always assume some common understanding and fall prey to the same issues as format standards do.  Its really a problem of linguistics and philosophy of language

As we agree, there are some areas where we will be able to be more successful in conveying the meaning of a standard between developers and across time, such as in the definition of "textual content" or other such "simple" terms, and for these things endeavours such as this JPEG-2000 work are certainly the best approach. However as soon as a degree of complexity is introduced (its unclear exactly how much), there will inevitably be problems with ensuring a common understanding of the standards such that people developing applications that create (encode) files that adhere to the standards, or interpret and present (decode) files that adhere to them, produce the same results with their applications as with the "originals".  

I would also add that precise emulated clones of technical environments are rarely needed. In most cases it ought to be ok to just have a representative rendering environment from the time, i.e. one that is representative of the type of environment normally used to render the object. In many cases this will just be a vanilla install of the OS with the particular rendering app installed. This should be ok as for many things it would have not been expected (at the time) that the users/viewers of the objects would have had the exact same environment as the creator, and so we shouldn't be expected to replicate that now. 

Great conversation!

Euan

Andy Jackson's picture

To come back to this old thread, I always meant to say that I don't think it is fair to imply that all languages are as ambiguous as each other. I do not believe it is reasonable to lump boolean logic, regular languages, Turing-complete languages, the entirety of mathematics, and all natural languages together in to one big ambiguous void.

Normal prose buries its ambiguity in almost every word, even a simple word like 'blue' will invoke a slightly different shade in every mind that reads it.

The whole point with formal languages, including software itself, is that they push the ambiguity to the edges. A computer program will execute precisely the same thing each time it is run, as long as the technical environment can be constructed correctly. The ambiguity is only in the context, and the challenge lies in understanding how much context we really need to maintain.

I hope a 'representative environment' can be found. However, I fear that the combinatoric possibilities of computer installations (which OS and version with which language packs and which Office version and which fonts and which JVM and which DB connections and and and...) means that no single environment will cover a majority of formats. Nevertheless, it will be interesting to find out!

Euan Cochrane's picture

I agree about the idea that not all languages are as ambiguous as others. However I'd add that at some level they all have a minimum level of ambiguity related to how words mean things (theories of meaning try to address this). 

By "representative environment" I did not mean to say that a single representative environment would be enough for all objects from an era, rather that every object might be able to be associated with a representative environment from that era. We may need many representative environments from each era/architecture type.

Euan Cochrane's picture

One more note:

The benefit of the representative environment approach is that creating/capturing the environments would be a one-off process done at point of ingest (and would be a real action we can take now to safegaurd our digital objects). In the future all that would be needed on an ongoing basis would be to migrate the emulation/virtual machine environment to new architectures. The relative complexity & cost of that migration (compared to the number of files it would apply to) may well be (I suggest: would likely be) much less than multiple file-migrations for multiple formats with validation of each combination of source and result file/format. Each Emulator/VM tool would provide preservation functionality for potentially millions or billions of files which would otherwise each have to be migrated every x years.

 

There are detriments also of course. One being that information can get trapped in environments and made difficult to use. However there are ways to solve that if the will is there (you can already copy text and files easily out of Emulated/Virtualised machines and print from them). 

Error | Open Planets Foundation

Error

The website encountered an unexpected error. Please try again later.