Open Planets Foundation Blogs
I found it both truthful and inspiring...
Truthful, because the chaotic path of discovery involved in understanding mysterious digital media reflected my own experiences on similar digital preservation adventures, both for the library and for the AQuA and SPRUCE projects.
Inspiring, because it brought new light to my old concerns about format/software/hardware registry systems. I've long been worried that they have not been designed with their users in mind. Specifically, the users that know all of this information and are willing to spend time sharing it. Why would they do it? What incentive would they need? What form of knowledge sharing would they choose?
Upon reading Ben's article, things became clearer. As I twittered at the time:Now, go through and read it one more time, and think about how such a registry could actually have helped. What would it need to include? [t]Could it really replace the expertise of those five (or so) people? Or should its purpose be to capture and link what they have achieved? [t]Is the answer really in building registries? Or is it better to run more XFR STNs and help document and preserve what they do? [t]Maybe we don't know what information we need? Maybe we don't even know who or what we are building registries for? Are we trying to replace imagination and expertise with an encyclopedia? Is it wrong to focus on the information, and ignore the people? Do we need a registry if we have a community of expertise to rely on? Should that community come first, and then be allowed to build whatever it needs? Maybe running and documenting more events like XFR STN and AQuA/SPRUCE is the only way to find out? Preservation Topics: Format Registry
- Preservation Policy Levels in SCAPE. Barbara Sierman, Catherine Jones, Sean Bechhofer and Gry Elstrøm, iPRES 2013, 10th International Conference on Preservation of Digital Objects.
- Open Preservation Data: Controlled vocabularies and ontologies for preservation ecosystems. Hannes Kulovits, Michael Kraxner, Markus Plangg, Christoph Becker and Sean Bechhofer, iPRES 2013, 10th International Conference on Preservation of Digital Objects.
- SCAPE Project Site http://www.scape-project.eu/
Last Friday I ran a workshop at the BL trying to identify what I guess we might call significant properties of ebooks. This is to inform requirements for ebook characteristation tools developed as part of SCAPE and also help inform BL staff involved in ebook ingest projects. To this end I wasn't just interested in the theoretically interesting features that anyone can get excited about - and there are plenty of interesting things about ebooks - but rather what properties of ebooks where really important for the BL's core business. (In DP speak, the significant properties of ebooks as defined by the designated community of the Britsh Library!)
For this workshop we only invited Collections staff from non-technical backgrounds. It was suggested that we re-run it for other communities within the BL - collection content groups and developers perhaps. Certainly I think we would get different discussions with more technical folk who, for example, cared more about ebook representation information and internal structures of the ebook format.
While I had planned two sessions in the workshop and a fairly structured agenda we only had 4 non-DPT staff attend and as such it made sense to let the agenda develop on its own. I still think in a bigger group the two sessions in groups could work, but the materials remain untested. The plan was as follows:
- Using a number of ebook devices explore the books and create a list of properties of interest.
- Prioritise that list into business requirements for the BL using our old friend the user story.
We managed the first part pretty well and participants were very interested in exploring a set of ebooks (a selection of fiction, non-fiction including enhanced books with embedded audio) on a number of devices - an iPad, an iPad mini, a Kindle Fire, a Nook HD, a Kindle Keyboard, a Sony PRS-505 (clearly the Kindle's ancestor!) and an Elonex Ebook. The latter two are early examples of ebook readers and were, sadly, largely unusable. The Elonex was clearly underpowered and the Sony PRS-505's battery could no longer hold its charge. I included these in the selection deliberately to get people thinking about device preservation. Keeping the Sony PRS-505 going would be costly and things like iPads do not have user servicable batteries (though, unlike the Sony at least they remain operational while plugged into a USB charger!). I also provided a couple of (physical) books (of the same items) for comparison.
What struck me during this session was how much the device featured in the discussions. When I raised this with the group they said this was mainly because there were unfamiliar with the operation of the devices and this was a barrier to the content. It was stated quite categorically that the people at the workshop felt it important to separate content from reading device in BL systems. To do this we really need to be storing content in open formats that are not tied to devices or accounts but I think we knew that already.
I then gave a brief presentation on what we in DPT have considered important properties of ebooks that we should consider and used this as the start of a brainstorming session on the significant properties and came up with this list in no particular order:
- Interactivity - were book meets computer game - e.g. Interactive fiction ebooks, apps that are books, etc.
- Searching - whilst reading instead of consulting indexes, etc. and also full-text search via the catalogue
- Versions - who published this edition, when, etc. Ebooks can be remotely updated/removed.
- Authenticity - ebooks are easy to change and re-publish. There are plenty of cheap editions on the book stores from unknown "publishers".
- Accessiblity - text to speech support, manipulation of font size, colours, etc.
- Skills - the skills required to make use of an ebook - probably lacking for some of our researchers at present
- Social Context - reviews, ratings, recommendations, tweets, comments, tags, annotations, etc. that are associated with a book (typically part of a content seller's system).
- Language of the content and on-the-fly translation support
- Linked Resources - references, bibliography, extra content (such as PDFs of knitting patterns, print at home origami, additional appendices)
- Embedded Resources - images, audio, video, fonts, software, etc.
- Layout - where the words, images, etc. appear on the page.
- Structure - where chapters start and finish, what is a heading, what is the Table of Contents, etc.
- Citation - how when there are no page numbers?
- Metadata - embedded metadata at many levels (author by chapter for example) and the ability to embed further BL enhanced metadata
- Devices - the reader hardware and software
- Digital Rights & Restrictions - the BL has a policy on what it will and will not accept so we could quickly skim this one, but it is important to know if a document can be printed, cut and pasted, accessed on only one device at a time, etc. All of these restrictions seriously hamper preservation activivies (imagine doing conservation on the cover of a book that you could not touch!)
- Usage Statistics and Recording - it is reasonable to assume the ebook readers record and perhaps report statistics. The latest incarnation of the Kindle reading software for example will tell you how fast you read and how long you have to the end of the chapter. Handy if your train is near the station I guess.
- Content - related to searching, but preserving the words themselves.
That is a long list! We expanded on a few of them:
While it was felt that the devices were interesting and had cultural and historic value - something the BL may be interested in as part of the history of the book - keeping these devices was not considered a priority. Indeed, as previously mentioned, it was felt the devices got in the way of the content. I wondered what would happen if I'd provided Calibre on a laptop instead of or as well as the devices.
I showed a slide with a scan of E.E.Cumming's The Cubist Break-Up (not available an ebook) and we discussed The Waste Land and poetry in general. It is easy, using the font size, typeface, line-spacing and screen size to alter the formatting of a poem. One of the participants noted how a stanza that should've been on a single page was split across two. Another noted how the text of the Hobbit was not flowing correctly around an image no matter what settings were used. At the same time participants seemed happy they could alter the text as they saw fit. This suggests a need to be able to preserve text layout - perhaps in the form of hints - but have the ability to turn this off when necessary.
This one provoked a lot of discussion. My colleague Will Palmer noted that we do not go great lengths to ensure every book referenced in every bibliography is also available to readers at the BL. Given that, why should we want to preserve links found in ebooks. Some argued that it was a question of expectation. A reader of a physical book expects to have to do some leg work to find references, but an ebook user expects that any links will work or at least resolve to something useful. (This raises a bigger issue of use of ebooks in reading rooms on restricted networks, but that isn't a preservation problem). Further, some content is probably more important to obtain than others. Bibliographic links perhaps can be left but what about additional content omitted from the book and thus the ebook and only downloadable? The BL separates CD-ROMs from books and holds these separately. Should we do the same for downloadable content? Do we need to define an ebook as an agregation and preserve that rather than just the book itself? How does all this hook into the Web Archive?
Having created our list we spent the last half hour or so identifying those we felt mattered most to the BL and came up with this subset:
- Linked Resources
- Digital Rights and Restrictions
- Structure - internal dictionaries, table of contents, etc.
- Layout and layout hints
I would have liked to have explored these further including working them into user stories but we were out of time. Hopefully the workshop will be run again and we can find out more and if you want to repeat the whole thing at your institution and add to the debate that would make me very happy!CharacterisationPreservation RisksSCAPE
Help! Digital Repositories
As part of the SPRUCE Project Awards, Northumberland Estates are currently assessing digital repository solutions which will result in the creation of an associated business case justifying investment in a recommended solution. The business case will aim to implement a sustainable digital repository for the long term management of Northumberland Estates digital content. With a particular focus on small to medium organisations this project aims to address the lack of knowledge in the digital preservation community on preservation as a service (PaaS) providers.
What is a digital repository?
When you think of a traditional repository, you imagine a structure for the preservation and safety of paper archives and collections. Digital Repositories are harder to define due to their intangible nature, but in essence they provide a system for managing and preserving digital content.
There a number of high level requirements which the adopted solution must meet:
- Incorporates methodology of the OAIS Reference Model
- Sustainable and supported
- Secure storage environment which provides bit level preservation and fixity checks
- Investment appropriate for a small organisation with a single dedicated member of staff
- Provides internal access to preserved content
- Handles a broad range of formats including CAD, GSI, and forensic disk Images
I have identified three potential options based on these requirements:
Open Source: Many Higher Education institutions already have mature repository instances through the use of open source software such as DSpace, EPrints, and Fedora. These repositories support research, learning, and administrative processes.
Out of the Box: The emergence of PaaS providers such as Tessella Preservica and Ex Libris Rosetta has enabled digital preservation functionality to remain at the systems core. Often based on OAIS, they provide active preservation and curation of digital assets. Preservica uses AWS to provide bit level preservation in the cloud while Rosetta implements an in-house solution with storage provided by the organisation.
- Hybrid: The combination of commercial services with an in house/open source route is feasible. For instance, Arkivum provides bit level preservation while open source OAIS digital preservation systems such as Archivematica can provide the extra level of preservation required for the creation of SIP’s, AIP’s, and DIP’s.
A Call for Help!
By outlining these options I hope to spark some debate on suitable repository options for small to medium organisations. Two questions spring to mind:
- Am I missing any feasible repository options based on these requirements?
- Are there any further requirements which need to be taken account of?
Please feel free to comment below.Preservation Topics: SPRUCE
Like many other organisations that are using JPEG 2000, the KB produces two representations of most of its digitised content (newspapers, books, periodicals):
- a high-quality, losslessly compressed JP2 that is the archival master;
- a lesser-quality, lossily compressed JP2 that is used as an access image (this is used for e.g. our newspapers website).
The majority of our digitisation work is contracted out to external suppliers, and both master and access images are typically derived from from a parent (TIFF) image, which is converted to JP2 using the settings for master and access images, respectively. This means that we're not currently using the archival masters for producing derived images. However, there may be a need for this at some point in the future. For instance, we may need higher quality access images, or access images that give better performance in our access environment. Because of this, I was asked to take a further look into ways to derive access JP2s directly from our archival masters.
In this blog post I'll be sharing some preliminary findings of this work, which may be of interest to other JPEG 2000 practitioners as well. All images and test results that I'll be showing along the way are available from this Github repository, so you can have a go at these data yourself, if you're so inclined.Masters vs access images
To better understand the remainder of this blog post it is helpful to outline the differences between our masters and our access images. The tables below list the encoding-related specifications of both.Specifications masterParameterValueFile formatJP2 (JPEG 2000 Part 1)Compression typeLossless (reversible 5-3 wavelet filter)Colour transformYes (only for colour images)Number of decomposition levels5Progression orderRPCLTile size1024 x 1024Code block size64 x 64 (26 x 26)Number of quality layers1Error resilienceStart-of-packet headers; end-of-packet headers; segmentation symbolsSpecifications accessParameterValueFile formatJP2 (JPEG 2000 Part 1)Compression typeLossy (irreversible 9-7 wavelet filter)Colour transformYes (only for colour images)Number of decomposition levels5Progression orderRPCLTile size1024 x 1024Code block size64 x 64 (26 x 26)Precinct size256 x 256 (28) for 2 highest resolution levels; 128 x 128 (27) for remaining resolution levelsNumber of quality layers8Target compression ratio layers2560:1  ; 1280:1  ; 640:1  ; 320:1  ; 160:1  ; 80:1  ; 40:1  ; 20:1 Error resilienceStart-of-packet headers; end-of-packet headers; segmentation symbols
The main differences between the two are:
- access images are compressed lossily (reduced file size), whereas lossless compression is used for the masters;
- access images contain quality layers (enables progressive decoding), whereas the masters don't;
- access images use precincts (optimises performance while panning across zoomed-in regions), which aren't used in the masters either.
So, the central question here is: if we have an image that was encoded according to the master specifications, how can we derive an image from this that conforms to our access specifications? To find out, I did a number of tests on the image balloon_master.jp2, which was created according to the KB's master specifications. It looks like this (surprise, surprise!):
I tried to derive an access image from this master using 2 popular JPEG 2000 software toolkits:
For both software packages I limited myself to using only the pre-compiled binaries (i.e. the kdu_.. demo tools for Kakadu, and the j2kdriver command-line tool for Aware).Kakadu
Kakadu's kdu_compress tool doesn't accept any of the JPEG 2000 formats as input; however, it does include a kdu_transcode tool which is capable of a wide array of reformatting operations. I should add here that kdu_transcode is primarily intended as a demo tool that showcases Kakadu's codestream reformatting capabilities, and it doesn't produce output in the JP2 format (for a detailed explanation by Kakadu's author look here)1.
However, kdu_transcode is capable of wrapping output in a JPX container (which can be made JP2-compatible), so this is what I used for these tests. To keep things simple, I started out by instructing the tool to create an output image with a 20:1 compression ratio (ignoring any of the layer / precinct requirements). For an RGB image with 8 bits/component this corresponds to an equivalent bitrate of 1.2, so I ended up with the following command line:kdu_transcode -i balloon_master.jp2 -o balloon_access_kdu.jpf -jpx_layers sRGB,0,1,2 Sprofile=PROFILE2 -rate 1.2
The resulting output image did have the expected size, but opening it in an image viewer revealed a problem:
Compared to the source image, most of the colour information has gone, resulting in a representation that is largely grayscale. The reason behind this seemingly unexpected result is fairly simple: when kdu_transcode creates the derived (lower quality) image, it does so by discarding some of the information that makes up the source image. In other words, it doesn't decode and recompress the image, but instead re-arranges the compressed image data (which is a very fast process). For a source image with multiple quality layers, the result would be largely equivalent to discarding some of the highest quality layers. However, our source image only has one single quality layer, so this isn't possible here. Instead, we end up with a result in which most of the colour information is missing (my guess is that the exact behaviour in such cases also depends on the progression order that was used for encoding the source image). Importantly, this is not a flaw of the tool, but simply a consequence of the way the source image was formatted upon its creation.Aware
Aware's j2kdriver tool supports encoding, decoding and reformating of JP2 images. I used the following command-line in an attempt to create a lossy access image (note that he -w switch sets the transformation to irreversible 9-7 wavelet, and the -R switch sets the target compression ratio to 20:1):j2kdriver -i balloon_master.jp2 -R 20 -w I97 -t JP2 -o balloon_access_aw.jp2
This produced an output image that has the same size as the master! Similar to Kakadu's kdu_transcode tool, j2kdriver makes no attempt at decoding and recompressing the source image in this case. However, the Aware tool does have a number of reformatting options, including one that allows you to discard quality layers. Needless to say, as the source image contains only one quality layer, this isn't of much use in this case.Optimising the archival masters for access generation
In order to produce access images from our current archival masters, we would need to fully decode the source images and then recompress them. Even though this is perfectly possible (e.g. we could simply convert each JP2 to TIFF and then compress that back to lossy JP2), this is both awkward and computationally expensive. A more elegant approach would be to take advantage of JPEG 2000's ability to include multiple quality layers. We're already using quality layers in our existing access images, but this is mainly to optimise performance for access. However, we can also define quality layers in the preservation masters, and we can do this in such a way that a subset of all the quality layers in the master become equivalent to the access image. Access images can then be generated by simply discarding one or more quality layers in the preservation master, without any need for re-compressing the whole image. Visually, this results in the following situation:
This is also the approach that Rob Buckley suggested in this 2009 report for the Wellcome Library. In this case we have a losslessly compressed master with 11 quality layers. Access images at a 20:1 compression ratio can then be derived by simply discarding the highest 3 quality layers.Making it work
To make this all work I first optimised the specifications of the preservation masters by incorporating the quality layer definitions from our access specifications, adding 3 further quality layers to accommodate for the higher quality that is produced by lossless compression. I also added precinct definitions, since we're using those for access as well. This resulted in the following profile:ParameterValueFile formatJP2 (JPEG 2000 Part 1)Compression typeLossless (reversible 5-3 wavelet filter)Colour transformYes (only for colour images)Number of decomposition levels5Progression orderRPCLTile size1024 x 1024Code block size64 x 64 (26 x 26)Precinct size256 x 256 (28) for 2 highest resolution levels; 128 x 128 (27) for remaining resolution levelsNumber of quality layers11Target compression ratio layers2560:1  ; 1280:1  ; 640:1  ; 320:1  ; 160:1  ; 80:1  ; 40:1  ; 20:1  ; 10:1  ; 5:1  ; 2.5:1 *Error resilienceStart-of-packet headers; end-of-packet headers; segmentation symbols
Then I went back to that dreaded balloon image TIFF, and created a new lossless master that follows the optimised specifications (balloon_master_layers_precincts.jp2).Generating the access image
Kakadu's kdu_transcode doesn't allow you to explicitly discard quality layers, but the -rate switch can be used to select an output bitrate, which has pretty much the same effect. So can can simply set all parameters to identical values as in our earlier example (remember that the 1.2 bitrate is equivalent to a compression ratio of 20:1 for an RGB image):Kakadu kdu_transcode -i balloon_master_layers_precincts.jp2 -o balloon_access_precincts_kdu.jpf -jpx_layers sRGB,0,1,2 Sprofile=PROFILE2 -rate 1.2
In contrast to our earlier test, the resulting image has a very good quality. Note that using kdu_transcode in this way produces output images that have the same number of quality layers as the source image (here: 11). However, in this case 3 are actually empty (i.e. the 4 highest quality layers are effectively identical). This is not a problem at all, it just means that progressive decoding of the image will result in an improved quality up to (and including) layer 8, with layers 9, 10 and 11 not adding anything on top of that.Aware
Aware works differently in that it allows you to define explicitly which quality layers must be included in the output image. To get an access image with 20:1 compression ratio, we need to include the 4th best quality layer and anything below it (i.e. discard the 3 highest quality layers), which is done with the following command:j2kdriver -i balloon_master_layers_precincts.jp2 -ql 4 -t JP2 -o balloon_access_layers_precincts_aw.jp2
Note that instead of decoding images by quality layer, it is also possible to do this by resolution level. This can be useful if derived images at a lower resolution are needed. Both Kakadu's kdu_transcode and Aware's j2kdriver application are capable of this, provided that the master images contain a sufficient number of resolution levels (which is controlled by the number of decomposition levels at the encoding stage).Conclusions
Careful selection of how JP2 preservation masters are generated can greatly facilitate the derivation of access images at a later stage. Tests with images that follow the KB's current master specifications showed that lossy access images could only be derived by fully decoding and re-compressing them. Though not necessarily a problem, a more efficient approach would be to make better use of quality layers. This allows access images to be derived by simply extracting a subset of the master, without the need to decode or re-compress the source data. Tests with two widely used JPEG 2000 software toolkits (Kakadu and Aware) show that using this approach the process of deriving access images is both simple and efficient.Acknowledgements
Thanks go out to David Taubman, whose reply to some of my questions on Kakadu's transcode tool was largely the impetus for this blog post.Useful links
- Dataset with all test images (Github link)
- Buckley & Tanner (2009): JPEG 2000 as a Preservation and Access Format for the Wellcome Trust Digital Library
For most operational uses you would need to create a custom application using the full SDK. ↩
Over the last hour or two of the sprint I was focused on reviewing as many of the pages we'd created as possible, looking for gaps and picking up on jobs we still had to do. I was again struck by just how good the content was. 13 minds were definitely a lot better than one. After the sprint had finished I would spend another week back in the office doing some thorough editing, removing inconsistent formatting, plugging gaps and adjusting some of the language. But this would mainly be a polish job not a rewrite. This said a lot about the hard work of our sprinters during the event. Despite all being rather tired at the end of the 3 days I was quite surprised to see how excited everyone was about what we'd created and (I think most strikingly) about the process we used to get there. We spent the last 40 minutes of the event chatting about how the 3 days had gone, what we could learn from the experience for next time, and where we should go next. There were lots of positive comments from the group about the book sprint process, summed up nicely by William Kilbride of the DPC when he contrasted how things would have gone had we not taken the book sprint route. Suffice to say, the end result would have been of significantly lower quality, and we wouldn't have had the buy in and ownership of the entire team in the way that the book sprint naturally ensured we did. So the book sprint experience was quite a fascinating one. We didn't actually write a book. And we certainly didn't do any sprinting (perhaps the odd swift walk to the pub after a long day of writing). But the end result has already had ten times more impact on twitter than any of our other SPRUCE work, with 10000 hits on the wiki. I'm hoping the toolkit will have a significant impact and make it easier for our digital preservationists to get at least some of the cash they need to make their preservation work happen. Thanks to all the sprinters and everyone who contributed to the toolkit!Preservation Topics: SPRUCE
It has been quite some time since the last update of the bwFLA demo instance. Sine then we have significantly improved usability and added a lot of new and hopefully useful features. With a first complete implementation of ingest and workflows it is time to release new version of the bwFLA framework.
- Please keep in mind that all the bits and peaces are of beta-quality, and thus, things may brake. However, we are eager to hear about your experience. Please leave a comment if you encountered any problems or have suggestions for improvements, use-case, questions, etc.
- We have restricted access to OPF members because the demo setup suffers from resource limitations for now. At most 12 parallel sessions can be handled in a performant way. Thus, if you experience slow or unresponsive emulators please try again later and/or contact us. OPF members can get the password here: http://wiki.opf-labs.org/display/PT/bwFLA+test+demo+instance
New and noteworthy features
- Complete Ingest / Evaluation workflow examples for digital art objects
- Example: Digital CD-ROM Art Collection
- Complete Access workflow examples
- Access to preserved complete computer systems
- Example: Apple Macintosh of Vilem Flusser
- 100% browser solution - no plugins etc requierd. Just a current Firefoxe or Chrome browser
- Sound support for some emulators (Virtualbox) (experimental)
- Sound for other emulators under development
- Tablet / Smartphone support (iOS/Android)(experimental
Workflows: Ingest and Access
For this demo we have implemented an example use case for curating digital art objects. A number of example object can be chosen from a list to be rendered and evaluated in an emulated legacy environment. To simplify usage only to rendering environment are offered (Platforms). In the generic ingest workflow any currently available platform can be chosen.
In the next step both technical meta-data, i.e. the technical description of the rendering platform and user configuration together with domain-specific meta-data describing performance aspects are generated.
The access workflows presents available and working digital objects. The previously generated technical meta-data is used to re-enact the chosen environment while the performance evaluation may be used to guide users.
Archive: Base Images
The base image workflow provides an overview of currently available system environments. In the next versions these images can forked and modified by users.
The Baden-Württemberg Functional Long-Term Archiving and Access (bwFLA) is a two-year state funded project transporting the results of past and ongoing digital preservation research into practitioners communities. Primarily, bwFLA creates tools and workflows to ensure long-term access to digital cultural and scientific assets held by the state's university libraries and archives. The project consortium brings together partners across the state, involving people from university libraries computing centers, libraries and archives providing a broad range of background and insights into the digital preservation landscape.
The ENSURE (Enabling kNowledge Sustainability, Usability and Recovery for Economic value) project focuses on the challenges associated with the long-term preservation of data produced by organizations in different sectors, namely health care, clinical trials and finance. A cloud based digital preservation system is in the final stages of development. The following figure shows the architecture of the ENSURE system.
The configuration layer components create the preservation plan used by the preservation solution in the first place and update it whenever the environmental or business needs change. They are run before the initial deployment of the preservation solution and they are re-run periodically in case of any environmental changes.
One major component of the configuration layer is the cost engine. The cost engine is used to predict the ‘whole life-cycle cost’ of long-term digital preservation in the cloud. The cost model is developed following the activity based costing methodology. It may be relevant not just in the three ENSURE use cases but in many other industries. Experts from the digital preservation community helped in qualitatively validating the cost model. There are plans to validate the cost engine quantitatively with real cost values to ensure its generalizability, applicability and validity.
The second layer is the System Runtime. The system runtime provides many services including data management, ingest, archival storage and access. It is the SOA infrastructure for executing the plug-ins selected by the Configuration layer. Different components of the system runtime include Preservation Digital Asset Lifecycle Management, Information Preparation, Ontology Framework and Content-aware Long-Term Data Protection.
Please, visit the project website for more details.
In Policy Representation we have been looking at the different levels of policy that an organisation should consider
We have identified three levels:
- Guidance policy: Very high level statements which apply to the whole organisation
- Preservation Procedure policy: Natural language human readable policy which may encompass the whole organisation or may be focused on a particular collection or material type depending on the needs of the particular organisation
- Control level policy: These are statements derived from the Preservation Level, which are in both a human readable and machine-readable form and relate to a specific collection or material type.
The first two levels are written by humans to be read by humans, the third level will be available in both human and readable forms. If one intends to use the SCAPE watch and planning tools, such as SCOUT and PLATO, then machine readable policy statements will be needed to inform the operation of the tools and can be used in both tools without further modifications.
Getting from policy which is aimed at humans to policy which a machine can evaluate is not a straightforward process. We are developing some guidance and trying the process out with policy from SCAPE partners.
The control policy model will be described in further blogs, but to summarise: for a given content set and user community, there will be a preservation case which has a series of measurable objectives, which together define machine understandable policy which applies in this case. Examples of objectives might include permittable file formats or the presence of documentation.Steps in the process StepsDescription
Stage 1: Whole policy activities1. Identify the content set the policy addresses2. Identify the user communities/roles required by the policy3. Map policy statements to high level concepts.
This stage has activities which apply equally to all parts of the written policy.
As a result of the steps in this stage the content, users and topics addressed will be identified
Stage 2: Policy statements within the whole policy1. Clarification of implicit meaning2 .Identification of control policy preservation case3. Identification of objectives4. Generate control statements
Taking each policy statement in turn.
The steps in this stage of process are designed to ensure that all the information a machine will need to have is explicitly stated and the the parameters to be measured are chosenStage 3: Review the Preservation Cases and identify any rationalisation requiredFinally, the final stage is an opportunity to review the complete set of control policies and to make any adjustments required. Points to note This is work in progress, but from the work we have done so far, we have reached these conclusions:
- Having explicit policy in natural language is important
- Expressing policy in machine testable ways is more complex but can bring benefit through use of tools
- Natural language preservation procedure policy defines acceptable states in statements but control level defines measurable attributes in questions
- Written policy is at a fairly abstract level and practicalities may be addressed in implementation plan/job procedure document or one-off project plan
- Implicit information understood by human audience will need explicitly expressing for computers
Preservation Topics: SCAPE
The browser-shots tool is developed by Internet Memory in the context of SCAPE project, as part of the preservation and watch (PW) sub-project. The goal of this tool is to perform automatic visual comparisons, in order to detect rendering issues in the archived Web pages.
From the tools developed in the scope of the project (in the preservation components sub-project), we selected the MarcAlizer tool, developed by UPMC, that performs the visual comparison between two web pages. In a second phase, the renderability analysis will also include the structural comparison of the pages, which is implemented by the new Pagelyser tool.
Since the core analysis for the renderability is thus performed by an external tool, the overall performance of the browser-shot tool will be tight to this external dependency. We will keep integrating the latest releases issued from the MarcAlizer development, as well as the updates on the tool issued from a more specific training.
The detection of the rendering issues is done in the following three steps:
1° Web pages screenshots automatically taken using Selenium framework, for different browser versions.
2° Visual comparison between pairs of screenshots using MarcAlizer tool (recently replaced by PageAlizer tool, to include also the structural comparison).
3° Automatically detect the rendering issues in the Web pages, based on the comparison results.
The browser-shots tool is developed as a wrapper application, to orchestrate the main building blocks (Selenium instances and MarcAlizer comparators) and to perform large scale experiments on archived Web content.
The browser versions currently experienced and tested are: Firefox (for all the available releases), Chrome (only for the last version), Opera (for the official 11th and 12th versions) and Internet Explorer (still to be fixed).
The initial, sequential implementation of the tool is represented by several Python scripts, running on a Debian Squeeze (64 bits) platform. This version of the tool was released on GitHub and we received some valuable feedback from the sub-project partners:
For the preliminary rounds of tests, we deployed the browser-shots tool on three nodes of IM's cluster and we performed automated comparisons for around 440 pairs of URLs. The processing time registered in average was about 16 seconds per pair of Web pages. These results showed that the existing solution is suitable for small-scale analysis only. Most of the time in the process is actually represented by IO operations and disk access to the binary files for the snapshots. Taking the screenshots proven to be very time consuming and therefore if this solution is to be deployed on a large scale, the solution needed to be further optimized and parallelized.
These results showed also that a serious bottleneck for the performance of the tool is represented by the passage of intermediary parameters in between the modules. More precisely, the materialization of the screenshots in binary files on the disk is a very time consuming operation, especially when considering large scale experiments on a large number of Web pages.
We therefore have to move to a different implementation of the tool, which will use an optimized version of MarcAlizer. The Web pages screenshots taken with Selenium will be directly passed over to MarcAlizer comparator using streams and the new implementation of the browser-shots tool will be represented by a MapReduce job, running on a Hadoop cluster. Based on this framework, the current rounds of tests could be extended up to much higher number of pairs of URLs.
In the second round the browser shot comparison tool is implemented as a MapReduce job to parallelize the processing of the input. The input in this later case is a list of urls that together with a list of browser versions, that are used to render the screen shot - note the difference in comparison to the former version where the input where pairs of URLs that were rendered using one common browser version and these were compared.Optimizations
In order to achieve acceptable running times of the tool newer version of the Marcalizer comparison tool was integrated into this tool. The major improvement brings the possibility of feeding to tool with in-memory objects instead of pointers to files on disk. This improvement and the elimination of the unnecessary IO operations lead into following average times got for the particular steps in the shot comparison:
1) browser shot acquirement - 2s
2) marcalizer comparison 2s
Note that the time to take the render the screenshot using a browser mainly depends on the size of the rendered page, for instance capturing a wsj.com page takes about 15s on the IM machine where the resulting png image has several MBs.MapReduce
As you can see, the operations on the operations on the screenshots are very expensive (remember that the list of the tested browsers can be very long and for each we need to spend one browser screen shot operation). Therefore we need to parallelize the tool to several machines working on the input list of urls. To facilitate this, we have employed Hadoop MapReduce which is part of the SCAPEs platform.
The result of the comparisons is then materialized in a set of XML files where each file represents one pair of browser shots comparions. In order to alleviate the problem of having big numbers of small files, these files are automatically bundled together into one ZIP file. A C3P0 adapter has been implemented by TU Wien so the result can be processed and passed further to Scout.Tests
In the moment, we have ran preliminary tests on the currently supported browser versions - Firefox and Opera. The list of urls to test is about 13 000 entries long. We are using the IM central instance for these tests, currently having two worker nodes (thus we can cut the processing time to half in parallel execution).
Preservation Topics: Software AttachmentSize LA times browser screen shot320.93 KB
Last winter I started a first attempt at identifying preservation risks in PDF files using the Apache Preflight PDF/A validator. This work was later followed up by others in two SPRUCE hackathons in Leeds (see this blog post by Peter Cliff) and London (described here). Much of this later work tacitly assumes that Apache Preflight is able to successfully identify features in PDF that are a potential risk for long-term access. This Wiki page on uses and abuses of Preflight (created as part of the final SPRUCE hackathon) even goes as far as stating that "Preflight is thorough and unforgiving (as it should be)". But what evidence do we have to support such claims? The only evidence that I'm aware of, are the results obtained from a small test corpus of custom-created PDFs. Each PDF in this corpus was created in such a way that it includes only one specific feature that is a potential preservation risk (e.g. encryption, non-embedded fonts, and so on). However, PDFs that exist 'in the wild' are usually more complex. Also, the PDF specification often allows you to implement similar features in subtly different ways. For these reasons, it is essential to obtain additional evidence of Preflight's ability to detect 'risky' features before relying on this tool in any operational setting.Adobe Acrobat Engineering test files
Shortly after I completed my initial tests, Adobe released the Acrobat Engineering website, which contains a large volume of test documents that are used by Adobe for testing their products. Although the test documents are not fully annotated, they are subdivided into categories such as Multimedia & 3D Tests and Font tests. This makes these files particularly useful for additional tests on Preflight.Methodology
The general methodology I used to analyse these files is identical to what I did in my 2012 report: first, each PDF was validated using Apache Preflight. As a control I also validated the PDFs with the Preflight component of Adobe Acrobat, using the PDF/A-1b profile. The table below lists the software versions used:SoftwareVersionApache Preflight2.0.0Adobe Acrobat10.14Acrobat Preflight10.1.3 (090)Re-analysis of PDF Cabinet of Horrors corpus
Because the current analysis is based on a more recent version of Apache Preflight than the one used in the 2012 report (which was 1.8.0), I first re-ran the analysis of the PDFs in the PDF Cabinet of Horrors corpus. The main results are reproduced here. The main differences with respect to that earlier version are:
- Apache Preflight now has an option to produce output in XML format (as suggested by William Palmer following the Leeds SPRUCE hackathon)
- Better reporting of non-embedded fonts (see also this issue)
- Unlike the earlier version, Preflight 2.0.0 does not give any meaningful output in case of encrypted and password-protected PDFs! This is probably a bug, for which I submitted a report here.
Since the Acrobat Engineering site hosts a lot of PDFs, I only focused on a limited subset for the current analysis:
- all files in the General section of the Font Testing category;
- all files in the Classic Multimedia section of the Multimedia & 3D Tests category.
The results are summarized in two tables (see next sections). For each analysed PDF, the table lists:
- the error(s) reported by Adobe Acrobat Preflight;
- the error code(s) reported by Apache Preflight (see Preflight's source code for a listing of all possible error codes);
- the error description(s) reported by Apache Preflight in the details output element.
The table below summarizes the results of the PDFs in the Font Testing category:Test fileAcrobat Preflight error(s)Apache Preflight Error Code(s)Apache Preflight DetailsEmbeddedCmap.pdfFont not embedded (and text rendering mode not 3) ; Glyphs missing in embedded font3.1.3Invalid Font definition, FontFile entry is missing from FontDescriptor for HeiseiKakuGo-W5TEXT.pdfFont not embedded (and text rendering mode not 3); Glyphs missing in embedded font ; TrueType font has differences to standard encodings but is not a symbolic font; Wrong encoding for non-symbolic TrueType font3.1.5; 3.1.1; 3.1.2; 3.1.3; 3.2.4Invalid Font definition, The Encoding is invalid for the NonSymbolic TTF; Invalid Font definition, Some required fields are missing from the Font dictionary; Invalid Font definition, FontDescriptor is null or is a AFM Descriptor; Invalid Font definition, FontFile entry is missing from FontDescriptor for Arial,Italic (repeated for other fonts); Font damaged, The CharProcs references an element which can't be readType3_WWW-HTML.PDF-3.1.6Invalid Font definition, The character with CID"58" should have a width equals to 15.56599 (repeated for other fonts)embedded_fonts.pdfFont not embedded (and text rendering mode not 3); Type 2 CID font: CIDToGIDMap invalid or missing3.1.9; 3.1.11Invalid Font definition; Invalid Font definition, The CIDSet entry is missing for the Composite Subsetembedded_pm65.pdf-3.1.6Invalid Font definition, Width of the character "110" in the font program "HKPLIB+AdobeCorpID-MyriadRg"is inconsistent with the width in the PDF dictionary (repeated for other font)notembedded_pm65.pdfFont not embedded (and text rendering mode not 3); Glyphs missing in embedded font3.1.3Invalid Font definition, FontFile entry is missing from FontDescriptor for TimesNewRoman (repeated for other fonts)printtestfont_nonopt.pdf*ICC profile is not valid; ICC profile is version 4.0 or newer; ICC profile uses invalid color space;ICC profile uses invalid type-Preflight throws exception (exceptionThrown), exits with message 'Invalid ICC Profile Data'printtestfont_opt.pdf*ICC profile is not valid; ICC profile is version 4.0 or newer; ICC profile uses invalid color space; ICC profile uses invalid type-Preflight throws exception (exceptionThrown), exits with message 'Invalid ICC Profile Data'substitution_fonts.pdfFont not embedded (and text rendering mode not 3)3.1.1; 3.1.2; 3.1.3Invalid Font definition, Some required fields are missing from the Font dictionary; Invalid Font definition, FontDescriptor is null or is a AFM Descriptor; Invalid Font definition, FontFile entry is missing from FontDescriptor for Souvenir-Light (repeated for other fonts)text_images_pdf1.2.pdfFont not embedded (and text rendering mode not 3); Glyphs missing in embedded font; Width information for rendered glyphs is inconsistent3.1.1; 3.1.2Invalid Font definition, Some required fields are missing from the Font dictionary; Invalid Font definition, FontDescriptor is null or is a AFM Descriptor
* As this document doesn't appear to have any font-related issues it's unclear why it is in the Font Testing category. Errors related to ICC profiles reproduced here because of relevance to Apache Preflight exception.General observations
An intercomparison between the results of Acrobat Preflight and Apache Preflight shows that Apache Preflight's output may vary in case of non-embedded fonts. In most cases it produces error code 3.1.3 (as was the case with the PDF Cabinet of Horrors dataset), but other errors in the 3.1.x range may occur as well. The 3.1.6 "character width" error is something that was also encountered during the London SPRUCE Hackathon, and according to the information here this is most likely the result of the PDF/A specification not being particularly clear. So, this looks like a non-serious error that can be safely ignored in most cases.Multimedia
To accomplish effective digital preservation, environments with a preservation concern such as repositories need scalable and context-aware preservation planning and monitoring capabilities to ensure continued accessibility of content over time. These should create a continuous cycle that allows the system to detect opportunities and risks and act accordingly.
So far so good. The problem is that so far, this lifecycle is not well-supported and thus hardly implemented in practice. We identified a number of gaps in the state of art that we set out to address:
Preservation environments are lacking business intelligence mechanisms and tools and the scalable feature-rich content profiling required to really understand what is in the preservation collections and what risks exist.
Knowledge sharing and discovery is not practiced at scale, since it is not supported well enough by current mechanisms.
Decision making efficiency needs to be improved (Plato is trustworthy, but requires considerable manual effort).
Policies need to be better understood and modelled, in particular preservation policies in the sense of "business policies" which guide and constraint preservation activities and provide the context of preservation planning, monitoring, and operations-
Lots of challenges! Five key goals are what we set out to achieve:
Provide a scalable mechanism to create and monitor large content profiles
Enable monitoring of operational preservation compliance, risks and opportunities
Improve efficiency of trustworthy preservation planning
Make the systems aware of their context
Design for open, loosely-coupled and robust preservation ecosystems that can grow over time
Our SCAPE Planning and Watch suite makes preservation planning and monitoring context-aware through a semantic representation of key organizational factors, and it collects and reasons on preservation-relevant information. Integration with repositories and external information sources provide powerful preservation capabilities that can be freely integrated with virtually any repository. Many of you already know the names of the components of that solution:
C3PO provides scalable profiling of feature-rich content collections. It takes the output of fits or tika and calculates the statistical distribution of features. It also selects representative sample objects from the collection to enable systematic experiments of more manageable size, which is important for planning, and it exports these statistics and the samples into a content profile that is understood by its partnering tools, Plato and Scout. Finally, it has an intiuitive neat user interface for visualising properties dynamically. It does not (yet?) support real-time analytics on Petabytes of data. (So far. It it would be great to make that happen too....)
Scout is the business intelligence component that draws content profiles and many other sources together to monitor what is going on in the repository - and outside! - and check whether the two are a good fit or not. C3PO and SCOUT together can already provide very useful insights to collections right now, if you use them properly. (There are still a few spaces left for the tutorial at IPRES.)
Plato, who has been around for a while, has been learning a lot about its context recently. Endowed with an understanding of the C3PO content profile and a semantic model of preservation "control policies", it is increasingly able to support the preservation planning process more efficiently. While the big improvements will be coming out in the next year, some things are already much less work than they used to be - provided you have created a "policy model" before and used C3PO to make a content profile. You can also discover Taverna workflows on myExperiment inside Plato and run them from there. That discovery function is going to get a lot more powerful in the near future, by the way... The latest release of Plato is 4.2, with more to come soon, and of course online as a service as usual since 2007.
The policy model is one of the things we will be presenting in more detail at IPRES this year (together with a demonstration of the tool suite and a tutorial on content profiling and monitoring with C3PO and SCOUT). The model represents an organisation's objectives and key contextual knowledge in a way that both Plato and SCOUT can use to provide better support for preservation planning and monitoring. A set of permanent vocabularies is out at PURL.org to provide the core elements used by the control policy model and others:
http://purl.org/DP/preservation-case contains the basic elements that link a preservation case together. This is in some cases quite closely related to preservation intent: It defines what is being preserved for whom, providing the rationale for checking whether the current state of preservation is fine or whether a plan for actions is needed.
http://purl.org/DP/quality describes the attributes to describe aspects of preservation quality,
http://purl.org/DP/quality/measures contains the elements used for annotating, describing and discovering measures for quality,
http://purl.org/DP/control-policy , finally, defines the classes of objectives relevant for a preservation case, so that goas and objectives can be defined for each case.
We expect this set of vocabularies to grow over time, naturally, both in terms of classes and their instances. It is used by the tools in different ways, providing a glue that enables them to converse with each other.
Where are we now and what is left to do?
Without trying to be complete here, some of the key things you want to know in my opinion will include that
Prototypes of all tools are out.
The APIs between these are partially published, the rest will follow soon.
We have started to measure how long it takes to create a plan and how much we can improve on that.
SCOUT already knows a number of sources to get information from (such as content profiles, PRONOM, and the policy model). Every additional source that is added makes every other source more valuable, since SCOUT can link between them. We will be developing more adaptors for additional sources, but you are very much encouraged to create adaptors too!
Documentation about the vocabulary will be out soon, and so will be further thoughts on how you can specify your policies more effectively.
Some of the upcoming things that the team in Planning and Watch is working on include
Specifying Service Level Agreements for preservation actions as part of the executable preservation plan, based on the criteria and measures created as part of planning, so that execution of the preservation plan can be monitored continuously for compliance to the expectations (after all, the choice and configuration of the action was based on experiments and measures of quality - we don’t want surprises there when we run it on lots of content!)
Sophisticated integration of Plato and Scout with myExperiment to discover components according to what they can do and what they measure, provided they are properly annotated.
Tool support for control policy editing, so you don’t need to model your policies in RDF!
A simulation engine that can be used to calculate predictions about the future state of a preservation environment, based on a current state and a set of assumptions. The neat thing here is that the entire set of assumptions is explicitly declared and documented, since this environment is built using model-driven engineering. That means that a model of the simulation, the cause-effect relationships, can be built using a domain-specific language, and that model is documented together with the simulation run, can be shared and extended, and the simulation is hence documented fully.
..... and quite a few other things that you will hear about soon!
Final APIs will be openly published to enable anybody to integrate (with) these tools.
I am very much looking forward to seeing the outcomes of the final SCAPE year!Preservation Topics: Preservation ActionsIdentificationCharacterisationEmulationMigrationPreservation StrategiesNormalisationPreservation RisksFormat RegistryRepresentation InformationCorporaToolsPlanetsSCAPE
An important part of image file format migration is quality assurance. Various tools can be used such as ImageMagick or Matchbox, but they only provide one metric or are for different use-cases. I wanted to investigate implementation of image comparison algorithms so began investigating.
I created a prototype tool/library for image quality analysis, called Dissimilar. I had previously prototyped a tool that used the OpenCV libraries in Java to perform image comparisons. Those experiments showed that, while possible, it was not ideal; a large native-code shared object needed to be packaged with the tool and some inline memory management was required.
For Dissimilar I subsequently implemented PSNR and SSIM algorithms from scratch in Java, making use of Apache-Commons Imaging and Math3 libraries. The result is about 600 lines of commented, pure-Java code for performing image quality analysis.
The SSIM is calculated for an image by splitting it in to 8 pixel by 8 pixel “windows” and then calculating the mean of the results for each window. In addition to the (mean) SSIM value, Dissimilar reports the minimum SSIM value alongside the variance in SSIM values. It may be useful to use a combination of some of the mean, minimum and variance to set a better threshold for image format migration. For example, setting a minimum value would ensure that the quality of all 8x8 windows stayed above a certain threshold. Or using variance would enable identification of images where there were large differences in the individual SSIM windows, but where those values might still produce a mean that is assessed as ok.
Testing was performed using our Hadoop cluster to enable comparison of results from ImageMagick (PSNR) and Dissimilar (PSNR/SSIM). A tiff was migrated to lossy jp2 and then back to tiff. The original tiff and second tiff were then compared using each tool, each tool therefore having identical inputs.
It is worth noting that there is no built-in support for JPEG2000 files in Apache-Commons Imaging, and it is worth using a known decoder to decompress to tiff for comparison. For more about that see our iPres paper in September.
Results on a homogenous dataset of 1000 greyscale image files showed that ImageMagick took about half the execution time of Dissimilar. This is a good result as the code is currently unoptimised. The execution time of Dissimilar also includes startup of a new JRE, an SSIM calculation and saving an SSIM “heatmap” image to identify the low values. Some execution speed savings are therefore expected. It is possible to call the code as a library – this could be done as part of a Java workflow, thus removing the overhead of a new JRE. Some information regarding the difference between using a Java library versus executing a new JRE has been blogged about before.
The PSNR results were identical to that of ImageMagick. The SSIM results were not the same as Matchbox’s but I think it and Dissimilar calculate SSIM in different ways. I couldn’t find another readily available and tested tool to calculate SSIM to verify the results – suggestions are welcome!
Next steps include testing more files, producing more unit tests, optimisation and identifying suitable values for the threshold of SSIM-mean, SSIM-minimum and SSIM-variance. I am also going to investigate adding more types of image quality assessment metrics.Preservation Topics: Preservation ActionsMigrationPreservation RisksToolsSCAPESoftware
Emulation-as-a-Service (EaaS) can simplify the provisioning of emulation for end-users' needs without requiring them to install and configure complex toolchains on their devices. Beside single machine instances the user is directly interacting with, more complex setups for networked services as in preserved business processes or research environments are required. This involves computer networking between nodes and machines which are accessiblenot directly, but over network links. Depending on the users' requirements the networking is just internal, and unlinked to the rest of the world or it can be required that access is possible from within the original, emulated environment to (parts of) the outside world or vice versa.Computer Networking
To illustrate the different requirements for EaaS networking a couple of use cases can help to understand the problem set. Starting from the early days of computing, many services especially in business environments are networked. Like for instance for databases there is a (single) server running the Database Management System (DBMS) and clients connecting to it over a network protocol. With the raising popularity of TCP/IP, networking spread to applications like email, file transfer protocols, network drives, or the world wide web. Modern WWW sites may provide additional features which use a database or other services in the back-end such as found for content management services or electronic shops. Other use cases include computer games either played in a LAN setup or over the internet. Early games depended on IPX/SPX networking, all newer ones switched to TCP/IP. Later on new services emerged like Voice over IP. The requirements of the various services regarding bandwidth and latency may differ significantly.
Different higher layer protocols depend on certain physical networks. Nevertheless, the underlying physical transport medium is usually only relevant as different computer platforms might provide different options like virtual Ethernet connections of various speeds or serial line interfaces (for modem connections). Operating system drivers might be needed for the hardware layer and for the higher layer protocols as they were not provided with all operating systems. While the case is comparably easy with Unix and Linux systems dating from 20 years ago, Microsoft and Apple OS added networking later on. The hardware drivers were usually provided by the manufacturers of the devices. Popular TCP/IP stacks for DOS and Windows 3.11 were NCSA Telnet and Trumpet Winsock. Later on Microsoft provided a fully functional TCP/IP stack for Windows 3.11 including fully operational DHCP client. Another challenge is the network configuration. Not all TCP/IP stacks implement BOOTP or DHCP for dynamic IP configuration. A future problem emerges with the switch from IPv4 to IPv6.Networking in Emulators
Emulators implement hardware components in software which mimic the behavior of their physical counterparts. Modern, modular emulators like QEMU provide a couple of popular Ethernet cards like NE2000, NE2K, PCnet featuring 10Mbps, E100, E1000 (i82551, i82557b, i82559er) for faster connections up to Gigabit. Some aspects of network access require privileged operation executed by the emulator as it might need to modify the network routing table or bridging functionality of the host system. X86 virtualization software usually links quite directly to the host system via bridged, NATed or host-only networking, often providing the feature of more than one network interface provided to a guest. Some emulators offer just user mode networking, where the emulator acts as a normal network client within the host environment, which might limit the available protocols and functionality. In this case the emulator usually provides routing and NATing functionality on the IP layer. Often, the emulator or virtual machine provides higher layer network configuration like BOOTP/DHCP directly built-in into QEMU or VirtualBox. This prevents the use of non-IP higher layer protocols like IPX/SPX or AppleTalk.
Original environments with networking can be connected to each other either on the virtual hardware layer by bridging network interfaces to each other or routed networking on a higher layer protocol like TCP/IP. When directly bridged to each other on the same host system the virtual machines usually provide the software network stack. In cases of accessing the host networking via protocol routing (e.g. for NAT or host-only networking) special virtual network interfaces are required. Those are either provided by the virtual machine frameworks like from VMware, or they use available technology like TUN/TAP deployed by QEMU.
The fundamental building block of an EaaS architecture are abstract emulation-components used to standardize deployment and to hide individual system complexity. An EC encapsulates various emulators, available either as open source or commercial products, into an abstract component with a unified set of software interfaces (API). Beside control interfaces for standard emulator features in combination with node- and user-management a networking control is needed. The task of configuring and providing networking in EaaS is somewhat more challenging. Different objectives need to be met, depending on the actual task. A general software network layer should be provided, which should be able to fulfill the following requirements:
- Bridge distributed running systems to each other (on a virtual Data Link Layer)
- Allow lower layer packet exchange (Data Link Layer, e.g. for Ethernet, TokenRing, ... frames) to provide service for different higher layer protocols like NetBEUI, IPX, AppleTalk, IPv4, IPv6, ...
- Connect to different system emulators in a standardized way
- Allow for network tunneling between EaaS instances and to end users
- Provide (controlled) break-out into today's networks (via NAT, proxy, ...)
Idially, no privileged access to host resources is required to prevent security issues and simplify configuration.
The base for such a software network layer can be provided, e.g. with Virtual Distributed Ethernet (VDE). VDE, part of the virtualsquare research project, is an Ethernet compliant virtual network/softswitch that can span over a distributed set of physical machines connected through the Internet. The actual needed functionality depends on the use case.
A typical use case depending on the availability of network connects are databases. Usually databases are deployed on a dedicated server and accessed over the network from (numerous) clients. For preservation and access through EaaS several scenarios are possible. Some setups may allow that server and clients are merged into one machine avoiding the need to create a network connection at all. This approach was used for the LINZ database preservation described in Euan's post. The server machine already provided all necessary components (web browser) to access the database. For large scale databases and complex setups, like different operating systems on the server and clients, at least two machines need to be preserved. Then, an appropriate network connection is to be provided and brought up, when a user wants to interact with the client. The bandwidth requirements depend on the actual database application but should be usually low. As the database server usually runs permanently, the client expects it running. Thus, it needs to be started or resumed before the client is started. In EaaS the user would get the client presented only. Functions need to be implemented which automatically enable the server, when a user requests access to the preserved client. Depending on the user requirements data re-use, the data might need to be extracted into today's user environments. For many databases exist client implementations, which still run in today's working environments. For these scenarios just the database machine (running the DBMS) needs to be started and a direct network connection to the end user provided. Optimally, the connection is tunneled to the end user avoiding to have open data connections between both sides. The client on the user side directly connects to the local end of the network connection, provided together with the EaaS access. Such a setup avoids to create a specific network tunnel between the EaaS provider and the user.Computer Games
Networked computer games usually require at least two user instances running. Depending on the type of game one of the clients or a dedicated machine will provide server functionality. While early network games were happy with dial-up network link speeds, newer games expect higher bandwidth and lower latencies provided by Ethernet links. Early games were using IPX/SPX over LAN connections, later on IPv4 took over. Different higher layer protocols (expecting broadcast functionality) need to be supported by the virtual network link. The latencies provided by today's WAN connections should be fast enough to suffice for early network games. Network games offer different numbers of simultaneous players. Beside minimal network and user configuration differences all clients are pretty much the same. Thus, optimally every client should be cloned and configured from a single instance to avoid keeping numerous client instances (full machine images and configuration) just for that purpose. The Eaas needs to provide the means to start and configure the server if required and to fire up the needed number of clients upon user request. Here, the users some how should be able to join the same game.Business processes
Another use case involving networking between multiple machines are business processes or scientific research environments. Such configurations might have external dependencies like third party services accessed over the network. Either these services can be internalized to keep the original environment functional, or it could be tried to emulate the necessary features of the service. This could be very complex and thus impossible, if e.g. cryptographically secured connections are used. For simpler services it might be feasible to monitor and record the data exchange over the network for certain situations and play-back the answers of the service later on.Preservation Topics: Preservation ActionsEmulation
It's been more than two years now since I wrote my D-Lib paper JPEG 2000 for Long-term Preservation: JP2 as a Preservation Format. From time to time people ask me about the status of the issues that are mentioned in that paper, so here's a long overdue update.Issues addressed in the 2011 paper
- The specification was overly restrictive on the embedding of ICC profiles. By only allowing input profiles, this ruled out the use of display profiles. In practice this meant that widely-used working colour spaces such as Adobe RGB and eciRGB 2 could not be used in JP2 without violating the standard.
- JP2 makes a distinction between capture resolution and default display resolution, which are stored in two designated sets of header fields. However, the specification was not clear in which case either set of fields should be used.
This lead to the situation that not all software products were interpreting the specification in the same way. For instance, some encoders would (silently) produce files in JPX format whenever they encountered an input image with an embedded display ICC profile. Other encoders would embed the profile, changing the profile class in the process, whereas others yet would ignore the limitation altogether and embed the profile without complaining. Similarly, some software products would only write (and read) the capture resolution fields (while ignoring any default display ones), whereas the opposite was true for other products. This in turn raised various interoperability issues, many of which are potential risks in a long-term preservation context.
The 2011 paper concluded that "[t]hese issues could be remedied by some small adjustments of JP2's format specification, which would create minimal backward compatibility problems, if any at all".Amendment to the standard
So, enter the amendment "Updated ICC profile support and resolution clarification", which was published by ISO earlier this year. This amendment remedies the above issues by applying the following changes to the existing JP2 format specification:
- The Restricted ICC profile method now permits the use of display profiles (previously only input profiles were allowed). The other restrictions (e.g. that ICC profiles should be of either the Monochrome or the Three-Component Matrix-Based type) remain unchanged.
- It is more specific about the intended uses of the capture and the default display resolution boxes. Of particular interest here is that capture resolution now reflects the resolution at which the image samples were "captured or created". Previously the word "digitized" was used, which ruled out the case of born-digital materials. The use of the default display resolution box is also further clarified. In practice this means that the capture resolution is pretty much equivalent to the XResolution /YResolution fields in TIFF.
Note that the full amendment text is only available after purchase at ISO (previously an earlier draft was available for free, but apparently it was taken down recently).Implementation of changes
Since a standard isn't worth much unless it is used, let's have a quick look at the three most popular JPEG 2000 implementations (a more elaborate overview is available here and here at the OPF File Format Risk Registry).
Recent versions of Kakadu's kdu_compress are now able to correctly handle ICC profiles. Compressing a TIFF that contains an ICC profile that meets the (updated) restricted ICC profile definition now produces a JP2 with the profile correctly embedded. Moreover, kdu_compress now uses the capture resolution fields to store the image's resolution (as derived from the TIFF). Previously, the Kakadu demo applications were using the default display fields instead, which resulted in various interoperability issues, because most other decoders/encoders were using the capture fields. This is all solved in the latest version.
By default Aware and Luratech already used the capture resolution fields back in 2011, and this behaviour is now consistent with the updated standard. As for ICC profiles, Aware accepted display profiles without complaining in my 2011 tests, and with the amendment in effect, these images are now also valid JP2. Luratech used to handle display profiles by change the profile class field tom input. That was in 2011, and I don't know if anything has changed since then, but then again this behaviour never caused by problems in the first place.Round-up and conclusions
The amendment to JP2 fixes the previous shortcomings that were mentioned in my 2011 D-Lib paper. Moreover, the behaviour of the three most popular (commercial) JPEG 2000 implementations now closely follows the updated specification, which should minimise any interoperability problems related to ICC profiles and resolution.LinksPreservation ActionsCharacterisationEmulationMigrationNormalisationRepresentation InformationToolsSCAPEjpylyzer
Jisc would like to invite you to take part in a survey on how you discover and use software in a work-related context. Findings from this survey will help us to better understand problems that the higher and further education sectors face in this area and what Jisc and its partner organisations might do to help.
In particular we would like to understand how you search for new software, what criteria are important for you to decide whether software appears to be relevant and trustworthy enough to trial and what common problems you encounter in using software. This survey is aimed both at individual users of software (such as researchers, teaching staff. archivists or students) and those who may procure software or advise others (IT support, management). Our focus is on the UK but we are also interested in hearing from international colleagues. The survey has been developed in partnership with OSS Watch and the Software Sustainability Institute.
Survey results won’t be made attributable to individuals or institutions, and all information will be anonymised for analysis purposes.This survey will remain open until 7th July (17:00 BST) at
If you have any question about the survey please contact Torsten Reimer at firstname.lastname@example.org
We would like to thank you for your interest in the survey - we do really value your feedback.Preservation Topics: Software
Preserving Cultural Heritage
National libraries have the responsible task of building a bridge between preserving the rich cultural heritage of our society and providing public accessibility to it. Digitization is a mean to solve this complex and contradictory task. Digital copies of books provide access to their content while preserving the original artifacts in case of lost or destruction. While this seems to solve one problem it raises another one: long term preservation of digital objects.
What is Matchbox?
The Matchbox Toolset is an open source toolset that provides decision-making support for various quality assurance tasks of digital libraries. It can be used to assess quality properties of image collections, compare different versions or find duplicates within collections. Matchbox is based on state-of-the-art image processing technologies and does not rely on Optical Character Recognition (OCR) which makes it more flexible than previous approaches.
How does it work?
The solution provided is based on interest point detection - a technique that has emerged into various fields of visual computing. Based on contrast properties of an image perceptual outstanding points are detected and statistically described. The intrinsic properties - scale, illumination and rotation invariance - make this approach a perfect choice for analyzing inhomogeneous document collections. From these interest points a visual vocabulary and document fingerprints are computed. An approach derived from classical document retrieval, now applied to image retrieval. Using machine learning techniques to identify interest points common to all images of a collection a visual vocabulary is calculated. By counting these words a histogram based fingerprint can be created for each image. This highly condensed representation of an image can be used for fast indexing and search operations. Based on efficient machine learning algorithms these fingerprints are used to identify matching images within one or between different book collections. Once matching pairs have been identified, a geometrical transformation is calculated from their corresponding interest points to scale, rotate and align the images accurately. After this registration procedure they can be reliably compared and a similarity estimation can be calculated.
The method described has been implemented as a set of small tools. Instead of a monolithic program solving only a specific task this approach provides flexibility for various current problems - detecting duplicated pages within a book, estimating quality differences between two different digital versions of a book, assembly of a collection from different partial versions - as well as many yet unknown problems.
The provided video demonstrates and visualizes the fundamental principles and technologies of the Matchbox Toolset.
My name is Krešimir Đuretec. I work as a project assistant at the Department of Software Technology and Interactive Systems, Vienna University of Technology with focus on SCAPE and BenchmarkDP projects. I am also pursuing a PhD at the same department.Your role in SCAPE?
My primal focus in the SCAPE project is in the Planning and Watch sub-project. There, I have been involved in the development of Scout – preservation watch component. Furthermore, I am responsible for the development of the simulation environment.
Recently, I took over the subproject lead so now my responsibilities are shifting from the development to coordination. My task will be to make sure the Planning and Watch sub-project products (Scout, Plato, C3PO, policies) integrate nicely with each other and with other products from the rest of the SCAPE project (repositories, component catalogue, web archives, ...).Why is your organisation involved in SCAPE?
With the knowledge in the digital preservation field and SCAPE being the follow up of the Planets project (where we were also participating) it was a logical choice for Vienna University of Technology to be part of the SCAPE project. The biggest benefit for us, as a research institution, from participating in this project is the contact with potential users of our research products. This enables us to get immediate feedback on our results and also allows us to drive our research towards real users needs.What are the biggest challenges in SCAPE as you see it?
SCAPE as a big project definitely has a lot of challenges. The two that I would like to point out are large scale content processing and providing the full life cycle of preservation.
With content easily exciding TBs and in some cases PBs today’s organisations need tools which are capable of processing such amount of data in a reasonable amount of time with reasonable resources. We can’t rely just on scaling out because scaling out can be quite expensive. Instead, we need tools that will optimally use available resources.
Even when we solve the first challenge we are only half way through. To be fully scalable we need tools to automatically monitor our content and tools that will notify us when there is a potential risk with our content. Also we need tools which will enable us to make systematic decisions on what to do with our content. Once we have those tools working together then we will be able to say we have a scalable preservation environment.
SCAPE is addressing both of these challenges and already has shown significant progress so I am sure that by the end of the project we will be able to say we have a real scalable preservation environment.What do you think will be the most valuable outcome of SCAPE?
I see Scout as a tool currently offering us huge potential. The possibility to automatically monitor digital content in a repository and notifying us when there are potential problems with it, such as policy violations, could significantly improve the quality of the preservation process. Furthermore, sharing content information (format distribution, size, … ) with a wider community could reveal some hidden risks (Am I the only institution holding this format?) but also potential opportunities such as two institutions working together on solving a problem with a specific format they both have. Once this happens digital preservation will be on a different (better) level.Contact information:SCAPE AttachmentSize Krešimir Đuretec1.02 MB
Following the community response to our workshop last year, we want to invite you again to contribute your future preservation challenge!
Digital Preservation has emerged as a key challenge for information systems in almost any domain from eCommerce and eGovernment to finance, health, and personal life. The field is increasingly recognized and has taken major strides in the last decade. However, key areas of research are often limited to applying solutions to existing problems rather than proactively investigating the challenges ahead and probing for innovative break-through approaches that would radically advance the domain.
Open Research Challenges @ IPRES 2013
To provide a forum for eliciting, discussing and refining future research challenges in digital longevity, we are organizing a half-day workshop at IPRES 2013 in Lisbon, Portugal. The goal of this interactive workshop is to elicit and discuss DP research challenges to be tackled in the next decade. It brings together researchers in order to step beyond the limitations of solutions that are applicable now, and develop concepts, models and solutions for upcoming challenges. This will cover diverse areas such as Information Systems, Databases, Information Retrieval, Library and Archival Science, Content Management, Modeling, Simulation, Human-Computer Interaction, Scholarly Communication, Systems Engineering, Cloud computing, Security and others.
The workshop builds on a highly successful first workshop last year, but focuses the discussion on specific topics and encourages participants to arrive at specific research challenges that can be tackled in concrete research designs. The output will be published as a report.
The workshop is an interactive event focusing on discussions and inspirational exchange between participants on challenging new research questions. We do not wish to prescribe topics for challenges, but encourage proposals to think outside of the box and propose challenges that lie ahead rather than research problems that are currently being tackled.
Please submit your description of a research challenge in which you
- argue for the concrete motivation behind this challenge,
- set it in relation to existing work,
- provide a context that enables understanding and further investigation and
- outline, if possible, concrete research designs to tackle the challenge.
This includes questions such as the following
- Why is this topic relevant?
- Why has it not been addressed yet?
- What are the problems involved?
- What is the potential impact if this is successfully addressed?
- What are possible ways of evaluating if it has been addressed?
Format and mode of submissions
Papers must contain the sections Motivation; Current state-of-the-art; Research contributions and benefits; Outlook. Papers must be submitted in PDF format and should be 2-4 pages. They should conform to the ACM SIG template. Three to five keywords characterizing the paper should be indicated at the end of the abstract. It is expected that at least one author of each accepted challenge will register for and attend the workshop. Submissions should be sent to email@example.com.
Challenge submission deadline: July 12, 2013
Notification of acceptance: July 31, 2013
Workshop: September 6, 2013
Up-to-date information can be found at the workshop website: http://digitalpreservationchallenges.wordpress.com