Feed aggregator

Incorporating Emulation into a 'business as usual' digital preservation workflow.

Digital Continuity Blog - 3 April 2012 - 8:53am

This post is intended to be speculative and may well be full of errors, both in the writing (spelling/grammar/typos) and in the content (I could be way off-mark). I am putting it out here as a thought piece to stimulate commentary and ideas. Some of this came out of recent discussions at the Future Perfect 2012 conference with many people including Jeff Rothenberg and Dirk von Suchodoletz.

 

What would it mean to take emulation seriously as a digital preservation strategy?

 

Most major digital preservation systems are currently based around having migration as the main long term preservation strategy. Some may argue that they are all in fact based on a strategy of hedging bets by way of retaining the original files and implementing migration, and this may be so; however none that I am aware of are based around using only emulation as a digital preservation strategy. I believe there is merit in some institutions using only emulation as a digital preservation strategy. They may wish to also use migration for providing access derivatives, much as we use a photocopier for providing access derivatives of paper records today. However there are some interesting and potentially cost-saving differences when implementing an emulation based digital preservation strategy instead of a migration based strategy.

 

This post is an attempt to highlight some of the differences in implementing a purely emulation based approach.

 

What would a business as usual digital preservation workflow look like?

 

At point of transfer or earlier digital preservation practitioners (DPPs) would try to ascertain the necessary rendering environment or environments for each digital object. This might be as simple as knowing that the object was a pdf file from a certain era and so would have been intended to be rendered in one of x version of Acrobat Reader, or a Microsoft Word document file from a certain era, created with OpenOffice, therefore intended to be rendered with either OpenOffice or one of the versions of Microsoft Word that was available at the time. Or it may be far more complex. The decision on how accurate the rendering environment has to be will depend on the context in which the object was normally used. If it was normally used by many users on many different systems then one or more representative rendering environments may be appropriate. If it was normally used by a multiple users via a specialised environment, then a copy of that environment may need to be made and transferred with the object.

 

Any necessary environments or environment components would be checked off against the preservation institution’s inventory (e.g. Microsoft word xx, java version xx, environment xx). Any components that had to be transferred from the agency would be packaged for transfer. Where full environments had to be transferred disk images would be made or virtual appliances would be transferred.

 

Files would go into the repository with some (digital) preservation metadata consisting of their age, rendering environment ID(s), date of last modification and any relevant fixity information (other metadata would be transferred for access restriction and discovery purposes etc). The date of last modification would be used when configuring the rendering environment to ensure active date fields were contemporaneous with the file (i.e. the emulated environment would have the system date set to the date the file was last modified).

The files would then have bit-preservation routines applied to them as per usual (copies made, checksums checked, media refreshment and replacement, etc).

If an appropriate rendering environment was not available in the inventory of the transferring agency one would either have to be configured or selected from a provider. Testing of the environment could be done in conjunction with the transferring organisation or individual, or could be done automatically using standard software installation testing routines. That one environment could then be used to render any object that was associated with it in the future. An average DPP (archivist, librarian) with basic IT skills should be able to be trained on how to configure most environments. In many cases it will only require knowledge of how to install applications on a base-operating system image.

 

When a user requested access to the original object there would be a number of options available:

1.  They could be provided access to the object automatically rendered in the associated rendering environment within a controlled environment, e.g. in a reading room.

2.  They could be provided access to the object automatically rendered in the associated rendering environment remotely, either through a custom application or through a web-browser.

3.  They could be provided with the files that make up the object and information about the rendering environment, e.g. an unique ID for the environment or a list of the components. This could then be provided by the user (e.g. the transferring agency may still have the environment running) or by an external service provider.

4.  They could be provided an access derivative created as part of non-preservation value-add process to facilitate greater reuse.

 

Throughout all of these options (aside from 4) the user could be given a number of ways to interact with the object and move content from it to a more modern environment (these may depend on confidentiality or commercial constraints):

 

a)  They could be given the option of printing objects to a file or printer.

b)  They could be given the option of selecting and copying content to paste into the modern host environment.

c)  They could be given the option of save the object in a different format and moving the result to the modern host environment.

 

How does this process differ from standard, migration-based, approaches?

 

 

1.  There is no validating of files against format standards (JHOVE would be unnecessary). Format validation only matters if you want to be able to consistently apply migration tools across a large set of files. If you are employing an emulation strategy this variance is not a problem. Intra-format variance generally results from different creating applications creating files differently but with the intention of them adhering to the same formatting standard. This variance is useful for identifying the rendering application but a problem for validation tools.

 

2.  Format analysis becomes less important. Strictly speaking format identification is unnecessary when implementing an emulation strategy. The only format-like information that is necessary is an identifier for the rendering environment(s) to be used to render the object. File format identification tools could be used to infer the rendering environment(s) for the files. For example tools like DROID could be repurposed to identify patterns relating to creating applications and from there the intended rendering environment(s) could be inferred.

 

3.  Identifying the rendering environment would be much more important and testing that environment at point of transfer could be more important. Doing this at point of transfer would make any issues apparent immediately rather than putting them off to a later date. In theory it would make it easier to consult with the original content owners to confirm decisions made (something that is harder to do each time a migration is conducted).

 

4.  Preservation planning would involve tracking systems architecture etc, not software “obsolescence”. I.e. preservation planning would require ensuring that your emulation tools ran on your current host environment(s).

 

5.  Preservation actions would involve writing new emulation hosts to host the old virtual hardware or writing new emulators to run the old environment images. This could be a significant process but would be relatively rare and would only need to be done once per emulator (which might emulate many different architectures & hundreds or thousands of environments).

 

6.  Decisions about the content presented to users (e.g. as a result of migration or emulation) are made early in the preservation process (at point of transfer) as opposed to when a migration action is deemed necessary. 

7. Access to the digital original could be more complicated for the average user and various mechanisms may have to be put in place to overcome this. Providing basic instructions for interacting with each environment would be an initial step. Old software documentation could be digitised and made available. Old software manuals often assumed no knowledge of computers and could be repurposed for future users.  Interactive walk-through overlays could be added to the software (thanks to Jeff Rothenberg for suggesting this) leading users through the main steps necessary to interact with the objects (e.g. when mice no longer exist). Access to derivative versions may also be provided if required.


 

In general the steps involved in implementing a digital preservation strategy involved only emulation are quite different from those involved in implementing a migration strategy.  Without solid examples of the practice of each, and metrics on costs and results, it is hard to say which would be more efficient.

 

I welcome comments and am very aware of the many gaps in this quite hurriedly written post. I chose to post this here rather than on the OPF or elsewhere because of its very raw nature, its speculative content and because i do not want it in any way associated with my kind employers.

 

EDIT:

I forgot an important point

The digital preservation institution does would not necessarily have to hold copies of any or every environment. They would only need to have access to them or to ensure that users could access them. Initially this may be possible with no work whatsoever. For example the environment for a pdf file may be limited to any current version of Acrobat Reader that a user would likely have at home, running on any OS that supported it. In the future if external emulation services were available the preservation institution may only have to check that the particular environment was available or request that it was configured and made available from the service provider. After that they may not need to actively do a lot besides tracking the health of the service providers (besides the usual bit-preservation routines).  

Categories: Planet DigiPres

FIDO version 1.0.0 released

Fido Blog Feed - 27 February 2012 - 11:24pm

FIDO (Format Identification for Digital Objects) is a Python command line tool to identify the file formats of digital objects. A lot of improvements to the code and functionality have been made.

What's new?
  • Scanning of container objects
  • Signature update script
Scanning of container objects

FIDO now uses the PRONOM container signature file to determine the container (compound) type if the file is of container type "OLE2" (fmt/111) or "ZIP" (fmt/189 or x-fmt/263). If so, FIDO will perform a deep (complete) scan of the object.

A side effect of this deep scan is that FIDO sometimes detects objects which are embedded in the scanned object. So, if you cut'n'paste a Powerpoint slide into a Word document, FIDO might detect this. Tests revealed this depends on the way the creating application (in this case MS Office 2003) structures the file while saving, this is not always straightforward and seems to differ on a per-file basis with only slight variations. Nice food for thought I'd suppose!

Another side effect is that deep scanning can cause FIDO to slow down significantly with big container-type files. You can disable deep scanning by invoking FIDO with the '-nocontainer' argument. While disabling deep scan of containers speeds up identification, it may reduce accuracy.

Signature update script

FIDO now comes with an interactive CLI script to update PRONOM signatures to the latest version. This script queries the PRONOM SOAP webservice for the latest version available. If there is a new version available it will download the latest signature file and signatures.

Please note that the script WILL NOT update the container signature file. The reason for this is that the PRONOM container signature file contains special types of sequences which need to be tested before FIDO can use them. If there is an update available for the PRONOM container signature file, it will be tested before it shows up in the next commit. However, if you are up for a challenge, you can always try and update the file yourself.

Other changes in this version

The functionality to convert and check the quality of PRONOM signatures has been taken out in this new version. These functions are now available through the prepare script which comes with the signature update script.

The future

FIDO should only do what it was intended for in the first place: Format Identification for Digital Objects. As an example, don't expect FIDO to ever validate JPEG2000 files, we have JPYLYZER for that. The OPF philosophy is to create small applications for separate kinds of digital preservation tasks. This way every application can be used standalone, or, and here comes the best part, they can be built into a workflow. In fact, at NANETH we already use FIDO in one of our ingest tools.

With FIDO version 1.0 being published, it does not mean development stops. In a next release there will be support added to run FIDO multithreaded for even better performance. A next step in container identification could be to unpack compound files and have foreign objects scanned separately. Also there has been a request to provide an interface to the values of the inner functions. This way FIDO can be incorporated into larger Python frameworks more easily. BTW: at the moment it is already possible to invoke FIDO from your own scripts, read more here.

Try it and report!

If you already use DROID or other format identification tools on a regular basis, give FIDO a try and report back to us. We'd like to know what features are missing, if you use FIDO in a workflow or if you have found a bug. Even report back if you are just happy using FIDO.

DOWNLOAD NOW!

http://www.openplanetsfoundation.org/software/fido/

More FIDO... Preservation Topics: Preservation ActionsIdentificationToolsSoftwareFido

Identification tools, an evaluation

Fido Blog Feed - 23 February 2012 - 9:09am
The Scape Characterisation Tool Testing Suite

This information have also been published in the Scape Deliverable D9.1.

 We have created a testing framework based on the Govdocs1 digital Corpora (http://digitalcorpora.org/corpora/files), and are using the characterisation results from Forensic Innovations, Inc. ((http://www.forensicinnovations.com/), as ground truths.

The framework we used for this evaluation can be found on 

https://github.com/openplanetsfoundation/Scape-Tool-Tester

All the tested tools use a different identification scheme for formats of files. As a common denominator, we have decided to use Mime Types. Mimetypes are not detailed enough to contain all the relevant information about a file format, but all the tested tools are capable of reducing their more complete results to mimetypes. This thus ensures a level playing field.

The ground truths and the corpus

The govdocs1 corpus is a set of about 1 million files, freely available (http://digitalcorpora.org/corpora/files). Forensic Innovations, Inc. (http://www.forensicinnovations.com/) have kindly provided the ground truths for this testing framework, in the form of http://digitalcorpora.org/corp/files/govdocs1/groundtruth-fitools.zip. Unfortunately, they do not list mimetypes for each file, but rather a numeric ID, which seems to be vendor specific. They do provide this mapping, however, http://www.forensicinnovations.com/formats-mime.html, which allows us to match IDs to mimetypes. The list is not complete, as they have not provided mimetypes for certain formats (which they claim do not have mimetypes). For the testing suite, we have chosen to disregard files that Forensic Innovations, Inc. do not provide mimetypes for, as they make up a very small part of the collection. The remaining files number 977885.

The reduced govdocs1 corpus contains files of 87 different formats. These are not evenly distributed, however. Some formats are only represented by a single file, while others make up close to 25% of the corpus.

To display the results, we have chosen to focus on the 20 most common file formats in the corpus, and to list the remainding 67 as the long tail, as these only make up 0.56% of the total number of files in the corpus.

 Format Distribution in Govdocs

One interesting characteristic of the ID-to-mime table from Forensic Innovations, Inc. is that each format only have one mimetype. Now, in the real world, this is patently untrue. Many formats have several mimetypes, the best known example probably being text/xml and application/xml. To solve this problem, we have introduced the mimetype-equivalent list, which ammends the ground truths with additional mimetypes for certain formats. It should be noted that this list have been constructed by hand, simply by looking at the result of the characterisation tools. Any result that do not match the ground truth is recorded as an error, but inspection of the logs later have allowed us to pick up the results that should not have been errors, but rather alias results.

The test iterator

We have endeavoured to use the tools in a production-like way for benchmarking purposes. This means that we have attempted to use the tools’ own built-in recursion features, to avoid redundant program startups (most relevant for the java based tools). Likewise, we have, if possible, disabled those parts of the tools, that are not needed for format identification (most relevant for Tika). We have hidden the filenames from the tools (by simple renaming the data files), in order to test their format identification capabilities, without recursion to file extension.

Versions

Tika: 1.0 release

Droid: 6.0 release, Signature version 45

Fido: 0.9.6 release

Tika – a special note

For this test, Tika have been used as a java library, and have been wrapped in a specialised Java program (https://github.com/blekinge/Tika-identification-Wrapper). This way, we can ensure that only the relevant parts of Tika is being invoked (ie. identification) and not the considerably slower metadata extraction parts. By letting java, rather than the test framework handle the iteration over the files in the archve, we have also been able to measure the performance in a real massprocessing situation, rather than the large overhead in starting the JVM for each file.

Results

We have tested how precisely the tools have been able to produce results to match the ground truths. As stated, we have focused on the 20 most common formats in the corpus, and bundled the remainder into a bar called the Long Tail.

Precision

As can be seen from this graph, Tika generally performs best for all the 20 most common formats. Especially for text files (text/plain), it is the only tested tool that correctly identifies the files. For office files, especially excel and powerpoint, droid seems to be more precise. Tika is almost as precise, but Fido loses greatly here. Given that Fido is based on the Droid signatures, it might be surprising why it seems to outperform Droid for certain formats, but this is clear for pdf, postscript and rich text format. The authors will not speculate on why this is so.

Comma/tab separated files are fairly common in the corpus. Tika cannot detect this feature of the files, and recognize them as text/plain files. Fido and Droid fails to identify the files, just as they do for text/plain files.

The dBase files, a somewhat important feature of the corpus is not detected by any of the tools.

Only Tika identifies any files as rfc2822, and even then it misses a lot. All three tools are equally bad at identifing sgml files.

Interestingly, Droid and Fido seems to work much better than Fido on the long tail of formats.

The Long tail

We feel that the long tail of formats is worth looking more closely at.

 

The long tail 

In this table, we have removed any format where none of the tools managed to identify any files. So, this table is to show the different coverage of the tools. We see that it is not just different levels of precision that matter, but which formats are supported by which tools.

Droid and fido support the Fits image format. Tika does not. Tika however, supports the openxml document format, which Fido and Droid does not.

Application pdf and application xml are some rather odd files (otherwise the ground truths would have marked them as normal pdfs or xmls). Here Tika is worse than the other tools. Tika, however, is able to recognize RDF, as shown by the application/rdf+xml format.

It is clear that while the overall precision in the long tail is almost equivalent for the three tools, the coverage differs greatly. If Tika, for example, gained support for the fits image format, it would outperform Droid and Fido on the long tail. Droid and Fido, however, would score much higher, if they gained Tikas support for Microsoft openxml documents.

The speed of the tools

For production use of these tools, not just the precision, but also the performance of the tools are critical. For each tool, we timed the execution, to show us the absolute time, in which the tool is able to parse the archive. Of course, getting precise numbers here is difficult, as keeping an execution totally free of delays is almost impossible on modern computer systems.

We ran each of the tools on a dell poweredge m160 blade server, with two Intel(R) Xeon(R) CPU X5670 @ 2.93GHz. The server had 70 GB RAM, in the form of 1333MHz Dual Ranked LV RDIMMs.

The corpus was mounted on file server accessed through a mounted Network File System via Gigabit network interface.

Each of the tools were allowed to run as the only significant process on the given machine, but we could not ensure that no delayes were caused by the network, as this was shared with other processes in the organisation.

 Speed test

To establish baselines, we have added two additional ”tools”, the unix File tool and the md5 tool.

The Unix File tool check the file headers against a database of signatures. Being significantly faster than the File tool indicates that the tool was able to identify the file without reading the contents. To do so, it would probably have to rely on filenames. Tika seems to be faster, but such small differences are covered by the uncertainties in the system.

Md5 is not a characterisation tool. Rather, it is a checksumming tool. To checksum a file, the tool needs to read the entire file. For the system in question the actual checksum calculation in neglible, so Md5 gives a baseline for reading the entire archive.

As can be seen, Tika is the fastests of the tools, and Fido is the slowest. That the showdown was to be between Tika and Droid was expected. Python does not have a Just In Time compiler, and will not be able to compete with java for such long running processes. That Fido was even slower than Md5 came as a surprise, but again, Md5 is written in very optimised C, and Fido is still python.

Preservation Topics: IdentificationWeb ArchivingCorporaSCAPEFido AttachmentSize Format distribution in Govdocs17.48 KB Tool precision19.12 KB The long tail17.43 KB Tool speed7.17 KB

New FIDO version: 0.9.6

Fido Blog Feed - 4 October 2011 - 1:50pm

The new FIDO (Format Identification for Digital Objects) is here, version 0.9.6.

Improvements:

  • reports if match is based on signature, extension or no match (fail)
  • reports if file is empty (to stderr)
  • reporting of mime-types fixed (special thanks to Derek Higgins)
  • shows help upon invocation without arguments
  • PDF signatures updated from PRONOM files, due to this FIDO failed to recognize some versions
  • extra information available in output via matchprintf: file format version, alias, Apple UTI, group index and group size (in case of multiple -tentative- hits) and current file count


Changes:

  • extension switch removed, this is a builtin default now
  • mime-types added to standard match output
  • match type added to standard match output
  • STDOUT/STDERR printing is now backward/forward compatible with old and future Python versions
  • Windows installer and site-package installer removed due to incompatibility problems


Additionally there is a new script 'to_xml.py' which converts FIDO's csv output to XML. This script also reports the FIDO version and PRONOM signature version. You can pipe FIDO's output to this script while it runs or use it afterwards to convert the CSV output file. More information on how to invoke this converter can be found in the script. Please note that the XML template in this script is only compatible with the default matchprintf output, but you are free to change this template yourself if needed.


Next tasks on the list are cleaning up code, create Pythonic easy installer, have FIDO recognize Compound documents better and improve the Prepare script (to generate FIDO compatible signatures). Please consult the FIDO JIRA for more information on these subjects.


You can pull the new version via https://github.com/openplanets/fido or download the zip directly: https://github.com/openplanets/fido/zipball/master


If you find any bugs or have any questions or requests, please submit them to the FIDO JIRA:

http://jira.opf-labs.org/browse/FIDO

Preservation Topics: IdentificationToolsOpen Planets FoundationFido

PDF, File Formatting and Creating Applications

Digital Continuity Blog - 12 July 2011 - 1:00am
I often go on about how software applications often diverge from, or uniquely implement, documented standards for formatting files that they create. Some software vendors are aware of, and document these deviations. Adobe are a good example of this with their software and in particular their implementation of extensions in writing files that adhere to the PDF version 1.7 formatting standard. Below is an extract from the wikipedia entry on PDF with my emphasis in italics at the bottom: "Adobe’s PDF specifications

Adobe changed the PDF specification several times and continues to develop new specifications with new versions of Adobe Acrobat. There have been nine versions of PDF with corresponding Acrobat releases:[10]

  • (1993) – PDF 1.0 / Acrobat 1.0
  • (1994) – PDF 1.1 / Acrobat 2.0
  • (1996) – PDF 1.2 / Acrobat 3.0
  • (1999) – PDF 1.3 / Acrobat 4.0
  • (2001) – PDF 1.4 / Acrobat 5.0
  • (2003) – PDF 1.5 / Acrobat 6.0
  • (2005) – PDF 1.6 / Acrobat 7.0
  • (2006) – PDF 1.7 / Acrobat 8.0
  • (2008) – PDF 1.7, Adobe Extension Level 3 / Acrobat 9.0
  • (2009) – PDF 1.7, Adobe Extension Level 5 / Acrobat 9.1

The ISO standard ISO 32000-1:2008 is equivalent to Adobe’s PDF 1.7. Adobe declared that it is not producing a PDF 1.8 Reference. The future versions of the PDF Specification will be produced by ISO technical committees. However, Adobe published documents specifying what extended features for PDF, beyond ISO 32000-1 (PDF 1.7), are supported in its newly released products. This makes use of the extensibility features of PDF as documented in ISO 32000-1 in Annex E. Adobe declared all extended features in Adobe Extension Level 3 and 5 have been accepted for a new proposal of ISO 32000-2 (a.k.a. PDF 2.0).[11]

The specifications for PDF are backward inclusive. The PDF 1.7 specification includes all of the functionality previously documented in the Adobe PDF Specifications for versions 1.0 through 1.6. Where Adobe removed certain features of PDF from their standard, they too are not contained in ISO 32000-1.[1]

PDF documents conforming to ISO 32000-1 carry the PDF version number 1.7. Documents containing Adobe extended features still carry the PDF base version number 1.7 but also contain an indication of which extension was followed during document creation.

I added the emphasis to make a point. For understanding how files are internally structured it is not always enough to just know the formatting standard adhered to when files were created (e.g. PDF version 1.7). Sometimes we need more information about how the particular application chose to interpret or, as in the example above, implement the standard. This information could be represented in many cases simply by knowing what the creating application was. This is shown/implicitly acknowledged in the Wikipedia extract by the inclusion of the application association with the name of each PDF version listed in the list of versions. 


Categories: Planet DigiPres

Clarifying Migration vs Emulation (+ some conjectures)

Digital Continuity Blog - 11 July 2011 - 2:18am

Disclaimer:

This post and all others on this blog are my personal thoughts and opinions and are not necessarily those of any organisation I work for or have worked for.

Now to the post.

Firstly, the clarification:

If we assume that “the aim of digital preservation is to maintain our (the preserving organisation’s) ability to render digital objects over time”.

Then this means that digital objects become at risk when there is potential for them not to be rendered by us at a point in the future, and digital objects become issues when they can’t be rendered by us.

Maintaining the ability to render digital objects means maintaining access to a software environment that can render the objects. In other words this means we have to have at least one copy of the software and dependencies that are needed to render the objects. 

In order to mitigate against a risk that objects won’t be renderable we have at least two options:

1. migrate content from files that make up the objects to other files that can be rendered in environments that we currently support. 

2. maintain access to environments indefinitely using emulation/virtualization.

So there is the clarification. Now some conjectures regarding it:

  1. For any reasonably sized volume of digital objects that require the same rendering environment, it may be simpler and cheaper to just continue to maintain access to one environment by emulating or virtualizing it. All this takes is the ability for somebody to install the required software in a virtual/emulated machine and for that machine image to continue to be renderable by emulation/virtualisation software in the future.
  2. Maintaining one copy of a compatible environment suffices for preservation purposes as it enables us to say we have preserved the objects, but is probably not good enough for access. There are reasons why we should provide viewers for digital objects, and also reasons why we should try to make sure users can access objects using their own modern/contemporary software.  For these reasons we may also have to perform migration where it is cheap/fund-able and provide access to the preservation master through reading rooms (either physical or virtual) in which we can restrict the number of concurrent users to as many as we have licenses for for the emulated environments.
Categories: Planet DigiPres

Emulation Workbench for Digital Object Format Analysis

Digital Continuity Blog - 7 July 2011 - 2:52am

As part of on-going research I have recently been working a lot with emulated desktop environments. 

One of the somewhat surprising things to come out of this work has been the realisation that an having a set of emulated desktops with various old applications installed on them (an emulation workbench) is a really valuable tool for digital preservation practitioners. 

When faced with an digital object with an unknown format that DROIDJHOVE etc cannot identify, one of the most useful approaches I have found for discovering the format of the object is to try opening it in a number of applications of roughly the same era.  Often applications will suggest an open-parameter to use when opening a file e.g:

Share photos on twitter with Twitpic

Or they may obviously produce errors when opening a file e.g:

Share photos on twitter with Twitpic

Both of which can be useful for understanding the types of objects you are dealing with. 

Some applications specify explicitly that they are converting an object from one format to another, implying that the application decided that the object was of the first format. 

Admittedly this approach can be time consuming. But if you have a set of files that you think are the same type it may be worthwhile spending the time attempting to open the files in different applications. Also, with some research it may be possible to automate this process so that an object can be automatically opened in a range of applications from it’s era and the results automatically analysed to see which gave the least errors or to analyse the conversion messages provided to see whether all the applications agree on the original format. Jay Gattuso has discussed something similar here.

Given the obsolescence of hardware, and difficulty setting up old hardware, this use-case highlights the need for a set of emulated desktops for digital preservation practitioners to add to their tool-set. Such a tool-set or “workbench” would be extremely helpful for adding to format databases such as Pronom and UDFR.

Comments appreciated via @euanc on twitter

Categories: Planet DigiPres

Mining Application Documentation for File Format Intelligence

Digital Continuity Blog - 7 July 2011 - 1:53am

I’ve been working on an application and installed environment database. 

As part of this I have been documenting the save-as, open, export and import parameters (options) for many business applications. 

For example, the following are the open parameters available for Lotus 1-2-3 97 edition installed on Windows 95:

ANSI Metafile (CGM)
Bitmap (BMP)
dBase (DBF)
Excel (XLS;XLT;XLW)
Lotus 1-2-3 PIC (PIC)
Lotus 1-2-3 SmartMaster Template (12M)
Lotus 1-2-3 Workbook (123;WK*)
Paradox (DB)
Quattro Pro (WQ1;WB1;WB2)
Text (TXT;PRN;CXV;DAT;OUT;ASC)
Windows Metafile (WMF)

Recently I realised that this might be a good source for intelligence about file formats. Let me explain what I mean.

Different applications differentiate in different ways between versions of file formats in their open and save-as parameters. The logic behind the differentiation may be able to be analysed to discover when format variants are significant or not.

For example Microsoft Word Version 6.0c (running on Windows 3.11) has the following open parameters for word for ms-dos files:

Word for MS-DOS 3.x - 5.x
Word for MS-DOS 6.0

In contrast to this WordPerfect 5.2 for Windows (running on Windows 3.11) has these open parameters:

MS Word 4.0; 5.0 or 5.5
MS Word for Windows 1.0; 1.1 or 1.1a
MS Word for Windows 2.0; 2.0a; 2.0b

Of which the first may be referring to ms-dos versions.

Lotus Word Pro 96 Edition for Windows (running on Windows 3.11) has the following open parameter for word for ms-dos files:

MS Word for DOS 3;4;5;6 (*.doc)

And Corel WordPerfect Version 6.1 for Windows (running on Windows 3.11) has these open parameters:

MS Word for Windows 1.0; 1.1 or 1.1a
MS Word for Windows 2.0; 2.0a; 2.0b; 2.0c
MS Word for Windows 6.0

None of which refer to any ms-dos variants. 

This pattern continues through more recent variants of each office suite.

The interesting finding from this is that the Microsoft suites differentiate between versions 3,4,5 (as a group) and version 6 but not within/between versions 3, 4 and 5 and the other suites (when they have a relevant parameter) do not differentiate between any of 3, 4, 5, or 6. 

If every office suite differentiated between the variants in the same way then this would indicate that there were significant differences between them. However as they don’t then it is inconclusive in this case. 

As Microsoft wrote the standards in this example then their suites ought to have the most reliable information and therefore it may be sensible to conclude that version 6 is significantly different to versions 3, 4 or 5. 

This pattern also holds for save-as paramaters. The Microsoft suites differentiate between version 6 and the group of versions 3, 4 and 5 whereas the other suites don’t differentiate this way. 

As the database gets more populated more analysis will be possible. Where there is general agreement in both open and save-as parameters across multiple applications then this will give digital preservation practitioners very good reason to believe that there are significant differences between the formats in question. 

I am carefully suggesting that these findings only give us reason to believe that there are differences. There may not actually be differences. Just because particular applications allow for users to differentiate between these parameters/file formatting options that does not mean that the applications themselves actually do. It may, for example, be a marketing tool to enable the vendor of the product to state show that the tool is “compatible with many formats” even though it may use the same code to open them all. 

Hopefully finding similar differences across many vendor’s tools enables us to mitigate against this issue but it should be noted that this approach does not provide definitive results. 

Comments would be appreciated via twitter @euanc

Categories: Planet DigiPres

Permanent (Digital) Preservation

Digital Continuity Blog - 20 June 2011 - 3:47am

Digital preservation practitioners often talk about digital preservation actions that they are planning or thinking about doing (rarely do they talk about ones that they have conducted, but thats another post).

Unfortunately I have found that when questioned about potential issues with their approaches digital preservation practitioners often fall back to saying either:

"well, we are keeping the originals as well",

or

"well, we are also doing ‘x’",

both of which are really unsatisfying replies.

It has lead me to conclude that we need a new term (or newly-redefined way of using an old term): “Permanent Preservation”.

Permanent preservation means actions that are taken that are intended to be the real-applied solution for digital preservation and which have the trust and approval of the organisation involved. Permanent preservation actions are those in which the organisation trusts the outcome and is willing to defend the authenticity of it.

  • Permanent preservation using migration

Any migration action that an organisation is not willing to defend to the extent that they will dispose of the original files should not be considered a permanent preservation action.

  • Permanent preservation using emulation

Any emulation solution that an organisation is not willing to defend to the extent that they will not perform any other (non-bit stream) preservation actions on objects that rely on the solution, should not be considered a permanent preservation action

Under this understanding of permanent preservation, migration for access is not a permanent preservation action as it is not intended to be a digital preservation solution and will generally involve retaining the original. 

If we use this term in the way outlined above then when practitioners talk about digital preservation approaches they can now differentiate between those that are permanent and those that are not (yet) permanent and not (yet) worth our trust. 

For Archives in particular (keepers of evidential records), trust and the authenticity is key to their very business, so all preservation solutions should have the potential to be permanent.

Of course for now we may not have any possibly permanent preservation actions. But we should also use the above definition to distinguish between those with the potential for permanence and those without. 

Some preservation actions will never be able to be trusted without extensive and costly ongoing/long-term manual checking. Others may be able to be trusted with minimal (and therefore inexpensive) ongoing/long-term manual checking. Given that all digital preservation actions currently involve a degree of up-front cost, those that may be able to be trusted at some point in the future are arguably worth more upfront investment than those that never will be able to be trusted or those that won’t be able to be trusted without significant ongoing or long-term cost.


Categories: Planet DigiPres

Fido in the jar

Fido Blog Feed - 3 March 2011 - 4:37pm

Open Planets Foundation is proud to present: Fido.jar. A java port of the Python version of Fido (Format Identification for Digital Objects). This first version runs on all platforms with Java 6 update 23 or later installed.

We would like you to give this first Fido in a jar a try. If you encounter any bugs, please submit them to the OPF Labs Jira. Installation and usage instructions are included in the zipfile.

Download Fido.jar @ Github:

https://github.com/downloads/openplanets/fido/fido_jar-0.9.5.zip

Preservation Topics: IdentificationToolsFido

Fido – a high performance format identifier for digital objects

Fido Blog Feed - 3 November 2010 - 7:57am

Fido is a simple format identification tool for digital objects that uses Pronom signatures. It converts signatures into regular expressions and applies them directly. Fido is free, Apache 2.0 licensed, easy to install, and runs on Windows and Linux.  Most importantly, Fido is very fast.

In a subsequent post, I’ll describe the implementation in more detail.  For the moment, I would just like to highlight that the implementation was done by a rusty programmer in the evenings during October.  The core is a couple of hundred lines of code in three files.  It is shorter than these blog posts!

I was stunned by Fido’s performance.  Its memory usage is very small.  Under XP, it consumes less than 5MB whether it identifies 5 files or 5000 files.

 I have benchmarked Fido 0.7.1 under Python 2.6 on a Dell D630 laptop with a 2ghz Intel Core Duo processor under Windows XP.  In this configuration, Fido chews through a mixed collection of about 5000 files on an external USB drive at the rate of 60 files per second.

As a point of comparison, I also benchmarked the file (cygwin 5.0.4 implementation) command in the same environment against the same set of 5000 files.  File does a job similar to Droid or Fido – it identifies types of files, but more from the perspective of the Unix system administrator than a preservation expert (e.g., it is very good about compiled programmes, but not so good about types of Office documents).  I invoked file as follows:

       time find . –type f | file –k –i –f – > file.out

This reports 1m24s or 84 seconds.  I compared this against:

       time python –m fido.run –q –r . > fido.csv

This reports 1m18s or 78 seconds.

In my benchmark environment, Fido 0.7.1 is about the same speed as file.  This is an absolute shock.  Neither Fido nor the Pronom signature patterns have been optimised, whereas file is a mature and well established tool.  Memory usage is rock solid and tiny for both Fido and file.

Meanwhile, Maurice de Rooij at the National Archives of the Netherlands has done his own benchmarking of Fido 0.7.1 in a setting that is more reflective of a production environment (Machine: Ubuntu 10.10 Server running on Oracle VirtualBox; CPU: Intel Core Duo CPU E7500 @ 2.93 GHz (1 of 2 CPU's used in virtual setup);  RAM: 1 GB).  He observed Fido devour a collection of about 34000 files at a rate of 230 files per second.

Fido’s speed comes from the mature and highly optimised libraries for regular expression matching and file I/O – not clever coding.

For me, performance in this range is a surprise, a relief, and an important step forward.  It means that we can include precise file format identification into automated workflows that deal with large-scale digital collections.  A rate of 200 files per second is equivalent to 17.28 million files in a day – on a single processor. Fido 0.7 is already fast enough for most current collections.

Good quality format identification along with a registry of standard format identifiers is an important element for any digital archive.  Now that we have the overall performance that we need, I believe that the next step is to correct, optimise, and extend the Pronom format information.

Fido is available under the Apache 2.0 Open Source License and is hosted by GitHub at http://github.com/openplanets/fido. It is easy to install and runs on Windows and Linux.  It is still beta code – we welcome your comments, feedback, ideas,  bug reports - and contributions!

Preservation Topics: IdentificationFido

The cost of providing the Google box

Digital Continuity Blog - 28 October 2010 - 11:56pm

After a discussion about the cost of digital preservation the other day I thought I would try to do a quick and dirty estimate of the cost of providing the “simple”  Google search box:   

(source: http://investor.google.com/pdf/2010Q3_google_earnings_slides.pdf)      

Google took in US$ 7,286,000,000 in revenues in the 3rd Quarter of this year. Of that 65% was Costs and Expenses including (in millions US$):   

Cost of Revenues $2,552   —-This is the amount it cost to provide the services that gained the revenues (I think)

Percent of revenues  35%

Research & Development $994  —Cost for ongoing R & D

Percent of revenues  14%

Sales & Marketing $661 — Many people think we need more of this, we can’t really not count it as a cost

Percent of revenues  9%

General & Administrative $532 — Unclear what this means but seems reasonable

Percent of revenues 7%

Total Costs & Expenses $4,739   (4.739 Billion US$  !!!)

Percent of revenues  65%

        $4,833  Billion in revenues came from its Google search service/advertising. That is 66% of its revenues. Assuming it spends 66% of its money to get those revenues then  we can multiple the total costs by 66% and we should get a vague notion of how much it spends on its search:  

 

4739 Billion x 66% = 3,190 Billion US$         

So next time someone asks for a Google like solution, ask them for 3.2 Billion US dollars per quarter         

(admittedly this analysis has gaping holes but its kinda fun to think about—  We probably don’t need to index all of the internet’s information for any particular solution, and this probably includes the cost of the advertising infrastructure that gathers the revenues for Google,  but I suspect the start-up cost to get something close to Google’s power would still be astronomical).  

Categories: Planet DigiPres

Preserving digital calendars

Digital Continuity Blog - 17 June 2010 - 1:25am

How do you preserve an outlook calendar? It would be quite a resource for future researchers.

Categories: Planet DigiPres

Memory Forever

Digital Continuity Blog - 18 March 2010 - 1:22am

Gizmodo are running a great series of posts on digitial continuity issues:

http://gizmodo.com/tag/memoryforever

enjoy!

Categories: Planet DigiPres

In the beginning

Digital Continuity Blog - 15 September 2009 - 10:14pm

Hi and welcome to anyone who has stumbled upon this blog!

My name is Euan Cochrane and I’m a digital preservation professional based in Wellington, New Zealand. I intend on using this blog to talk about issues related to digital continuity generally, including issues and news around digital preservation, metadata and related technologies such as XML and RDF.

It may take me a little while to get this started so please bear with me. I don’t want to officially launch the blog until I have prepared it more fully and am ready to regularly post.

I look forward to starting a dialogue with you all.

Euan

Categories: Planet DigiPres