Project to Identify files with linked dependencies

Many office suites and other applications allow the embedding of information in them via a link to another file. The use of linked spreadsheets is common amonst data intensive agencies and large documents are often managed through linking multiple office documents to form a single final product.

Currently we have only anecdotal evidence as to the prevalence of linked files in the digital universe. It would be really useful to be able to understand the scale of the issue and identify the prevalence of linked files in the material that we ingest. Archives New Zealand and Victoria University have recently intiated a project that we hope will go some way towards achieving this.

A student from the School of Engineering and Computer Science at Victoria University recently started work on a new summer project at Archives New Zealand. The student, Niklas Rehfeld, is funded through a summer scholarship jointly provided by Archives New Zealand and Victoria university.

Over the next 10 weeks Niklas will be working on a project to investigate linked files and build a tool to identify them. Specifically, the aim of this project is to develop a prototype tool to identify when computer files formatted in the Microsoft Office 1997-2003 formats link to other computer files and which files they link to (in order to identify the component files that make up the complex digital object).

The technical work will involve the following:

Analysis of the Microsoft specifications to determine how document linking and other metadata that maybe of use for preservation purposes is implemented for Word, Excel and Powerpoint documents for the period 1997-2003.
Review of existing frameworks and related tools such as the open source “format identification, validation, and characterization” tool JHOVE.
Writing a specification for a modular tool for identifying linked documents given a root Microsoft Office document. As part of the specificion will be an evaluation of the feasibility of extending an existing tool versus creating a standalone implementation from scratch.
Implementation of a prototype tool for at least one document format. Time permitting, the tool will be extended either to handle a wider range of document formats or a wider range of preservation metadata.
Testing of the tool against a selection of files supplied by National Archives.

This project is a research project first and foremost. There is no guarantee that a working tool will be produced from it. However if a useful tool is produced the intention is to release it as an open source product that anyone can incorporate into their preservation workflows.

Niklas’s first steps will include looking at the various tools that are out there that may be able to be extended to perform the function outlined above. These include JHOVE, JHOVE 2, DROID and the National Library of New Zealand metadata extraction tool. In addition he will be investigating the available java libraries that he may be able to use for this purpose.

If anyone in the OPF community has any advice on best places to start with this project or any other advice they would like to offer we would greatly appreciate it. We have a number of good leads already but would appreciate any help the community could offer.

Preservation Topics:

Format Registry

Identification

Characterisation

Representation Information

Corpora

Tools

Preservation Risks

Submitted by Euan Cochrane on 21 November 2011 – 4:23am

Comments

apache poi

In Droid we used the Apache poi library to process office documents. We actually only used the poifs component (the part which reads ole2 files as a file system) which works very well. There are also more detailed apis to deal with more advanced concepts in office files.

Submitted by Matt Palmer on 21 November 2011 – 9:29am Permalink

Apache poi Library

Thanks for this suggestion Matt.

One of Niklas’s first steps was to look into the libraries used by tools such as DROID. Your message confirmed that the Apache poi library was one of the best options for use in this project.

Submitted by Euan Cochrane on 22 November 2011 – 3:27am Permalink

Related AQuA code

As part of the AQuA project, I started writing a basic Office binary-format analyser using Apache POI. The aim there was to identify the version of Office/Word/whatever that created a file, but I had difficulty determining whether the supplied version identifier denoted the authoring version or the compatability version. e.g. if you Save As 2003-2007 or whatever from Office 2010, can you tell both the source app. version (2010) and the target compatability version (2003-2007) from the file alone?

I then started exploring the POI API to extract further properties, e.g. lists of fonts (declared rather than used, I think). Feel free to fork/copy the code if it helps.

See https://github.com/openplanets/AQuA/tree/master/office-analyser

Submitted by Andy Jackson on 22 November 2011 – 12:32pm Permalink

Implications of Complex Objects on View Paths

The findings of this investigation might directly influence the concept of view paths. View paths or pathways, depending on which literature you are referring too, are a formalization of the sets of actual technical meta data to compute the requirements for accessing, rendering or running a specific digital object of a certain type. For simple objects like self containing document files the view path usually seen as a linear vector from the object of interest into the actual working environment (software, hardware combination) of the user. The object type determines the rendering application (more or less). Then a number of additional software (secondary) objects like operating systems, helper programs and drivers are to be taken into account.

If the objects get more complex the view path might have to reflect this as there could be additional applications or components be required, e.g. to render a movie that is linked into a presentation document like Power Point. In such cases the traditional view path model needs to be more versatile to deal with such complex objects as the assumed linearity is not longer sufficient. Additional applications or other components have to be considered to render all components of the object. This might require a much more complex setup of a suitable environment.

The problem is not limited to traditional linked documents but manifests in other scenarios like content management systems (CMS) too. CMSes consist of several well identifyable sub-objects like the database and the webserver components plus other helper tools. To preserve such a system outside its original environment, if e.g. direct system imaging should be avoided for some reason, a number of different components are required.

Submitted by Dirk von Suchodoletz on 21 November 2011 – 8:42pm Permalink

View paths

Thanks for the comments Dirk.

Identifying “complex objects” (for want of a better term), that include multiple files that each require different rendering applications, does seem like it will lead to more complex view paths. However it should be feasible, in most cases, to use the current mechanisms to identify the right applications to render each individual file in a complex object and then bundle/provide them all within the same emulated environment.

Problems may arise when there are multiple applications that render one dependent object but only one of those is compatible with the rendering application of the primary file within the complex object. For example If a .doc file has a .xls spread sheet embedded in it via a link and it is opened with Microsoft Word on a system that has OpenOffice.org Calc installed on it and not Microsoft Excel then this may mean that the embedded spread sheet is unable to be automatically opened within the Microsoft Word document interface as it was originally able to be. This may mean that the user is unable to know which data from the spread sheet was presented to the user through the document interface.

I suspect this example does not actually exist as in that particular case there is often a “screenshot” added to the .doc file showing the last known view of the data from the spread sheet as it was when the document was saved. However there may be other examples where such complexity causes problems.

There is certainly complexity to be found here but that doesn’t make it complicated. The first step to solving these problems will be to identify the scope of the issue. I hope that this project will result in a mechanism to help us take this first step.

Submitted by Euan Cochrane on 22 November 2011 – 3:46am Permalink

Tracing dependencies…

As we discussed at iPres, it would be possible in principle to reveal these dependencies using my kernel tracing technique (click here for pre-print), but unfortunately, there is no good system call tracing tool for Windows, as far as I can tell.

However, the STraceNT codebase is very close to what we need, and if someone wants to take a look at improving it I’d be glad to help steer things along. We would need to understand the Windows system call API well enough to know which calls correspond to opening and reading files or spawning child processes, and then add code to report the file names involved.

Any Windows hackers out there want to give it a go?

Submitted by Andy Jackson on 22 November 2011 – 11:27am Permalink

Re:tracing dependencies

This is a very interesting approach. However, in our case we would have to create some sort of webservice to analyse “Windows-only” files since almost all our stuff is running on Linux. Here is an interesting article I have found on this subject: http://www.codeproject.com/KB/system/hooksys.aspx

Submitted by Maurice de Rooij on 23 November 2011 – 3:48pm Permalink

Status/Integration

Hi all,

I’m Nik, who’s doing this project.

I thought I’d give a little bit of a status update, and ask for some input from people who might end up using this tool.

So far we’ve got something that will take an Office binary file (i.e. 1997-2003 formats), and will discover any linked OLE objects. It is implemented in java, using the Apache POI libraries, at the POIFS level. This means that it is fairly format-agnostic, as it works at the level that is common to all of the MS Office formats of that family. I have so far only tested it on Excel and Word files, but don’t see any reason why it wouldn’t work on PowerPoint or Publisher etc. files as well.

I am now waiting to get hold of a test data set to look at performance and reliability.

While we are waiting for that, we have been discussing trying to integrate this tool into some of the existing frameworks/tool such as the NLNZ Metadata Extraction Tool or FITS etc.

Where do people think would be the best fit for a tool like this?

Nik

Submitted by Niklas Rehfeld on 28 November 2011 – 12:47am Permalink

Consider Maven and perhaps Apache Tika

Hi Nik,

It’s not really clear where best to integrate, so I’d suggest first of all that you get the source code up somewhere and make it available as a simple command-line executable jar so folks can play with it. Also, if you’ve willing, I’d strongly encourage you to use Maven and publish the code onto the Maven central repository. Carl and I would be able to help with this kind of thing (it’s actually pretty easy), and this makes the code much more sustainable, as the Apache foundation will host the artefacts for the foreseeable future.

As for full integration, there’s also JHOVE2 and Apache Tika to consider. In SCAPE we are considering these two as well as FITS, but not NZME, as I had kind of assumed that that code base was no longer maintained. I’d be happy to be proved wrong about that. However, it would emphasise the fact that our community is trying to support far too many tools that do basically the same thing!

So, for now, if you make the core code available including examples of how to call it, others might be able to integrate it in their framework of choice.

Cheers

Andy

Submitted by Andy Jackson on 29 November 2011 – 11:32am Permalink

Update #2

Hi everyone again,

Another quick update on the status of the project.

As could be expected, immediately after posting the last update, It turned out that there are a whole lot of other ways of linking that had not been considered… So now implementations have been written for the following link types:

Formulas, Pivot Tables, Chart Source Ranges in XLS
normal linked documents in PPT, XLS and DOC files
Text and Shape links in PPT files.

There are a couple more that I have found out about, that hopefully will be implemented in the next couple of weeks.

I have also been working on splitting the code into two different parts, a user program and an API, so that extending and integrating it will be a bit easier.

I will try and get it up on the web some time soon, once I have checked the documentation and fixed a couple of bugs that I know about, so that if anyone is interested in trying/testing it out they can.

Nik

Submitted by Niklas Rehfeld on 17 January 2012 – 2:36am Permalink