FITS Blitz | Open Planets Foundation

FITS is a classic case of a great digital preservation tool that was developed with an initial injection of resource, and subsequently the creator (Harvard University) has then struggled to maintain it. But let me be very clear, Harvard deserves no blame for this situation. They’ve created a tool that many in our community have found particularly useful but have been left to maintain it largely on their own.

Wouldn’t it be great if different individuals and organisations in our community could all chip in to maintain and enhance the tool? Wrap new tools, upgrade outdated versions of existing tools, and so on? Well many have started to do this, including some injections of effort from my own project, SPRUCE. What a lovely situation to be in, seeing the community come together to drive this tool forward…

Unfortunately we were perhaps a little naive about the effort and mechanics needed to make this happen as a genuine open source development. FITS is a complex beast, wrapping a good number of tools that extract a multitude of information about your files which is then normalised by FITS. What happens when you tweak one bit of code? Does the rest of the codebase still work as it should? Obviously you need to have confidence in a tool if it plays a critical role in your preservation infrastructure.

From the point of view of the SPRUCE Project, we’d like to see all the latest tweaks and enhancements to FITS brought together so that the practitioners we’re supporting get a more effective tool. But we also equally want future improvements to find their way into the codebase in a managed and dependable way, so that upgrading to a new FITS version doesn’t involve lots of testing for every organisation using it.

So in partnership with Harvard and the Open Planets Foundation (with support from Creative Pragmatics), SPRUCE is supporting a two week project to get the technical infrastructure in place to make FITS genuinely maintainable by the community. “FITS Blitz” will merge the existing code branches and establish a comprehensive testing setup so that further code developments only find their way in when there is confidence that other bits of functionality haven’t been damaged by the changes.

FITS Blitz commences next Monday. Please get in touch with myself, or Carl Wilson from the Open Planets Foundation, if you’d like to find out more.

Preservation Topics:

SPRUCE

Submitted by Paul Wheatley on 6 November 2013 – 11:31am

Comments

The fits-testing project

KEEP SOLUTIONS, for a private project that is sponsoring some developments in RODA, is going to develop some new features in FITS. These features including improving the support in FITS for identification, feature extraction and validation of some defined file formats.

But, before we could start, we wanted to assess how well FITS currently handles the file formats we want to deal with. To do so, we created a new open-source tool that takes a FITS installation and a curated corpora with well defined ground truth, and it output a XLS report details how well the FITS behaved.

The tool and some preliminary results are available at:

https://github.com/keeps/fits-testing#results

We compared the harvard-lts official version with the openplanets master, openplanets gary version (which updated Droid to version 6 but seems to have some configuration problem), and a new KEEPS version that added FIDO and fine-tuned gary’s version with a better configuration and some bug fixes.

We are still going to add some new file formats we need to support, like shape files and autocad. On the next months, we will be invested in improving FITS for these file formats.

The conclusion for now is that FITS gives pretty bad results for our target corpora, but that with just a couple of weeks we managed to greatly improve the FITS results. I know that the corpora is still very reduced and too focused in our own problem, but I think that, by improving the test corpora, we could make this tool very proficient at testing new developments in FITS and verify if they are actually improvements or not.

Submitted by Luis Faria on 6 November 2013 – 5:56pm Permalink

Yes, really useful work!

Thanks for the details on your work to date on this Luis! This gives us a really useful starting point that Carl has already been working with in his preparation for next week. Incidentally, we’ll be doing daily skype calls to review where we’re up to, so let me know if you want to join in anytime.

Submitted by Paul Wheatley on 6 November 2013 – 6:03pm Permalink

Our FITS updates

I would surely like to be updated with the results of your efforts!

Also, it may be useful for you to check our FITS working branch updates:

Updated Droid to 6.1.3
Removed Java 7 lock from Droid tool
Added new Droid signature file with some new mimetypes (experimental)
Changed the way FITS consolidates results (so tools that only partially identify the file format can still be used to extract metadata)
Added FIDO to FITS
Created a new ODF validator and added it to FITS

Also we are planning to do soon (next week):

Add corrupted Microsoft Office documents (doc, docx, ppt, pptx, xls, xlsx) to the fits-testing corpora
Add Apache POI to FITS to validate Microsoft Office documents (doc, docx, ppt, pptx, xls, xlsx)

Submitted by Luis Faria on 6 November 2013 – 6:32pm Permalink

Apache ODF Validator

Lovely to hear about all this work going ahead, and it’s really good to publicise it like this.

Just wanted to check you know that there’s already an Apache ODF Validator you could exploit.

Submitted by Andy Jackson on 7 November 2013 – 9:36am Permalink

Apache ODF Validator

We experimented with Apache ODF validator and with Office-o-tron and they did not function correctly. Apache ODF Validator gives many false-negatives and false-positives and office-o-tron is terribly slow. But as the schemas are publicly available and the method to validate is well defined by OASIS, we just quickly developed a new implementation that will be available soon at keeps-validator-odf.

Submitted by Luis Faria on 7 November 2013 – 10:40am Permalink

Yet another codebase?

That OASIS link says

last edited 2009-08-12

which I find a bit worrying. Are you sure it’s still that simple? Also, did you report your findings to either of the existing projects?

FWIW, I think we are strongest when we contribute our solid experience and our testing and development resources to higher-profile projects with larger user communities, like the Apache ones. In particular, I think the work Johan and Will have done with Apache PDFBox/Preflight is wonderful. I know there’s a higher collaboration overhead, but as well as gaining a larger audience for our work, we also gain a more maintainable infrastructure for the software over time.

Submitted by Andy Jackson on 7 November 2013 – 10:57am Permalink

Where does the truth lie?

I do agree I would not want to create yet another codebase. This is only an exploratory implementation that aims to evaluate its results against other existing implementations, such as the Apache ODF Validator. What is done in our implementation, and suggested in the 2009 guideline, is to check the XML files against relax-ng schemas, which is quite up to date.

But, or current problem is where does the truth lie? At first we thought that the Libre Office would absolutely give correct and valid ODFs, that they should be seen as the ground truth for valid files. But Apache ODF Validator seems not to agree with LibreOffice in what constitutes a valid ODF file. Without knowing in who to trust, we created our own validation tool and will try to ascertain by ourselfs who is right. Todays results actually seem to point to Libre Office as the cullprit, and we might choose Apache ODF Validator after all, but we still need more tests to be sure.

We also tried to compare the results with the output of online validation tools such as OpenDocument Fellowship and RHCloud ODF Validator, but the question of where does the truth lie still remains.

This is why we think that testing FITS against a test corpora with well defined ground truth is so important. But creating a good set of test corpora and ensuring the quality of the ground truth is not easy, as we can see from this ODF validation example. That is why I would like to bring some attention to this tool and call upon the community to help this project go further.

Submitted by Luis Faria on 7 November 2013 – 12:50pm Permalink

Contact LibreOffice direct?

I think there is some difference now between OpenOffice and LibreOffice. It might be worth speaking directly to Michael Meeks from LibreOffice (https://people.gnome.org/~michael/) about ODF validation and adherance to the ODF spec etc.

Submitted by William Palmer on 7 November 2013 – 1:09pm Permalink

Implementation Over Ground Truth

I guess I’m just a little surprised that making a new tool would tell you anything you could not learn by picking through the Apache one, which appears to use the same validation methodology. However, everyone has their own ways of approaching things, and now that I understand that this work reflects your process of understanding this format and it’s validation (rather than necessarily being a new tool intended to usurp existing implementations) it all makes much more sense.

I do agree that creating test corpora that explore these issues is really important and useful work. However, I’m a little skeptical about the implication that the formal specification represents ‘Ground Truth’. We have to deal with whatever the implementations create, and so the test corpus must include examples from the common tools, even if they break the formal specification. That specification may provide a useful baseline against which the variation between implementations might be compared and understood, but that does not make it ‘the truth’.

Submitted by Andy Jackson on 8 November 2013 – 9:56am Permalink

Corpora creation and the purpose of validation

Our process to create existing corpora in the fits-testing project is by using current well know implementations, such as LibreOffice and Microsoft Word, create a new document with some content, and then save or export in as many formats as tool options allow. But to consider these files as valid or not may be a question of terminology, or point of view.

On one point of view, a file being valid means it follows the file format specification. Whereas an archive accepts or not files that do not conform to specification is a matter of policy, and I agree that many times they would have to accept whatever implementations provide. But, nevertheless, the information that a file follows or not the formal specification should be there.

On another point of view, the whole idea of digital preservation is continuous access from the community, and if the community uses the implementations (such as Microsoft Word and LibreOffice), than compatibility with these implementations is the most important objective, even if they deviate from formal standards.

Now, there might be that there is no such thing as a valid file (as there is no truth), but there are files that “follow formal specification” and files that “are compatible with implementation X”. For now, we are considering valid as the first one, but if no specification is available we might have to resort to the second definition. In the end, it might mean we just need to define clearer terms and let policy decide what to accept or not.

Submitted by Luis Faria on 8 November 2013 – 11:18am Permalink

Implementation is A Ground Truth, so is a Specification

Interesting discussion. I just wanted to add that I think we need both things. Some use cases demand a reference spec for a format, other use cases ask for an exemplar implementation that can be poked and prodded.

Why not both? The Spec is a reference. The implementation(s) is a tangible example of a spec (you could also associate deviations from the spec here, e.g. a file proprietary implementation of a format type that shares 95% of its structure with an “official’ spec but includes some 5% proprietary novelty)(. Both are truths in their own right and should be used as such. The total knowledge of the format is then formed from both the specs and the implementations that a DP/format SME has decided to include….

Submitted by Jay Gattuso on 12 November 2013 – 1:54am Permalink

This is just for info:

This is just for info: incidentally I created an ODT entry in the OPF File Format Risk registry today, if there are any validation issues please feel free to report them e.g. as a child page here:

http://wiki.opf-labs.org/display/TR/OpenDocument+Text

Submitted by Johan van der Knijff on 18 November 2013 – 5:42pm Permalink