Content is King. The key to a good file format registry is not software; it’s not user interface; it’s not governance. The key is content, content, content. We will all win if we have a registry whose content is usable, accurate, and comprehensive.
I have a challenge for developers in the digital preservation community: can we build a file format registry without building any new software systems at all?
Let’s take Pronom as an exemplar. Pronom consists of about 700 small XML documents with some cross references. There is a modest community of people who may look at them and point out errors (bugs) or omissions (new features). Some of these may even email in corrections. Of course, not every member of the community is trusted to make changes to the underlying data! There is a special subset of community members that validate these changes and actually commit them.
How often do these changes occur? I don’t think any of us know precisely, but I’ll suggest with confidence that it will be less than once a second. Actually, I bet it will average less than once a week.
How much format data is there? I don’t know precisely, but the entire current Pronom XML data fits into a single zip file of 680KB. I would be willing to bet that an active community will not grow this by more than a factor of 100 over the next few years – so I’ll estimate much less than 700MB. Today’s size and rate of change would be comfortably handled via email! I receive many larger documents each day. It certainly does not require a complex database.
So let’s consider this profile. There is a community that consists of a few committers, tens of active members, and perhaps hundreds of end users. They have a process to manage patches and releases. The community maintains a few thousand objects constituting a few megabytes.
Does this sound like a process and structure that we are familiar with? To me, this sounds like any of hundreds of modest open-source community-driven software development projects. Is this bad news? Do they have to invest hundreds of thousands of euros to setup a special infrastructure to manage all of this complicated stuff and distribute their data? Well, no! The infrastructure is already there. It is well established; there are many providers; and it is mostly free. Most importantly, it lets communities get on with their real work.
My second challenge to the developers in the digital preservation community is this: Suppose that you had two weeks to set up a functional file format registry, could you do it? Could you manage some directories full of XML and also generate some HTML files (or just use a style sheet)? Would you really need to write a single line of code? And the really big question: could you take the second week off?
Someone else can do the hard work while you are on vacation! Specifying good signatures for common file formats would be a great next step.
Let’s think radically. Let’s make our problems so easy that it’s almost embarrassing to solve them!
http://planets-suite.sourceforge.net/
How about relationships?
Content may be king but the hard bit of building a registry is dealing with the relationships between entities (it’s not all about formats).
Formats are only interesting if I know what to do with the information, e.g.,:What validation tool can I run for this format? What property extraction tool can I run? (and which properties should it extract)? What migration tools can I run? (and into what formats)? What emulation tools can I run?etc.This means I need information on software, properties etc. and to maintain these relationships.
There is a case to answer that today’s registries (PRONOM, PCR etc.) contain too many entities and maintain too many relationships but I think you need some.How would you do this in flat XML files? Even if you could, why would you want to?
Rob
Perhaps xlink:href would do.
An xlink:href from the format to the software XML resource (or vice versa) would be fine, I think. Would be compatible with any future linked-data approach and really easy to turn into a HTML hyperlink.
Content is still King!
Content is still king – and content includes relationships. I’ll make the radical suggestion, however, that format information could really mostly be about formats! All about a format’s history, specification, signatures, usage, and relationships to other formats, perhaps even relationships to the software tools or environments that were used with it. I would actually be very happy if we had a good repository with just this information!
In addition, we also need all of the great stuff that Rob mentions – information about tools for validation, characterisation, migration, and other types of manipulation.
But that is also about content – not about databases or software infrastructure. So my challenge stands – how can we focus our attention on the content and ways to use it, rather than the system that holds it.
Interesting approach but
Interesting approach but starts me wondering why this has not happened already ten years ago 🙂
I agree with Rob that it might become more complex if we want to add relations to software and support for complex digital objects such as websites and games.
I would say: let the market come up with tools why we enforce working on a standard that makes interoperability between registries possible.
The king is not deposed!
Jeffrey suggests that we “let the market come up with tools while we enforce working on a standard that makes interoperability between registries possible”.
I respectfully disagree with this position. And I find the points that Jeffrey highlights very telling. He suggests that we work on a registry standard that enables interoperability and let the market build tools. I note that this leaves out the part that gives the tools and the registries value. It leaves out the content! This is what we’ve been doing over the last few years, and the result is not encouraging. What is the value of interoperable registries where there is so little content for them to exchange? The answer is “not much”.
We need to get the digital preservation community to focus on the content of these ‘registries’ and to focus on content that makes a difference to us. I will have a longer blog post on this in the near future.