Building A Collaborative Format Registry Editor

By andy jackson on 14 January 2011 – 10:43am

After Ross’s post, I thought I’d better follow up on my format registry thoughts and show you all my response to Adam’s challenge. Using my weapon of choice, I was able create and populate a web site for collaboratively editing PRONOM data in just over one week’s worth of my spare time (six days FTE).

About half the time was spent configuring the web interface. No coding was required in order to do this – I just installed off-the-shelf Drupal modules and configured them. Perhaps the most critical module is the Content Construction Kit, as this allows custom content types to be built on top of the basic Drupal ‘node’ type. For example, I needed a ‘file extension’ field for each record, and this was implemented by creating a taxonomy field called ‘file_extensions’ and adding it to my Format content type. This makes it easy to discover known extensions and to group format records by extension.

The rest of the time was spent creating the script to upload the data from the PRONOM files into the site over XML-RPC. It’s not that much code, but I’ve never used XML-RPC before and don’t use Python all that often, so it took me a little while to make it work. The code is on github if you want to have a look. It’s not production-ready, but I think that’s okay for a proof-of-concept prototype. As I say on the site, please use the register link if you want an account so you can have a look at the content editing interface. Of course, it may not be precisely the interface one might choose to design, but I think this is a point worth compromising on. In return, we get a lot of stuff for free:

A browsing and editing interface with no coding required. Note that the data schema can be edited through the web interface almost as easily as the content itself.
User sessions and account management, OpenID, authentication, authorisation, user roles etc.
Content management and workflow tools (drafts, editorial control, notification, etc.).
User comments (e.g. this one), user content rating.
Easy access to the latest additions (RSS) and the latest edits (RSS).
Faceted search and an index. I’m particularly please with the faceted search.
Content with nice URLs, tagging, versioning and version comparison.
Direct export as XML or as RDF.
A programmatic service interface to the content and some site features (e.g. search).

It’s not a complete, polished implementation, but I just wanted to show that web interfaces are not that much work if you use the right tool for the job. In my opinion, if you find yourself writing things like user session management or URL aliasing code, you are probably Doing It Wrong. This stuff has been coded thousands of times, and there are multiple implementations on every platform (and available under open licenses) that will help you get started. By relaxing some of our more superficial constraints, we can get a long way very quickly by standing on the shoulders of the giants of web content management.

Preservation Topics:

Format Registry

Representation Information

Comments

Revision permissions now fixed…

andy jackson – 21 January 2011 – 4:18pm

Oops, the node revisions were not public. I’ve fixed that now, and the ‘revisions’ link will work properly.

Group by file extensions?

Rob Zirnstein – 24 October 2011 – 2:14pm

Why do people group file formats by file extension?

I’ve seen 28 types of files that use the file extension .DOC. Many of those file types/formats are also seen with other file extensions, like .TXT, .DOT, .WMC, .MCW, etc.. This list doesn’t even cover half of the diiferent Document file formats that I’ve encountered, many of which never use the .DOC file extension.

So, what do you gain by looking up the file extension .DOC? It doesn’t include all of the file formats that could be classified as documents. It isn’t an accurate method to identify what your mysterious .DOC file might be.

Is it a method for indexing file formats so that we can colaborate on them and use a reference to ensure we are all talking about the same file format?

Can’t we create an indexing method that doesn’t have duplicate entries? How about an integer index value, a bates number or something that looks like a MIME type? I realize that some databases are doing this, but their reference numbers are typically more human readable and not widely used. I prefer interger index values, in order to simplify software development. Then, we would need to encourage everyone to use the new index method to simplify coordination of all the indipendent databases. What is the chance of that happening?

Side question: How many file extensions does your database allow for each format, and how long can each file extension be? In my private database, I’ve found the need to support at least 8 file extensions (at a length of 15 characters each) for each file format.

Eaxamples:

PC Stomp Data .DB1, .DSX, .DWR, .LGS, .SY0, .DST, .DSN, .AP0

Paint Shop Pro Line Style .PSPSTYLEDLINE, .PSP

Resolving ambiguities…

andy jackson – 24 October 2011 – 3:02pm

File extensions are important because they are the primary (and often only) mechanism by which format identity is established ‘in the wild’, i.e. on people’s desktops, by associating sets of files with particular applications. Extensions rarely overlap in a problematic way for end users, and can always be overridden manually (Open with…). The web, of course, uses MIME types instead, but often the file extension is used to define the MIME type (mod_mime etc.), and once downloaded the extension usually overrides the MIME type.

This is why grouping by file format can be useful, as it enables us to list the applications that support a given file extension, and so provides a way to explore mysterious formats before the available identification tools support it. What better fall-back could we have for finding the right software for opening a file than the way operating systems do it? Having a list of 28 formats to go through is much better than having no clue at all.

As for indexing formats using integers, this is precisely what PRONOM does. Personally, I’d prefer a more immediately understandable fine-grained MIME scheme, e.g. ‘application/pdf; version=1.4’. I don’t really mind if integers are easier for computers – the computers are there to make my life easier, not the other way around! 🙂

Unfortunately, creating truly unambiguous identifiers for formats is proving difficult (as discussed over here).

File Extensions, MIME & Integer Values

Rob Zirnstein – 22 November 2011 – 4:15pm

I don’t agree so much with the use of file extensions, because they are too easy to change and I see too many name collisions. For example ‘.DOC’. There are many word processors using proprietary file formats that utilize ‘.DOC’. In my work, I do my best to avoid having to use a fall back. However, as an absolute last resort, I do use a file’s extension when all else fails.

I do agree, that there should be a human readable naming convention. I like MIME, so I would recommend a standardized integer and MIME pair to represent every type of file. The primary integer & MIME could represent the primary file type (ex: application/pdf and #258). The MIME parameters, and a secondary integer, can be used to represent the variation of the primary file format used. (ex: application/pdf; version=1.4 and #258-5) Then, we could talk about file types using the MIME values and use the integer values in our software. There could also be a standard cross reference table available between the two values. While we’re at it, include cross references to PRONOM, Oracle, Forensic Innovations, etc. values as well. Then we could gradually adopt a single referencing system to join MIME, and not have played any favoritism by selecting someone’s existing integer indexes. For the file types missing MIME values, we could create a standard system for automatically augmenting the MIME list with new non-MIME values.

Search form

You are here

Building A Collaborative Format Registry Editor

Comments