After Ross’s post, I thought I’d better follow up on my format registry thoughts and show you all my response to Adam’s challenge. Using my weapon of choice, I was able create and populate a web site for collaboratively editing PRONOM data in just over one week’s worth of my spare time (six days FTE).
About half the time was spent configuring the web interface. No coding was required in order to do this – I just installed off-the-shelf Drupal modules and configured them. Perhaps the most critical module is the Content Construction Kit, as this allows custom content types to be built on top of the basic Drupal ‘node’ type. For example, I needed a ‘file extension’ field for each record, and this was implemented by creating a taxonomy field called ‘file_extensions’ and adding it to my Format content type. This makes it easy to discover known extensions and to group format records by extension.
The rest of the time was spent creating the script to upload the data from the PRONOM files into the site over XML-RPC. It’s not that much code, but I’ve never used XML-RPC before and don’t use Python all that often, so it took me a little while to make it work. The code is on github if you want to have a look. It’s not production-ready, but I think that’s okay for a proof-of-concept prototype. As I say on the site, please use the register link if you want an account so you can have a look at the content editing interface. Of course, it may not be precisely the interface one might choose to design, but I think this is a point worth compromising on. In return, we get a lot of stuff for free:
It’s not a complete, polished implementation, but I just wanted to show that web interfaces are not that much work if you use the right tool for the job. In my opinion, if you find yourself writing things like user session management or URL aliasing code, you are probably Doing It Wrong. This stuff has been coded thousands of times, and there are multiple implementations on every platform (and available under open licenses) that will help you get started. By relaxing some of our more superficial constraints, we can get a long way very quickly by standing on the shoulders of the giants of web content management.
Comments
Revision permissions now fixed…
Oops, the node revisions were not public. I’ve fixed that now, and the ‘revisions’ link will work properly.
Group by file extensions?
Why do people group file formats by file extension?
I’ve seen 28 types of files that use the file extension .DOC. Many of those file types/formats are also seen with other file extensions, like .TXT, .DOT, .WMC, .MCW, etc.. This list doesn’t even cover half of the diiferent Document file formats that I’ve encountered, many of which never use the .DOC file extension.
So, what do you gain by looking up the file extension .DOC? It doesn’t include all of the file formats that could be classified as documents. It isn’t an accurate method to identify what your mysterious .DOC file might be.
Is it a method for indexing file formats so that we can colaborate on them and use a reference to ensure we are all talking about the same file format?
Can’t we create an indexing method that doesn’t have duplicate entries? How about an integer index value, a bates number or something that looks like a MIME type? I realize that some databases are doing this, but their reference numbers are typically more human readable and not widely used. I prefer interger index values, in order to simplify software development. Then, we would need to encourage everyone to use the new index method to simplify coordination of all the indipendent databases. What is the chance of that happening?
Side question: How many file extensions does your database allow for each format, and how long can each file extension be? In my private database, I’ve found the need to support at least 8 file extensions (at a length of 15 characters each) for each file format.
Eaxamples:
PC Stomp Data .DB1, .DSX, .DWR, .LGS, .SY0, .DST, .DSN, .AP0
Paint Shop Pro Line Style .PSPSTYLEDLINE, .PSP
Resolving ambiguities…
File extensions are important because they are the primary (and often only) mechanism by which format identity is established ‘in the wild’, i.e. on people’s desktops, by associating sets of files with particular applications. Extensions rarely overlap in a problematic way for end users, and can always be overridden manually (Open with…). The web, of course, uses MIME types instead, but often the file extension is used to define the MIME type (mod_mime etc.), and once downloaded the extension usually overrides the MIME type.
This is why grouping by file format can be useful, as it enables us to list the applications that support a given file extension, and so provides a way to explore mysterious formats before the available identification tools support it. What better fall-back could we have for finding the right software for opening a file than the way operating systems do it? Having a list of 28 formats to go through is much better than having no clue at all.
As for indexing formats using integers, this is precisely what PRONOM does. Personally, I’d prefer a more immediately understandable fine-grained MIME scheme, e.g. ‘application/pdf; version=1.4’. I don’t really mind if integers are easier for computers – the computers are there to make my life easier, not the other way around! 🙂
Unfortunately, creating truly unambiguous identifiers for formats is proving difficult (as discussed over here).
File Extensions, MIME & Integer Values
I don’t agree so much with the use of file extensions, because they are too easy to change and I see too many name collisions. For example ‘.DOC’. There are many word processors using proprietary file formats that utilize ‘.DOC’. In my work, I do my best to avoid having to use a fall back. However, as an absolute last resort, I do use a file’s extension when all else fails.
I do agree, that there should be a human readable naming convention. I like MIME, so I would recommend a standardized integer and MIME pair to represent every type of file. The primary integer & MIME could represent the primary file type (ex: application/pdf and #258). The MIME parameters, and a secondary integer, can be used to represent the variation of the primary file format used. (ex: application/pdf; version=1.4 and #258-5) Then, we could talk about file types using the MIME values and use the integer values in our software. There could also be a standard cross reference table available between the two values. While we’re at it, include cross references to PRONOM, Oracle, Forensic Innovations, etc. values as well. Then we could gradually adopt a single referencing system to join MIME, and not have played any favoritism by selecting someone’s existing integer indexes. For the file types missing MIME values, we could create a standard system for automatically augmenting the MIME list with new non-MIME values.