What do we mean by format?

Submitted by andy jackson on 10 August 2011 – 2:19pm

Bill’s earlier post and this one from Chris Rusbridge have spurred me to try to describe what I discovered about PRONOM format records during my editable registry experiment. Building that site required a close inspection of the PRONOM Format Record data model, during which I realised that we commonly conflate two quite different ways of defining formats. I suspect we should start to tease them apart.

The two definitions are:

Format, as it is specified. e.g. files that conform to the PDF 1.4 specification.
Format, as it is used. e.g. PDF files created by Adobe Illustrator CS2.

Flicking through the existing PRONOM records, it is clear the majority of the most complete records are defined by reference to specification. Many of the emptiest record correspond to known software, but with poor documentation. In between, the records are mostly thin wrappers around simple names, known internal signatures and file extensions. These different flavours of records have no consistently overlapping data fields that can be considered ‘primary keys’, i.e. that uniquely define a format. In other words, we don’t know what a format is.

If we are not sure which fields define a format, then I fear that the PRONOM team’s primary focus on creating signatures rather than documenting formats going to sting us in the long term. This is because the lack of clarity about what it is we are identifying will mean we may risk, for example, accidentally conflating different formats, or making artificial distinctions between differently named versions of the same format. We are minting identifiers for ambigious concepts, and so we must expect those identifiers to be retracted or replaced at some point in the future. What does it mean mint a permanent identifier for a record when every single aspect of that record is permitted to change?

One alternative to the PRONOM model is the GDFR approach, which defines a format as “a serialized encoding of an abstract information model”, and provides a sophisticated four-level model of what that means:

Abstract information model

…mapped via Format encoding model to the…

Coded information set (semantic)

…mapped via the Format encoding form to the…

Structural information set (syntactic)

…mapped via the Format encoding scheme (parser/encoder) to the…

Serialized byte stream

The problem is that not all format specifications have these four levels. The levels were inspired by the Unicode character encoding model, but (as that document itself indicates) other specifications uses different numbers of levels. RDF has three, HTML5 has three that define the markup semantics but uses more levels to link the mark-up to the behaviours and other features of the interpretation/renderer. Furthermore, formats defined only by software have only the lowest rungs of this scheme (data and parser/encoder). Such formats have no abstract information model, just an in-memory representation and an interpretation/performance. Even this mapping conflates the formal specification of the parser/encoder with it’s implementation – if we are being perfectly strict, the only thing the two perspectives have in common is the bytestream itself.

Conflating these different ways ways of defining format makes it difficult to describe the cases where conflict arises. We have probably all come across files that are perfectly well handled by the software, but break the specification, or indeed formats that have no formal specification. We need to be able to describe these difficult cases. Perhaps we should we be minting distinct identifiers for format specifications and format implementations instead? This could be done by deferring to the specification document instead of trying to model it’s contents, and would still allow us to distinguish between a bitstream that conforms to a given standard and a bitstream that is parseable using a particular implementation of that standard.

I think PRONOM are aware of the limitations of their model, but are going to go ahead and get the data out first anyway. Simultanously, it looks like UDFR are proceeding with their own ontology, presumably based on the GDFR model. In general, I think just pushing the data out there first (a.k.a. raw data now) is a reasonable approach, because we can always review and consolidate later, and doing it this way around helps ensure that the consolidation is based on real data. But I can’t shake the feeling that we are taking the long way round.

andy jackson’s blog
Login or register to post comments

10 comments

Bill Roberts wrote 7 weeks 1 day ago

Is format a useful concept?

Hi Andy – nice post.

Obviously the main reason we are interested in formats in a preservation context is to help us answer the question of how we manage our digital objects. I’ve got some object – have I got some software to view it, or should I migrate it to some other format, and if so what tool should I use – etc.

If we can assign a format to an object, then we can group together objects that we can treat in the same way. If we’re confident that an object follows a format specification and we know that software X can work with objects meeting that spec, then we can confidently connect the software and the object.

So I suppose what I am saying is, it doesn’t matter so much what format an object is, rather we want to know which software can reliably read it.

One (impractical) way to do that would obviously be to try to open each object in turn and see what happens. But we want to automate that process so we can handle lots of objects.

In many ways, PRONOM identifiers are identifiers for the class of digital objects which respond positively to a particular digital signature via the chosen identification tool. We want to design the identification tool and signature so that this class of objects is as close as possible to the class of objects that work correctly with software X.

Matching an object against a file format specification might be a good way to do this in some cases, but there might be other ways to do it too. If it turns out that ‘file format’ isn’t such a useful concept in practice, I don’t see a problem in throwing that away and doing it differently, as long as we can find a way to efficiently and automatically connect digital objects to software we can use to render it.

andy jackson wrote 7 weeks 1 day ago

Exactly…

Exactly – format identifiers are essentially an optimisation. I recognise that we need a proxy for software compatibility, and accept that formal specification is the usual social structure that is used to encourage compatibility. But by baking the spec. inside the format concept, we can’t capture the conflicts.

Looking at formats from this perspective, it’s not clear we need identifiers that are either permanent and unique. File format extensions, despite their transience and ambiguity, are currently entirely sufficient for the purposes of identifying a manageable list of compatible software. This is because current operating systems identify formats and define compatibility primarily in terms of file extensions. Or in other words, what’s the business case for version-level format identifiers? I think there is one, but we need the need to lead the deed.

I think it may be possible to build a format concept that is based directly upon compatible implementations, and only links to formal specifications rather than deferring to them (conformsTo rather than hasFormat). I have a set of follow up posts on this type of thing, which I’ll put up as soon as i can.

Angela wrote 7 weeks 1 day ago

Use Cases

We need different information to satisfy different use cases.

It is enough to know what file format specification a software version was trying to implement for a bitstream if we assume that almost all implementations of the specification mostly work and mostly manage to process each other’s outputs. In this case, a lot of (preservation) actions can be based on the attempted file format only. E.g.

we want to select a renderer that handles all (most) versions that have been implemented or

we want to migrate all TIFFs in a uniform way, but we don’t need to treat different implementations’ output differently for the purpose of migration,

then this is sufficient information.

Another use case would be to find all bitstreams that have been produced by a particular implementation because it is known to be faulty (e.g. metadata missing in the header) and we want to repair them, so that it is more likely possible to process them together with others produced by different implementations of this file format in the future. In this case we want to know what software version implemented it. One could trawl through all bitstreams and identify whether they have the known problem. But in case this is not possible, information about the creating software should be stored.

What makes me uneasy is that today’s software may process most bitstreams of a format, but that there will be less choice of supporting software in the future as a format falls out of use and borderline implementations will become inaccessible. Also, if we migrate bitstreams of varying provenance in a uniform way we don’t know how the differences in implementations add up.

As for modelling this: PREMIS captures file format information and creating application information. What else should be captured if we wanted to have a complete picture?

andy jackson wrote 7 weeks 1 day ago

Does PREMIS really capture this?

I know PREMIS provides spaces for this information, but while we have PRONOM to capture format, we don’t yet have any such scheme for application software. At the BL, we just use a text string for the application name and version. Is that sufficient? I’m not sure.

As for capturing the complete picture, perhaps we need more information about the technical environment. Unfortunately, the amount of information involved in completely documenting the technical environment would be rather large. Do we need to know that the PDF was created using Adobe Distiller 9 running on Windows Server 2003? Do we need to know the firmware version of the hardware encoding dongle that was used to encode a JPEG 2000 image? Does any of the other hardware matter? We know Word document formatting properties can depends on the available printers), but do we really need to dig that deep?

Jay wrote 7 weeks 11 hours ago

Format Identifcation..

This is a question we have been mulling over for a while, and it strikes me that we perhaps we are confusing ourselves by lumping a few concepts together under the banner of format identification. I have tried to avoid the ‘definition of a format’ question for now, and focused on format identification methods.

I want to warn the reader that there is no real conclusion. I wanted to unpick some of the comments already made, and chuck a whole bunch of commentary back in the mix.

Firstly, I need to ‘define’ what a format is to assist my description, so for the rest of this message, a ‘format’ is simply a label I apply to a discrete bitstream (file), or discrete collection of files, that allows me to (1) make an explicit association between a clump of 0’s and 1’s and some software that can make some sense of them in a meaningful way and (2) to gather together files that can be operated on in the same way, be that operation viewing, rendering, decoding, editing etc, (with a specific focus on migration as an expected operation).

(Caveat: This is a sematic description of a format, not a technical one.)

Now I have described what a format is, the tools that we use to identify formats undertake a number of different strategy, and each of those strategies tell us something quite different.

Approach 1:

Pattern matching.

C.f. droid signature matching.

This approach looks to find a signature bit pattern in the file. This method does not ‘care’ what the format is, it has no sematic or interpretive function (aside from the reg exp aspects of the sig search). If a pattern matches, file A is Format 1

Assuming the signature is accurate, and unique enough, this is a perfect adequate approach. However, the assumption is that the sig pattern is granular enough to be meaningful.

I tend to think that there will need to be some subsequent nesting regardless of the given label. Example, perhaps I can find with a high degree of accuracy all files that declare themselves as jpeg v1.02. This assertion tells me nothing about any other containered data inside the file (e.g. metadata sidecars) and so when it comes to an future processing I may need to drill into the group of files I have labelled as jpeg v1.02 to create subsets of those files that have XMP, or those files that declare embedded colour profile in a specific way (that may cause further process to fail etc).

But that’s OK. I am making an assertion based on information that I trust, and understand that the pattern matched label is likely to be accurate, but of differing granularity.

Approach 2:

Extension matching.

In this approach, I take the file extension, and I trust that (1) says something meaningful about the 0’s and 1’s and (2) that it is correct.

This process is as a less accurate than the pattern matching method. Example, a text file has the extension txt. Assuming that the extension correctly reflects the 0’s and 1’s, the actual content of the file could be characters encoded in a number of different ways, UTF-8, UT-16, 7-bit ASCII, 8-bit ASCII to name a few.

The point here is that (again, labouring the point, but assuming that the extension is ‘correct’) I can lump these files into a set that has some descriptive value. But there is some clear ambiguity where there is a choice of discretely defined labels inside an agree superset (e.g. MIME/text), or if there is some duplication of the extension by unrelated format types (e.g. .tmp or .data)

It follows that any assertions made purely via this method should be regarded as of low granularity, and medium accuracy. Known conflicts can be flagged, and where any further operations are required, this ambiguity can be addressed at the point of interaction with the object..

Approach 3:

(Standards) Validation

Example – Jhove file validation.

This approach takes a file, and either inherits the format identity assertion as established by another process, or by an internal signature/magic number/extension harvest etc.

Once a format assertion is made, the file passed through a validator that looks the compare the file with an agreed implement of the technical standard.

As ever, there is some finesses in the application, the primary constraint being the accuracy / completeness of the validation statements, (e.g. if the TIF validator expects a specific tag value, but sees something it considers outside the standard, the file may fail validation, even though the tag value that caused the failure has been adopted by the file creation applications as an acceptable loosening of the original standard.)

This approach is the most accurate of those described, as (assuming the validator is correct and appropriate) it test the formedness of the of the file, and therefore can be considered a deeper, more semantically meaningful pattern matching.

Of course, the down side is its very constraining, its complex/costly to develop a validator, and not all formats suit this structured level of validation.

Approach 4:

Magic Number

This is a simple version of pattern matching, but is used as a core OS level file format identification approach.

Its not always true, but when it is, is a very quick, generally accurate approach. I wonder if it’s a hangover from the good old days, or if we should be still using it? For example, I suggested a signature for MP3 that only looks at the opening part of the header, taking the declaration ‘ID3..’ or [0x 49 44 33 03 00] as a low accuracy, but high confidence signature that allows a super set of ‘MP3’ to be used, without having to drill into the specific of the format. I can just hand over the mp3 file to an mp3 player (for example) and let it sort out if it can play it or not.

Approach 5:

‘It is, because I said it is’

This approach is counterintuitive, but must have a place in our worlds. It basically says ‘I don’t have a test that demonstrates that file A is format A, but I trust that file A is format A’.

I would reserve this approach to files that are arcane/obscure, and obviously there some huge risks to the accuracy of this approach, but it has to be a valid method if there is no pattern / extension test that allows me to assert format identity with confidence.

I want to reserve the right the to make an expert judgement about files, simply because there are occasions when it’s the best we can do. Arguably the label I might want to use is a collection level identifier, not a format level identifier, however for us, format identity is such a core process to how we manage and operate on files I need to use something, and fmt/unknown does not capture the knowledge that I might have that tells me something about the technical makeup (and commonality) of a group of files.

Approach 6:

Demonstration of format/application pairing

This approach is described by Bill in his first comment, ‘One (impractical) way to do that would obviously be to try to open each object in turn and see what happens’ and actually I think that one of our technical challenges is unpicking where some render/viewing/editing application is doing this any way.

Example, if I pass a filename.doc to MS Word, it doesn’t actually care if it’s filename.doc. filename.rtf filename.txt or filename.blah, it will take the file, and using the inbuilt converters and renderers attempt to interpret the 0’s and 1’s and return the file in a away that makes sense.

So I have filename.blah and I know it can be viewed in MS Word v8. Do I actually need to know any more? As long as I maintain the ability to run MS Word v8, and I protect the 0’s and 1’s so there is no changes that all I need to do.

This approach is highly accurate – I can demonstrate that I can reproduce the 0’s and 1’s in a way that is meaningful and desirable.

But this approach also very intensive, and there is no tool that I am aware of that would allow a broad validation at this level across multiple format types in an efficient way.

So, what’s my point? (and that’s a good question…)

We have a few floating definitions of what a format is, and we have a few different ways of identifying, or asserting a format against a file, or group of related files.

The primary bottleneck for me is the tools, and therefore the approach we use to make that assertion. We could have the best, most complete, most accurate format registry in the world. If we don’t have a suitable method of applying that knowledge base to the objects I have in hand there is no point to it. So registries and tools go hand in hand to deliver the capability of making format type assertions.

I think that if I know how my format assertion decision came about, I might act differently when I am dealing with the object. There is something about understanding accuracy of assertions, repeatability of assertions (over time), confidence of assertions, and granularity of assertion. All of these factors are important when I try and understand a collection of objects, and understand the risks to being able to interpret those files in a meaningful way.

We also have this notion that format assertions are relatively permanent thing. PRONOM (I would argue) is the primary reference for format identity, and that’s a great thing. But we also are increasingly aware of issues that are coming out of our collective reliance on this resource as our primary reference. It has gaps, it (and more specifically via DROID) mixes approach 1 and approach 2 in a very opaque way, it is evolving.

None of these statements as meant as a criticism of PRONOM, more that I think the success of the idea (which can only be a good thing, talking about success!) means that perhaps its time that as a community of practitioners we pause, reflect on where we are, and start to collectively agree some governance principles that allows us to make trusted assertions, with confidence that when we revisit them in the future they will make sense.

Suggestions:

1) We agree that any file:format assertion is a reflection of the capability we have at the time, and subject to change, in a managed and agreed way.

2) We seek a way of versioning our assertion, transparently in that assertion.

3) We continue this (and related) conversation….

andy jackson wrote 6 weeks 6 days ago

Versioning assertions.

Thanks, that’s a good breakdown. I’m becoming less keen on trying to tease things apart, and instead embrace the ambiguity and encourage data to be shared. i.e. Allow a format concept that includes all of the definitions you outlined. I think there are a few refinements to make up-front (e.g. distinguishing between the IP restrictions of the spec. and the format), but they need not be so radical. As you imply, the important thing it to find ways to trust and verify our assertions. Permanence and modelling can take a back seat.

On a related note, looking at the PRONOM change log, it’s note clear to me what kind of changes to a record prompt full deprecation of the record rather than an edit. The TIFF version records have all been deprecated and replaced with one record, but the precise reasoning is not clear. Perhaps only splits and merges prompt deprecation? Also, I’m not sure if this decision is based on the fact that it is hard to determine which version of the format a TIFF file is supposed to be (no embedded version number), or if it’s still possible to use PRONOM to talk about TIFF 5 explicitly.

gmcgath wrote 3 weeks 6 days ago

TIFF

TIFF has an especially strong divergence between format as specified and format as used. Adobe owns the specification but hasn’t updated it in many years. Some doubtful data type requirements have been relaxed. The spec requires even alignment of data, but every modern implementation accepts odd alignment. This was a problem when implementing JHOVE: Should validity require that files conform to every detail of the specification, even when no one observes them any more?

Jay Gattuso wrote 3 weeks 4 days ago

Conformance to TIFF

Great question – what’s your opinion?

This is an issue we’ve had to broach a few times here in New Zealand. Not only with TIFF, but other formats too.

We use JHOVE and DROID as a core process in our initial file validation and characterisation process. And we, like others, have come across collections that fail JHOVE, but are render-able in a standard image/tiff viewer.

Our view in general has been that if the file opens in a few standard viewer with no warnings or errors then we should ignore the JHOVE warning. If the image not be opened, or is opened with warnings / errors then we need to look into the file more to understand what is going on, and to see what options are available.

We have seen this issue with PDF (quite a number of times) TIFF (occasionally) and JPEG (in a single case of a few hundred files that I plan to document in detail at some point).

For the PDF we generally ignore the JHOVE warning once we have checked the document opens in a few PDF viewers. (I actually had a PDF with embedded video in a few days ago, and JHOVE wouldn’t assess the file at all – it seems like the document caused JHOVE to completely fall over).

For the TIFF we decided to ‘re-write’ the XMP chunk, as it was the XMP structural declaration in the header that caused the JHOVE failure, and subsequent warnings in GIMP & Photoshop.

For the JPEG we are still assessing the options, but are likely to not call them fmt/44 (their ‘native’ format) but a new fmt that is a version of fmt/44, meaning we can ignore the JHOVE errors, and return the issue in hand for these objects.

I am happy to post on any of these cases in more detail if there is interest.

gmcgath wrote 3 weeks 2 days ago

Strict or practical conformance?

JHOVE is based on the premise that at some point in the future, a preservationist might have files in a given format and the format spec and will have to write software from scratch. Officially I should be defending that position, but I’m not sure how realistic it is. On the other hand, expecting people decades in the future to have today’s software isn’t necessarily realistic, and if you’re creating files for long-term preservation, strict conformance to the spec is safest. JHOVE is intended mostly to validate files intended for long-term preservation, and that affects its strategy.

The case of PDF isn’t quite the same. There are probably still some bugs in the PDF module, and it’s never been updated beyond PDF 1.6, so any files that use 1.7 (ISO 32000)-specific features may fail.

#10

Jay Gattuso wrote 3 weeks 2 days ago

Strict or practical conformance? The Archavist dichotomy..

Interesting comments, thank you.

My view is strict conformance is desirable, and something we are building into our thinking here. The logical end point for that thread is full normalisation, and I wonder where down that particular road its appropriate to stop.

There are a few levels of enforcement, ranging from stripping redundant/padding characters, recasting documents in their original format via a known creation application (e.g. resaving a PDF v1.5 as a PDF v1.5 using adobe acrobat) or migrating all file types to an agreed single implementation (e.g. move all PDF to v1.6).

Mechanics aside, the very nature of any of these conformance processes changes the original object, raising a number of workflow and audit questions.

I guess the underpinning questions are along the lines of ‘should we account for every single byte of a file, especially in the technical/structural aspects? Can we ever ensure that any conformance steps do not cost us useful information?’

Open Planets Foundation – A community hub for digital preservation