can we talk about fmt/42, fmt/43 and fmt/44?

In a relatively recent signature update, the fmt/44 signature was updated to in allow some data after the stated EOF marker (ff d9).

In the case that started this off, a number of fmt/44 jpg files were found that had a couple of bytes after what DROID looks for as an absolute EOF.

I had a look into the specs for jpg, trying to unravel this story - were the extra bytes useful to someone? were we missing something by ignoring these bytes?

It transpired these bytes were added by a production workflow, and didn't really add to the informational aspect of the jpg (but one could argue it adds some informational aspect to the digital object as an abstract entity.....) It also transpired that the EOF marker used by the jpg signature is not described in the same way by the jpg standards. The standards describe an End Of Image marker (EOI) of [ff d9] and does not seem to make reference to any data held in the file after the EOI marker, the jpg standard doesn't care... the ‘jpg’ stops at the EOI marker.

If we take a close look at the fmt/42, 43 and 44 signatures, we can see there is an absolute (apart from the slight offset in the case of fmt/44) EOF marker expected. The EOF marker is the EOI marker, which can occur at an arbitrary point in the file. Of course it does usually occur at the EOF, that’s very typical and expected, but in the case of jpg files with an offset EOI marker DROID fails to match the version correctly (or at all) and offers all the PUIDS with jpg extensions as possible matches.

Over the last few months, I have seen perhaps 30 examples of jpg files (fmt/43 and fmt/44) that have a bunch of bytes after the EOI and therefore fail DROID signature matching. These files can be demonstrated to be valid fmt/43 or 44 files by (1) being rendered in all the jpg viewers - none of which seem to care that there is data after the EOI marker) and (2) by stripping the post EOI bytes from the file and re-running in DROID.

I would like to propose that the fmt/42, 43 and 44 signatures get changed, to support the variable placement of the EOI marker as per the jpeg specs (and experiences of file we are seeing).

This proposal has a few issues....

(1) What do we do with this extra data? Should we be scraping it somehow?

(2) Could one argue that a jpg with data post EOI is a different format, as there is clearly an informational aspect that is encapsulated in these extra bytes (although, of course if it’s not structured in a standardised way it’s of limited value to the community)

(3) Is there a need to make this change - it would impact one of the most common format types we have...

Comments

Since the format is agnostic on any data following the end of format marker, it would seem to be a good idea to make it a wildcard * search from the end. this would have almost no performance impact, as it would only be triggered if the rest of the signature had already matched, but it would improve accuracy. I wouldn't say its a different format to jpg, since jpg allows this data to appear. whether the rest of the data should be scraped or not is another matter. It might be interesting to treat the rest of the data as a new stream to be run through format identification. However, I expect (but don't know) that this data would be proprietary info recorded by some software. I doubt Droid would ever identify it... It might be interesting to flag that there was additional data, but that opens a can if worms to do it genetically, not just for jpg...

I stripped out the EOF for fmt 42,43 and 44. I then tested the 3000(ish) files that we have previously ID'ed as fmt 41,42,43 and 44 (500 fmt/41, 110 x fmt/42, 500 x fmt/43 and 500 x fmt/44 - then all these files again with no file extension - making 1610 unique signature comparison, and 3220 file comparisons). I left the fmt/41 files in as kind-of-ground truth.

The changes to the signature led me down a bit of a rabbit hole - there is an implication for x-fmt/80. This is a problematic to resolve as the x-fmt/80 (http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?st...) sig is pretty weak - it as a very weak (short and not hugely specific) string match at an absolute offset from the BOF. I have included a has priority statement for the three IDs, to allow the preferential match the jpg PUIDS (in this case the x-fmt/80 match is clearly a false positive. As I don't have any examples of x-fmt/80 I can't test how this works across the complete set of related files - fmts 42, 43, and 44 and x-fmt/80.) The inclusion of the has prority statement results in the offering of both fmt/44 and x-fmt/80 as hits, so I'm not sure how to weed out the x-fmt/80 hits without making the x-fmt/80 signature more specific.

 

Sig file with no <HasPriorityOverFileFormatID>467</HasPriorityOverFileFormatID> clause: http://dl.dropbox.com/u/59534857/DROID_SignatureFile_V55_no_jpeg_EOF.xml

Sig file with <HasPriorityOverFileFormatID>467</HasPriorityOverFileFormatID> clause: http://dl.dropbox.com/u/59534857/DROID_SignatureFile_V55_no_jpeg_EOF_v3.xml

I also ran the new signatures over the ~40 fmt/43 and fmt/44 files I have collected that have data after the EOF - and these ID'ed as the expected PUIDs.

In summary of my first basic tests - the removal of the EOF pattern for fmt 42,43 and 44 has some implications of inaccurate matches for a ~20 of my test files resulting in an erroneous x-fmt/80 match. This could potentially be resolved by making the x-fmt/80 tighter - assuming this is possible. Otherwise, the remaining 1600 signature based IDs resulted in the as expected results.

I'm not saying we should dump EOF out of hand, but in this case my limited testing has not highlighted any issue with removing the EOF aspect of the fmt/42,43 and 44 signatures.

I would be very interested in anyone else’s experiences. Feel free to have a play and let us know how you get on.

Alternatively I will have a look at making a wildcarded EOF pattern for the same PUIDs. There is a whole bunch more testing needed before this was committed, but I'd like to hear from others before I jump in and push this any further...

 

You are using the wrong id in <HasPriorityOverFileFormatID>467</HasPriorityOverFileFormatID>!

You should be using the file format id, not a signature id. "467" is a signature id of x-fmt/80, but its file format id is actually "122". If you change this, the erroneous matches should disappear.

Just to clarify my understanding of your results, are you saying that (discounting the erroneous x-fmt/80 matches), all jpgs were identified correctly without EOF markers?

Yupe - that was the issue - thanks - I had the wrong ID associated.

 

Here is the ammended XML: http://dl.dropbox.com/u/59534857/DROID_SignatureFile_V55_no_jpeg_EOF_v5.xml

 

 

Are you saying that without EOF markers, you get x-fmt/80 matches, but with them you don't? Because that doesn't make any sense to me, given how the DROID algorithm works (or is supposed to work).

DROID checks each file against all the signatures, recording if any of them matched. Matches are only ever removed if a higher priority file format is also detected. Given there were not originally priority relationships with xfmt/80, then if xfmt/80 could possibly match, it should have already appeared, in fact regardless of whether the fmt/42, 43 and 44 signatures also matched, or indeed were present at all.

What am I missing here?

Ok, further digging - what you are missing is that with the vanilla v55 sig file, these files get the dual ID..... These files come from a big pile of files that I pulled from our system based on their PUID, ignoring that we actually do some filtering POST DROID ID, so this issue would have been filtered at, and these objects transparently assigned the fmt/44 PUID.

The dual ID is expected for these files, and the x-fmt/80 ID occurs as a false positive hit for the x-fmt/80 pattern. I have 20 files that have this match (0x 11 01 @ offset 522).  All from the same producer, so it looks like the MD written by PS7 in this case is triggering this FP.

Better summary of my tests - adding the [hasProriityOver] element beneficially refines the fmt/44 signature regardless of any EOF changes. 

The removal of the EOF has no impact on the ID my test jpg files (with previously asserted PUIDS) but does allow the accurate ID of jpgs with data after the EOI marker.

andy jackson's picture

We've seen some similar problems, both with PDFs and JPEG2000s. In the former case, there is some variation in how the PDFs are closed, but this variation has no effect on the interpretation of the item (as in your JPG case here) - I think the DROID signatures were modified to take the variation into account. In the latter case, we had JP2 files that had been accidentally damaged in such a way that most of the damaged files would fail to be identified as JP2 if the end-of-file marker was required.

In the PDF case, the EOF signature causes us to waste time dealing with exceptions that are completely harmless. In the JP2 case, we only want to know whether any given file is 'probably intended to be a JP2', so that we can run deeper analysis and validation upon it, and so the EOF signature gets in the way of this workflow.

Furthermore, as far as I can tell by manually inspecting the Droid signatures, I am aware of no cases where the EOF signature tells us anything more than the BOF signature - i.e. when ignoring the EOF signatures when a BOF signature is present does not alter the result. Neither am I aware of any cases where a format can only be identified using EOF signatures (certainly, there are no such signatures in v55 of the DROID signature file). Finally, it is interesting to note that two of the most widely-used identification tools, file and Apache Tika, only allow BOF signatures (and in fact, use just an 8K chunk from the start of the file). As they don't seem to be required, and in fact cause a range of problems, I currently consider EOF signatures to be actively harmful, and would rather we simple stopped using them.

Unless, of course, there are cases where EOF signatures are really needed?

I don't much like the EOF components of signatures either. I suspect they don't add much accuracy, but do involve scanning at the end of a file or stream. In the case where DROID has to work from compressed files, it only has a stream to work with, so it must read the entire stream in order to get to the end, just to read these EOF markers! Well, there are also some annoying signatures which only involve a variable length scan (no BOF or EOF offset) - which potentially can force a scan of the entire stream too (but there used to be only about 2 of those, for fairly uncommon formats, so it would be nice to be able to disable those if required).

I suspect that very frequently, they do not add any identification accuracy - and as we have seen, sometimes decrease it!

It would be very interesting to strip out the <ByteSequence Reference="EOFoffset"> elements of the signatures from a signature file, and compare results with the original signature file, running over a fairly large corpus, of course.

If it turns out that the EOF byte sequences don't really affect identification accuracy, it would be an incredibly simple change for DROID to turn on or off running the EOF byte sequences (it already sorts the sequences to run all the BOF parts first, followed by the EOF parts for each signature).

This could be suggested for incorporation into DROID 7, by adding a new requirement to the wiki: http://droid7.wikispaces.com/

andy jackson's picture

I know some of the folks in the SCAPE project are currently running DROID over the govdocs1 corpus, so if we can construct a version of DROID that does not use EOF signatures then they'll probably be able to run a suitable test. I'll follow that up.

If that looks good, then adding a requirement to the DROID 7 wiki is an excellent idea. Thanks!

If you do edit a signature file to produce another version, and manually upload it into DROID, remember to also change the signature version in the file header, and in the file name to a different number. Probably best to pick a low number (e.g. version 2, so it won't conflict with any higher versions that may appear, and won't become the default highest version available to DROID. DROID may become confused about which file to run profiles with if this isn't done. And it becomes harder to remember which signature file was run over which set of files!

I can echo these comments... I've been bitten by this more than once. My current method is to completely close and restart DROID every time I reuse a filename - 'just to make sure'. I'll look at versioning via the XML internally and see if that fixes things for me.

Perhaps a D7 requirement is a 'flush' function that forces a re-parse of the signature source XML... but that may just be me being lazy and not using proper versioning....

I took out all the EOF patterns. I've not tested it, other than to validate the XML and run in up in D6 and fired a few known files at it. Seemed to work OK. 

http://dl.dropbox.com/u/59534857/DROID_SignatureFile_V55%20-%20no%20EOF.xml

andy jackson's picture

Thanks for this, I'll pass it on to the SCAPE folks doing tool evaluation and see if they have time to try it.

All good points / questions Andy..

 I would be very interested in creating a version of the whole sigfile sans EOF markers and seeing what the difference is between the as is sigfile, and this amended one.

 

I will get round to it (unless someone beats me to it..) but I'm currently swamped in another set of tests that is looking at the longer term changes to sigs over time - all this data will be useful, and I am starting to wonder how we can best share (1) source data for testing and (2) results from these kinds of tests.

The minimalistic approach adopted by  O/S based file ID methods is compelling, but I suspect somewhat skewed by the complex overhead of end point applications dealing with internal conversions transparently to the user / OS which still potentially leaves us somewhat in the dark about the exact nature if the files we are looking at. 

 

 

andy jackson's picture

Not sure what you mean about the OS tools. None of the ones I'm dealing with perform 'internal conversions'. That said, it is true that the definition of format in the OS tools is often more loose than for PRONOM (although even PRONOM's definition is still somewhat slippery). My currently preferred approach is to re-use OS identification algorithms/source code but to extend or replace the supplied signature file with one that matches up the PRONOM IDs etc. These signatures may end up in core Tika, but if they don't, we can still have a version of the tool with a new signature file that takes our stricter and more fine-grained format definitions into account.

Finally, of course, EOF markers may be useful for validation (if we capture all live variants), but I don't want to use them for identification because that is just a prelude to deeper validation. This means I would rather identification produced false positives that my production workflow can sift through than false negatives I have to override manually.

I don't think it's the OS tools being referred to here. It's the end user applications which open, for example, any kind of Word file when the OS only identifies it as a Word file. In other words, OS level file identification is normally too course grained.

Interesting you are re-using OS identification algorithms and extending their signatures to give PUID-equivalent matching. Is there a particular reason or set of reasons driving you to doing this work? I completely get that DROID/PRONOM aren't suitable for all contexts or workflows, but I am interested in understanding the contexts where they don't fit and why.

andy jackson's picture

Yes, indeed, all operating systems tend to describe format in very coarse ways (usually file extensions associated with applications), as do many open source applications, but I still don't quite understand what this has to do with what I was saying originally. The file identification tools I am talking about (file, Tika) work at the 'common name' and MIME type levels respectively, but both have expressed some interest in more fine-grained identification.

My main drive here is to understand what features we really need in identification, and whether it is possible to have signature files that can be easily shared across the different tools. If BOF RegEx are sufficient, then we can generate signature files for all these tools from a single data source, and whoever needs to use the signatures can do so easily without having to switch tools or platforms. As you say, different tools suite different contexts, so I'd like to be able to get the same results across different contexts. If this approach works, PRONOM will have more users, and the more users we have, the more help we will have in growing the signature data to cover more formats.

In other words, this isn't primarily about which contexts the DROID/PRONOM tools do and don't fit into, but about sharing signature information and getting that valuable data embedded into tools that are more widely used and supported. However, it is true that if DROID was easy to deploy on Hadoop, then this would not be so pressing. Attempting to do so revealed a number of issues, e.g. usual DROID usage requiring a File while Hadoop only provides an InputStream, but they all boil down to DROID being optimised for desktop usage and Tika being optimised for batch execution (and indeed Map-Reduce tasks).

I completely agree that sharing signatures across platforms would be great. There is a lot of interest on the DROID 7 development wiki for better signature management / development tools. One of the proposals was to switch to Java regular expressions. I think I shot that down a bit (for mainly technical reasons) - but pointed out that DROID already uses very regular-expression like signatures - but the way they are delivered is very opaque in the DROID XML.

A tool to allow signatures to be specified in their original reg-ex like form would be very welcome here. It would also faciliate signature sharing between other platforms, which can only be a good thing for everyone. You can see the more advanced syntax already supported by DROID if you look at the container signatures in DROID 6.

Very interesting that DROID is hard to deploy on Hadoop. Maybe that should be a proposal for the DROID 7 develpment wiki? Or "make DROID more stream friendly"? Which brings us nicely back to not having to process EOF signatures!

I've actually been working away for the last year or so on some new byte pattern matching capabilities in the byteseek library, which is *much* more stream friendly. In particular, it doesn't need to know the length of the stream to match or search (unless you actually want to scan backwards from the end). I hope to get the 1.3 release out in the next couple of months (but have been saying that for the last couple of months!).

andy jackson's picture

I think I'll have to put some more effort into the DROID 7 requirements wiki. I'll try to spend some time doing that now. FWIW, the think your point about RegEx performance 'falling of a cliff' will not turn out to be the case. The Java RegEx engine appears to emply Boyer-Moore as appropriate, and my informal testing indicates that the speed difference between the two is probably negligible. We're trying to do some more formalised large-scale testing, and the results of that will be published as soon as we can.

Fascinating. I had no idea that Java regular expressions used Boyer Moore for searching internally. I note they use Boyer Moore which (while theoretically faster than the Horspool variant used in DROID) is usually slower due to its added complexity - but I'm nitpicking here.

The more serious objection to using native regular expressions is that they are forced to work on char[] buffers (or Strings, or other char sequences). Setting aside the conversion of byte[] to char[] in order to process byte-oriented streams as char arrays, the bigger issue is that these regexes cannot process expressions which would span more than one array. In practice, this means you have to pick a candidate buffer size (e.g. 64Kb), and then you can only identify signatures which fit into this buffer.

By contrast, DROID has already been engineered to process its (near) regular expressions across buffers if necessary, allowing signatures to match as long as they need (or as small as you would like them to be), in each case only loading enough to make it worth loading a bit more.

Hmmm... a further analysis shows it may be possible to read from as many buffers as necessary by mplementing the CharSequence interface (ultimately backed by byte array buffers read from streams or files as necessary). This could work as flexibly as DROID currently does. Maybe my objections were premature. I'll look forward to any further work done on this.

I'm refering to the O/S ID mechs as a course format ID process, and the 'Endpoint' process being some consuming application (e.g. MS Word) that has its own opaque conversion/ID process.

' This means I would rather identification produced false positives that my production workflow can sift through than false negatives I have to override manually.'

Interesting point - I'm interested to hear how the two different classes of errors (FP and FN) are weeded out - it sounds like you are suggesting the FN have to be manually worked, but the FP can be systematically addressed?

 

I also think that we often (and dangerously) conflate 3 different processes - format classification (lumping things that are the same into a pile), format ID (giving a pile of things a label) and format validation (asserting that the things in the pile are a valid & formal set of things with the previously assigned label). Perhaps this is one of the things that gets muddied in this space - especially as sometimes a PUID can give a high confidence classification, ID and validation, and other PUIDs will only give a low confidence classification...

andy jackson's picture

We find most of our workflows are trying to assure that items are renderable, i.e. 'will this item display okay?'. We therefore want to pass our data to the right tool that we use to estimate whether rendering will work, which may be a fairly simple format validation or may be something more complex. For example, we want to pass our JP2 files to jpylyzer to look for damaged/truncated ones.

Therefore, if DROID gave us a false negative (as it would if it expects the JP2 EOI marker to be present), we would end up with a big set of files marked 'unknown format', mixed up with everything else marked 'unknown format', despite the fact that the first set of files are very nearly JP2s. This has to be picked apart manually at present.

However, if we use a more forgiving, inclusive identification step (which we do), we only risk false positives, and those would be revealed by the format-specific validator. Indeed, this is what happens, and the damaged/truncated files are marked as 'jp2 but invalid'. Lovely.

i.e. the reason FN are worse than FP is that we have format-based validation workflows, and therefore FP will be weeded out downstream. Furthermore, in my experience, most FP are usually either renderable-but-non-conforming or are malformed instances of the identified format.

Intrigued by the hints of Boyer Moore in the java regular expression classes, I did a little more digging to observe it in action. I was amazed that something so useful wasn't more widely known. To cut a long story short, it turns out that BM is only enabled for java "regular expressions" if you compile the expression as a literal case sensitive string match, using the Pattern.LITERAL compile flag. I'm afraid this means it's not a regular expression anymore, just a simple string search.

The implementation of Boyer Moore Horspool in the byteseek library (therefore, in DROID) is already considerably more advanced than this, in that it can handle character classes (sets of bytes) in positions of the string to be matched, and can handle case insensitivity (although no signatures currently use this I believe - but some could definitely benefit from this).

andy jackson's picture

Well, that's interesting. Perhaps we could knock up a patch for Open JDK and then everyone who uses Java 8 could benefit from this advanced implementation?

As for the additional functionality, that feels a bit like putting the cart before the horse to me. The case-sensitivity seems like a stretch, as RegEx can be declared case-sensitive or insensitive and so any DROID expression that needs mixed case sensitivity could at worse be matched using two or more RegEx.

I think we should aim to use the minimum functionality we need, and the big gaps in DROID matching ability seem to be centred around text formats. Are there any binary formats we can't match right now?

I'm not entirely convinced about the performance issue either, as I suspect most signatures are literal matches. I'll wait for the experimental data to lead the way there. Guessing performance makes me nervous.

Patching the regex implementation in java is a nice idea, but it would be a very big "patch"!

The use of BM would have to be enabled for regular expressions that were not just string literals (so you could specify the character classes in the first place!). This in turn would require the identification of candidate regular expressions that could be searched for in this way, involving classifying sub-components of the expression as "BM-able". This means changing core behaviour of quite a lot of the underlying engine.

In point of fact, this is exactly what is on the roadmap for byteseek. Even given that byteseek is being designed with this in mind, it's not quite as straightforward as the explanation above indicates. Maybe once I've worked out the kinks in marrying up automata-based regular expressions (deterministic and non-deterministic automata are also in byteseek, but not used by DROID!) with sub-linear sequence (and multi-sequence!) searching, I'll turn my attention to working it back into the core Java libraries! There would be additional complications making all of that work with unicode text, rather than just byte sequences.

Btw: if you enable case insensitivity for Java regexes, you won't get BM either - it's explicitly only for case sensitive searching. I guess *that* could be patched up - but that's a lot of work and effort for very little payoff.

There are very few binary formats it's not possible to match right now (other than ones based in containers, which have their own signatures in any case). However, there are a lot of signatures which amount to horrible hacks, to work around the limitations of earlier DROIDs. There are quite a lot of signatures which do more work than they need to, and are much less clear than they could be, as the DROID syntax can't curently handle quite standard features of normal regular expressions (for example, the absence of optionality). Bringing them closer to standard Java regular expressions would be one way to do this (if they are not replaced by them!).

You are right to say that the big gaps in DROID identification are indeed around text formats, not binary signatures. There's a whole proposal (including some detailed text heuristics I developed) on the DROID 7 wiki about this. In my (not so!) humble opinion, reg ex is simply not the way to go for text format identification - but that's a whole different discussion I'd be happy to have elsewhere, as I think we've drifted significantly from the point of this thread!

andy jackson's picture

Sorry, I didn't mean to imply that RegEx would be suitable for attacking the text identification problem. It is possible to make some headway like that, but regular languages cannot be used to parse HTML and other higher-order formal languages properly. I've noted some ideas on this in the DROID7 wiki.

I did a lot of signature analysis and profiling during DROID 5 and 6 development. Any DROID signature with more than one sub-sequence involves the equivalent of a .* expression for those sub-sequences. This is a very high proportion of them. Now, some of them find the next subsequence fairly quickly - but many of them actually do a lot of scanning.

Put it this way, when profiling DROID, the majority of its identification time is spent in the Boyer Moore Horpsool algorithm scanning along byte streams. So for me, the performance issue is already a known factor - there is a lot of scanning to find the next matching subsequence of a signature.

I'm looking forward to any other experimental data that might appear.

andy jackson's picture

Okay, fair enough. I don't quite understand how B-M-H is helping here, given that you are having to scan the bytes anyway, which suggests I/O is the limiting factor, but I'm happy to be wrong about that.

I didn't really mean to get so distracted by the performance issues. My only point is really that I'm willing to accept a reasonably signficant speed and even expressivity loss if we can permit the use of a more widely adopted signature language. Of course, you and the TNA are free to disagree with me!

Well, BMH helps precisely by *not* having to examine all the bytes - it's a sub-linear search algorithm which skips over bytes that can't match, without examining them at all. Of course, they still have to be read into a byte buffer, but you don't have to actually process all of them any further!

But I'm largely in agreement with you - I would also sacrifice performance if it meant making signatures easier to work with and more sustainable. I don't know what TNA thinks, as I don't work there anymore!

andy jackson's picture

Sorry about that - I didn't mean to imply that you work at TNA. Should have stuck to a more general 'YMMV'.