Identification of PDF preservation risks: the sequel

Last winter I started a first attempt at identifying preservation risks in PDF files using the Apache Preflight PDF/A validator. This work was later followed up by others in two SPRUCE hackathons in Leeds (see this blog post by Peter Cliff) and London (described here). Much of this later work tacitly assumes that Apache Preflight is able to successfully identify features in PDF that are a potential risk for long-term access. This Wiki page on uses and abuses of Preflight (created as part of the final SPRUCE hackathon) even goes as far as stating that “Preflight is thorough and unforgiving (as it should be)“. But what evidence do we have to support such claims? The only evidence that I’m aware of, are the results obtained from a small test corpus of custom-created PDFs. Each PDF in this corpus was created in such a way that it includes only one specific feature that is a potential preservation risk (e.g. encryption, non-embedded fonts, and so on). However, PDFs that exist ‘in the wild’ are usually more complex. Also, the PDF specification often allows you to implement similar features in subtly different ways. For these reasons, it is essential to obtain additional evidence of Preflight‘s ability to detect ‘risky’ features before relying on this tool in any operational setting.

Adobe Acrobat Engineering test files

Shortly after I completed my initial tests, Adobe released the Acrobat Engineering website, which contains a large volume of test documents that are used by Adobe for testing their products. Although the test documents are not fully annotated, they are subdivided into categories such as Multimedia & 3D Tests and Font tests. This makes these files particularly useful for additional tests on Preflight.

Methodology

The general methodology I used to analyse these files is identical to what I did in my 2012 report: first, each PDF was validated using Apache Preflight. As a control I also validated the PDFs with the Preflight component of Adobe Acrobat, using the PDF/A-1b profile. The table below lists the software versions used:

Software	Version
Apache Preflight	2.0.0
Adobe Acrobat	10.14
Acrobat Preflight	10.1.3 (090)

Re-analysis of PDF Cabinet of Horrors corpus

Because the current analysis is based on a more recent version of Apache Preflight than the one used in the 2012 report (which was 1.8.0), I first re-ran the analysis of the PDFs in the PDF Cabinet of Horrors corpus. The main results are reproduced here. The main differences with respect to that earlier version are:

Apache Preflight now has an option to produce output in XML format (as suggested by William Palmer following the Leeds SPRUCE hackathon)
Better reporting of non-embedded fonts (see also this issue)
Unlike the earlier version, Preflight 2.0.0 does not give any meaningful output in case of encrypted and password-protected PDFs! This is probably a bug, for which I submitted a report here.

Analysis Acrobat Engineering PDFs

Since the Acrobat Engineering site hosts a lot of PDFs, I only focused on a limited subset for the current analysis:

all files in the General section of the Font Testing category;
all files in the Classic Multimedia section of the Multimedia & 3D Tests category.

The results are summarized in two tables (see next sections). For each analysed PDF, the table lists:

the error(s) reported by Adobe Acrobat Preflight;
the error code(s) reported by Apache Preflight (see Preflight’s source code for a listing of all possible error codes);
the error description(s) reported by Apache Preflight in the details output element.

For the sake of readability, the tables only list those error messages/codes that are directly related to font problems, multimedia, encryption and JavaScript. The full output for all tested files can be found here.

Fonts

The table below summarizes the results of the PDFs in the Font Testing category:

Test file	Acrobat Preflight error(s)	Apache Preflight Error Code(s)	Apache Preflight Details
EmbeddedCmap.pdf	Font not embedded (and text rendering mode not 3) ; Glyphs missing in embedded font	3.1.3	Invalid Font definition, FontFile entry is missing from FontDescriptor for HeiseiKakuGo-W5
TEXT.pdf	Font not embedded (and text rendering mode not 3); Glyphs missing in embedded font ; TrueType font has differences to standard encodings but is not a symbolic font; Wrong encoding for non-symbolic TrueType font	3.1.5; 3.1.1; 3.1.2; 3.1.3; 3.2.4	Invalid Font definition, The Encoding is invalid for the NonSymbolic TTF; Invalid Font definition, Some required fields are missing from the Font dictionary; Invalid Font definition, FontDescriptor is null or is a AFM Descriptor; Invalid Font definition, FontFile entry is missing from FontDescriptor for Arial,Italic (repeated for other fonts); Font damaged, The CharProcs references an element which can’t be read
Type3_WWW-HTML.PDF	–	3.1.6	Invalid Font definition, The character with CID”58″ should have a width equals to 15.56599 (repeated for other fonts)
embedded_fonts.pdf	Font not embedded (and text rendering mode not 3); Type 2 CID font: CIDToGIDMap invalid or missing	3.1.9; 3.1.11	Invalid Font definition; Invalid Font definition, The CIDSet entry is missing for the Composite Subset
embedded_pm65.pdf	–	3.1.6	Invalid Font definition, Width of the character “110” in the font program “HKPLIB+AdobeCorpID-MyriadRg”is inconsistent with the width in the PDF dictionary (repeated for other font)
notembedded_pm65.pdf	Font not embedded (and text rendering mode not 3); Glyphs missing in embedded font	3.1.3	Invalid Font definition, FontFile entry is missing from FontDescriptor for TimesNewRoman (repeated for other fonts)
printtestfont_nonopt.pdf*	ICC profile is not valid; ICC profile is version 4.0 or newer; ICC profile uses invalid color space;ICC profile uses invalid type	–	Preflight throws exception (exceptionThrown), exits with message ‘Invalid ICC Profile Data’
printtestfont_opt.pdf*	ICC profile is not valid; ICC profile is version 4.0 or newer; ICC profile uses invalid color space; ICC profile uses invalid type	–	Preflight throws exception (exceptionThrown), exits with message ‘Invalid ICC Profile Data’
substitution_fonts.pdf	Font not embedded (and text rendering mode not 3)	3.1.1; 3.1.2; 3.1.3	Invalid Font definition, Some required fields are missing from the Font dictionary; Invalid Font definition, FontDescriptor is null or is a AFM Descriptor; Invalid Font definition, FontFile entry is missing from FontDescriptor for Souvenir-Light (repeated for other fonts)
text_images_pdf1.2.pdf	Font not embedded (and text rendering mode not 3); Glyphs missing in embedded font; Width information for rendered glyphs is inconsistent	3.1.1; 3.1.2	Invalid Font definition, Some required fields are missing from the Font dictionary; Invalid Font definition, FontDescriptor is null or is a AFM Descriptor

* As this document doesn’t appear to have any font-related issues it’s unclear why it is in the Font Testing category. Errors related to ICC profiles reproduced here because of relevance to Apache Preflight exception.

General observations

An intercomparison between the results of Acrobat Preflight and Apache Preflight shows that Apache Preflight’s output may vary in case of non-embedded fonts. In most cases it produces error code 3.1.3 (as was the case with the PDF Cabinet of Horrors dataset), but other errors in the 3.1.x range may occur as well. The 3.1.6 “character width” error is something that was also encountered during the London SPRUCE Hackathon, and according to the information here this is most likely the result of the PDF/A specification not being particularly clear. So, this looks like a non-serious error that can be safely ignored in most cases.

Multimedia

The next table shows the results for Multimedia & 3D Tests category:

Test file	Acrobat Preflight error(s)	Apache Preflight Error Code(s)	Apache Preflight Details
20020402_CALOS.pdf	–	1.0; 1.2.1	No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
Disney-Flash.pdf	Contains action of type JavaScript; Document contains JavaScripts; Document contains additional actions (AA); Font not embedded (and text rendering mode not 3); Form field does not have appearance dict; Form field has actions; Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry	1.0; 1.2.1	No multimedia-related errors; Preflight did report syntax and body syntax error
Jpeg_linked.pdf	Document is encrypted; Encrypt key present in file trailer; Named action with a value other than standard page navigation used; Incorrect annotation type used (not allowed in PDF/A); Font not embedded (and text rendering mode not 3)	1.0; 1.2.1	No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
MultiMedia_Acro6.pdf	Document is encrypted; EmbeddedFiles entry in Names dictionary; Encrypt key present in file trailer; PDF contains EF (embedded file) entry; Incorrect annotation type used (not allowed in PDF/A)	1.0; 1.2.1	No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
MusicalScore.pdf	CIDset in subset font is incomplete; CIDset in subset font missing; Contains action of type JavaScript; Document contains JavaScripts; Document contains additional actions (AA); Font not embedded (and text rendering mode not 3); Form field has actions; Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry; Type 2 CID font: CIDToGIDMap invalid or missing	1.0; 1.2.1	No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
SVG-AnnotAnim.pdf	Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry	5.2.1; 1.2.9	Forbidden field in an annotation definition, The subtype isn’t authorized : SVG; Body Syntax error, EmbeddedFile entry is present in a FileSpecification dictionary
SVG.pdf	Contains action of type JavaScript; Document contains JavaScripts; Font not embedded (and text rendering mode not 3); Form field has actions; PDF contains EF (embedded file) entry	1.0; 1.2.1	No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
ScriptEvents.pdf	Contains action of type JavaScript; Document contains JavaScripts; Font not embedded (and text rendering mode not 3); Form field has actions; Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry	1.0; 1.2.1	No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
Service Form_media.pdf	Contains action of type JavaScript; Contains action of type ResetForm; Document contains JavaScripts; Document contains additional actions (AA); Font not embedded (and text rendering mode not 3); Glyphs missing in embedded font; Incorrect annotation type used (not allowed in PDF/A); Named action with a value other than standard page navigation used; PDF contains EF (embedded file) entry	1.0; 1.2.1	No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
Trophy.pdf	Contains action of type JavaScript; Document contains JavaScripts; Font not embedded (and text rendering mode not 3); Form field has actions; Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry	1.0; 1.2.1	No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
VolvoS40V50-Full.pdf	Preflight exits with: “An error occurred while parsing a contents stream. Unable to analyze the PDF file”	1.0; 1.2.1	No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
gXsummer2004-stream.pdf	File cannot be loaded in Acrobat (damaged file)	1.0; 1.1	No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
phlmapbeta7.pdf	Document contains additional actions (AA); Font not embedded (and text rendering mode not 3); Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry	1.0; 1.2.1	No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
us_population.pdf	Preflight exits with: “An error occurred while parsing a contents stream. Unable to analyze the PDF file”	1.0; 1.2.1	No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error
movie.pdf	Incorrect annotation type used (not allowed in PDF/A)	5.2.1	Forbidden field in an annotation definition, The subtype isn’t authorized : Movie
movie_down1.pdf	Incorrect annotation type used (not allowed in PDF/A)	5.2.1	Forbidden field in an annotation definition, The subtype isn’t authorized : Movie
remotemovieurl.pdf	Font not embedded (and text rendering mode not 3); Incorrect annotation type used (not allowed in PDF/A)	5.2.1; 3.1.1; 3.1.2; 3.1.3	Forbidden field in an annotation definition, The subtype isn’t authorized : Movie; Invalid Font definition, Some required fields are missing from the Font dictionary; Invalid Font definition, FontDescriptor is null or is a AFM Descriptor; Invalid Font definition, FontFile entry is missing from FontDescriptor for Arial

General observations

The results from the Multimedia PDFs are interesting for several reasons. First of all, these files include a wide variety of ‘risky’ features, such as multimedia content, embedded files, JavaScript, non-embedded fonts and encryption. These were successfully identified by Acrobat Preflight in most cases. Apache Preflight, on the other hand, only reported non-specific and fairly uninformative errors (1.0 + 1.2.1) for 12 out of 17 files. Even though Preflight was correct in establishing that these files were not valid PDF/A-1b, it wasn’t able to drill down to the level of specific features for the majority of these files.

Looking more into detail at those 1.0 and 1.2.1 errors, the detailed description of most of them is:

Syntax error, Expected pattern 'obj but missed at character 'o'

To me it looks like Preflight doesn’t correctly parse the binary structure of the PDF. Opening a few of the problematic PDFs revealed that the object identifiers in these files were followed immediately by the object contents, e.g:

32 0 obj<</Kids[33 0 R]>>
endobj

whereas more commonly they are separated by a line terminator, like this:

32 0 obj
<</Kids[33 0 R]>>
endobj

As far as I’m aware neither the PDF specification nor PDF/A have anything to say about line endings in this case, so my best guess is that this is simply a bug that results in the file not being fully parsed. I submitted a bug report for this issue here.

Summary and conclusions

The re-analysis of the PDF Cabinet of Horrors corpus, and the subsequent analysis of a sub-set of the Adobe Acrobat Engineering PDFs shows a number of things. First, Apache Preflight 2.0.0 does not properly identify encryption and password-protection. This looks like a bug that is probably easily fixed. Second, the analysis of the Font Testing PDFs from the Acrobat Engineering site revealed that non-embedded fonts may result in a variety of error codes in Apache Preflight (assuming here that the Acrobat Preflight results are accurate). So, when using Apache Preflight to check font embedding, it’s probably a good idea to treat all font-related errors (perhaps with the exception of character width errors) as a potential risk. The more complex PDFs in the Multimedia category proved to be quite challenging to Apache Preflight: for most files here, it was not able to identify specific features such as multimedia content, embedded files, JavaScript and non-embedded fonts. A cursory analysis of some of the failed files suggests that this is probably a bug that results in Apache Preflight not being able to parse the file structure correctly. Keeping in mind that he specificity of Preflight‘s validation output already improved considerably since version 1.8.0, a fix of both this issue and the encryption problem would probably result in another significant improvement. In the meantime, it’s important to keep the expectations about the tool’s capabilities realistic, in order to avoid some potential unintended misuses.

Links

Taxonomy upgrade extras:

Preservation Topics:

Representation Information

Submitted by Johan van der Knijff on 25 July 2013 – 12:57pm

Comments

The Isartor PDF/A test suit

The Isartor PDF/A test suite may also be of interest. Each file specifically violates each point in the PDF/A spec.

http://www.pdfa.org/2011/08/download-isartor-test-suite/

Submitted by Andy Jackson on 25 July 2013 – 1:37pm Permalink

Isartor

Yes, actually I ran the whole of Isartor through Preflight ages ago, but never published the results. As far as I remember those results weren’t that interesting (not surprising, as it’s probably the very first thing any developer of a PDF/A validator will uses as a test data set).

With the current analysis I mainly wanted to work with some more realistic examples.

Submitted by Johan van der Knijff on 26 July 2013 – 4:13pm Permalink

Yep – we never tested it this good! 🙂

Hi Johan,

Nice work!

You’re right – “Preflight is thorough and unforgiving” was a poor choice of words – I certainly never meant to imply “we tested it and can be sure it works”. Were I saying that sentence out loud I think you’d probably detect a hint of sarcasm and what I wanted to suggest was that Preflight generated a number of errors that may or may not have been preservation risks.

What became really apparent during those two SPRUCE events was that the PDF/A specification isn’t that useful. We were working towards a tool that would allow users to ignore certain errors suggesting in some cases PDF/A does not meet the use case of the content holder. Many of the validation errors thrown by Preflight are unhelpful or meaningless to the unintiated; even more bewildering when inspite of the errors the PDF renders correctly in all viewers (and in the case of the SPRUCE PDFs were validating as PDF/A using PDFTron).

I got the impression that the DP community needs to take a good look at PDF/A and decide if it meets our requirements. Someone said it at the SPRUCE event – if it renders (and we continue to make sure it renders as time passes), does it matter?

Pete

Submitted by Peter Cliff on 26 July 2013 – 12:49pm Permalink

Some thoughts on this

Thanks for getting back to this. First of all a violation of the PDF/A profile doesn’t necessarily (and rarely will) mean a file won’t render. Even a file with non-embedded fonts usually renders (although not necessarily showing the original fonts). Also in my experience in most library/archive collections PDFs with serious preservation risks (e.g. open passwords, embedded multimedia content based on arcane video codecs) are pretty rare, but even if the numbers are small you’d still want to identify them. And I still think the PDF/A profile is a pretty good starting point for that. But yes, I agree that some of Preflight’s validation error messages aren’t very clear and this can be a source of confusion.

Also, have a look at the remarks I added earlier this afternoon about Preflight’s odd behavior with the Multimedia files. If I’m right, this may explain a lot about Preflight throwing obscure validation errors for PDFs that are perfectly fine – it looks like something goes wrong with they way Preflight parses the PDF structure in some cases. It would also mean that Preflight’s behavior could improve quite a lot once this gets fixed.

Submitted by Johan van der Knijff on 26 July 2013 – 4:37pm Permalink

Validating the render…

…if it renders (and we continue to make sure it renders as time passes), does it matter?

I had a look at some of those PDFs (from here, right?), and I wasn’t at all convinced they they were rending correctly. When opened in Apple Preview there were various bits of text that appeared to be in the wrong place, and suspiciously blank parts that appeared to contain hidden detailed that could be selected but not seen.

So, the question is, how can we tell it’s rending correctly, now or in the future? These validation errors might be relevant, but unfortunately PDF viewers generally appear to swallow errors during processing (unlike HTML browsers which have ‘quirks mode’ and console errors to investigate). If the document rendering is precisely the same across many distinct PDF implementations, that would instil some confidence. For valuable documents, it might be worth generating some kind of thumbnail contact sheet to help detect changes in rendering behaviour.

But of course none of these approaches will help if the errors occured when the PDF was initially created and the result wasn’t checked, unless the original source document is also available.

Submitted by Andy Jackson on 26 July 2013 – 9:11pm Permalink

Why/when do the "errors" in objects matter?

Your last paragraph reminds me of an important point to be made about/raised by the issue of rendering and “errors” in files: the “errors” discussed above are probably only (problematic/real) errors if they were introduced after object was created and changed the way the object is rendered. It seems reasonable to assume that anything else is just part of the object.

Knowing that “errors” exist in files is not very useful if you don’t know whether they were part of the original object or not (as you don’t know whether they need to be preserved or should be “fixed”). And if they were part of the original object, then it is also not that useful to know about them, as in most situations the “errors” will likely need to be preserved as part of the object and the only option that seems practical for preserving such objects, emulation, is, for the most part, agnostic to those “errors”, i.e. the errors aren’t something that would cause an unnusual problem for an emulation based approach so its not realy important to even be aware of them.

The reason I suggest emulation is the only option that seems practical in such cases, is because attempting to preserve the objects by migrating content from the file(s) to new (a) file(s) starts to seem awfully complicated or impossible when you consider the need to replicate “errors” in the new (presumably) industry standard files that (presumably) don’t support those errors. In other words, to preserve the “errors” using a migration/normalisation based approach you would need to have a format to move the content to that supported maintaining those “errors” and that seems (almost by definition) rather unlikely.

So it seems (to me) that the only times the “errors” matter is if:

they were not part of the original object, have been introduced after it was “finalised” (for want of a better term), and need to be “fixed”
they were part of the original object and you are going to attempt to migrate content from the files in order to preserve the objects

(1.) is one of the reasons why this work is important but (2.) seems to be a use-case/scenario that will turn out to be very rare because of the difficulty and expense involved in attempting to take this approach.

Submitted by Euan Cochrane on 29 July 2013 – 2:18am Permalink

A third scenario

Hi Euan,

Actually apart from scenarios 1 and 2 there’s a third one: consider an “error” that was part of the original object due to some mistake in a publisher’s production workflow. This is something I’ve seen a couple of times myself. Publishers are often able to provide corrected versions of those files when asked. They do this by simply re-generating them from the original source documents, and there’s no need for any complex fixes or migrations whatsoever. Needless to say this option is only available while the original publisher is still around, and I imagine it may be more problematic for older publications. Which makes it all the more important to try detecting such issues in a timely fashion.

Cheers,

Johan

Submitted by Johan van der Knijff on 29 July 2013 – 10:24am Permalink

Get the originals then?

Seems to me this scenario (3) implies that the PDFs are access copies.

Deposit libraries then should really be angling for both the “original source” and a generated copy? However I am not convinced that’ll happen and I suspect there are many flavours of (standards-compliant or otherwise) source documents…

Maybe we can just get everyone to use EPUB3?

Submitted by Peter Cliff on 29 July 2013 – 12:06pm Permalink

Well …

Most of the PDFs in our own collection are scientific papers, and the “source” of each PDF is very much dependent on the specific production workflow of a publisher. So this wouldn’t be very useful unless libraries would get themselves a working mirror of each publisher’s entire production workflow -something you REALLY don’t want to get into. Also I wouldn’t be surprised if some of those source documents weren’t standards-compliant either, so there really wouldn’t be an end to this!

Submitted by Johan van der Knijff on 29 July 2013 – 12:53pm Permalink

Rendered page images should work…

Andy, if you will use niche products like Apple Preview what do you expect? 🙂

Yes, that set are a sub-set of those we worked on in the last SPRUCE event. All validate (using one validator) as PDF/A-1b and generate errors using Apache Preflight.

Knowing if something renders correctly or not using thumbnails sounds like a good approach and is something being worked on albeit in a different context. Also, tools that detect common error artifacts (blank bits of the document, lots of small character size boxes where there should be characters, etc.) in PDFs shouldn’t be beyond the realms of possiblity. One curious error I came across working at the Bodleian (with the Planets documenation no less!) was that the documents rendered fine on screen, but didn’t print correctly, so we might need to ask ourselves how far we go with this…

The question of errors in the original and what to do with them is an interesting one and I think the answers have to come from the communities in question. Ask the curators – do you want this error fixed? Only they know the answers because only they know why the collection is important.

If a set of PDFs are examples of just how easy it is to break the PDF spec then the errors probably matter. However if the PDF format and its foibles entirely irrelevant to the collection and it could just as easily be Word or EPUB then validation errors in rendering pages are not important but the order of the words and the whitespace is…

I’ll stop now before I say ‘designated community defined significant properties’. Oh. Damn. 🙂

Submitted by Peter Cliff on 29 July 2013 – 12:22pm Permalink

publishers’ errors

Hi Johan,

Good point. The tool would definitely be useful for archives/libraries to use if they set standards for donors/transferring agencies to use when creating content to go into the archive, and publishers are a great example of where this might happen. Also, the tool would be good for publishers to use themselves to ensure they are making compliant files, if they want to do that.

Cheers,

Euan

Submitted by Euan Cochrane on 29 July 2013 – 10:32am Permalink