When (not) to migrate a PDF to PDF/A

PDF Eh? – Another Hackathon Tale

It is well-known that PDF documents can contain features that are preservation risks (e.g. see here and here). Migration of existing PDFs to PDF/A is sometimes advocated as a strategy for mitigating these risks. However, the benefits of this approach are often questionable, and the migration process can also be quite risky in itself. As I often get questions on this subject, I thought it might be worthwhile to do a short write-up on this.

PDF/A is a profile

First, it's important to stress that each of the PDF/A standards (A-1, A-2 and A-3) are really just profiles within the PDF format. More specifically, PDF/A-1 offers a subset of PDF 1.4, whereas PDF/A-2 and PDF/A-3 are based on the ISO 32000 version of PDF 1.7. What these profiles have in common, is that they prohibit some features (e.g. multimedia, encryption, interactive content) that are allowed in 'regular' PDF. Also, they narrow down the way other features are implemented, for example by requiring that all fonts are embedded in the document. This can be illustrated with the following simple Venn diagram below, which shows the feature sets of the aforementioned PDF flavours:

PDF Venn diagram

Here we see how PDF/A-1 is a subset of PDF 1.4, which in turn is a subset of PDF 1.7. PDF A/2 and PDF A/3 (aggregated here as one entity for the sake of readability) are subsets of PDF 1.7, and include all the features of PDF A/1.

Keeping this in mind, it's easy to see that migrating an arbitrary PDF to PDF/A can result in problems.

Loss, alteration during migration

Suppose, as an example, that we have a PDF that contains a movie. This is prohibited in PDF/A, so migrating to PDF/A will simply result in the loss of the multimedia content. Another example are fonts: all fonts in a PDF/A document must be embedded. But what happens if the source PDF uses non-embedded fonts that are not available on the machine on which the migration is run? Will the migration tool exit with a warning, or will it silently use some alternative, perhaps similar font? And how do you check for this?

Complexity and effect of errors

Also, migrations like these typically involve a complete re-processing of the PDF's internal structure. The format's complexity implies that there's a lot of potential for things to go wrong in this process. This is particularly true if the source PDF contains subtle errors, in which case the risk of losing information is very real (even though the original document may be perfectly readable in a viewer). Since we don't really have any tools for detecting such errors (i.e. a sufficiently reliable PDF validator), these cases can be difficult to deal with. Some further considerations can be found here (the context there is slightly different, but the risks are similar).

Digitised vs born-digital

The origin of the source PDFs may be another thing to take into account. If PDFs were originally created as part of a digitisation project (e.g. scanned books), the PDF is usually little more than a wrapper around a bunch of images, perhaps augmented by an OCR layer. Migrating such PDFs to PDF/A is pretty straightforward, since the source files are unlikely to contain any features that are not allowed in PDF/A. At the same time, this also means that the benefits of migrating such files to PDF/A are pretty limited, since the source PDFs weren't problematic to begin with!

The potential benefits PDF/A may be more obvious for a lot of born-digital content; however, for the reasons listed in the previous section, the migration is more complex, and there's just a lot more that can go wrong (see also here for some additional considerations).

Conclusions

Although migrating PDF documents to PDF/A may look superficially attractive, it is actually quite risky in practice, and it may easily result in unintentional data loss. Moreover, the risks increase with the number of preservation-unfriendly features, meaning that the migration is most likely to be successful for source PDFs that weren't problematic to begin with, which belies the very purpose of migrating to PDF/A. For specific cases, migration to PDF/A may still be a sensible approach, but the expected benefits should be weighed carefully against the risks. In the absence of stable, generally accepted tools for assessing the quality of PDFs (both source and destination!), it would also seem prudent to always keep the originals.

282
reads

9 Comments

  1. johan
    September 1, 2014 @ 2:22 pm CEST

    Hi Ross,

    As already pointed out by others, many of the issues I mention in my blog (e.g. multimedia, fonts) just follow directly from the ISO standards, so I don't see why this should be backed up by experimental evidence.

    You're right that the stuff under "complexity and effect of errors" could be backed up by evidence. But based on the simple "garbage in, garbage out" principle, I think the outcome I've sketched here is not unreasonable. If anyone's willing to do some actual testing, that would be great of course. I expect that the results of such an exercise would be highly dependent on the specific migration tool used, and the specific characteristics of the source PDFs.

    Personally I would approach this the other way around, i.e. before starting a migration make an inventory of the expected risks (like the ones mentioned in the blog), and then provide evidence that the migration will not result in unintentional data loss (e.g. by using a QA workflow that detects errors before it's too late). I think this should really be common practice for any format migration; it was also pretty much the idea behind the KB's Metamorfoze TIFF to JP2 migration.

    Interestingly, you state that:

    The way the blog is positioned speaks to fears, that without proof, and experimental evidence, remain the equivalent of tales around a camp fire about a bogey man that might come out at night to devour those working in digital preservation

    But doesn't this exactly describe the situation where institutions start migrating their PDFs to PDF/A, based on some vague and unfounded fears that this is the only way to keep them accessible over time?

    In a Twitter discussion about this subject, you mentioned that the "danger of migration is the users controlling the migration". I think you're hitting the nail right on the head there, as -in my experience- many users are simply unaware of the things that can go wrong. And this is exactly why I wrote that blog in the first place: to countervail the myth of PDF/A being some miracle cure that magically solves all preservation-related problems. I see this myth being peddled by some software vendors that sell PDF/A-related products (can't blame them for trying!), but, worryingly, also by some factions of the archival community.

    To give you just one example, have a look at this article. It presents PDF/A as "an archiving solution to the preservation of the (sic) PDF". So, the authors seem to view PDF/A as a solution for preserving legacy PDFs (although to be honest they're quite vague about it, as with a lot of other things in this paper). They go to extreme lengths in describing the pros and cons of PDF/A for preservation (coming up with quite a few bizarre and far-fetched arguments in the process), but they don't even mention basic stuff such as the fundamental impossibility (as per the standard!) of using PDF/A for features such as movies. There's also no mention whatsoever of what happens in case of missing fonts in source files 1.

    Now this article was written by professional archivists, in a publication by an established standards organisation. If the degree of knowledge that is displayed there is even roughly representative of the larger archiving community, sparking a bit of fear isn't such as bad idea, I think!


    1. BTW there's a lot of other stuff wrong with this paper, but I won't get into that now!  

  2. FBIt
    September 12, 2014 @ 11:52 am CEST

    Hi,

    I'm not stating that's there no basis for a migration to PDF/A at all. Imho, PDF/A is usefull in digitization processes: PDF/A-documents can replace the paper 'originals'. OK, a property like smell get lost, but the cases in which this is a signifant property, will be rather scarce.

    I'm more concerned when it comes to digital born records. From our practical experience at the City Archives of Antwerp, we know that digital borns records are transferred in lots of different formats (not only PDF). We have preferred and accepted formats and we migrate documents ourselves to preservation formats. Where possible, we opt for XML-based file formats as preservation format (ODF, XML, SVG, etc.).  We also learned that migrating existing PDF-documents towards PDF/A is often not successfull (protected PDF's, propietary or missing fonts, not allowed contents in PDF-documents, loss of quality, digital signed documents, etc.). So, my point is mainly: the archival communitiy needs solutions for the authentic rendering of other formats than PDF as well for PDF-documents which do not comply to PDF/A. In other words, the add value of migrating existing PDF-documents to PDF/A is minimal and it remains a 'moment of risk' as Johan stated.

    Regards,

    Filip

     

  3. thorsted
    September 5, 2014 @ 10:45 pm CEST

    As a digital archivist who has been tasked with preserving PDF's for long term access, what is the solution other than PDF/A? 

    Our standard currently is to migrate PDF to PDF/A with the hope that consistancy will make any future migrations easier. In our institution, the content is more important than the provenence of the PDF. Meaning we care most about the information stored in the PDF, then metadata, etc. Majority of our PDF's come from outside sources where we cannot control if fonts are embedded or if they include movies or attachments in the file.

    I migrate hundreds of PDF file each week and have to take special care of a few of them to bring them up to standard. I can see the risk of migration and have seen first hand what can happen to a PDF if you do not pay attention while embedding fonts, flatten annotations, etc.

    Is the risk so great as to be worse then if we leave these PDF's alone and not attempt some form of normalization? Is there no middle ground to ensure we have done our best to make the content available into the furture?

    Love the discussion!

    -Thorsted

  4. FBIt
    September 2, 2014 @ 8:38 am CEST

    Hi Johan,

    If it can be of any assurance, the view described in the NISO-article isn't representative for the way the archival community and archivists look at PDF/A as a preservation format. Archivists are (or should be) well aware of the risks and issues involved with a migration to PDF/A. Summarized, a migration of digital born records to PDF/A may result in integrity loss in essential respect. Especially, contents, functionality and metadata may change or get lost. This should not surprise us, as the basic unit of a PDF-document is a sheet of paper. That's why I prefer to describe PDF as a Paper Document Format. As a consequence, we use PDF/A mainly for digitised documents and try to avoid it for digital born records at the City Archives of Antwerp. Our view on PDF and PDF/A has been elaborated in our article on technical standards in 'Archiefbeheer in de praktijk' (Dutch only).

    The bottomline of this discussion raises the question: what is a good preservation format for PDF-documents? As the answer is not always PDF/A, another approach is necessary. Once this alternative is in place, one might wonder what the added value of PDF/A remains…

    Regards,

    Filip

Leave a Reply

Join the conversation