The PDF format contains various features that may make it difficult to access content that is stored in this format in the long term. Examples include (but are not limited to):
A more exhaustive overview is given here:
http://www.openplanetsfoundation.org/system/files/PDFInventoryPreservationRisks_0_2_0.pdf
and also here:
http://libraries.stackexchange.com/questions/964/what-preservation-risks-are-associated-with-the-pdf-file-format
When creating a PDF, it is possible to minimise these risks by using one of the PDF/A standards, which delineate a number of PDF feature profiles that are unlikely to result in any long-term accessibility problems. However, the simple fact is that most PDFs that are out there are not PDF/A.
For assessing risks in existing collections, it would be helpful to be able to screen or profile PDFs for specific ‘risky’ features, such as encryption or font embedding. Since PDF/A was specifically designed to eliminate these ‘risky’ features, one would expect that PDF/A validators (i.e. software tools that check the conformance of a PDF file against the PDF/A specification) would be able to provide some useful information on this.
In a first attempt to test whether this approach is feasible at all, I did some tests with Apache Preflight, an open-source PDF/A-1 validator that is part of the Apache PDFBox library.The specific objectives of this work were:
The results can be found in the report Identification of preservation risks in PDF with Apache Preflight: a first impression.
The report’s findings are to a large extent based on a suite of small, simple test files that were created especialy for this work. Each file contains one ‘risky’ feature, with focus on the following feature classes:
The dataset can be found here:
http://www.opf-labs.org/format-corpus/pdfCabinetOfHorrors/
Identification of preservation risks in PDF with Apache Preflight: a first impression
Since the report was published, a number of improvements have been made to Apache Preflight which should fix some of the reported issues. I haven’t tested the latest version yet, but will try doing this some time soon.