The most important new feature of the recently released PDF/A-3 standard is that, unlike PDF/A-2 and PDF/A-1, it allows you to embed any file you like. Whether this is a good thing or not is the subject of some heated on-line discussions. But what do we actually mean by embedded files? As it turns out, the answer to this question isn’t as straightforward as you might think. One of the reasons for this is that in colloquial use we often talk about “embedded files” to describe the inclusion of any “non-text” element in a PDF (e.g. an image, a video or a file attachment). On the other hand, the word “embedded files” in the PDF standards (including PDF/A) refers to something much more specific, which is closely tied to PDF‘s internal structure.
When the PDF standard mentions “embedded files”, what it really refers to is a specific data structure. PDF has a File Specification Dictionary object, which in its simplest form is a table that contains a reference to some external file. PDF 1.3 extended this, making it possible to embed the contents of referenced files directly within the body of the PDF using Embedded File Streams. They are described in detail in Section 7.11.4 of the PDF Specification (ISO 32000). A File Specification Dictionary that refers to an embedded file can be identified by the presence of an EF entry.
Here’s an example (source: ISO 32000). First, here’s a file specification dictionary:
31 0 obj
<</Type /Filespec /F (mysvg.svg) /EF <</F 32 0 R>> >>
endobj
Note the EF entry, which references another PDF object. This is the actual embedded file stream. Here it is:
32 0 obj
<</Type /EmbeddedFile /Subtype /image#2Fsvg+xml /Length 72>>
stream
…SVG Data…
endstream
endobj
Note that the part between the stream and endstream keywords holds the actual file data, here an SVG image, but this could really be anything!
So, in short, when the PDF standard mentions “embedded files”, this really means Embedded File Streams.
Here’s the first source of confusion: if a PDF contains images, we often colloquially call these “embedded”. However, internally they are not represented as Embedded File Streams, but as so-called Image XObjects. (In fact the PDF standard also includes yet another structure called inline images, but let’s forget about those just to avoid making things even more complicated.)
Here’s an example of an Image XObject (again taken from ISO 32000):
10 0 obj
<< /Type /XObject /Subtype /Image /Width 100 /Height 200 /ColorSpace /DeviceGray /BitsPerComponent 8 /Length 2167 /Filter /DCTDecode >>
stream
…Image data…
endstream
endobj
Similar to embedded filestreams, the part between the stream and endstream keywords holds the actual image data. The difference is that only a limited set of pre-defined formats are allowed. These are defined by the Filter entry (see Section 7.4 in ISO 32000) . In the example above, the value of Filter is DCTDecode, which means we are dealing with JPEG encoded image data.
Going back to embedded file streams, you may now start wondering what they are used for. According to Section 7.11.4.1 of ISO 32000, they are primarily intended as a mechanism to ensure that external references in a PDF (i.e. references to other files) remain valid. It also states:
The embedded files are included purely for convenience and need not be directly processed by any conforming reader.
This suggests that the usage of embedded file streams is simply restricted to file attachments (through a File Attachment Annotation or an EmbeddedFiles entry in the document’s name dictionary).
Here’s a sample file (created in Adobe Acrobat 9) that illustrates this:
http://www.opf-labs.org/format-corpus/pdfCabinetOfHorrors/fileAttachment.pdf
Looking at the underlying code we can see the File Specification Dictionary:
37 0 obj
<</Desc()/EF<</F 38 0 R>>/F(KSBASE.WQ2)/Type/Filespec/UF(KSBASE.WQ2)>>
endobj
Note the /EF entry, which means the referenced file is embedded (the actual file data are in a separate stream object).
Further digging also reveals an EmbeddedFiles entry:
33 0 obj
<</EmbeddedFiles 34 0 R/JavaScript 35 0 R>>
endobj
However, careful inspection of ISO 32000 reveals that embedded file streams can also be used for multimedia! We’ll have a look at that in the next section…
Section 13.2.1 (Multimedia) of the PDF Specification (ISO 32000) describes how multimedia content is represented in PDF (emphases added by me):
Rendition actions (…) shall be used to begin the playing of multimedia content.
A rendition action associates a screen annotation (…) with a rendition (…)
- Renditions are of two varieties: media renditions (…) that define the characteristics of the media to be played, and selector renditions (…) that enables choosing which of a set of media renditions should be played.
- Media renditions contain entries that specify what should be played (…), how it should be played (…), and where it should be played (…)
The actual data for a media object are defined by Media Clip Objects, and more specifically by the media clip data dictionary. Its description (Section 13.2.4.2) contains a note, saying that this dictionary “may reference a URL to a streaming video presentation or a movie embedded in the PDF file“. The description of the media clip data dictionary (Table 274) also states that the actual media data are “either a full file specification or a form XObject”.
In plain English, this means that multimedia content in PDF (e.g. movies that are meant to be rendered by the viewer) may be represented internally as an embedded file stream.
The following sample file illustrates this:
http://www.opf-labs.org/format-corpus/pdfCabinetOfHorrors/embedded_video_quicktime.pdf
This PDF 1.7 file was created in Acrobat 9, and if you open it you will see a short Quicktime movie that plays upon clicking on it.
Digging through the underlying PDF code reveals a Screen Annotation, a Rendition Action and a Media clip data dictionary. The latter looks like this:
41 0 obj
<</CT(video/quicktime)/D 42 0 R/N(Media clip from animation.mov)/P<</TF(TEMPACCESS)>>/S/MCD>>
endobj
It contains a reference to another object (42 0), which turns out to be a File Specification Dictionary:
42 0 obj
<</EF<</F 43 0 R>>/F(<embedded file>)/Type/Filespec/UF(<embedded file>)>>
endobj
What’s particularly interesting here is the /EF entry, which means we’re dealing with an embedded file stream here. (The actual movie data are in a stream object (43 0) that is referenced by the file specification dictionary.)
So, the analysis of this sample file confirms that embedded filestreams are actually used by Adobe Acrobat for multimedia content.
In PDF/A-1, embedded file streams are not allowed at all:
A file specification dictionary (…) shall not contain the EF key. A file’s name dictionary shall not contain the EmbeddedFiles key
In PDF/A-2, embedded file streams are allowed, but only if the embedded file itself is PDF/A (1 or 2) as well:
A file specification dictionary, as defined in ISO 32000-1:2008, 7.11.3, may contain the EF key, provided that the embedded file is compliant with either ISO 19005-1 or this part of ISO 19005.
Finally, in PDF/A-3 this last limitation was dropped, which means that any file may be embedded (source: this unofficial newsletter item, as at this moment I don’t have access to the full specification of PDF/A-3).
No, not at all! Even though nothing stops you from embedding multimedia content (e.g. a Quicktime movie), you wouldn’t be able to use it as a renderable object inside a PDF/A-3 document. The reason is that the annotations and actions that are needed for this (e.g. Screen annotations and Rendition actions, to name but a few) are not allowed in PDF/A-3. So effectively you are only able to use embedded file streams as attachments.
A few weeks ago the embedding issue came up again in a blog post by Gary McGath. One of the comments there is from Adobe’s Leonord Rosenthol (who is also the Project Leader for PDF/A). After correctly pointing out some mistakes in both the original blog post and in an earlier a comment by me, he nevertheless added to the confusion by stating that objects that are are rendered by the viewer (movies, etc.) all use Annotations, and that embedded files (which he apparently uses a a synonym to attachments) are handled in a completely different manner. This doesn’t appear to be completely accurate: at least one class of renderable objects (screen annotations/rendition actions) may be using embedded filestreams. Also, embedded files that are used as attachments may be associated with a File Attachment Annotation, which means that “under the hood” both cases are actually more similar than first meets the eye (which is confirmed by the analysis of the 2 sample files in the preceding sections). Contributing to this confusion is also the fact that Section 7.11.4 of ISO 32000 erroneously states that embedded file streams are only used for non-renderable objects like file attachments, which is contradicted by their allowed use for multimedia content.
Some might argue that the above discussion is nothing but semantic nitpicking. However, details like these do matter if we want to do a proper assessment of preservation risks in PDF documents. As an example, in this previous blog post I demonstrated how a PDF/A validator tool can be used to profile PDFs for “risky” features. Such tools typically give you a list of features. It is then largely up to the user to further interpret this information.
Now suppose we have a pre-ingest workflow that is meant to accept PDFs with multimedia content, while at the same time rejecting file attachments. By only using the presence of an embedded file stream (reported by both Apache‘s and Acrobat‘s Preflight tools) as a rejection criterion, we could end up unjustly rejecting files with multimedia content as well. To avoid this, we also need to take into account what the embedded file stream is used for, and for this we need to look at what annotation types are used, and the presence of any EmbeddedFiles entry in the document’s name dictionary. However, if we don’t know precisely which features we are looking for, we may well arrive at the wrong conclusions!
This is made all the worse by the fact that preservation issues are often formulated in vague and non-specific ways. An example is this issue on the OPF Wiki on the detection of “embedded objects”. The issue’s description suggests that images and tables are the main concern (both of which aren’t strictly speaking embedded objects). The corresponding solution page subsequently complicates things further by also throwing file attachments in the mix. In order to solve issues like these, it is helpful to know that images are (mostly) represented as Image XObjects in PDF. The solution should then be a method for detecting Image XObjects. However, without some background knowledge of PDF‘s internal data structure, solving issues like these becomes a daunting, if not impossible task.
In this blog post I have tried to shed some light on a number of common misconceptions about embedded content in PDF. I might have inadvertently created some new ones in the process, so feel free to contribute any corrections or additions using the comment fields below.
The PDF specification is vast and complex, and I have only addressed a limited number of its features here. For instance, one might argue that a discussion of embedding-related features should also include fonts, metadata, ICC profiles, and so on. The coverage of multimedia features here is also incomplete, as I didn’t include Movie Annotations or Sound Annotations (which preceded the Screen Annotations, which are now more commonly used). These things were all left out here because of time and space constraints. This also means that further surprises may well be lurking ahead!
Johan van der Knijff
KB / National Library of the Netherlands
Comments
How do we just solve the PDF problem?
Excellent post here Johan, thanks for taking the time to write this up after your recent investigations which I’ve been following on Twitter.
I’m amazed that we (i.e digital preservation types) haven’t solved the PDF preservation problem yet. It’s a really common format, it scores in the top 3 popular formats in a lot of repositories out there. Yet our characterisation capabilities are still pretty weak. In fact, we’re doing so badly that the new PDF versions are making the problem get worse quicker than we can fix it.
I’ve been talking to a few different people about how we can just get on and solve the PDF preservation problem. For me, that means a solution akin to what you did with Jpylyzer. In other words, we need a comprehensive listing of PDF risks and what is behind them (i.e how they are represented in actual PDF files). And then we need a focused tool that scans PDFs and reports on any risks that are found.
It seems to me that we have many of the pieces required to achieve this aim, but not all in one place. A whole host of people have done bits and pieces with code to deal with various PDF issues. But it feels like we need to bring this all together. How we do this? Do we need an OPF special interest group to focus development, unite various bits of work already out there and help push things forward? Would an OPF hackathon on the subject help?
So Johan, 3 easy (ahem) questions for you. Have I characterised the problem correctly? Have I characterised the solution correctly? And how do we make the solution happen?
We have solved the PDF problem already
I don’t mean to be antagonistic but I do have to suggest that we have solved the PDF problem already.
It is quite straight forward to install a PDF application on an emulated desktop and use that to interact with PDF files that require that interaction-stack for the indefinite future.
All of this discussion about the details of PDF files might be superfluous. If donors/publishers etc. provide the archival institutions with:
Then the institution has all it needs (aside from an emulator) to preserve the content provided and make it accessible indefinitely. E.g.:
a) The checksums ensure the integrity of the files and ensure that any unusual rendering issues are maintained when the content is presented using an emulated interaction stack. This ensures that those issues are not considered integrity issues (i.e. it ensures the “bugs” are preserved).
b) The interaction-stack details provide the information necessary for the institution to set up the emulated interaction stack.
c) The software itself can also be used by the institution as part of the emulated interaction stack. This would be particularly relevant if the institution used a customised pdf-interaction application.
All the work undertaken to “validate” or characterise pdf files is unnecessary if using a strategy like this as validation to a standard doesn’t matter and characterisation is mostly unnecessary also. The content can be preserved without validating or characterising the files as they can be interacted with through the emulated environment indefinitely.
Not particularly practical, and doesn’t address all PDF risks
No need to worry about antagonism. It’s healthy to challenge and debate these issues!
I see a few problems with your strategy:
1) “If donors/publishers etc. provide the archival institutions with: (checksums, rendering stack details, software)” is a huge “if”. Most repositories I’m familiar with get content from donors/publishers and very little else. Changing this situation is near to impossible.
2) Donors providing software is likely to be illegal.
3) I suspect most users would not be happy having to fire up an emulator just to view a PDF, but having said that, the kind of work that you, Bram and Dirk have been showing recently is going a long way to negate these concerns (very impressive stuff).
4) A successful PDF rendering can depend on content external to the PDF (and likely not deposited in the repository). Eg. non embedded fonts. Your solution does not catch this (or at least in the best case is very dependent on that huge “if”). Characterisation is required to pick this up, giving the repository chance to swiftly go back to the depositor and ask for a better copy.
5) What about encrpyted PDFs? Again these need to be identified on ingest so there is an opportunity to go back to the depositor straight away.
Just to pick up on your “validation to a standard” comment. I don’t think that approach does the whole job. What I’m advocating is characterising to identify specific PDF risks that we’re aware of. As has been eloquently demonstrated by Portico et al, just because a file adheres to a format spec, doesn’t mean it renders correctly (or vice versa).
Am I missing the point?
Thanks for the reply Paul. I’ve outlined some counter-points below.
1) I agree this is a challenge and for some types of institutions this will be more difficult than others. But it is definitely possible (and often simple). Publishers, for instance, would likely have a very good idea of the software required to interact with their products. Government and Corporate archives can set standards to a degree, and requiring documentation of the interaction-stack is not a particularly onerous standard to require.
2) The legality of donors providing software will depend on the jurisdiction and the license terms. In some jurisdictions I believe even OEM licensed can be sold/reused/donated (I am not a lawyer but I believe this is the case in Germany for instance). This is also not an essential step. The archival institutions can acquire the software themselves through legal means. While there are not currently many legal means of doing so, this is a solvable problem provided there is a will to do so.
3) I think you countered this point yourself. As Dirk, Bram and others are showing, opening and interacting with a pdf-based object via an emulator doesn’t have to be any more difficult than opening and interacting with one via Adobe Acrobat reader, i.e. click to open. This “issue”, if not already solved, will be soon.
4) I agree this is a challenge and may necessitate the use of characterisation tools. However if donors can provide rendering-stack details then those should include all dependencies and therefore characterisation would not be required (your point about the challenge of getting this information still stands though). Or alternatively they might provide snapshots of desktops with all necessary software & dependencies included on them, to be used for interacting with the objects via emulators in the future.
5) My points in answer to your 4) apply equally to encryption. However what I would also suggest here is that a) this is an edge case much more so than embedded fonts and therefore not a great criticism of the overall solution, and b) the transferring owner will likely be quite aware of the objects it has that require encryption keys, as they will have had to have a solution for managing those keys on an ongoing basis, and will therefore be quite easily able to provide those keys to the archival institution.
So in summary, the approach I suggest is not necessarily straight forward at the moment but it does offer a solution to the problems of preserving PDFs. I concede that characterisation can be useful, however under the approach I outlined it is not necessary if you are willing to do a little more up-front (and certainly possible) work with donors etc. The issues that are being noted in trying to build characterisation and validation tools for every format and format variant highlight the complexity of an approach that relies on them. When this complexity and its associated cost are combined with the current (and likely future) impossibility of validating migration on a large scale, for a cost that isn’t outrageous, then you are left with serious questions about the feasibility of an approach that requires all of this. More generally the objections you raise are all things that could be solved if there is a will (and therefore funding) to do so.
I may be sounding like a broken record here but the reasons I keep advocating emulation as a solution are that it works, can be proven to work (unlike large scale cost-feasible migration for anything but the simplest of objects), is mostly a just-in-time solution rather than a just-in-case solution, is likely to be cheaper than alternatives in the long-run as a result, and, if properly supported, can provide a much richer and more engaging digital history experience.
While there is increasing support for emulation solutions, I often find myself wondering why there aren’t more practioners that support this approach in contrast to the number that support migration. I am beginning to be concerned about the possibility that, after 10-15+ years of investment in digital preservation solutions that institutions are now stuck with, there is a large amount of vested interested in migration-based approaches. If this is at all the case then I implore anyone reading to take some time to reconsider their approach and to think about the possible long-term benefits an emulation-based strategy may have to your institution and to the future of your digital assets and digital cultural heritage.
But perhaps I’m missing the point? I still don’t really understand what a “preservation risk” is meant to be. A risk to what? A risk to our ability to continue to interact with the objects in the future?
Yes, getting more than data from publishers/depositors IS hard
1) I strongly disagree with your thoughts here. Over the last couple of years I’ve worked with a lot of different practitioners (80+) from a number of institutions (60+) and in the vast majority of cases they don’t know enough about their data, never mind have checksums, rendering stack details or any other useful stuff. I should say, I know these details as I’ve just been putting them on a poster for IDCC next week, and have also previously blogged here about analysis of these results. “Definitely possible (and often simple)” could not be further from the truth. The exception is probably some archives where its possible to build up a relationship with the depositor and get more out of them. Based on the evidence collected from many organisations (including archives) that I’ve outlined above, these cases are few and far between however. I’d also pick up on this quote: “requiring documentation of the interaction-stack is not a particularly onerous standard to require?”: it is onerous if you know nothing about these issues, which is the case with many depositors.I’d also quickly mention Andy’s “Formats over Time: Exploring UK Web History” paper. He found (for PDF) that “later years have seen an explosion in the number of distinct creator applications, with over 2100 different implementations of around 600 distinct software packages”. Obviously we need some more data and analysis here, but this suggests a very complex picture for PDF that I don’t think we fully understand yet.
2) Granted, it does depend on the location, but I suspect will be illegal in the majority of countries. Again, I’d question the practicality here. Building a software archive is resource intensive. Resources that the majority of repositories simply don’t have.
3) Yes, I’m looking forward to you guys completely solving this one *:-)
4) So now you’re suggesting depositors will list external dependencies as well? This would of course be lovely, but is unfortunately nowhere near any reality I’m familar with.
5) Edge case? Well maybe, but an all or nothing critical one! Again you’re assuming a lot about the depositor here. If you don’t pick up encryption very soon after deposit, you’re likely to be completely stuck. It’s important.
As you know, I’ve been an advocate of emulation for a long time and did quite a bit to change perceptions about emulation. But I really don’t think it offers a solution to the PDF problems we’re discussing here. That doesn’t mean I’m necessarily advocating migration though.
Your comment about investment in a migration approach is rather interesting. I’d not thought about it like that before. I’ll digest that a bit more before commenting further on it.
Re: preservation risks: yes – anything that’s going to get in the way of future access. I love Johan’s page on JP2, which says it all.
Great chatting with you Euan, as always!
Quick reply to Euan
Hi Euan,
Thanks for commenting on this, but I have to agree with Paul here. Checksums aside (which is something we always need), it’s simply not realistic to expect publishers/donors to be able to provide us with this information. Have a train to catch now, perhaps more later!
Here we go (reply to Paul’s first comment)…
Great comment. To answer your questions:
The problem
Yes, I think you’ve largely outlined the problem here, although I wouldn’t necessarily put it down to our characterisation possibilities being too weak. I think a lot of the work that has been done on (PDF) characterisation so far has been mostly about extracting as much information (from a PDF) as possible. This may have been partially driven by a focus on migration as a preservation strategy, where the main role of the characterisation tool is to record “significant properties” of source and target files. The aim here was mainly to track any unwanted changes in the migration process. At the same time a lot work has been dedicated to format validation (e.g. PDF validation by JHOVE). The problem here is that even a “valid” PDF may still have preservation risks associated with it (e.g. encryption, non-embedded fonts), so format validation alone is not enough. At the same time, even though a tool like JHOVE can give you a lot of information on a PDF (in terms of reported properties), most of it isn’t directly linked to concrete risks. To be clear, this is not a fault of JHOVE, but simply the result of the fact that it was designed with a different purpose in mind.
The solution
As for the solution: you’re suggesting something similar to what I did with jpylyzer. The comparison makes me a bit nervous, because it seems to suggest the development of a new tool, which I think is something we should avoid at all costs! Personally I think what we need here is a 2 stage process:
Now this is actually pretty similar to what I did here with jpylyzer:
http://www.openplanetsfoundation.org/blogs/2012-09-04-automated-assessment-jp2-against-technical-profile
Now let’s assume that “all features that could pose a potential preservation risk” can be approximated by everything that is not allowed in PDF/A (e.g. PDF/A-1, which is the strictest profile). In that case, all we need for stage 1 is a software tool that is able to compare any given PDF against the PDF/A profile, and report back the deviations (i.e. all features encountered that violate PDF/A). These tools already exist, they’re PDF/A validators. Some first tests with an open-source one revealed a number of problems, but also showed a lot of promise. There are also several commercial tools; this 2009 study (which it seems has been largely overlooked by the digital preservation community) suggests that some of them are actually pretty good. So we can probably do this already!
Then the next step is to compare the output of those tools against our institutional profile. This is something we could do in a similar fashion to what I did with jpylyzer; if the PDF/A validator output is nicely formatted XML a bunch of (relatively simple) Schematron rules would probably be all that’s needed. The main difficulty here is that you need a pretty good understanding of how preservation risks, PDF features and the validator’s error codes are interlinked. This is somewhat complex, but definitely doable.
How to make it happen
To me, the most important thing is not to start reinventing the wheel, and try to minimise any development efforts. If we restrict ourselves to open source tools, Apache Preflight looks like the most promising one, but apparently it’s still in its early stages, it needs further testing and the version that I tested had problems that would seriously limit its use in practical settings. However, the developers have promptly picked up on my bug reports and some of these problems should be fixed in the latest version (haven’t had any time for testing yet). So one possible avenue would be to, as a community, get more involved with the further development of this tool. Right now we’re considering some scalability tests on a very large dataset within SCAPE (actually Clemens Neudecker came up with this idea some days ago, we will discuss this further at the upcoming scenario workshop later this month). This may involve some development work as well (e.g. we would probably need an XML output handler). The second stage (validation against an institutional profile) would mainly require research effort, and probably not much development.
An alternative would be to go for commercially available tools. The advantage would be that it doesn’t involve any development in the first stage, as mature tools already exist (some testing would of course be needed). The main disadvantage would be that it makes a collaborative effort for stage 2 very difficult.
I would love to be involved in this myself, but I have to see how much time in can dedicate here. I will probably know more after the SCAPE scenario workshop (i.e. late January).
As for hackathons: not sure what they might contribute in this case. They may be fine for relatively small, ad-hoc issues, but what we need here is mainly a matter of sustained research and development by the right people. But that’s just my own opinion of course!
Phew, and I really tried to make my reply brief…
What he said
Yes, can’t find anything to disagree with there Johan. It’s a good job I’ve still got Euan to debate with!
Did not want to imply by mention of Jpylyzer that we should write a new tool. Or in fact that (just) you should write a new tool! We must exploit existing work where possible, especially when (as you say) so much work has already been done for us. However, as you said, the “Jpylyzer” approach of characterise->XML->Schematron->instutional policy feels spot on.
The Apache Preflight option sounds very interesting, especially hearing that the devs have picked up on your bugs quickly. A non-commercial solution would I think be very helpful on the adoption front. Potentially this could be very useful for a lot of repositories.
Lets chat on this more next week!