While working with the PDF format I sometimes get the impression that this complexity is what Adobe wants. As a result, Adobe Reader is the only viewer that implements the entire spec and can handle all (or most) quirks.
This is especially apparent when trying to edit arbitrary PDF files, which is sometimes not so easy or even impossible. Just the definition of fonts and the text layout is already so complicated that this is the logical consequence.
But perhaps the format has simply grown and led to additional requirements such as PDF/A, PDF/X, PDF/E and now PDF 2.0, the next standard that makes everything even more complex... Will this every stop?
PDF is an unusual format in the sense that it had a rather specific thing it tried to do and then it achieved that goal, so that it could be considered "done", but the product it was most associated with, Acrobat, tried to expand still.
PDF has the semantics of a digital print that is resolution-independent and supports copypaste and search (mostly by mapping glyphs back to text).
In addition to resolution independence being something that's higher-level than strictly "digital print", being able to capture transparency is such a higher-level feature.
From the above perspective, PDF peaked in 1.4 when it got transparency support. Supporting roughly the PDF 1.4 feature set was that allowed the Mac Preview app be good enough for Mac users so that Apple could stop bundling Acrobat Reader with Macs.
After 1.4, PDF has gotten better compression algorithms that don't really change what the format is about. PDF/A and PDF/X fit well the notion of PDF as "digital print".
But Adobe has been trying to leverage Acrobat/PDF to other areas that don't fit the notion of "digital print". These include pre-Macromedia acquisition attempts to make PDFs a more dynamic platform and later inclusion of 3D models in PDFs. Other PDF viewers still work for users most of the time without this stuff, which is a signal of what PDF really is to users ("digital print").
(Filling in paper-like forms, while not true to the notion that PDF is a final-form format sort of make sense from the point of view of digital paper, though.)
> While working with the PDF format I sometimes get the impression that this complexity is what Adobe wants. As a result, Adobe Reader is the only viewer that implements the entire spec and can handle all (or most) quirks.
While that certainly does play in Adobe's favor, the complexity of the spec. is also what occurs when over time new features, some never even envisioned by the original creators, are bolted on to keep the whole "relevant" and/or to add new "features" to keep the 'thing' from becoming obsolete.
We can certainly argue whether the addition of different features was worth the complexity increase, but simply taking an existing system and bolting on the latest "hotness" to use to add to the checklist of "why one should upgrade" features also produces similar levels of complexity.
So some of the complexity increase is merely the fact that the pdf spec. has been evolved to do things it was likely never designed to do in the first place.
The Office formats are well specified, they are complex because that is the nature of the software but it is a world away from something like PSD or even PDF.
PDF is actually quite well specified, there are not many holes in the specification itself.[0] As to what Adobe Reader will do when it encounters an out-of-spec file, that is a lot fuzzier.
On the other hand, the Office file formats (especially Word) have many un- or underspecified cases.
[0] The only one I know of is finding the end of compressed inline image data.
I agree, the PDF spec is great, and very easy to understand (if slow to wade through). The hardest parts are when you have to duck out to read another spec for a contained format like TrueType.
Regarding Reader, I work with PDFs a lot, and the majority of issues have a fairly common pattern. The supplier has created a PDF in a 3rd party tool, which is invalid in a subtle way (production printers in particular are very specific about what they want to accept).
But it works fine in Adobe Reader, since it was built to be very tolerant in what it accepts, so it's often hard to convince the non-technical users that the file has an issue. It's great for end users but has meant that a lot of tools out there just didn't have to try too hard to make PDFs that mostly work, so programming workflows can be an issue.
I found quite a few areas that were vague when I was working with it.
The advantage of the office formats is they are Zip files with a ton of XML, ie they are well defined. The application parts are another matter of course.
The original criticism was that some parts are just binary blobs encoded in XML elements, which wouldn’t suprise me at all, with Microsoft being allowed to tick the ‘XML file format’ checkbox and still getting to keep the binary format advantages.
I see. I was mostly referring to semantic problems, of which I heard there are a lot (I haven't really worked with Office internals much), and also I was thinking of the pre-XML Office formats.
I remember reading in the past that Microsoft had corrupted the ISO standards body to publish essentially fake standards that were different to what MS Office actually produced, so software like Libreoffice would output files that didn't work properly in Office or visa versa. Are you saying that now this is not the case and they are full specified? I sometimes tell people about this so I want to make sure I have my facts straight.
Okay, apparently I totally didn't notice the release of PDF 2.0 a year ago, even though I was working a lot with PDFs at that time. Also, this new version is an ISO standard that costs 198 CHF to download, so I hereby predict that it is basically dead in the water, since few people will bother implementing it. The new features also don't seem very interesting, and from what I gather the spec is still backwards compatible despite the major version number increment.
This is especially apparent when trying to edit arbitrary PDF files, which is sometimes not so easy or even impossible. Just the definition of fonts and the text layout is already so complicated that this is the logical consequence.
But perhaps the format has simply grown and led to additional requirements such as PDF/A, PDF/X, PDF/E and now PDF 2.0, the next standard that makes everything even more complex... Will this every stop?