Chapter 19
Accessible PDFs with PDF/UA

cpdf -print-struct-tree in.pdf

cpdf -extract-struct-tree in.pdf -o out.json

cpdf -replace-struct-tree in.json in.pdf -o out.pdf

cpdf -verify "PDF/UA-1(matterhorn)" [-json] in.pdf

cpdf -verify "PDF/UA-1(matterhorn)" -verify-single <test> [-json] in.pdf

cpdf -mark-as ["PDF/UA-1" | "PDF/UA-2"] in.pdf -o out.pdf

cpdf -remove-mark ["PDF/UA-1" | "PDF/UA-2"] in.pdf -o out.pdf

cpdf -create-pdf-ua-<1|2> <title> [-create-pdf-pages <n>]
      [-create-pdf-papersize <paper size>] -o out.pdf

PDF/UA (Universal Accessibility) is a PDF subformat whose rules consist of a set of machine-checkable and human-checkable-only requirements to make PDF documents accessible for all users - for example, those using screen readers. Cpdf has some basic facilities for manipulating the extra PDF constructs which are used in (amongst others) PDF/UA, and a basic verifier for many of the machine-checkable requirements.

19.1 Structure trees

In a PDF document, the optional Structure Tree is a parallel construct which describes the logical structure of a document (as opposed to the information for rendering the document on the screen or printing it out, which every PDF of course contains.)

We can print an abbreviated form of the structure tree to standard output:

cpdf -print-struct-tree in.pdf

This might yield:

StructTreeRoot
   Document
       Sect
          P (1)
             Span (1)
          Figure (1)
       Sect
          H1 (2)
          TOC
              TOCI
                 P
                     Link (2)
    .       .
    .       .
    .       .

The numbers in parentheses are the page numbers for structure elements, where present. We can extract the full structure tree to JSON for inspection or manupulation:

cpdf -extract-struct-tree in.pdf -o out.json

Here is a typical fragment:

[
  [ 0, { "/CPDFJSONformatversion": 1, "/CPDFJSONpageobjnumbers": [ 52 ] } ],
  [
    102,
    {
      "/Type": { "N": "/StructElem" },
      "/S": { "N": "/TD" },
      "/P": 98,
      "/Pg": 52,
      "/K": { "I": 38 },
      "/T": { "U": "row #7, col #3" },
      "/A": {
        "/O": { "N": "/Layout" },
        "/Height": { "F": 18.28 },
        "/Width": { "F": 73.07689999999999 }
      }
    }
  ],
  [
    15,
    {
      "/Type": { "N": "/StructElem" },
      "/S": { "N": "/TD" },
      "/P": 59,
      "/Pg": 52,
      "/K": { "I": 20 },
      "/T": { "U": "row #3, col #5" },
      "/A": {
        "/O": { "N": "/Layout" },
        "/Height": { "F": 18.28 },
        "/Width": { "F": 73.07689999999999 }
      }
    }
  ],
...

This JSON file contains the structure tree objects from the file, using the format described in chapter 15. There is a special entry in object 0 which gives the key to the page object numbers. In this example, there is one page with object number 52.

This JSON file can be edited, for example to change text strings, and reapplied to the same file from which it was extracted:

cpdf -replace-struct-tree out.json in.pdf -o out.pdf

If extra objects are required, they should be introduced with negative object numbers: Cpdf will renumber them on import so as not to clash with any existing numbers.

To remove a structure tree from a PDF, we can use -remove-dict-entry from Chapter 20, in other words:

cpdf -remove-dict-entry /StructTreeRoot in.pdf -o out.pdf

19.2 Verifying conformance to PDF/UA

Cpdf contains a new, experimental verifier for PDF/UA via most of the machine-checkable subset of the Matterhorn Protocol, a list of checks based on the PDF/UA-1 specification. For example, we can run:

cpdf -verify "PDF/UA-1(matterhorn)" in.pdf

We see:

06-001 UA1:7.1-8 Document does not contain an XMP metadata stream
07-001 UA1:7.1-9 ViewerPreferences dictionary of the Catalog dictionary does
not contain a DisplayDocTitle entry
11-006 UA1:7.2-3 Natural language for document metadata cannot be determined.
("No top-level /Lang")
28-004 UA1:7.18.1-4 An annotation, other than of subtype Widget, does not
have a Contents entry and does not have an alternative description (in the
form of an Alt entry in the enclosing structure element).
28-008 UA1:7.18.3-1 A page containing an annotation does not contain a Tabs
entry
28-011 UA1:7.18.5-1 A link annotation is not nested within a <Link> tag.
28-012 UA1:7.18.5-2 A link annotation does not include an alternate
description in its Contents entry.

The first column here is the Matterhorn Protocol checkpoint, the second the reference in the PDF/UA-1 standard docunment, the third the textual description from the Matterhorn Protocol, and an optional fourth (in parentheses) any extra information available.

The same information is available in JSON format by adding -json to the command line:

[
  {
    "name": "06-001",
    "section": "UA1:7.1-8",
    "error": "Document does not contain an XMP metadata stream",
    "extra": null
  },
  {
    "name": "07-001",
    "section": "UA1:7.1-9",
    "error": "ViewerPreferences dictionary of the Catalog dictionary does not
contain a DisplayDocTitle entry",
    "extra": null
  },
  {
    "name": "11-006",
    "section": "UA1:7.2-3",
    "error": "Natural language for document metadata cannot be determined.",
    "extra": "No top-level /Lang"
  },
  {
    "name": "28-004",
    "section": "UA1:7.18.1-4",
    "error": "An annotation, other than of subtype Widget, does not have a
Contents entry and does not have an alternative description (in the form of
an Alt entry in the enclosing structure element).",
    "extra": null
  },
  {
    "name": "28-008",
    "section": "UA1:7.18.3-1",
    "error": "A page containing an annotation does not contain a Tabs entry",
    "extra": null
  },
  {
    "name": "28-011",
    "section": "UA1:7.18.5-1",
    "error": "A link annotation is not nested within a <Link> tag.",
    "extra": null
  },
  {
    "name": "28-012",
    "section": "UA1:7.18.5-2",
    "error": "A link annotation does not include an alternate description in
its Contents entry.",
    "extra": null
  }
                                                                               
                                                                               

If verifying many files for a single fault, we may choose which test to run by adding -verify-single <testname> to the command line. For example:

cpdf -verify "PDF/UA-1(matterhorn)" -verify-single "28-012" in.pdf

A list of Matterhorn tests and their implementation status forms Appendix C.

19.3 PDF/UA compliance markers

Once we are sure a file complies to PDF/UA, in terms of both machine and human checks, we can mark it as such:

cpdf -mark-as "PDF/UA-1" in.pdf -o out.pdf

Or, for the more recent PDF/UA-2 standard:

cpdf -mark-as "PDF/UA-2" in.pdf -o out.pdf

To remove such a marker we can use, for example:

cpdf -remove-mark "PDF/UA-1" in.pdf -o out.pdf

19.4 Merging and splitting PDF/UA files

The -process-struct-trees option should always be used in conjunction with any splitting or merging command to preserve PDF/UA compliance. Details are given in chapter 2.

19.5 Creating new PDF/UA files

To create a new PDF/UA-1 file, with A4 portrait paper, one page, and the title "My Book", we may write:

cpdf -create-pdf-ua-1 "My Book" -o out.pdf

A title is needed for every PDF/UA document (even a blank one) for it to meet the standard. For PDF/UA-2, use -create-pdf-ua-2 instead. To make it valid, you must also draw a top-level PDF/UA-2 Document tag as described below.

19.6 Drawing PDF/UA files

Cpdf can add PDF/UA structure data when drawing on new PDF/UA files. For example the following produces a valid PDF/UA-1 file with structure information:

cpdf -create-pdf-ua-1 "Hello" AND
     -embed-std14 /path/to/fonts -draw-struct-tree
     -draw -bt -text "Hello, World" -et -o out.pdf

Note we had to specify embedded fonts to make this a valid PDF/UA-1 file. To make a valid PDF/UA-2 file we must also add a top-level Document structure tag with the appropriate namespace. Here is the PDF/UA-2 version of our file:

cpdf -create-pdf-ua-2 "Hello" AND 
     -embed-std14 /path/to/fonts -draw-struct-tree
     -draw -namespace PDF2 -stag Document -namespace PDF
     -bt -text "Hello, World" -et -end-stag -o out.pdf

See chapter 18 for more details about adding structure information when drawing.