19 Accessible PDFs with PDF/UA

PDF/UA (Universal Accessibility) is a PDF subformat whose rules consist of a set of machine-checkable and human-checkable-only requirements to make PDF documents accessible for all users - for example, those using screen readers. Cpdf has some basic facilities for manipulating the extra PDF constructs which are used in (amongst others) PDF/UA, and a basic verifier for many of the machine-checkable requirements.

19.1 Structure trees

In a PDF document, the optional Structure Tree is a parallel construct which describes the logical structure of a document (as opposed to the information for rendering the document on the screen or printing it out, which every PDF of course contains.)

The numbers in parentheses are the page numbers for structure elements, where present. We can extract the full structure tree to JSON for inspection or manupulation:

This JSON file contains the structure tree objects from the file, using the format described in chapter 15. There is a special entry in object 0 which gives the key to the page object numbers. In this example, there is one page with object number 52.

This JSON file can be edited, for example to change text strings, and reapplied to the same file from which it was extracted:

If extra objects are required, they should be introduced with negative object numbers: Cpdf will renumber them on import so as not to clash with any existing numbers.

This removes the structure tree and all references to it, including from inside page content. In addition we can, afterward, use -mark-as-artifact:

19.2 Verifying conformance to PDF/UA

Cpdf contains a new, experimental verifier for PDF/UA via most of the machine-checkable subset of the Matterhorn Protocol, a list of checks based on the PDF/UA-1 specification. For example, we can run:

The first column here is the Matterhorn Protocol checkpoint, the second the reference in the PDF/UA-1 standard docunment, the third the textual description from the Matterhorn Protocol, and an optional fourth (in parentheses) any extra information available.

The same information is available in JSON format by adding -json to the command line:

If verifying many files for a single fault, we may choose which test to run by adding -verify-single <testname> to the command line. For example:

Presently, Matterhorn tests 31-001–016,018,030 are unimplemented. Matterhorn tests 31-027,10-001,11-001–005 are partially implemented. All others are implemented.

19.3 PDF/UA compliance markers

Once we are sure a file complies to PDF/UA, in terms of both machine and human checks, we can mark it as such:

19.4 Merging and splitting PDF/UA files

The -process-struct-trees option should always be used in conjunction with any splitting or merging command to preserve PDF/UA compliance. Sometimes -subformat may be required too. Details are given in chapter 2.

19.5 Creating new PDF/UA files

To create a new PDF/UA-1 file, with A4 portrait paper, one page, and the title "My Book", we may write:

A title is needed for every PDF/UA document (even a blank one) for it to meet the standard. For PDF/UA-2, use -create-pdf-ua-2 instead. To make it valid, you must also draw a top-level PDF/UA-2 Document tag as described below i.e:

19.6 Drawing PDF/UA files

Cpdf can add PDF/UA structure data when drawing on new PDF/UA files. For example the following produces a valid PDF/UA-1 file with structure information:

Note we had to specify embedded fonts to make this a valid PDF/UA-1 file. To make a valid PDF/UA-2 file we must also add a top-level Document structure tag with the appropriate namespace. Here is the PDF/UA-2 version of our file:

See chapter 18 for more details about adding structure information when drawing.

19.7 Remediation of PDF/UA verification errors

Remediation of a file which claims to match PDF/UA but which does not (either failing human or mechanical tests) is a complex topic. In this section, we list possible remediations for a file which fails mechanical verification with Cpdf or another verification tool. Sometimes these will be clear and simple – for example where some piece of document metadata is missing – and sometimes they will be almost impossible. Of course, often the truth lies between those two extremes.

When all else fails, it may be possible to modify the basic structures of the PDF manually. This may be done by extracting the PDF to JSON using -output-json from chapter 15, modifying the file manually in a text editor or automatically with a JSON processing tool such as jq and converting back to a PDF with -j. If the remediation requires altering page content streams, the option -output-json-parse-content-streams may be used.

19.7.1 Remediation List

The following table lists each mechanically-verifiable test from the Matterhorn protocol. For each, we give the number, description from the Matterhorn protocol, and the reference into the PDF/UA standard. Then we describe, if possible, how to use Cpdf to remediate the failure. Sometimes this is a definitive command, sometimes a last-ditch attempt to re-process the file (to embed missing fonts or correct font structures, for example) and sometimes simply a direction to try the manual remediation procedure described above.

Number	Description	Reference





01-003	Content marked as Artifact is present inside tagged content.	UA1:7.1-1
01-004	Tagged content is present inside content marked as Artifact.	UA1:7.1-1
01-005	Content is neither marked as Artifact nor tagged as real content.	UA1:7-1-2
File does not meet Tagged PDF standard - only manual remediation possible (see description above this table).
01-007	Suspects entry has a value of true.	UA1:7-1-11
If you are sure the file conforms to tagged PDF conventions, use cpdf -replace-obj /Root/MarkInfo/Suspects=false in.pdf -o out.pdf.
02-001	One or more non-standard tag’s mapping does not terminate with a standard type.	UA1:7.1-3
02-003	A circular mapping exists.	UA1:7.1-3
02-004	One or more standard types are remapped.	UA1:7.1-4
File does not meet PDF/UA tagging standard - only manual remediation possible (see description above this table).
06-001	Document does not contain an XMP metadata stream	UA1:7.1-8
Create XMP metadata from any existing old-style metadata in the file with cpdf -create-metadata in.pdf -o out.pdf. This may lead to further verification errors due to empty metadata entries.
06-002	The XMP metadata stream in the Catalog dictionary does not include the PDF/UA identifier.	UA1:5
Mark the file as PDF/UA using cpdf -mark-as ["PDF/UA-1" \| "PDF/UA-2"] in.pdf -o out.pdf.
06-003	XMP metadata stream does not contain dc:title	UA1:7.1-8
Add a title using cpdf -set-title "My title" -also-set-xmp in.pdf -o out.pdf.
07-001	ViewerPreferences dictionary of the Catalog dictionary does not contain a DisplayDocTitle entry	UA1:7.1-9
Add the entry with cpdf -display-doc-title true in.pdf -o out.pdf.
07-002	ViewerPreferences dictionary of the Catalog dictionary contains a DisplayDocTitle entry with a value of false	UA1:7.1-9
Replace the entry with cpdf -display-doc-title true in.pdf -o out.pdf.
09-004	A table-related structure element is used in a way that does not conform to the syntax defined in ISO 32000-1, Table 337.	UA1-7.2-1
09-005	A list-related structure element is used in a way that does not conform to Table 336 in ISO 32000-1.	UA1-7.2-1
09-006	A TOC-related structure element is used in a way that does not conform to Table 333 in ISO 32000-1.	UA1-7.2-1
09-007	A Ruby-related structure element is used in a way that does not conform to Table 338 in ISO 32000-1.	UA1-7.2-1
09-008	A Warichu-related structure element is used in a way that does not conform to Table 338 in ISO 32000-1.	UA1-7.2-1
File does not meet PDF/UA tagging standard - only manual remediation possible (see description above this table).
10-001	Character code cannot be mapped to Unicode.	UA1:7.2-2
It is possible that reprocessing the file with gs using cpdf in.pdf -gs gs -gs-malformed-force -o out.pdf [-gs-quiet] will correct the fonts.
11-001	Natural language for text in page content cannot be determined.	UA1:7.2-3
11-002	Natural language for text in Alt, ActualText and E attributes cannot be determined.	UA1:7.2-3
11-003	Natural language in the Outline entries cannot be determined.	UA1:7.2-3
11-004	Natural language in the Contents entry for annotations cannot be determined.	UA1:7.2-3
11-005	Natural language in the TU entry for form fields cannot be determined.	UA1:7.2-3
11-006	Natural language for document metadata cannot be determined.	UA1:7.2-3
Assuming the document is all in a single language, set the top-level language with, for example, cpdf -set-language "en-US" in.pdf -o out.pdf. If the document contains multiple languages, only manual remediation is possible.
13-004	<Figure> tag alternative or replacement text missing.	UA1:7.3-3
14-002	Does use numbered headings, but the first heading tag is not <H1>.	UA1:7.4.2-1
14-003	Numbered heading levels in descending sequence are skipped (Example: <H3> follows directly after <H1>).	UA1:7.4-1
14-006	A node contains more than one <H> tag.	UA1:7.4.4-1
14-007	Document uses both <H> and <H#> tags.	UA1:7.4.4-3
15-003	In a table not organized with Headers attributes and IDs, a <TH> cell does not contain a Scope attribute.	UA1:7.5-2
17-002	<Formula> tag is missing an Alt attribute.	UA1:7.7-1
File does not meet PDF/UA standard - only manual remediation possible (see description above this table). Alternatively, edit the tree manually using -extract-struct-tree and -replace-struct-tree from this chapter.
17-003	Unicode mapping requirements are not met.	UA1:7.7-2
It is possible that reprocessing the file with gs using cpdf in.pdf -gs gs -gs-malformed-force -o out.pdf [-gs-quiet] will correct the fonts.
19-003	ID entry of the <Note> tag is not present.	UA1:7.9-2
19-004	ID entry of the <Note> tag is non-unique.	UA1:7.9-2
20-001	Name entry is missing or has an empty string as its value in an Optional Content Configuration Dictionary in the Configs entry in the OCProperties entry in the Catalog dictionary.	UA1:7.10-1
20-002	Name entry is missing or has an empty string as its value in an Optional Content Configuration Dictionary that is the value of the D entry in the OCProperties entry in the Catalog dictionary.	UA1:7.10-1
File does not meet PDF/UA standard - only manual remediation possible (see description above this table). Alternatively, edit the tree manually using -extract-struct-tree and -replace-struct-tree from this chapter.
20-003	An AS entry appears in an Optional Content Configuration Dictionary.	UA1:7.10-2
File does not meet PDF/UA standard - only manual remediation possible (see description above this table).
21-001	The file specification dictionary for an embedded file does not contain F and UF entries.	UA1:7.11-1
File does not meet PDF/UA standard - only manual remediation possible (see description above this table). Alternatively, edit the tree manually using -extract-struct-tree and -replace-struct-tree from this chapter.
25-001	File contains the dynamicRender element with value “required”.	UA1:7.15-1
Not remediable, unless actually a wrong marker. This is an interactive PDF form which likely only works with Adobe Acrobat. If the marker is actually wrong, it may be edited manually inside the XML stream using the instructions above.
26-001	The file is encrypted but does not contain a P entry in its encryption dictionary.	UA1:7.16-1
26-002	The file is encrypted and does contain a P entry but the 10th bit position of the P entry is false.	UA1:7.16-1
Re-encrypt the file with Cpdf as described in Chapter 4.
28-002	An annotation, other than of subtype Widget, Link and PrinterMark, is not a direct child of an <Annot> structure element.	UA1:7.18.1-2
28-004	An annotation, other than of subtype Widget, does not have a Contents entry and does not have an alternative description (in the form of an Alt entry in the enclosing structure element).	UA1:7.18.1-4
28-005	A form field does not have a TU entry and does not have an alternative description (in the form of an Alt entry in the enclosing structure element).	UA1:7.18.1-4
File does not meet PDF/UA standard - only manual remediation possible (see description above this table). Alternatively, edit the tree manually using -extract-struct-tree and -replace-struct-tree from this chapter.
28-006	An annotation with subtype undefined in ISO 32000 does not meet 7.18.1.	UA1:7.18.2-1
28-007	An annotation of subtype TrapNet exists.	UA1:7.18.2-2
28-008	A page containing an annotation does not contain a Tabs entry	UA1:7.18.3-1
28-009	A page containing an annotation has a Tabs entry with a value other than S.	UA1:7.18.3-1
If annotations are not required, they may be removed with cpdf -remove-annotations in.pdf -o out.pdf. Alternatively, use -output-annotations-json and -set-annotations-json as described in Chapter 10 to remove one or more specific annotations.
28-010	A widget annotation is not nested within a <Form> tag.	UA1:7.18.4-1
28-011	A link annotation is not nested within a <Link> tag.	UA1:7.18.5-1
If annotations are not required, they may be removed with cpdf -remove-annotations in.pdf -o out.pdf. Alternatively, use -output-annotations-json and -set-annotations-json as described in Chapter 10 to remove one or more specific annotations. Alternatively, edit the tree manually using -extract-struct-tree and -replace-struct-tree from this chapter.
28-012	A link annotation does not include an alternate description in its Contents entry.	UA1:7.18.5-2
28-014	CT entry is missing from the media clip data dictionary.
28-015	Alt entry is missing from the media clip data dictionary.	UA1:7.18.6.2-1
28-016	File attachment annotations do not conform to 7.11.	UA1:7.18.7-1
If annotations are not required, they may be removed with cpdf -remove-annotations in.pdf -o out.pdf. Alternatively, use -output-annotations-json and -set-annotations-json as described in Chapter 10 to remove one or more specific annotations.
28-017	A PrinterMark annotation is included in the logical structure.	UA1:7.18.8-1
If annotations are not required, they may be removed with cpdf -remove-annotations in.pdf -o out.pdf. Alternatively, use -output-annotations-json and -set-annotations-json as described in Chapter 10 to remove one or more specific annotations. Alternatively, edit the tree manually using -extract-struct-tree and -replace-struct-tree from this chapter.
28-018	The appearance stream of a PrinterMark annotation is not marked as Artifact.	UA1:7.18.8-2
If annotations are not required, they may be removed with cpdf -remove-annotations in.pdf -o out.pdf. Alternatively, use -output-annotations-json and -set-annotations-json as described in Chapter 10 to remove one or more specific annotations.
30-001	A reference XObject is present.	UA1:7.2
A reference XObject references a page in another file. May be cut out manually using the manual remediation instructions above.
30-002	Form XObject contains MCIDs and is referenced more than once.	UA1:7.21.3.1-1
Unlikely to be remediable: the only option is to manually remove them, but this would then result in a tag tree pointing to non-existent MCIDs, which would be another kind of invalidity. Any PDF producer creating Tagged PDF with MCIDs like this is simply broken.
31-001	A Type 0 font dictionary with encoding other than Identity-H and Identity-V has values for Registry in both CIDSystemInfo dictionaries that are not identical.	UA1:7.21.3-1
31-002	A Type 0 font dictionary with encoding other than Identity-H and Identity-V has values for Ordering in both CIDSystemInfo dictionaries that are not identical.	UA1:7.21.3.1-1
31-003	A Type 0 font dictionary with encoding other than Identity-H and Identity-V has a value for Supplement in the CIDSystemInfo dictionary of the CID font that is less than the value for Supplement in the CIDSystemInfo dictionary of the CMap.	UA1:7.21.3.1-1
31-004	A Type 2 CID font contains neither a stream nor the name Identity as the value of the CIDToGIDMap entry.	UA1:7.21.3.2-1
31-005	A Type 2 CID font does not contain a CIDToGIDMap entry.	UA1:7.21.3.2-1
31-006	A CMap is neither listed as described in ISO 32000- 1:2008, 9.7.5.2, Table 118 nor is it embedded.	UA1:7.21.3.3-1
31-007	The WMode entry in a CMap dictionary is not identical to the WMode value in the CMap stream.	UA1:7.21.3.3-1
31-008	A CMap references another CMap which is not listed in ISO 32000-1:2008, 9.7.5.2, Table 118.	UA1:7.21.3.3-2
31-009	For a font used by text intended to be rendered the font program is not embedded.	UA1:7.21.4.1-1
31-011	For a font used by text the font program is embedded but it does not contain glyphs for all of the glyphs referenced by the text used for rendering.	UA1:7.21.4.1-3
31-012	The FontDescriptor dictionary of an embedded Type 1 font contains a CharSet string, but at least one of the glyphs present in the font program is not listed in the CharSet string.	UA1:7.21.4.2-1
31-013	The FontDescriptor dictionary of an embedded Type 1 font contains a CharSet string, but at least one of the glyphs listed in the CharSet string is not present in the font program.	UA1:7.21.4.2-2
31-014	The FontDescriptor dictionary of an embedded CID font contains a CIDSet string, but at least one of the glyphs present in the font program is not listed in the CIDSet string.	UA1:7.21.4.2-3
31-015	The FontDescriptor dictionary of an embedded CID font contains a CIDSet string, but at least one of the glyphs listed in the CIDSet string is not present in the font program.	UA1:7.21.4.2-4
31-016	For one or more glyphs, the glyph width information in the font dictionary and in the embedded font program differ by more than 1/1000 unit.	UA1:7.21.5-1
31-017	A non-symbolic TrueType font is used for rendering, but none of the cmap entries in the embedded font program is a non-symbolic cmap.	UA1:7.21.6-1
31-018	A non-symbolic TrueType font is used for rendering, but for at least one glyph to be rendered the glyph cannot be looked up by any of the non-symbolic cmap entries in the embedded font program.	UA1:7.21.6-2
31-019	The font dictionary for a non-symbolic TrueType font does not contain an Encoding entry.	UA1:7.21.6-3
31-020	The font dictionary for a non-symbolic TrueType font contains an Encoding dictionary which does not contain a BaseEncoding entry.	UA1:7.21.6-4
31-021	The value for either the Encoding entry or the BaseEncoding entry in the Encoding dictionary in a non-symbolic TrueType font dictionary is neither MacRomanEncoding nor WinAnsiEncoding.	UA1:7.21.6-5
31-022	The Differences array in the Encoding entry in a non-symbolic TrueType font dictionary contains one or more glyph names which are not listed in the Adobe Glyph List.	UA1:7.21.6-6
31-023	The Differences array is present in the Encoding entry in a non-symbolic TrueType font dictionary but the embedded font program does not contain a (3,1) Microsoft Unicode cmap.	UA1:7.21.6-7
31-024	The Encoding entry is present in the font dictionary for a symbolic TrueType font.	UA1:7.21.6-8
31-025	The embedded font program for a symbolic TrueType font contains no cmap.	UA1:7.21.6-9
31-026	The embedded font program for a symbolic TrueType font contains more than one cmap, but none of the cmap entries is a (3,0) Microsoft Symbol cmap.	UA1:7.21.6-10
31-027	A font dictionary does not contain the ToUnicode entry and none of the following is true: the font uses MacRomanEncoding, MacExpertEncoding or WinAnsiEncoding; the font is a Type 1 or Type 3 font and the glyph names of the glyphs referenced are all contained in the Adobe Glyph List or the set of named characters in the Symbol font, as defined in ISO 32000-1:2008, Annex D; the font is a Type 0 font, and its descendant CIDFont uses Adobe-GB1, Adobe-CNS1, Adobe-Japan1 or Adobe-Korea1 character collections; the font is a non-symbolic TrueType font.	UA1:7.21.7-1
31-028	One or more Unicode values specified in the ToUnicode CMap are zero (0).	UA1:7.21.7-2
31-029	One or more Unicode values specified in the ToUnicode CMap are equal to either U+FEFF or U+FFFE.	UA1:7.21.7-3
31-030	One or more characters used in text showing operators reference the .notdef glyph.	UA1:7.21.8-1
It is possible that reprocessing the file with gs using cpdf in.pdf -gs gs -gs-malformed-force -o out.pdf [-gs-quiet] will correct the fonts.

Chapter 19Accessible PDFs with PDF/UA