Chapter 15
PDF and JSON

cpdf in.pdf -output-json -o out.json
     [-output-json-parse-content-streams]
     [-output-json-no-stream-data]

Output PDF as JSON data. Each object is written under its object number. The object number zero is used to store the trailer dictionary. Negative object numbers are reserved for future format expansion. Here is an example of the output for a small PDF:

[ [ 0,  
    { "/Size": 4, "/Root": 4,  
      "/ID": ["<elided>", "<elided>"] } ],  
  [ 3,  
    { "/Type": "/Page", "/Parent": 1,  
      "/Resources": { "/Font": { "/F0": { "/Type": "/Font",  
                                          "/Subtype": "/Type1",  
                                          "/BaseFont": "/Times-Italic" } } },  
      "/MediaBox": [ 0, 0, 595.275591, 841.889764 ], "/Rotate": 0,  
      "/Contents": [ 2 ] } ],  
  [ 4, { "/Type": "/Catalog", "/Pages": 1 } ],  
  [ 1, { "/Type": "/Pages", "/Kids": [ 3 ], "/Count": 1 } ],  
  [ 2,  
    [ { "/Length": 49 },  
      "1 0 0 1 50 770 cm BT/F0 36 Tf(Hello, World!)Tj ET" ] ] ]

The option -output-json-parse-content-streams will also convert content streams to JSON, so our example content stream will be expanded:

[ [ 1.000000, 0.000000, 0.000000, 1.000000, 50.000000, 770.000000,  
          "cm" ],  
        [ "BT" ], [ "/F0", 36.000000, "Tf" ], [ "Hello, World!", "Tj" ],  
        [ "ET" ] ] ] ] ]

The option -output-json-no-stream-data simply elides the stream data instead, leading to much smaller JSON files.

Python Interface

 
# CHAPTER 15. PDF and JSON 
 
def outputJSON(filename, parse_content, no_stream_data, pdf): 
    """Output a PDF in JSON format to the given filename. If parse_content is 
    True, page content is parsed. If no_stream_data is True, all stream data is 
    suppressed entirely."""