Chapter 13
Working with Images

cpdf -extract-images in.pdf [<range>] [-im <path>] [-p2p <path>]      [-dedup | -dedup-perpage] [-raw] -o <path>

cpdf -list-images[-json] in.pdf [<range>]

cpdf -image-resolution[-json] <minimum resolution> in.pdf [<range>]

cpdf -list-images-used[-json] in.pdf [<range>]

cpdf -process-images [-process-images-info] in.pdf [<range>]
     [-im <filename>] [-jbig2enc <filename>]
     [-lossless-resample[-dpi] <n> | -lossless-to-jpeg <n>]
     [-jpeg-to-jpeg <n>] [-1bpp-method <method>]
     [-jbig2-lossy-threshold <n>]
     [-pixel-threshold <n>] [-length-threshold <n>]
     [-percentage-threshold <n>] [-dpi-threshold <n>]
     -o out.pdf

13.1 Extracting images

Cpdf can extract the raster images to a given location. JPEG and JPEG2000 and lossless JBIG2 images are extracted directly.

Lossy JBIG2 images are extracted likewise, but an extra __<n> is added, giving the number of the JBIG2Global stream for this image, which is extracted as <n>.jbig2global. You may reconstruct the individual images with, for example, jbig2dec.

Other images are written as PNGs, processed with either ImageMagick’s “magick” command, or NetPBM’s “pnmtopng” program, whichever is installed.

cpdf -extract-images in.pdf [<range>] [-im <path>] [-p2p <path]      [-dedup | -dedup-perpage] -o <path>

The -im or -p2p option is used to give the path to the external tool, one of which must be installed (unless -raw is added, which outputs instead just JPEG or plain .pnm files).

The output specifier, e.g -o output/%%% gives the number format for numbering the images. Output files are named serially from 0, and include the page number too. For example, output files might be called output/000-p1.jpg, output/001-p1.png, output/002-p3.jpg etc. Here is an example invocation:

cpdf -extract-images in.pdf -im magick -o output/%%%

The output directory must already exist. The -dedup option deduplicates images entirely; the -dedup-perpage option only per page.

13.2 Listing images

The -list-images operation lists all images in the file:

6, 1, /Z_Im0, 3300, 2550, 13432, 1, /DeviceGray, /CCITTFaxDecode
9, 2 13 14 15, /Z_Im0, 3376, 2649, 37972, 1, /DeviceGray, /CCITTFaxDecode

The fields are object number, page numbers, image name, width, height, size in bytes, bits per pixel, colour space, filter (compression method). With -list-images-json, the same information is available in JSON format:

    "Object": 6,
    "Pages": [ 1 ],
    "Name": "/Z_Im0",
    "Width": 3300,
    "Height": 2550,
    "Bytes": 13432,
    "BitsPerComponent": 1,
    "Colourspace": "/DeviceGray",
    "Filter": "/CCITTFaxDecode"
    "Object": 9,
    "Pages": [ 2, 13, 14, 15 ],
    "Name": "/Z_Im0",
    "Width": 3376,
    "Height": 2649,
    "Bytes": 37972,
    "BitsPerComponent": 1,
    "Colourspace": "/DeviceGray",
    "Filter": "/CCITTFaxDecode"

13.3 Listing images at point of use

To list all images in the given range of pages which fall below a given resolution (in dots-per-inch), use the -image-resolution function:

cpdf -image-resolution 300 in.pdf [<range>]

2, /Im5, 531, 684, 149.935297, 150.138267, 31
2, /Im6, 184, 164, 149.999988, 150.458710, 39
2, /Im7, 171, 156, 149.999996, 150.579145, 40
2, /Im9, 65, 91, 149.999986, 151.071856, 57
2, /Im10, 94, 60, 149.999990, 152.284285, 59
2, /Im15, 184, 139, 149.960011, 150.672060, 91
4, /Im29, 53, 48, 149.970749, 151.616446, 93

The format is page number, image name, x pixels, y pixels, x resolution, y resolution, object number. The resolutions refer to the image’s effective resolution at point of use (taking account of scaling, rotation etc).

The information is also available in JSON format:

    "Object": 240,
    "Page": 79,
    "XObject": "/Z_Im0",
    "W": 3326,
    "H": 2584,
    "Xdpi": 300.0,
    "Ydpi": 300.0
    "Object": 243,
    "Page": 80,
    "XObject": "/Z_Im0",
    "W": 3300,
    "H": 2550,
    "Xdpi": 300.0,
    "Ydpi": 300.0

To list all images regardless of resolution, use -list-images-used or -list-images-used-json instead.

13.4 Removing an Image

To remove a particular image, find its name using -list-images then apply the -draft and -draft-remove-only operations from Section 19.1.

13.5 Processing Images

Cpdf can process images within a PDF, replacing the original with the processed version. It does this by saving out the image data, putting it through an external process, and then reading it back in and re-inserting it. This is typically used to reduce the size of image data, and thus the size of the PDF.

There are a number of option to deal with lossy (e.g JPEG) and lossless images, one or more of which is specified. For example, the -jpeg-to-jpeg option processes existing JPEG images to a given JPEG quality level:

cpdf -process-images -im magick -jpeg-to-jpeg 65 in.pdf -o out.pdf

ImageMagick is required. Use -im to supply it. If we specify -process-images-info too, we can see the work being done:

cpdf -process-images -process-images-info -jpeg-to-jpeg 65
     -im magick in.pdf -o out.pdf

Here is sample output:

(20/344) Object 265 (JPEG)... JPEG to JPEG 40798 -> 33463 (82%)
(38/344) Object 278 (JPEG)... JPEG to JPEG 4382 -> 3482 (79%)
(87/344) Object 266 (JPEG)... JPEG to JPEG 37227 -> 30199 (81%)
(243/344) Object 209 (JPEG)... no size reduction
(246/344) Object 270 (JPEG)... JPEG to JPEG 202568 -> 191175 (94%)
(281/344) Object 280 (JPEG)... JPEG to JPEG 12255 -> 9825 (80%)
(312/344) Object 279 (JPEG)... JPEG to JPEG 4117 -> 3157 (76%)

Similar output appears for the other methods, when they are specified. You can see the counter of work being done, and the result for each image chosen for processing.

The -lossless-to-jpeg option converts lossless images within PDFs to JPEG too, at the given quality level. It may be specified in addition to -jpeg-to-jpeg:

cpdf -process-images -jpeg-to-jpeg 65 -lossless-to-jpeg 80
     -im magick in.pdf -o out.pdf

Images are only processed if they meet certain thresholds. Changes to the default thresholds may be specified:

 -pixel-threshold        Im ages below  th is num ber of pixels 25
 -length-threshold       Im ages w ith less than this num ber of 100
                         bytes ofdata notprocessed
 -percentage-threshold   Results not below this p ercentage of 99
                         originalsized iscard ed
 -dpi-threshold          Onlyimag es abovethis threshold atall (no dpicheck)
                         usepoints processed

Instead of compressing lossless images with lossy JPEG compression, we can resample losslessly:

cpdf -process-images -im magick -lossless-resample 80 in.pdf -o out.pdf

This will resample losslessly-compressed images to contain 80 percent of the original pixels. By default, there will be no interpolation. To use interpolation, which may result in slightly larger data, add -resample-interpolate. To use a DPI target instead, use -lossless-resample-dpi instead:

cpdf -process-images -im magick -lossless-resample-dpi 300
     in.pdf -o out.pdf

The methods so far introduced do not operate on 1 bit per pixel data. Different compression mechanisms are typically in use, and we need a different approach. The -1bpp-method option specifies what to do with losslessly compressed 1 bit-per-pixel images.

 M ethod      Eff ect
 JBIG2Lossy  Lo ssy JBIG 2,sharingJBIG2G lobalsdataam ongstallim ages.

These options require the jbig2enc program, whose location may be specified with -jbig2enc. For lossy JBIG2, the threshold for similarity of data may be set with -jbig2-lossy-threshold. For example:

cpdf -process-images -jbig2enc jbig2enc -1bpp-method JBIG2Lossy
     -jbig2-lossy-threshold 75 in.pdf -o out.pdf

It is not currently possible to reprocess lossless JBIG2 into lossy JBIG2, nor is it possible to recompress into CCITT.

NB: CYMK images will be converted to RGB or untouched by some of these processes. A future version of cpdf will remove this limitation.