cpdf -extract-images in.pdf [<range>] [-im <path>] [-p2p <path>]
[-dedup | -dedup-perpage] [-raw] -o <path>
cpdf -list-images[-json] in.pdf [<range>]
cpdf -image-resolution[-json] <minimum resolution> in.pdf [<range>]
cpdf -list-images-used[-json] in.pdf [<range>]
cpdf -process-images [-process-images-info] in.pdf [<range>]
[-im <filename>] [-jbig2enc <filename>]
[-lossless-resample[-dpi] <n> | -lossless-to-jpeg <n>]
[-jpeg-to-jpeg <n>] [-jpeg-to-jpeg-scale <n>]
[-jpeg-to-jpeg-dpi <n>] [-1bpp-method <method>]
[-jbig2-lossy-threshold <n>] [-pixel-threshold <n>]
[-length-threshold <n>] [-percentage-threshold <n>]
[-dpi-threshold <n>] [-resample-interpolate]
-o out.pdf
cpdf -rasterize in.pdf <range> -o out.pdf
[-rasterize[-gray|-1bpp|-jpeg|-jpeggray]
[-rasterize-res <n>] [-rasterize-jpeg-quality <n>]
[-rasterize-no-antialias | -rasterize-downsample]
[-rasterize-annots]
cpdf -output-image in.pdf <range> -o <format>
[-rasterize[-gray|-1bpp|-jpeg|-jpeggray]
[-rasterize-res <n>] [-rasterize-jpeg-quality <n>]
[-rasterize-no-antialias | -rasterize-downsample]
[-rasterize-annots] [-tobox <BoxName>]
Cpdf can extract the raster images to a given location. JPEG and JPEG2000 and lossless JBIG2 images are extracted directly.
Lossy JBIG2 images are extracted likewise, but an extra __<n> is added, giving the number of the JBIG2Global stream for this image, which is extracted as <n>.jbig2global. You may reconstruct the individual images with, for example, jbig2dec.
Other images are written as PNGs, processed with either ImageMagick’s “magick” command, or NetPBM’s “pnmtopng” program, whichever is installed.
cpdf -extract-images in.pdf [<range>] [-im <path>] [-p2p <path]
[-dedup | -dedup-perpage] -o <path>
The -im or -p2p option is used to give the path to the external tool, one of which must be installed (unless -raw is added, which outputs instead just JPEG or plain .pnm files).
The output specifier, e.g -o output/%%%
gives the number format for numbering the images.
Output files are named serially from 0, and include the page number too. For example, output files
might be called output/000-p1.jpg, output/001-p1.png, output/002-p3.jpg etc. Here is
an example invocation:
cpdf -extract-images in.pdf -im magick -o output/%%%
The output directory must already exist. The -dedup option deduplicates images entirely; the -dedup-perpage option only per page.
The -list-images operation lists all images in the file:
6, 1, /Z_Im0, 3300, 2550, 13432, 1, /DeviceGray, /CCITTFaxDecode 9, 2 13 14 15, /Z_Im0, 3376, 2649, 37972, 1, /DeviceGray, /CCITTFaxDecode
The fields are object number, page numbers, image name, width, height, size in bytes, bits per pixel, colour space, filter (compression method). With -list-images-json, the same information is available in JSON format:
[ { "Object": 6, "Pages": [ 1 ], "Name": "/Z_Im0", "Width": 3300, "Height": 2550, "Bytes": 13432, "BitsPerComponent": 1, "Colourspace": "/DeviceGray", "Filter": "/CCITTFaxDecode" }, { "Object": 9, "Pages": [ 2, 13, 14, 15 ], "Name": "/Z_Im0", "Width": 3376, "Height": 2649, "Bytes": 37972, "BitsPerComponent": 1, "Colourspace": "/DeviceGray", "Filter": "/CCITTFaxDecode" } ]
To list all images in the given range of pages which fall below a given resolution (in dots-per-inch),
use the -image-resolution
function:
cpdf -image-resolution 300 in.pdf [<range>]
Here is the result:
2, /Im5, 531, 684, 149.935297, 150.138267, 31 2, /Im6, 184, 164, 149.999988, 150.458710, 39 2, /Im7, 171, 156, 149.999996, 150.579145, 40 2, /Im9, 65, 91, 149.999986, 151.071856, 57 2, /Im10, 94, 60, 149.999990, 152.284285, 59 2, /Im15, 184, 139, 149.960011, 150.672060, 91 4, /Im29, 53, 48, 149.970749, 151.616446, 93
The format is page number, image name, x pixels, y pixels, x resolution, y resolution, object number. The resolutions refer to the image’s effective resolution at point of use (taking account of scaling, rotation etc).
The information is also available in JSON format:
[ { "Object": 240, "Page": 79, "XObject": "/Z_Im0", "W": 3326, "H": 2584, "Xdpi": 300.0, "Ydpi": 300.0 }, { "Object": 243, "Page": 80, "XObject": "/Z_Im0", "W": 3300, "H": 2550, "Xdpi": 300.0, "Ydpi": 300.0 } ]
To list all images regardless of resolution, use -list-images-used or -list-images-used-json instead.
To remove a particular image, find its name using -list-images then apply the -draft and -draft-remove-only operations from Section 20.1.
Cpdf can process images within a PDF, replacing the original with the processed version. It does this by saving out the image data, putting it through an external process, and then reading it back in and re-inserting it. This is typically used to reduce the size of image data, and thus the size of the PDF.
There are a number of option to deal with lossy (e.g JPEG) and lossless images, one or more of which is specified. For example, the -jpeg-to-jpeg option processes existing JPEG images to a given JPEG quality level:
cpdf -process-images -im magick -jpeg-to-jpeg 65 in.pdf -o out.pdf
ImageMagick is required. Use -im to supply it. If we specify -process-images-info too, we can see the work being done:
cpdf -process-images -process-images-info -jpeg-to-jpeg 65
-im magick in.pdf -o out.pdf
Here is sample output:
(20/344) Object 265 (JPEG)... JPEG to JPEG 40798 -> 33463 (82%) (38/344) Object 278 (JPEG)... JPEG to JPEG 4382 -> 3482 (79%) (87/344) Object 266 (JPEG)... JPEG to JPEG 37227 -> 30199 (81%) (243/344) Object 209 (JPEG)... no size reduction (246/344) Object 270 (JPEG)... JPEG to JPEG 202568 -> 191175 (94%) (281/344) Object 280 (JPEG)... JPEG to JPEG 12255 -> 9825 (80%) (312/344) Object 279 (JPEG)... JPEG to JPEG 4117 -> 3157 (76%)
Similar output appears for the other methods, when they are specified. You can see the counter of work being done, and the result for each image chosen for processing.
The -lossless-to-jpeg option converts lossless images within PDFs to JPEG too, at the given quality level. It may be specified in addition to -jpeg-to-jpeg:
cpdf -process-images -jpeg-to-jpeg 65 -lossless-to-jpeg 80
-im magick in.pdf -o out.pdf
Images are only processed if they meet certain thresholds. Changes to the default thresholds may be specified:
Option | Effect | Default value |
-pixel-threshold | Images below this number of pixels not processed | 25 |
-length-threshold | Images with less than this number of bytes of data not processed | 100 |
-percentage-threshold | Results not below this percentage of original size discarded | 99 |
-dpi-threshold | Only images above this threshold at all use points processed | (no dpi check) |
Instead of compressing lossless images with lossy JPEG compression, we can resample losslessly:
cpdf -process-images -im magick -lossless-resample 80 in.pdf -o out.pdf
This will resample losslessly-compressed images to be 80 percent of the original width and height. By default, there will be no interpolation. To use interpolation, which may result in slightly larger data, add -resample-interpolate. To use a DPI target instead, use -lossless-resample-dpi instead:
cpdf -process-images -im magick -lossless-resample-dpi 300
in.pdf -o out.pdf
We can also use resampling with -jpeg-to-jpeg, buy specifying -jpeg-to-jpeg-scale:
cpdf -process-images -im magick -jpeg-to-jpeg 70 -jpeg-to-jpeg-scale 50
in.pdf -o out.pdf
We can alternatively use a DPI target:
cpdf -process-images -im magick -jpeg-to-jpeg 70 -jpeg-to-jpeg-dpi 150
in.pdf -o out.pdf
The methods so far introduced do not operate on 1 bit per pixel data. Different compression mechanisms are typically in use, and we need a different approach. The -1bpp-method option specifies what to do with losslessly compressed 1 bit-per-pixel images.
Method | Effect | |
JBIG2 | Lossless JBIG2 |
|
JBIG2Lossy | Lossy JBIG2, sharing JBIG2Globals data amongst all images. |
These options require the jbig2enc program, whose location may be specified with -jbig2enc. For lossy JBIG2, the threshold for similarity of data may be set with -jbig2-lossy-threshold. For example:
cpdf -process-images -jbig2enc jbig2enc -1bpp-method JBIG2Lossy
-jbig2-lossy-threshold 75 in.pdf -o out.pdf
It is not currently possible to reprocess lossless JBIG2 into lossy JBIG2, nor is it possible to recompress into CCITT.
NB: CMYK images will be converted to RGB or untouched by some of these processes. A future version of Cpdf will remove this limitation.
Cpdf can send individual pages of a PDF out to gs to rasterize them - they are then read back in and replace the original page content:
cpdf -gs gs -rasterize in.pdf -o out.pdf
Other metadata (for example, bookmarks) is preserved. By default, the resolution is 144dpi, and the raster data is losslessly compressed. It is the Crop Box which is rasterized, or the Media Box if absent. The following options may be added:
Option | Effect |
|
-rasterize-gray | Use grayscale instead of colour |
|
-rasterize-1bpp | Use monochrome instead of colour |
|
-rasterize-jpeg | Use JPEG instead of lossless compression |
|
-rasterize-jpeggray | Use grayscale JPEG instead of lossless compression |
|
-rasterize-jpeg-quality | Set JPEG image quality (0..100) |
|
-rasterize-res | Set the resolution |
|
-rasterize-annots | Rasterize annotations instead of retaining |
|
-rasterize-no-antialias | Turn off antialiasing |
|
-rasterize-downsample | Use better but slower antialiasing |
|
-gs-quiet | Don’t show gs output |
|
In addition to rasterization of pages, we can export them in PNG or JPEG format, again by the use of gs:
cpdf -gs gs -output-image in.pdf 10-end -o image%%%.png
This will extract pages 10 onwards to the files image000.png, image001.png and so on. All the options above apply, and in addition we can choose which box is rasterized:
Option | Effect |
|
-tobox | Choose rasterization box |
|
For example:
cpdf -gs gs -output-image -tobox /BleedBox -rasterize-jpeg in.pdf
-o image%%%.jpeg