Posts Tagged ‘documents’

PDF (Portable Document Format) documents are a handy way to present text and images to others knowing they’ll look the same no matter what word processor or operating system they use. Basically, they’re a snapshot of a document, so saving images from them can be a hassle, even if your viewer lets you right-click them and save them as files. There are a few programs around that can do this for you, but it’s actually much easier and faster doing this from the command-line.

The pdfimages command is part of poppler-utils, which should already be installed on your system (sudo apt-get install poppler-utils in the terminal if it isn’t). To extract the images from a PDF, just open a terminal in the folder with the document, and run a command like the following:

pdfimages -j Cool-Pix-of-2011.pdf cool2011

Note that when extracting from files with spaces in the name, you will need to enclose the filename in single quotes. Eg:

pdfimages -j 'Cool Pix of 2011.pdf' cool2011

The text at the end of the command is what each extracted image will begin with, so the resulting filenames will be cool2011-000.jpg onwards (note that numbering starts at 000, not 001). Once again, if you’d prefer to have spaces in the target names, for example to mirror the name of the original PDF, then enclose that in single quotes too (eg: 'Cool Pix of 2011 ' – note the space at the end, just to provide a bit more separation between '2011' and the hyphen preceding the automatic numbering; this is of course optional, and you can pretty much do what you want). Eg:

pdfimages -j 'Cool Pix of 2011.pdf' 'Cool Pix of 2011 '

Your pictures will now be extracted into the folder with names starting with Cool Pix of 2011 -000.jpg.

Also, the -j option is to save the images in the .jpg format, otherwise they will be saved in .ppm (Portable Pixmap) format, with each file being over a megabyte. This can mean, for example, that an 18Mb document with 120 images can extract to 154Mb of files, whereas exporting them as .jpg ends up with a total of 18Mb, just like the original document. Of course, if you’d prefer to save them as .ppm images, simply leave out the -j option.

If you would like to include the page numbering in the file names, add the -p option. Eg:

pdfimages -j -p 'Cool Pix of 2011.pdf' 'Cool Pix of 2011 '

Lastly, don’t worry if you see the following in the terminal for each image being extracted:

Error (18468081): Missing ‘endstream’
Error: Unknown operator ‘endstream’
Error: Unknown operator ‘endobj’

You shouldn’t see that with every PDF you try to extract from, but even when you do you should find the target images have been created without issue.

Extra Notes:

For more options for this command, run pdfimages -?. For example, you can specify a start and end page, but personally I find it easier to just extract the whole document and delete any images I don’t want afterwards. But if you need to specify a password, you will find the option here.


Did this information make your day? Did it rescue you from hours of headache? Then please consider making a donation via PayPal, to buy me a donut, beer, or some fish’n’chips for my time and effort! Many thanks!

Buy Ubuntu Genius a Beer to say Thanks!

Read Full Post »