PyPDF2: The New Fork of pyPdf

Today I learned that the pyPDF project is NOT dead, as I had originally thought. In fact, it's been forked into PyPDF2 (note the slightly different spelling). There's also a possibility that someone else has taken over the original pyPDF project and is actively working on it. You can follow all that over on reddit if you like. In the mean time, I decided to give PyPDF2 a whirl and see how it is different from the original. Feel free to follow along if you have a free moment or two.

Introducing PyPDF2

I originally wrote about pyPDF over two years ago and just recently I have been delving deep into the various Python PDF related libraries, so stumbling onto a new fork of pyPDF was pretty exciting. We're going to take some of my old examples and run them in the new PyPDF2 and see if they work the same way.

# Merge two PDFs

output = PdfFileWriter()

outputStream = file(r"output.pdf", "wb")
output.write(outputStream)
outputStream.close()


That worked perfectly on my Windows 7 box. As you might have guessed, all that code does is create to PdfFileReader objects and read in the first page of each. Next it adds those two pages to our PdfFileWriter. Finally we open a new file and write out our PDF pages. That's it! You've just created a new document from two separate PDFs!

Now let's try the page rotation script from my other article:

from PyPDF2 import PdfFileWriter, PdfFileReader

output = PdfFileWriter()

outputStream = file("output.pdf", "wb")
output.write(outputStream)
outputStream.close()


That also worked on my machine. So far so good. My final test of parity is to see if it can extract the same data that the original pyPdf could. We'll try reading the metadata from the latest Reportlab user manual:

>>> from PyPDF2 import PdfFileReader

>>> p = r'C:\Users\mdriscoll\Documents\reportlab-userguide.pdf'

>>> pdf.documentInfo

{'/ModDate': u'D:20120629155504', '/CreationDate': u'D:20120629155504', '/Producer': u'GPL Ghostscript 8.15', '/Title': u'reportlab-userguide.pdf', '/Creator': u'Adobe Acrobat 10.1.3', '/Author': u'mdriscoll'}
>>> pdf.getNumPages()

120
>>> info = pdf.getDocumentInfo()

>>> info.author

u'mdriscoll'
>>> info.creator

>>> info.producer

u'GPL Ghostscript 8.15'
>>> info.title

u'reportlab-userguide.pdf'


That all looks right too, except for the author bit. I'm certainly not the author of that document and I don't know why it thinks I am. Otherwise, it appears to work correctly. Now let's find out what's new!

What's New in PyPDF2

One of the first things I noticed when looking through the source for PyPDF2 is that it's added a few new methods to PdfFileReader and PdfFileWriter. I also noticed that there's an entirely new module called merger.py which contains the class: PdfFileMerger. Let's take a look under the covers since there is no real documentation at the time of this writing. The only new method added to the reader is getOutlines, which retrieves the document outline, if it exists. In the writer, there is support for adding bookmarks and named destinations. Not much, but beggars can't be choosers. I think the part I'm most excited about is the new PdfFileMerger class, which reminds me a bit of the dead Stapler project. The PdfFileMerger allows the programmer to merge multiple PDFs into a single PDF via concatenation, slicing, inserting or any combination of the three.

Let's try that out with a few example scripts, shall we?

import PyPDF2

path = open('path/to/hello.pdf', 'rb')
path2 = open('path/to/another.pdf', 'rb')

merger = PyPDF2.PdfFileMerger()

merger.merge(position=0, fileobj=path2)
merger.merge(position=2, fileobj=path)
merger.write(open("test_out.pdf", 'wb'))


What this does is merge two files together. The first one will get the second file inserted starting on page 3 (note the off-by-one) and continue on after the insertion. This is a lot easier than iterating over the pages of both documents and putting them together. The merge command has the following signature and docstring, which sums it up pretty well:

>>> merge(position, file, bookmark=None, pages=None, import_bookmarks=True)

Merges the pages from the source document specified by "file" into the output
file at the page number specified by "position".

Optionally, you may specify a bookmark to be applied at the beginning of the
included file by supplying the text of the bookmark in the "bookmark" parameter.

You may prevent the source document's bookmarks from being imported by
specifying "import_bookmarks" as False.

You may also use the "pages" parameter to merge only the specified range of
pages from the source document into the output document.


There's also an append method which is identical to the merge command except that it assumes you want to append all the pages onto the end of the PDF. For completeness, here's an example script:

import PyPDF2

path = open('path/to/hello.pdf', 'rb')
path2 = open('path/to/another.pdf', 'rb')

merger = PyPDF2.PdfFileMerger()

merger.append(fileobj=path2)
merger.append(fileobj=path)
merger.write(open("test_out2.pdf", 'wb'))


That was pretty painless and very nice too!

Wrapping Up

I think I've found a good alternative for PDF hacking. I can combine and split PDFs with PyPDF2 easier than I could the original pyPdf. I am also hopeful that PyPDF will stick around since it has a sponsor paying people to work on it. According to the reddit thread, there may be a chance that the original pyPdf may be revived and the two projects may end up working together. Regardless of what happens, I'm just happy that it's back under development again and will hopefully stay that way for a while. Let me know your thoughts on the topic too.