PyPDF2: The New Fork of pyPdf

Today I learned that the pyPDF project is NOT dead, as I had originally thought. In fact, it’s been forked into PyPDF2 (note the slightly different spelling). There’s also a possibility that someone else has taken over the original pyPDF project and is actively working on it. You can follow all that over on reddit if you like. In the mean time, I decided to give PyPDF2 a whirl and see how it is different from the original. Feel free to follow along if you have a free moment or two.

Introducing PyPDF2

I originally wrote about pyPDF over two years ago and just recently I have been delving deep into the various Python PDF related libraries, so stumbling onto a new fork of pyPDF was pretty exciting. We’re going to take some of my old examples and run them in the new PyPDF2 and see if they work the same way.

# Merge two PDFs
from PyPDF2 import PdfFileReader, PdfFileWriter
 
output = PdfFileWriter()
pdfOne = PdfFileReader(file( "some\path\to\a\PDf", "rb"))
pdfTwo = PdfFileReader(file("some\other\path\to\a\PDf", "rb"))
 
output.addPage(pdfOne.getPage(0))
output.addPage(pdfTwo.getPage(0))
 
outputStream = file(r"output.pdf", "wb")
output.write(outputStream)
outputStream.close()

That worked perfectly on my Windows 7 box. As you might have guessed, all that code does is create to PdfFileReader objects and read in the first page of each. Next it adds those two pages to our PdfFileWriter. Finally we open a new file and write out our PDF pages. That’s it! You’ve just created a new document from two separate PDFs!

Now let’s try the page rotation script from my other article:

from PyPDF2 import PdfFileWriter, PdfFileReader
 
output = PdfFileWriter()
input1 = PdfFileReader(file("document1.pdf", "rb"))
output.addPage(input1.getPage(1).rotateClockwise(90))
# output.addPage(input1.getPage(2).rotateCounterClockwise(90))
 
outputStream = file("output.pdf", "wb")
output.write(outputStream)
outputStream.close()

That also worked on my machine. So far so good. My final test of parity is to see if it can extract the same data that the original pyPdf could. We’ll try reading the metadata from the latest Reportlab user manual:

>>> from PyPDF2 import PdfFileReader

>>> p = r'C:\Users\mdriscoll\Documents\reportlab-userguide.pdf'

>>> pdf = PdfFileReader(open(p, 'rb'))

>>> pdf.documentInfo

{'/ModDate': u'D:20120629155504', '/CreationDate': u'D:20120629155504', '/Producer': u'GPL Ghostscript 8.15', '/Title': u'reportlab-userguide.pdf', '/Creator': u'Adobe Acrobat 10.1.3', '/Author': u'mdriscoll'}
>>> pdf.getNumPages()

120
>>> info = pdf.getDocumentInfo()

>>> info.author

u'mdriscoll'
>>> info.creator

u'Adobe Acrobat 10.1.3'
>>> info.producer

u'GPL Ghostscript 8.15'
>>> info.title

u'reportlab-userguide.pdf'

That all looks right too, except for the author bit. I’m certainly not the author of that document and I don’t know why it thinks I am. Otherwise, it appears to work correctly. Now let’s find out what’s new!

What’s New in PyPDF2

One of the first things I noticed when looking through the source for PyPDF2 is that it’s added a few new methods to PdfFileReader and PdfFileWriter. I also noticed that there’s an entirely new module called merger.py which contains the class: PdfFileMerger. Let’s take a look under the covers since there is no real documentation at the time of this writing. The only new method added to the reader is getOutlines, which retrieves the document outline, if it exists. In the writer, there is support for adding bookmarks and named destinations. Not much, but beggars can’t be choosers. I think the part I’m most excited about is the new PdfFileMerger class, which reminds me a bit of the dead Stapler project. The PdfFileMerger allows the programmer to merge multiple PDFs into a single PDF via concatenation, slicing, inserting or any combination of the three.

Let’s try that out with a few example scripts, shall we?

import PyPDF2

path = open('path/to/hello.pdf', 'rb')
path2 = open('path/to/another.pdf', 'rb')

merger = PyPDF2.PdfFileMerger()

merger.merge(position=0, fileobj=path2)
merger.merge(position=2, fileobj=path)
merger.write(open("test_out.pdf", 'wb'))

What this does is merge two files together. The first one will get the second file inserted starting on page 3 (note the off-by-one) and continue on after the insertion. This is a lot easier than iterating over the pages of both documents and putting them together. The merge command has the following signature and docstring, which sums it up pretty well:

>>> merge(position, file, bookmark=None, pages=None, import_bookmarks=True)
        
        Merges the pages from the source document specified by "file" into the output
        file at the page number specified by "position".
        
        Optionally, you may specify a bookmark to be applied at the beginning of the 
        included file by supplying the text of the bookmark in the "bookmark" parameter.
        
        You may prevent the source document's bookmarks from being imported by
        specifying "import_bookmarks" as False.
        
        You may also use the "pages" parameter to merge only the specified range of 
        pages from the source document into the output document.

There’s also an append method which is identical to the merge command except that it assumes you want to append all the pages onto the end of the PDF. For completeness, here’s an example script:

import PyPDF2

path = open('path/to/hello.pdf', 'rb')
path2 = open('path/to/another.pdf', 'rb')

merger = PyPDF2.PdfFileMerger()

merger.append(fileobj=path2)
merger.append(fileobj=path)
merger.write(open("test_out2.pdf", 'wb'))

That was pretty painless and very nice too!

Wrapping Up

I think I’ve found a good alternative for PDF hacking. I can combine and split PDFs with PyPDF2 easier than I could the original pyPdf. I am also hopeful that PyPDF will stick around since it has a sponsor paying people to work on it. According to the reddit thread, there may be a chance that the original pyPdf may be revived and the two projects may end up working together. Regardless of what happens, I’m just happy that it’s back under development again and will hopefully stay that way for a while. Let me know your thoughts on the topic too.

I want to create a tool which could merge pdfs, sort, delete and rotate pages. I think pyPdf (or PyPdf2) is quite good for this task, though i also need some tool to render page previews.
It is very important to place page numbers in the final document. How can i achieve this? A-PDF Number Pro and Foxit Phantom are cool but commercial.

Brian

May 23, 2013 at 5:11 am

Has anyone noticed that PyPDF2 is broken? The above merge and append code does not work using the latest master from github. It throws a ‘NameError: global name ‘file’ is not defined’. A great pity IMHO as this library was v. useful.

Marc B. Hankin

June 7, 2013 at 3:25 pm

Does anyone know if it’s possible to bates stamp pdf files using PyPDF2?

Gregala

August 18, 2013 at 6:00 am

Hello!

Thank you for this article. I tried your examples and those given with PyPDF2. Unfortunatley, I have the following error:

(“Unable to find ‘endstream’ marker after stream at byte %s.” % utils.hexStr(stream.tell()))
PyPDF2.utils.PdfReadError: Unable to find ‘endstream’ marker after stream at byte 0xa36b5.

This error appears when this line is executed :

output.write(outputStream)

The previous ones don’t give any error.

Do you have an idea?

Thank you for your help!!!

jgmitzen

August 23, 2013 at 11:44 pm

That’s the same as broken IMHO, like new software that only runs on Windows 98.

MayakoLyyn

August 24, 2013 at 8:19 am

I agree but it’s still functional on Py2.x (I think), it just needs some adjustments to be functionnal in Py3k.

Noah Huntington

February 28, 2014 at 4:16 pm

How do I install pyPDF2?

Mike Driscoll

February 28, 2014 at 4:34 pm

Download the source from github. Decompress it. Then open a terminal and run python /path/to/pypdf_folder/setup.py install

eureka

March 18, 2014 at 6:28 am

index out of range: 1 (at line 39 in utils.py of PyPDF2 v1.15)

when i try this code

import clr
clr.AddReference(‘System.Drawing’)
clr.AddReference(‘System.Windows.Forms’)

from System.Drawing import *
from System.Windows.Forms import *
from PyPDF2 import PdfFileReader

class MyForm(Form):

def __init__(self):
# Create child controls and initialize form
self.Text = “Test Project”
self.Size = Size(600, 500)

path = “F:/Download/RealPython.pdf”
f = open(path)
inputpdf = PdfFileReader(open(path, “rb”))
display = inputpdf.getPage(8).extractText()

display.mediaBox.upperRight = (
display.mediaBox.getUpperRight_x() / 2,
display.mediaBox.getUpperRight_y() / 2
)

Application.EnableVisualStyles()
Application.SetCompatibleTextRenderingDefault(False)

form = MyForm() Application.Run(form)

March 18, 2014 at 8:19 am

Are you trying to use PyPDF2 with IronPython? I haven’t tried that myself. You should ask the PyPDF team and possibly the IronPython mailing group for help.

Matt

March 18, 2014 at 4:22 pm

Several libraries haven’t been made compatible with Py 3.x, it takes more work the larger the library.

Anyway, PyPDF2 is fully compatible with Py 3.x now (at least 3.2 and 3.3) if that cheers anyone up.

Pingback: Python PDF 2: Writing an Manipulating a PDF with PyPDF2 and ReportLab | Wired Andy Blog

bruno

November 16, 2014 at 7:55 am

Hello,
I read many pdf’s texts. I don’t do annotations popup but I only
highlight text in yellow. I wanted to extract with PyPdf2 this highlighted text to do some indexation with Whoosh for
my studies.

I know with pyPdf2 extract with the PDf ‘ s key “/Annot” and PDF ‘ s “/Subj” the “/Rect” that gives to me the coordonates of my highlighted text. But I don’t know with pyPdf2 to extract the text from this Rect : extractText() or getContents() does not work with coordonates like extractText(x,y,z,w).
Somebody can give me some way to resolve my little problem ?
Thanks for your patience,
Bruno