Manipulating PDFs with Python and pyPdf

There’s a handy 3rd party module called pyPdf out there that you can use to merge PDFs documents together, rotate pages, split and crop pages, and decrypt/encrypt PDF documents. In this article, we’ll take a look at a few of these functions and then create a simple GUI with wxPython that will allow us to merge a couple of PDFs.

A pyPdf Tour

To get the most out of pyPdf, you need to learn its two major functions: PdfFileReader and PdfFileWriter. They are the ones that I use the most. Let’s take a look at what they can do:

# Merge two PDFs
from pyPdf import PdfFileReader, PdfFileWriter

output = PdfFileWriter()
pdfOne = PdfFileReader(file( "some\path\to\a\PDf", "rb"))
pdfTwo = PdfFileReader(file("some\other\path\to\a\PDf", "rb"))

output.addPage(pdfOne.getPage(0))
output.addPage(pdfTwo.getPage(0))

outputStream = file(r"output.pdf", "wb")
output.write(outputStream)
outputStream.close()

The code above will open two PDFs, take the first page from each and create a third PDF by joining those two pages. Note that the pages are zero-based, so zero is page one, one is page two, etc. I use this sort of script to extract one or more pages from a PDF or to join a series of PDFs together. For example, sometimes my users will receive a bunch of scanned documents where each page of the document ends up in a separate PDF. They need all the single pages joined into one PDF and pyPdf allows me to do this very simply.

There aren’t many applications for the rotating functions that pyPdf gives me in my day-to-day experience, but maybe you have lots of users that scan documents in landscape rather than portrait and you end up needing to do a lot of turning. Fortunately, this is pretty painless:

from pyPdf import PdfFileWriter, PdfFileReader

output = PdfFileWriter()
input1 = PdfFileReader(file("document1.pdf", "rb"))
output.addPage(input1.getPage(1).rotateClockwise(90))
# output.addPage(input1.getPage(2).rotateCounterClockwise(90))

outputStream = file("output.pdf", "wb")
output.write(outputStream)
outputStream.close()

The code above is just taken from the pyPdf documentation and shortened for this example. As you can see, the method to call is rotateClockwise or rotateCounterClockwise. Make sure you pass the method the number of degrees to turn the page as well.

Now let’s see what information we can pull from the PDF about itself:

>>> from pyPdf import PdfFileReader
>>> p = r'E:\My Documents\My Dropbox\ebooks\helloWorld book.pdf'
>>> pdf = PdfFileReader(file(p, 'rb'))
>>> pdf.documentInfo
{'/CreationDate': u'D:20090323080712Z', '/Author': u'Warren Sande', '/Producer': u'Acrobat Distiller 8.0.0 (Windows)', '/Creator': u'FrameMaker 8.0', '/ModDate': u"D:20090401124817-04'00'", '/Title': u'Hello World!'}
>>> pdf.getNumPages()
432
>>> info = pdf.getDocumentInfo()
>>> info.author
u'Warren Sande'
>>> info.creator
u'FrameMaker 8.0'
>>> info.producer
u'Acrobat Distiller 8.0.0 (Windows)'
>>> info.title
u'Hello World!'

As you can see, we can gather quite a bit of useful data using pyPdf. Now let’s create a simple GUI to make merging two PDFs easier!

Creating a wxPython PDF Merger Application

I came up with this script yesterday. It uses Tim Golden’s winshell module to give me easy access to the user’s desktop on Windows. If you’re on Linux or Mac, then you’ll want to change this part of the code for your platform. I wanted to use wxPython’s StandardPaths module, but it didn’t have a function for getting the desktop that I could see. Anyway, enough talk. Let’s look at some code!

import os
import pyPdf
import winshell
import wx

class MyFileDropTarget(wx.FileDropTarget):
    def __init__(self, window):
        wx.FileDropTarget.__init__(self)
        self.window = window

    def OnDropFiles(self, x, y, filenames):
        self.window.SetInsertionPointEnd()
        
        for file in filenames:
            self.window.WriteText(file)

########################################################################
class JoinerPanel(wx.Panel):
    """"""

    #----------------------------------------------------------------------
    def __init__(self, parent):
        """Constructor"""
        wx.Panel.__init__(self, parent=parent)
        
        self.currentPath = winshell.desktop()
        
        
        lblSize = (70,-1)
        pdfLblOne = wx.StaticText(self, label="PDF One:", size=lblSize)
        self.pdfOne = wx.TextCtrl(self)
        dt = MyFileDropTarget(self.pdfOne)
        self.pdfOne.SetDropTarget(dt)
        pdfOneBtn = wx.Button(self, label="Browse", name="pdfOneBtn")
        pdfOneBtn.Bind(wx.EVT_BUTTON, self.onBrowse)
        
        pdfLblTwo = wx.StaticText(self, label="PDF Two:", size=lblSize)
        self.pdfTwo = wx.TextCtrl(self)
        dt = MyFileDropTarget(self.pdfTwo)
        self.pdfTwo.SetDropTarget(dt)
        pdfTwoBtn = wx.Button(self, label="Browse", name="pdfTwoBtn")
        pdfTwoBtn.Bind(wx.EVT_BUTTON, self.onBrowse)
        
        outputLbl = wx.StaticText(self, label="Output name:", size=lblSize)
        self.outputPdf = wx.TextCtrl(self)
        widgets = [(pdfLblOne, self.pdfOne, pdfOneBtn),
                   (pdfLblTwo, self.pdfTwo, pdfTwoBtn),
                   (outputLbl, self.outputPdf)]
        
        joinBtn = wx.Button(self, label="Join PDFs")
        joinBtn.Bind(wx.EVT_BUTTON, self.onJoinPdfs)
        
        self.mainSizer = wx.BoxSizer(wx.VERTICAL)
        for widget in widgets:
            self.buildRows(widget)
        self.mainSizer.Add(joinBtn, 0, wx.ALL|wx.CENTER, 5)
        self.SetSizer(self.mainSizer)
        
    #----------------------------------------------------------------------
    def buildRows(self, widgets):
        """"""
        sizer = wx.BoxSizer(wx.HORIZONTAL)
        for widget in widgets:
            if isinstance(widget, wx.StaticText):
                sizer.Add(widget, 0, wx.ALL|wx.CENTER, 5)
            elif isinstance(widget, wx.TextCtrl):
                sizer.Add(widget, 1, wx.ALL|wx.EXPAND, 5)
            else:
                sizer.Add(widget, 0, wx.ALL, 5)
        self.mainSizer.Add(sizer, 0, wx.EXPAND)
        
    #----------------------------------------------------------------------
    def onBrowse(self, event):
        """
        Browse for PDFs
        """
        widget = event.GetEventObject()
        name = widget.GetName()
        
        wildcard = "PDF (*.pdf)|*.pdf"
        dlg = wx.FileDialog(
            self, message="Choose a file",
            defaultDir=self.currentPath, 
            defaultFile="",
            wildcard=wildcard,
            style=wx.OPEN | wx.CHANGE_DIR
            )
        if dlg.ShowModal() == wx.ID_OK:
            path = dlg.GetPath()
            if name == "pdfOneBtn":
                self.pdfOne.SetValue(path)
            else:
                self.pdfTwo.SetValue(path)
            self.currentPath = os.path.dirname(path)
        dlg.Destroy()
        
    #----------------------------------------------------------------------
    def onJoinPdfs(self, event):
        """
        Join the two PDFs together and save the result to the desktop
        """
        pdfOne = self.pdfOne.GetValue()
        pdfTwo = self.pdfTwo.GetValue()
        if not os.path.exists(pdfOne):
            msg = "The PDF at %s does not exist!" % pdfOne
            dlg = wx.MessageDialog(None, msg, 'Error', wx.OK|wx.ICON_EXCLAMATION)
            dlg.ShowModal()
            dlg.Destroy()
            return
        if not os.path.exists(pdfTwo):
            msg = "The PDF at %s does not exist!" % pdfTwo
            dlg = wx.MessageDialog(None, msg, 'Error', wx.OK|wx.ICON_EXCLAMATION)
            dlg.ShowModal()
            dlg.Destroy()
            return
        
        outputPath = os.path.join(winshell.desktop(), self.outputPdf.GetValue()) + ".pdf"
        output = pyPdf.PdfFileWriter()

        pdfOne = pyPdf.PdfFileReader(file(pdfOne, "rb"))
        for page in range(pdfOne.getNumPages()):
            output.addPage(pdfOne.getPage(page))
        pdfTwo = pyPdf.PdfFileReader(file(pdfTwo, "rb"))
        for page in range(pdfTwo.getNumPages()):
            output.addPage(pdfTwo.getPage(page))

        outputStream = file(outputPath, "wb")
        output.write(outputStream)
        outputStream.close()
        
        msg = "PDF was save to " + outputPath
        dlg = wx.MessageDialog(None, msg, 'PDF Created', wx.OK|wx.ICON_INFORMATION)
        dlg.ShowModal()
        dlg.Destroy()
        
        self.pdfOne.SetValue("")
        self.pdfTwo.SetValue("")
        self.outputPdf.SetValue("")
        
########################################################################
class JoinerFrame(wx.Frame):
 
    #----------------------------------------------------------------------
    def __init__(self):
        wx.Frame.__init__(self, None, wx.ID_ANY, 
                          "PDF Joiner", size=(550, 200))
        panel = JoinerPanel(self)
        
#----------------------------------------------------------------------
# Run the program
if __name__ == "__main__":
    app = wx.App(False)
    frame = JoinerFrame()
    frame.Show()
    app.MainLoop()

You’ll notice that the frame and panel have dopey names. Feel free to change them. This is Python after all! The MyFileDropTarget class is used to enable drag-and-drop functionality on the first two text controls. If the user wants, they may drag a PDF onto one of those text controls and the path will magically get inserted. Otherwise, they can use the Browse button(s) to find the PDF(s) of their choice. Once they have their PDFs chosen, they need to enter the name of the output PDF, which goes in the third text control. You don’t even need to append the “.pdf” extension since this application will do it for you.

The last step is to press the “Join PDFs” button. This will add the second PDF to the end of the first, then display a dialog if it was created successfully. For this to function correctly, we need to loop over all the pages of each PDF and add them to the output in order.

It would be fun to enhance this application with some kind of drag-and-drop interface where the user could see the PDF pages and rearrange them by dragging them around. Or just being able to specify a few pages from each PDF to add rather than joining them in their entirety. I’ll leave those as exercises for the reader though.

I hope you found this article interesting and that it will spark your creative juices. If you write something cool with these ideas, be sure to post a comment so I can see it!

Irmen de Jong

May 16, 2010 at 12:00 am

Hello Mike. This is interesting stuff. I've been doing PDF manipulation myself a couple of years ago, in Java. We were splitting and joining and adding overlays (you know, “DRAFT” / “TOP SECRET” that sort of stuff), and barcodes. If only we could have used that code above back then 🙂

monobot

May 17, 2010 at 8:32 am

Awesome article!!many thanks.
Just a Question is there any “pdfviewer” for python?
On the application im designing i make some pdf's with platypus that i want the user to take a look before printing … and i would like to show them from python and not making the system pdf viewer to pop up.

driscollis

May 17, 2010 at 11:48 am

To my knowledge, no. But it may just be that I haven't found one yet. I have been thinking about using some other program to convert the pages to images and then display those though.

– Mike

A pyPdf Tour

Creating a wxPython PDF Merger Application

3 thoughts on “Manipulating PDFs with Python and pyPdf”