A Quick Intro to pdfrw - Mouse Vs Python

I’m always on the lookout for Python PDF libraries and I happened to stumble across pdfrw the other day. It looks like a replacement to pyPDF in that it can read and write PDFs, join PDFs and can use Reportlab for concatenation and watermarking, among other things. The project also appears slightly dead in that its last update was in 2011, but then again, pyPDF’s last update was in 2010, so it’s a little fresher. In this article, we’ll take a little test drive of pdfrw and see if it’s useful or not. Come and join the fun!

A note on installation: Sadly there is no setup.py script, so you’ll have to check it out of Google Code and just copy the pdfrw folder to site-packages or your virtualenv.

Joining PDFs Together with pdfrw

Joining two PDF files together into one is actually very simple with pdfrw. See below:

from pdfrw import PdfReader, PdfWriter

pages = PdfReader(r'C:\Users\mdriscoll\Desktop\1.pdf', decompress=False).pages
other_pages = PdfReader(r'C:\Users\mdriscoll\Desktop\2.pdf', decompress=False).pages

writer = PdfWriter()
writer.addpages(pages)
writer.addpages(other_pages)
writer.write(r'C:\Users\mdriscoll\Desktop\out.pdf')

What I find interesting is that you can also metadata to the file by doing something like this before you write it out:

writer.trailer.Info = IndirectPdfDict(
    Title = 'My Awesome PDF',
    Author = 'Mike',
    Subject = 'Python Rules!',
    Creator = 'myscript.py',
)

There’s also an included example that shows how to combine PDFs using pdfrw and reportlab. I’ll just reproduce it here:

# http://code.google.com/p/pdfrw/source/browse/trunk/examples/rl1/subset.py
import sys
import os

from reportlab.pdfgen.canvas import Canvas

import find_pdfrw
from pdfrw import PdfReader
from pdfrw.buildxobj import pagexobj
from pdfrw.toreportlab import makerl


def go(inpfn, firstpage, lastpage):
    firstpage, lastpage = int(firstpage), int(lastpage)
    outfn = 'subset_%s_to_%s.%s' % (firstpage, lastpage, os.path.basename(inpfn))

    pages = PdfReader(inpfn, decompress=False).pages
    pages = [pagexobj(x) for x in pages[firstpage-1:lastpage]]
    canvas = Canvas(outfn)

    for page in pages:
        canvas.setPageSize(tuple(page.BBox[2:]))
        canvas.doForm(makerl(canvas, page))
        canvas.showPage()

    canvas.save()

if __name__ == '__main__':
    inpfn, firstpage, lastpage = sys.argv[1:]
    go(inpfn, firstpage, lastpage)

I just thought that was really cool. It gives you a couple of alternatives to pyPDF’s writer anyway. There are lots of other interesting examples included with the package, including

How to to use a pdf (page one) as the background for all other pages together with platypus.
How to add a watermark

I think the project has potential. Hopefully we can generate enough interest to kickstart this project again or maybe get something new off the ground.

Pingback: Visto nel Web – 35 « Ok, panico

brogers

August 1, 2012 at 11:15 pm

I’m glad to hear about this lib. I tried your first example and it seems the out.pdf is written to the console and not a file. Any ideas?

#!python

import sys
import os

from pdfrw import PdfReader, PdfWriter
Â
pages = PdfReader(r’one.pdf’, decompress=False).pages
other_pages = PdfReader(r’two.pdf’, decompress=False).pages
Â
writer = PdfWriter()
writer.addpages(pages)
writer.addpages(other_pages)
writer.write(r’three.pdf’)

driscollis

August 2, 2012 at 8:28 am

The only different I see with your code is that you’re not passing an absolute path to the write method. Try doing that and see if that works. I just retried my code and it still works on Windows 7 with Python 2.6.6