A Quick Intro to pdfrw

Posted by Mike on July 7th, 2012 filed in Python

I’m always on the lookout for Python PDF libraries and I happened to stumble across pdfrw the other day. It looks like a replacement to pyPDF in that it can read and write PDFs, join PDFs and can use Reportlab for concatenation and watermarking, among other things. The project also appears slightly dead in that its last update was in 2011, but then again, pyPDF’s last update was in 2010, so it’s a little fresher. In this article, we’ll take a little test drive of pdfrw and see if it’s useful or not. Come and join the fun!

A note on installation: Sadly there is no setup.py script, so you’ll have to check it out of Google Code and just copy the pdfrw folder to site-packages or your virtualenv.

Joining PDFs Together with pdfrw

Joining two PDF files together into one is actually very simple with pdfrw. See below:

from pdfrw import PdfReader, PdfWriter
 
pages = PdfReader(r'C:\Users\mdriscoll\Desktop\1.pdf', decompress=False).pages
other_pages = PdfReader(r'C:\Users\mdriscoll\Desktop\2.pdf', decompress=False).pages
 
writer = PdfWriter()
writer.addpages(pages)
writer.addpages(other_pages)
writer.write(r'C:\Users\mdriscoll\Desktop\out.pdf')

What I find interesting is that you can also metadata to the file by doing something like this before you write it out:

writer.trailer.Info = IndirectPdfDict(
    Title = 'My Awesome PDF',
    Author = 'Mike',
    Subject = 'Python Rules!',
    Creator = 'myscript.py',
)

There’s also an included example that shows how to combine PDFs using pdfrw and reportlab. I’ll just reproduce it here:

# http://code.google.com/p/pdfrw/source/browse/trunk/examples/rl1/subset.py
import sys
import os
 
from reportlab.pdfgen.canvas import Canvas
 
import find_pdfrw
from pdfrw import PdfReader
from pdfrw.buildxobj import pagexobj
from pdfrw.toreportlab import makerl
 
 
def go(inpfn, firstpage, lastpage):
    firstpage, lastpage = int(firstpage), int(lastpage)
    outfn = 'subset_%s_to_%s.%s' % (firstpage, lastpage, os.path.basename(inpfn))
 
    pages = PdfReader(inpfn, decompress=False).pages
    pages = [pagexobj(x) for x in pages[firstpage-1:lastpage]]
    canvas = Canvas(outfn)
 
    for page in pages:
        canvas.setPageSize(tuple(page.BBox[2:]))
        canvas.doForm(makerl(canvas, page))
        canvas.showPage()
 
    canvas.save()
 
if __name__ == '__main__':
    inpfn, firstpage, lastpage = sys.argv[1:]
    go(inpfn, firstpage, lastpage)

I just thought that was really cool. It gives you a couple of alternatives to pyPDF’s writer anyway. There are lots of other interesting examples included with the package, including

  1. How to to use a pdf (page one) as the background for all other pages together with platypus.
  2. How to add a watermark

I think the project has potential. Hopefully we can generate enough interest to kickstart this project again or maybe get something new off the ground.

Print Friendly

  • Pingback: Visto nel Web – 35 « Ok, panico()

  • brogers

    I’m glad to hear about this lib. I tried your first example and it seems the out.pdf is written to the console and not a file. Any ideas?

    #!python

    import sys
    import os

    from pdfrw import PdfReader, PdfWriter
     
    pages = PdfReader(r’one.pdf’, decompress=False).pages
    other_pages = PdfReader(r’two.pdf’, decompress=False).pages
     
    writer = PdfWriter()
    writer.addpages(pages)
    writer.addpages(other_pages)
    writer.write(r’three.pdf’)

  • driscollis

    The only different I see with your code is that you’re not passing an absolute path to the write method. Try doing that and see if that works. I just retried my code and it still works on Windows 7 with Python 2.6.6

  • Patrick Maupin

    Thanks for the nice article!

    I have added a setup script and added it to PyPI, so it should now be available from easy_setup or pip.

  • driscollis

    That’s great! Thanks!