Tag Archives: Python PDF Series

Python: PDF Creation with pdfdocument

I do a lot of PDF report creation with Python using Reportlab. Occasionally I’ll throw PyPDF in as well. So I’m always on the lookout for other mature Python PDF tools. PDFDocument isn’t exactly mature, but it’s kind of interesting. The PDFDocument project is actually a wrapper for Reportlab. You can get it on github. I found the project easy to use, but pretty limiting. Let’s take a few minutes to examine how it works. Continue reading Python: PDF Creation with pdfdocument

PyPdf: How to Write a PDF to Memory

At my job, we sometimes need to write a PDF to memory instead of disk because we need to merge an overlay on to it. By writing to memory, we can speed up the process since we won’t have the extra step of writing the file to disk and than reading it back into memory again. Sadly, pyPdf’s PdfFileWriter() class doesn’t offer any support for extracting the binary string, so we have to StringIO instead. Here’s an example where I merge two PDFs into memory:

import pyPdf
from StringIO import StringIO
def mergePDFs(pdfOne, pdfTwo):
    Merge PDFs
    tmp = StringIO()
    output = pyPdf.PdfFileWriter()
    pdfOne = pyPdf.PdfFileReader(file(pdfOne, "rb"))
    for page in range(pdfOne.getNumPages()):
    pdfTwo = pyPdf.PdfFileReader(file(pdfTwo, "rb"))
    for page in range(pdfTwo.getNumPages()):
    return tmp.getvalue()
if __name__ == "__main__":
    pdfOne = '/path/to/pdf/one'
    pdfTwo = '/path/to/pdf/two'
    pdfObj = mergePDFs(pdfOne, pdfTwo)

As you can see, all you need to do is create a StringIO() object, add some pages to the PdfFileWriter() object and then write the data to your StringIO object. Then to extract the binary string, you have to call StringIO’s getvalue() method. Simple, right? Now you have a file-like object in memory that you can use to add more pages to or overlay OMR mark on or whatever.

Related Artlcies

Python PDF Series – An Intro to metaPDF

While researching PDF libraries for Python, I stumbled across another little project called metaPDF. According to its website, metaPDF is a lightweight Python library optimized for metadata extraction and insertion, and it is a fast wrapper over the excellent pyPdf library. It works by quickly searching the last 2048 bytes of the PDF before parsing the xref table, offering a 50-60% performance increase over directly parsing the table line by line. I’m not really sure how useful that will be, but let’s try it out and see what metaPDF can do. Continue reading Python PDF Series – An Intro to metaPDF

PyPDF2: The New Fork of pyPdf

Today I learned that the pyPDF project is NOT dead, as I had originally thought. In fact, it’s been forked into PyPDF2 (note the slightly different spelling). There’s also a possibility that someone else has taken over the original pyPDF project and is actively working on it. You can follow all that over on reddit if you like. In the mean time, I decided to give PyPDF2 a whirl and see how it is different from the original. Feel free to follow along if you have a free moment or two. Continue reading PyPDF2: The New Fork of pyPdf

A Quick Intro to pdfrw

I’m always on the lookout for Python PDF libraries and I happened to stumble across pdfrw the other day. It looks like a replacement to pyPDF in that it can read and write PDFs, join PDFs and can use Reportlab for concatenation and watermarking, among other things. The project also appears slightly dead in that its last update was in 2011, but then again, pyPDF’s last update was in 2010, so it’s a little fresher. In this article, we’ll take a little test drive of pdfrw and see if it’s useful or not. Come and join the fun! Continue reading A Quick Intro to pdfrw

Reportlab: Mixing Fixed Content and Flowables

Recently I needed the ability to use Reportlab’s flowables, but place them in fixed locations. Some of you are probably wondering why I would want to do that. The nice thing about flowables, like the Paragraph, is that they’re easily styled. If I could bold something or center something AND put it in a fixed location, then that would rock! It took a lot of Googling and trial and error, but I finally got a decent template put together that I could use for mailings. In this article, I’m going to show you how to do this too. Continue reading Reportlab: Mixing Fixed Content and Flowables

An Intro to rst2pdf – Changing Restructured Text into PDFs with Python

There are several cool ways to create PDFs with Python. In this article we will be focusing on a cool little tool called rst2pdf, which takes a text file that contains Restructured Text and converts it to a PDF. The rst2pdf package requires Reportlab to function. This won’t be a tutorial on Restructured Text, although we’ll have to discuss it to some degree just to understand what’s going on. Continue reading An Intro to rst2pdf – Changing Restructured Text into PDFs with Python

Reportlab Tables – Creating Tables in PDFs with Python

Back in March of this year, I wrote a simple tutorial on Reportlab, a handy 3rd party Python package that allows the developer to create PDFs programmatically. Recently, I received a request to cover how to do tables in Reportlab. Since my Reportlab article is so popular, I figured it was probably worth the trouble to figure out tables. In this article, I will attempt to show you the basics of inserting tables into Reportlab generated PDFs. Continue reading Reportlab Tables – Creating Tables in PDFs with Python

Manipulating PDFs with Python and pyPdf

There’s a handy 3rd party module called pyPdf out there that you can use to merge PDFs documents together, rotate pages, split and crop pages, and decrypt/encrypt PDF documents. In this article, we’ll take a look at a few of these functions and then create a simple GUI with wxPython that will allow us to merge a couple of PDFs. Continue reading Manipulating PDFs with Python and pyPdf