Tag Archives: Python PDF Series

An Intro to PyPDF2

The PyPDF2 package is a pure-Python PDF library that you can use for splitting, merging, cropping and transforming pages in your PDFs. According to the PyPDF2 website, you can also use PyPDF2 to add data, viewing options and passwords to the PDFs too. Finally you can use PyPDF2 to extract text and metadata from your PDFs.

PyPDF2 is actually a fork of the original pyPdf which was written by Mathiew Fenniak and released in 2005. However, the original pyPdf’s last release was in 2014. A company called Phaseit, Inc spoke with Mathieu and ended up sponsoring PyPDF2 as a fork of pyPdf

At the time of writing this book, the PyPDF2 package hasn’t had a release since 2016. However it is still a solid and useful package that is worth your time to learn.

The following lists what we will be learning in this article:

  • Extracting metadata
  • Splitting documents
  • Merging 2 PDF files into 1
  • Rotating pages
  • Overlaying / Watermarking Pages
  • Encrypting / decrypting

Let’s start by learning how to install PyPDF2!


Installation

PyPDF2 is a pure Python package, so you can install it using pip (assuming pip is in your system’s path):

python -m pip install pypdf2

As usual, you should install 3rd party Python packages to a Python virtual environment to make sure that it works the way you want it to. Continue reading An Intro to PyPDF2

Creating and Manipulating PDFs with pdfrw

Patrick Maupin created a package he called pdfrw and released it back in 2012. The pdfrw package is a pure-Python library that you can use to read and write PDF files. At the time of writing, pdfrw was at version 0.4. With that version, it supports subsetting, merging, rotating and modifying data in PDFs. The pdfrw package has been used by the rst2pdf package (see chapter 18) since 2010 because pdfrw can “faithfully reproduce vector formats without rasterization”. You can also use pdfrw in conjunction with ReportLab to re-use potions of existing PDFs in new PDFs that you create with ReportLab.

In this article, we will learn how to do the following:

  • Extract certain types of information from a PDF
  • Splitting PDFs
  • Merging / Concatenating PDFs
  • Rotating pages
  • Creating overlays or watermarks
  • Scaling pages
  • Combining the use of pdfrw and ReportLab

Let’s get started! Continue reading Creating and Manipulating PDFs with pdfrw

Creating PDFs with PyFPDF and Python

ReportLab is the primary toolkit that I use for generating PDFs from scratch. However I have found that there is another one called PyFPDF or FPDF for Python. The PyFPDF package is actually a port of the “Free”-PDF package that was written in PHP. There hasn’t been a release of this project in a few years, but there have been commits to its Github repository so there is still some work being done on the project. The PyFPDF package supports Python 2.7 and Python 3.4+.

This article will not be exhaustive in its coverage of the PyFPDF package. However it will cover more than enough for you to get started using it effectively. Note that there is a short book on PyFPDF called “Python does PDF: pyFPDF” by Edwood Ocasio on Leanpub if you would like to learn more about the library than what is covered in this chapter or the package’s documentation.


Installation

Installing PyFPDF is easy since it was designed to work with pip. Here’s how:

python -m pip install fpdf

At the time of writing, this command installed version 1.7.2 on Python 3.6 with no problems whatsoever. You will notice when you are installing this package that it has no dependencies, which is nice. Continue reading Creating PDFs with PyFPDF and Python

Splitting and Merging PDFs with Python

The PyPDF2 package allows you to do a lot of useful operations on existing PDFs. In this article, we will learn how to split a single PDF into multiple smaller ones. We will also learn how to take a series of PDFs and join them back together into a single PDF.


Getting Started

PyPDF2 doesn’t come as a part of the Python Standard Library, so you will need to install it yourself. The preferred way to do so is to use pip.

pip install pypdf2

Now that we have PyPDF2 installed, let’s learn how to split and merge PDFs!


Splitting PDFs

The PyPDF2 package gives you the ability to split up a single PDF into multiple ones. You just need to tell it how many pages you want. For this example, we will download a W9 form from the IRS and loop over all six of its pages. We will split off each page and turn it into its own standalone PDF.

Let’s find out how:

# pdf_splitter.py
 
import os
from PyPDF2 import PdfFileReader, PdfFileWriter
 
 
def pdf_splitter(path):
    fname = os.path.splitext(os.path.basename(path))[0]
 
    pdf = PdfFileReader(path)
    for page in range(pdf.getNumPages()):
        pdf_writer = PdfFileWriter()
        pdf_writer.addPage(pdf.getPage(page))
 
        output_filename = '{}_page_{}.pdf'.format(
            fname, page+1)
 
        with open(output_filename, 'wb') as out:
            pdf_writer.write(out)
 
        print('Created: {}'.format(output_filename))
 
if __name__ == '__main__':
    path = 'w9.pdf'
    pdf_splitter(path)

Continue reading Splitting and Merging PDFs with Python

Extracting PDF Metadata and Text with Python

There are lots of PDF related packages for Python. One of my favorite is PyPDF2. You can use it to extract metadata, rotate pages, split or merge PDFs and more. It’s kind of a Swiss-army knife for existing PDFs. In this article we will learn how to extract basic information about a PDF using PyPDF2


Getting Started

PyPDF2 doesn’t come as a part of the Python Standard Library, so you will need to install it yourself. The preferred way to do so is to use pip.

pip install pypdf2

Now that we have PyPDF2 installed, let’s learn how to get metadata from a PDF! Continue reading Extracting PDF Metadata and Text with Python

Python: PDF Creation with pdfdocument

I do a lot of PDF report creation with Python using Reportlab. Occasionally I’ll throw PyPDF in as well. So I’m always on the lookout for other mature Python PDF tools. PDFDocument isn’t exactly mature, but it’s kind of interesting. The PDFDocument project is actually a wrapper for Reportlab. You can get it on github. I found the project easy to use, but pretty limiting. Let’s take a few minutes to examine how it works. Continue reading Python: PDF Creation with pdfdocument

PyPdf: How to Write a PDF to Memory

At my job, we sometimes need to write a PDF to memory instead of disk because we need to merge an overlay on to it. By writing to memory, we can speed up the process since we won’t have the extra step of writing the file to disk and than reading it back into memory again. Sadly, pyPdf’s PdfFileWriter() class doesn’t offer any support for extracting the binary string, so we have to StringIO instead. Here’s an example where I merge two PDFs into memory:

import pyPdf
from StringIO import StringIO
 
#----------------------------------------------------------------------
def mergePDFs(pdfOne, pdfTwo):
    """
    Merge PDFs
    """
    tmp = StringIO()
 
    output = pyPdf.PdfFileWriter()
 
    pdfOne = pyPdf.PdfFileReader(file(pdfOne, "rb"))
    for page in range(pdfOne.getNumPages()):
        output.addPage(pdfOne.getPage(page))
    pdfTwo = pyPdf.PdfFileReader(file(pdfTwo, "rb"))
    for page in range(pdfTwo.getNumPages()):
        output.addPage(pdfTwo.getPage(page))
 
    output.write(tmp)
    return tmp.getvalue()
 
 
if __name__ == "__main__":
    pdfOne = '/path/to/pdf/one'
    pdfTwo = '/path/to/pdf/two'
    pdfObj = mergePDFs(pdfOne, pdfTwo)

As you can see, all you need to do is create a StringIO() object, add some pages to the PdfFileWriter() object and then write the data to your StringIO object. Then to extract the binary string, you have to call StringIO’s getvalue() method. Simple, right? Now you have a file-like object in memory that you can use to add more pages to or overlay OMR mark on or whatever.

Related Artlcies

Python PDF Series – An Intro to metaPDF

While researching PDF libraries for Python, I stumbled across another little project called metaPDF. According to its website, metaPDF is a lightweight Python library optimized for metadata extraction and insertion, and it is a fast wrapper over the excellent pyPdf library. It works by quickly searching the last 2048 bytes of the PDF before parsing the xref table, offering a 50-60% performance increase over directly parsing the table line by line. I’m not really sure how useful that will be, but let’s try it out and see what metaPDF can do. Continue reading Python PDF Series – An Intro to metaPDF

PyPDF2: The New Fork of pyPdf

Today I learned that the pyPDF project is NOT dead, as I had originally thought. In fact, it’s been forked into PyPDF2 (note the slightly different spelling). There’s also a possibility that someone else has taken over the original pyPDF project and is actively working on it. You can follow all that over on reddit if you like. In the mean time, I decided to give PyPDF2 a whirl and see how it is different from the original. Feel free to follow along if you have a free moment or two. Continue reading PyPDF2: The New Fork of pyPdf