Tag Archives: PyPDF

An Intro to PyPDF2

The PyPDF2 package is a pure-Python PDF library that you can use for splitting, merging, cropping and transforming pages in your PDFs. According to the PyPDF2 website, you can also use PyPDF2 to add data, viewing options and passwords to the PDFs too. Finally you can use PyPDF2 to extract text and metadata from your PDFs.

PyPDF2 is actually a fork of the original pyPdf which was written by Mathiew Fenniak and released in 2005. However, the original pyPdf’s last release was in 2014. A company called Phaseit, Inc spoke with Mathieu and ended up sponsoring PyPDF2 as a fork of pyPdf

At the time of writing this book, the PyPDF2 package hasn’t had a release since 2016. However it is still a solid and useful package that is worth your time to learn.

The following lists what we will be learning in this article:

  • Extracting metadata
  • Splitting documents
  • Merging 2 PDF files into 1
  • Rotating pages
  • Overlaying / Watermarking Pages
  • Encrypting / decrypting

Let’s start by learning how to install PyPDF2!


Installation

PyPDF2 is a pure Python package, so you can install it using pip (assuming pip is in your system’s path):

python -m pip install pypdf2

As usual, you should install 3rd party Python packages to a Python virtual environment to make sure that it works the way you want it to. Continue reading An Intro to PyPDF2

Splitting and Merging PDFs with Python

The PyPDF2 package allows you to do a lot of useful operations on existing PDFs. In this article, we will learn how to split a single PDF into multiple smaller ones. We will also learn how to take a series of PDFs and join them back together into a single PDF.


Getting Started

PyPDF2 doesn’t come as a part of the Python Standard Library, so you will need to install it yourself. The preferred way to do so is to use pip.

pip install pypdf2

Now that we have PyPDF2 installed, let’s learn how to split and merge PDFs!


Splitting PDFs

The PyPDF2 package gives you the ability to split up a single PDF into multiple ones. You just need to tell it how many pages you want. For this example, we will download a W9 form from the IRS and loop over all six of its pages. We will split off each page and turn it into its own standalone PDF.

Let’s find out how:

# pdf_splitter.py
 
import os
from PyPDF2 import PdfFileReader, PdfFileWriter
 
 
def pdf_splitter(path):
    fname = os.path.splitext(os.path.basename(path))[0]
 
    pdf = PdfFileReader(path)
    for page in range(pdf.getNumPages()):
        pdf_writer = PdfFileWriter()
        pdf_writer.addPage(pdf.getPage(page))
 
        output_filename = '{}_page_{}.pdf'.format(
            fname, page+1)
 
        with open(output_filename, 'wb') as out:
            pdf_writer.write(out)
 
        print('Created: {}'.format(output_filename))
 
if __name__ == '__main__':
    path = 'w9.pdf'
    pdf_splitter(path)

Continue reading Splitting and Merging PDFs with Python

Extracting PDF Metadata and Text with Python

There are lots of PDF related packages for Python. One of my favorite is PyPDF2. You can use it to extract metadata, rotate pages, split or merge PDFs and more. It’s kind of a Swiss-army knife for existing PDFs. In this article we will learn how to extract basic information about a PDF using PyPDF2


Getting Started

PyPDF2 doesn’t come as a part of the Python Standard Library, so you will need to install it yourself. The preferred way to do so is to use pip.

pip install pypdf2

Now that we have PyPDF2 installed, let’s learn how to get metadata from a PDF! Continue reading Extracting PDF Metadata and Text with Python

PyPdf: How to Write a PDF to Memory

At my job, we sometimes need to write a PDF to memory instead of disk because we need to merge an overlay on to it. By writing to memory, we can speed up the process since we won’t have the extra step of writing the file to disk and than reading it back into memory again. Sadly, pyPdf’s PdfFileWriter() class doesn’t offer any support for extracting the binary string, so we have to StringIO instead. Here’s an example where I merge two PDFs into memory:

import pyPdf
from StringIO import StringIO
 
#----------------------------------------------------------------------
def mergePDFs(pdfOne, pdfTwo):
    """
    Merge PDFs
    """
    tmp = StringIO()
 
    output = pyPdf.PdfFileWriter()
 
    pdfOne = pyPdf.PdfFileReader(file(pdfOne, "rb"))
    for page in range(pdfOne.getNumPages()):
        output.addPage(pdfOne.getPage(page))
    pdfTwo = pyPdf.PdfFileReader(file(pdfTwo, "rb"))
    for page in range(pdfTwo.getNumPages()):
        output.addPage(pdfTwo.getPage(page))
 
    output.write(tmp)
    return tmp.getvalue()
 
 
if __name__ == "__main__":
    pdfOne = '/path/to/pdf/one'
    pdfTwo = '/path/to/pdf/two'
    pdfObj = mergePDFs(pdfOne, pdfTwo)

As you can see, all you need to do is create a StringIO() object, add some pages to the PdfFileWriter() object and then write the data to your StringIO object. Then to extract the binary string, you have to call StringIO’s getvalue() method. Simple, right? Now you have a file-like object in memory that you can use to add more pages to or overlay OMR mark on or whatever.

Related Artlcies

PyPDF2: The New Fork of pyPdf

Today I learned that the pyPDF project is NOT dead, as I had originally thought. In fact, it’s been forked into PyPDF2 (note the slightly different spelling). There’s also a possibility that someone else has taken over the original pyPDF project and is actively working on it. You can follow all that over on reddit if you like. In the mean time, I decided to give PyPDF2 a whirl and see how it is different from the original. Feel free to follow along if you have a free moment or two. Continue reading PyPDF2: The New Fork of pyPdf

Top Ten Articles of 2010

A lot of websites are doing year-end retrospectives this week, so I thought you might find it interesting to know which articles on this blog were the most popular this year. Below you will find links to each article along with the page view count I got from Google Analytics:

  1. A Simple Step-by-Step Reportlab Tutorial, 9,709 page views, posted 03/08/2010
  2. Another Step-by-Step SqlAlchemy Tutorial Part 1, 7,746 page views, posted 02/03/2010
  3. Another Step-by-Step SqlAlchemy Tutorial Part 2, 4,858 page views, posted 02/03/2010
  4. Manipulating PDFs with Python and pyPdf, 4,511 page views, posted 05/15/2010
  5. Python 101: Introspection, 4,473 page views, posted 10/14/2010
  6. wxPython: Grid Tips and Tricks, 3,476 page views, posted 04/04/2010
  7. wxPython: Creating a Simple MP3 Player, 3,401 page views, posted 04/20/2010
  8. Python and Microsoft Office – Using PyWin32, 3,323 page views, posted 07/16/2010
  9. wxPython and Threads, 3,183 page views, posted 05/22/2010

It would seem that SqlAlchemy and Reportlab are pretty popular topics. Are there any articles about either of these cool packages that you think I should write? As you can see, wxPython makes it into the top ten 3 times! What should I write about next regarding wxPython?

This upcoming year, I plan to write about some of the other GUI toolkits. Which one do you think I should do first? Tkinter, PySide, PyGUI or something else? What packages or standard libraries do you think I should cover? Feel free to let me know via the comments below or via my contact form (link at top). I’m looking forward to another year of Python tinkering and writing and I hope you are too! Thanks for your readership and encouragement this year!