There are many times where you will want to extract data from a PDF and export it in a different format using Python. Unfortunately, there aren’t a lot of Python packages that do the extraction part very well. In this chapter, we will look at a variety of different packages that you can use to extract text. We will also learn how to extract some images from PDFs. While there is no complete solution for these tasks in Python, you should be able to use the information herein to get you started. Once we have extracted the data we want, we will also look at how we can take that data and export it in a different format.
Let’s get started by learning how to extract text!
Extracting Text with PDFMiner
Probably the most well known is a package called PDFMiner. The PDFMiner package has been around since Python 2.4. It’s primary purpose is to extract text from a PDF. In fact, PDFMiner can tell you the exact location of the text on the page as well as father information about fonts. For Python 2.4 – 2.7, you can refer to the following websites for additional information on PDFMiner:
PDFMiner is not compatible with Python 3. Fortunately, there is a fork of PDFMiner called PDFMiner.six that works exactly the same. You can find it here: https://github.com/pdfminer/pdfminer.six Continue reading Exporting Data from PDFs with Python