While researching PDF libraries for Python, I stumbled across another little project called metaPDF. According to its website, metaPDF is a lightweight Python library optimized for metadata extraction and insertion, and it is a fast wrapper over the excellent pyPdf library. It works by quickly searching the last 2048 bytes of the PDF before parsing the xref table, offering a 50-60% performance increase over directly parsing the table line by line. I’m not really sure how useful that will be, but let’s try it out and see what metaPDF can do.

Getting and Using metaPDF

The installation process for metaPDF is quite simple. Just use easy_install or pip to install it. Next we need to write a little script to see how it works. Here’s one that’s based on metaPDF’s github page:

from metapdf import MetaPdfReader
 
pdfOne = r'C:\Users\mdriscoll\Documents\reportlab-userguide.pdf'
x = MetaPdfReader()
metadata = x.read_metadata(open(pdfOne, 'rb'))
print metadata

Here I run it against the Reportlab user’s guide PDF. Note that the original had a typo where it used something called “read” to open the file. That won’t work unless you’ve shadowed open, I suppose. Anyway, the output from this script is as follows:

{'/ModDate': u'D:20120629155504', '/CreationDate': u'D:20120629155504', '/Producer': u'GPL Ghostscript 8.15', '/Title': u'reportlab-userguide.pdf', '/Creator': u'Adobe Acrobat 10.1.3', '/Author': u'mdriscoll'}

I really don’t understand how the author got changed on this document, but I’m sure I’m not the author. I don’t really understand why there are forward-slashes in the key fields either. Looking at the source code for this module it would seem that this is all it can do. That’s a little disappointing. Maybe by drawing attention to this library we can get the developer to write some more functionality into it?

Print Friendly