Python PDF Series – An Intro to metaPDF

Posted by Mike on July 21st, 2012 filed in Python

While researching PDF libraries for Python, I stumbled across another little project called metaPDF. According to its website, metaPDF is a lightweight Python library optimized for metadata extraction and insertion, and it is a fast wrapper over the excellent pyPdf library. It works by quickly searching the last 2048 bytes of the PDF before parsing the xref table, offering a 50-60% performance increase over directly parsing the table line by line. I’m not really sure how useful that will be, but let’s try it out and see what metaPDF can do.

Getting and Using metaPDF

The installation process for metaPDF is quite simple. Just use easy_install or pip to install it. Next we need to write a little script to see how it works. Here’s one that’s based on metaPDF’s github page:

from metapdf import MetaPdfReader
 
pdfOne = r'C:\Users\mdriscoll\Documents\reportlab-userguide.pdf'
x = MetaPdfReader()
metadata = x.read_metadata(open(pdfOne, 'rb'))
print metadata

Here I run it against the Reportlab user’s guide PDF. Note that the original had a typo where it used something called “read” to open the file. That won’t work unless you’ve shadowed open, I suppose. Anyway, the output from this script is as follows:

{'/ModDate': u'D:20120629155504', '/CreationDate': u'D:20120629155504', '/Producer': u'GPL Ghostscript 8.15', '/Title': u'reportlab-userguide.pdf', '/Creator': u'Adobe Acrobat 10.1.3', '/Author': u'mdriscoll'}

I really don’t understand how the author got changed on this document, but I’m sure I’m not the author. I don’t really understand why there are forward-slashes in the key fields either. Looking at the source code for this module it would seem that this is all it can do. That’s a little disappointing. Maybe by drawing attention to this library we can get the developer to write some more functionality into it?

Print Friendly

  • Tim Arnold

    I don’t know why this would be useful since pyPdf already contains the tools to get the info. But I imagine the metadata keys start with a forward slash is because that’s the way the PDFinfo metadata keys exist inside the PDF. What kind of utility would be good for your purpose? I use pyPdf to extract metadata info in my everyday work.

  • driscollis

    I use pyPdf too. I was just doing a tour of the Python PDF libraries to see what they were all about.

  • http://eduardo.cereto.net/ Eduardo Cereto Carvalho

    Maybe if you RTFD you will understand. From the top of their github page.

    >    The metapdf library is a lightweight Python library optimized for metadata extraction and insertion, and it is a fast wrapper over the excellent pyPdf library. It works by quickly searching the last 2048 bytes of the PDF before parsing the xref table, offering a 50-60% performance increase over directly parsing the table line by line.

    The code is very small and clean and it’s easy to see what they are doing. They are reducing the size of the file that is read to extract the meta information. Maybe for you it doesn’t matter and in that case you might want to stick with pyPDF, but if you are trying to extract metadata from thousands of files and want to do that in the fastest way possible this library seems like a good start

  • Pingback: Mike Driscoll: Python: PDF Creation with pdfdocument | The Black Velvet Room()