Python: Parsing XML with lxml

Last time, we looked at one of Python’s built-in XML parsers. In this article, we will look at the fun third-party package, lxml from codespeak. It uses the ElementTree API, among other things. The lxml package has XPath and XSLT support, includes an API for SAX and a C-level API for compatibility with C/Pyrex modules. We’ll just do a few simple things with it though.

Anyway, for this article, we will use the examples from the minidom parsing example and see how to parse those with lxml. Here’s an XML example from a program that was written for keeping track of appointments:



    
        1181251680
        040000008200E000
        1181572063
        
        
        1800
        Bring pizza home
    
    
        1234360800
        1800
        Check MS Office website for updates
        
        604f4792-eb89-478b-a14f-dd34d3cc6c21-1234360800
        dismissed
  

The XML above shows two appointments. The beginning time is in seconds since the epoch; the uid is generated based on a hash of the beginning time and a key (I think); the alarm time is the number of seconds since the epoch, but should be less than the beginning time; and the state is whether or not the appointment has been snoozed, dismissed or not. The rest are pretty self-explanatory. Now let’s see how to parse it.

from lxml import etree
from StringIO import StringIO

#----------------------------------------------------------------------
def parseXML(xmlFile):
    """
    Parse the xml
    """
    f = open(xmlFile)
    xml = f.read()
    f.close()
    
    tree = etree.parse(StringIO(xml))
    context = etree.iterparse(StringIO(xml))
    for action, elem in context:
        if not elem.text:
            text = "None"
        else:
            text = elem.text
        print elem.tag + " => " + text   
    
if __name__ == "__main__":
    parseXML("example.xml")

First off, we import the needed modules, namely the etree module from the lxml package and the StringIO function from the builtin StringIO module. Our parseXML function accepts one argument: the path to the XML file in question. We open the file, read it and close it. Now comes the fun part! We use etree’s parse function to parse the XML code that is returned from the StringIO module. For reasons I don’t completely understand, the parse function requires a file-like object.

Anyway, next we iterate over the context (i.e. the lxml.etree.iterparse object) and extract the tag elements. We add the conditional if statement to replace the empty fields with the word “None” to make the output a little clearer. And that’s it.

Parsing the Book Example

Well, the result of that example was kind of lame. Most of the time, you want to save the data you extract and do something with it, not just print it out to stdout. So for our next example, we’ll create a data structure to contain the results. Our data structure for this example will be a list of dicts. We’ll use the MSDN book example here:



   
      Gambardella, Matthew
      XML Developer's Guide
      Computer
      44.95
      2000-10-01
      An in-depth look at creating applications 
      with XML.
   
   
      Ralls, Kim
      Midnight Rain
      Fantasy
      5.95
      2000-12-16
      A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.
   
   
      Corets, Eva
      Maeve Ascendant
      Fantasy
      5.95
      2000-11-17
      After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.
   
   
      Corets, Eva
      Oberon's Legacy
      Fantasy
      5.95
      2001-03-10
      In post-apocalypse England, the mysterious 
      agent known only as Oberon helps to create a new life 
      for the inhabitants of London. Sequel to Maeve 
      Ascendant.
   
   
      Corets, Eva
      The Sundered Grail
      Fantasy
      5.95
      2001-09-10
      The two daughters of Maeve, half-sisters, 
      battle one another for control of England. Sequel to 
      Oberon's Legacy.
   
   
      Randall, Cynthia
      Lover Birds
      Romance
      4.95
      2000-09-02
      When Carla meets Paul at an ornithology 
      conference, tempers fly as feathers get ruffled.
   
   
      Thurman, Paula
      Splish Splash
      Romance
      4.95
      2000-11-02
      A deep sea diver finds true love twenty 
      thousand leagues beneath the sea.
   
   
      Knorr, Stefan
      Creepy Crawlies
      Horror
      4.95
      2000-12-06
      An anthology of horror stories about roaches,
      centipedes, scorpions  and other insects.
   
   
      Kress, Peter
      Paradox Lost
      Science Fiction
      6.95
      2000-11-02
      After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems 
      of being quantum.
   
   
      O'Brien, Tim
      Microsoft .NET: The Programming Bible
      Computer
      36.95
      2000-12-09
      Microsoft's .NET initiative is explored in 
      detail in this deep programmer's reference.
   
   
      O'Brien, Tim
      MSXML3: A Comprehensive Guide
      Computer
      36.95
      2000-12-01
      The Microsoft MSXML3 parser is covered in 
      detail, with attention to XML DOM interfaces, XSLT processing, 
      SAX and more.
   
   
      Galos, Mike
      Visual Studio 7: A Comprehensive Guide
      Computer
      49.95
      2001-04-16
      Microsoft Visual Studio 7 is explored in depth,
      looking at how Visual Basic, Visual C++, C#, and ASP+ are 
      integrated into a comprehensive development 
      environment.
   

Now let’s parse this and put it in our data structure!

from lxml import etree
from StringIO import StringIO

#----------------------------------------------------------------------
def parseBookXML(xmlFile):
    
    f = open(xmlFile)
    xml = f.read()
    f.close()
    
    tree = etree.parse(StringIO(xml))
    print tree.docinfo.doctype
    context = etree.iterparse(StringIO(xml))
    book_dict = {}
    books = []
    for action, elem in context:
        if not elem.text:
            text = "None"
        else:
            text = elem.text
        print elem.tag + " => " + text
        book_dict[elem.tag] = text
        if elem.tag == "book":
            books.append(book_dict)
            book_dict = {}
    return books
    
if __name__ == "__main__":
    parseBookXML("example2.xml")

This example is pretty similar to our last one, so we’ll just focus on the differences present here. Right before we start iterating over the context, we create an empty dictionary object and an empty list. Then inside the loop, we create our dictionary like this:

book_dict[elem.tag] = text

The text is either elem.text or “None”. Finally, if the tag happens to be “book”, then we’re at the end of a book section and need to add the dict to our list as well as reset the dict for the next book. As you can see, that is exactly what we have done. A more realistic example would be to put the extracted data into a Book class. I have done the latter with json feeds before.

Refactoring the Code

As pointed out by my vigilant readers, I wrote some pretty crappy code. So I have cleaned the code up a bit and hope this is a little better:

from lxml import etree

#----------------------------------------------------------------------
def parseBookXML(xmlFile):
    """"""
       
    context = etree.iterparse(xmlFile)
    book_dict = {}
    books = []
    for action, elem in context:
        if not elem.text:
            text = "None"
        else:
            text = elem.text
        print elem.tag + " => " + text
        book_dict[elem.tag] = text
        if elem.tag == "book":
            books.append(book_dict)
            book_dict = {}
    return books
    
if __name__ == "__main__":
    parseBookXML("example.xml")

As you can see, we dropped the StringIO module entirely and put all the file I/O stuff right in the lxml method calls. The rest is the same. Cool huh? As usual, Python rocks!

Wrapping Up

Did you learn anything in this article? I certainly hope so. Python has lots of cool parsing libraries both in its standard library and outside of it. Be sure to check them out and see which one fits your way of programming the best.

Further Reading

5 thoughts on “Python: Parsing XML with lxml”

  1. The next time you read the contents of a file into a variable only to turn around and put those contents back into a file like object, I’m going to strangle you! 🙂

    Either go with… etree.parse(open(‘file.xml’)) … or if you’re really insistent on reading a file out to a variable, then just use etree.fromstring(myvar)

  2. Thanks to both you and brutimus, I was inspired to try to fix the code. I’ve added another section with some refactored code that hopefully won’t “offend” anyone else. Thanks a lot for the constructive feedback!

  3. I do show an example of the parse command…but not quite in the way you’re talking about. Thanks for the suggestion though. I’m still a little green when it comes to XML parsing, I guess.

Comments are closed.