Python: Parsing XML with minidom

If you’re a long time reader, you may remember that I started programming Python in 2006. Within a year or so, my employer decided to move away from Microsoft Exchange to the open source Zimbra client. Zimbra is an alright client, but it was missing a good way to alert the user to the fact that they had an appointment coming up, so I had to create a way to query Zimbra for that information and show a dialog. What does all this mumbo jumbo have to do with XML though? Well, I thought that using XML would be a great way to keep track of which appointments had been added, deleted, snoozed or whatever. It turned out that I was wrong, but that’s not the point of this story.

In this article, we’re going to look at my first foray into parsing XML with Python. If you do a little research on this topic, you’ll soon discover that Python has an XML parser built into the language in its xml module. I ended up using the minidom sub-component of that module…at least at first. Eventually I switched to lxml, which uses ElementTree, but that’s outside the scope of this article. Let’s take a quick look at some ugly XML that I came up with:



    
        1181251680        
        040000008200E000
        1181572063
        
        
        1800
        Bring pizza home
    

Now we know what I needed to parse. Let’s take a look at the typical way of parsing something like this using minidom in Python.

import xml.dom.minidom
import urllib2

class ApptParser(object):

    def __init__(self, url, flag='url'):
        self.list = []
        self.appt_list = []        
        self.flag = flag
        self.rem_value = 0
        xml = self.getXml(url) 
        print "xml"
        print xml
        self.handleXml(xml)
        
    def getXml(self, url):
        try:
            print url
            f = urllib2.urlopen(url)
        except:
            f = url
        #print f
        doc = xml.dom.minidom.parse(f)
        node = doc.documentElement        
        if node.nodeType == xml.dom.Node.ELEMENT_NODE:
            print 'Element name: %s' % node.nodeName
            for (name, value) in node.attributes.items():
                #print '    Attr -- Name: %s  Value: %s' % (name, value)
                if name == 'reminder':
                    self.rem_value = value                    
    
        return node

    def handleXml(self, xml):
        rem = xml.getElementsByTagName('zAppointments')        
        appointments = xml.getElementsByTagName("appointment")
        self.handleAppts(appointments)

    def getElement(self, element):
        return self.getText(element.childNodes)

    def handleAppts(self, appts):
        for appt in appts:
            self.handleAppt(appt)
            self.list = []

    def handleAppt(self, appt):
        begin     = self.getElement(appt.getElementsByTagName("begin")[0])
        duration  = self.getElement(appt.getElementsByTagName("duration")[0])
        subject   = self.getElement(appt.getElementsByTagName("subject")[0])
        location  = self.getElement(appt.getElementsByTagName("location")[0])
        uid       = self.getElement(appt.getElementsByTagName("uid")[0])
        
        self.list.append(begin)
        self.list.append(duration)
        self.list.append(subject)
        self.list.append(location)
        self.list.append(uid)
        if self.flag == 'file':
            
            try:
                state     = self.getElement(appt.getElementsByTagName("state")[0])
                self.list.append(state)
                alarm     = self.getElement(appt.getElementsByTagName("alarmTime")[0])
                self.list.append(alarm)
            except Exception, e:
                print e
            
        self.appt_list.append(self.list)        

    def getText(self, nodelist):
        rc = ""
        for node in nodelist:
            if node.nodeType == node.TEXT_NODE:
                rc = rc + node.data
        return rc

If I recall correctly, this code was based on an example from the Python documentation (or maybe a chapter in Dive Into Python). I still don’t like this code. The url parameter you see in the ApptParser class can be either a url or a file. I had an XML feed from Zimbra that I would check periodically for changes and compare it to the last copy of that XML that I had downloaded. If there was something new, I would add the changes to the downloaded copy. Anyway, let’s unpack this code a little.

In the getXml, we use an exception handler to try and open the url. If it happens to raise an error, than we assume that the url is actually a file path. Next we use minidom’s parse method to parse the XML. Then we pull out a node from the XML. We’ll ignore the conditional as it isn’t important to this discussion (it has to do with my program). Finally, we return the node object.

Technically, the node is XML and we pass it on to the handleXml. To grab all the appointment instances in the XML, we do this: xml.getElementsByTagName(“appointment”). Then we pass that information to the handleAppts method. Yes, there is a lot of passing around various values here and there. It drove me crazy trying to follow this and debug it later on. Anyway, all the handleAppts method does is loop over each appointment and call the handleAppt method to pull some additional information out of it, add the data to a list and add that list to another list. The idea was to end up with a list of lists that held all the pertinent data regarding my appointments.

You will notice that the handleAppt method calls the getElement method which calls the getText method. I don’t know why the original author did it that way. I would have just called the getText method and skipped the getElement one. I guess that can be an exercise for you, dear reader.

Now you know the basics of parsing with minidom. Personally I never liked this method, so I decided to try to come up with a cleaner way of parsing XML with minidom.

Making minidom Easier to Follow

I’m not going to claim that my code is any good, but I will say that I think I came up with something much easier to follow. I’m sure some will argue that the code is not as flexible, but oh well. Here’s a new XML example that we will parse (found on MSDN):



   
      Gambardella, Matthew
      XML Developer's Guide
      Computer
      44.95
      2000-10-01
      An in-depth look at creating applications 
      with XML.
   
   
      Ralls, Kim
      Midnight Rain
      Fantasy
      5.95
      2000-12-16
      A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.
   
   
      Corets, Eva
      Maeve Ascendant
      Fantasy
      5.95
      2000-11-17
      After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.
   
   
      Corets, Eva
      Oberon's Legacy
      Fantasy
      5.95
      2001-03-10
      In post-apocalypse England, the mysterious 
      agent known only as Oberon helps to create a new life 
      for the inhabitants of London. Sequel to Maeve 
      Ascendant.
   
   
      Corets, Eva
      The Sundered Grail
      Fantasy
      5.95
      2001-09-10
      The two daughters of Maeve, half-sisters, 
      battle one another for control of England. Sequel to 
      Oberon's Legacy.
   
   
      Randall, Cynthia
      Lover Birds
      Romance
      4.95
      2000-09-02
      When Carla meets Paul at an ornithology 
      conference, tempers fly as feathers get ruffled.
   
   
      Thurman, Paula
      Splish Splash
      Romance
      4.95
      2000-11-02
      A deep sea diver finds true love twenty 
      thousand leagues beneath the sea.
   
   
      Knorr, Stefan
      Creepy Crawlies
      Horror
      4.95
      2000-12-06
      An anthology of horror stories about roaches,
      centipedes, scorpions  and other insects.
   
   
      Kress, Peter
      Paradox Lost
      Science Fiction
      6.95
      2000-11-02
      After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems 
      of being quantum.
   
   
      O'Brien, Tim
      Microsoft .NET: The Programming Bible
      Computer
      36.95
      2000-12-09
      Microsoft's .NET initiative is explored in 
      detail in this deep programmer's reference.
   
   
      O'Brien, Tim
      MSXML3: A Comprehensive Guide
      Computer
      36.95
      2000-12-01
      The Microsoft MSXML3 parser is covered in 
      detail, with attention to XML DOM interfaces, XSLT processing, 
      SAX and more.
   
   
      Galos, Mike
      Visual Studio 7: A Comprehensive Guide
      Computer
      49.95
      2001-04-16
      Microsoft Visual Studio 7 is explored in depth,
      looking at how Visual Basic, Visual C++, C#, and ASP+ are 
      integrated into a comprehensive development 
      environment.
   

For this example, we’ll just parse the XML, extract the book titles and print them to stdout. Are you ready? Here we go!

import xml.dom.minidom as minidom

#----------------------------------------------------------------------
def getTitles(xml):
    """
    Print out all titles found in xml
    """
    doc = minidom.parse(xml)
    node = doc.documentElement
    books = doc.getElementsByTagName("book")
    
    titles = []
    for book in books:
        titleObj = book.getElementsByTagName("title")[0]
        titles.append(titleObj)
        
    for title in titles:
        nodes = title.childNodes
        for node in nodes:
            if node.nodeType == node.TEXT_NODE:
                print node.data

if __name__ == "__main__":
    document = 'example.xml'
    getTitles(document)

This code is just one short function that accepts one argument, the XML file. We import the minidom module and give it the same name to make it easier to reference. Then we parse the XML. The first two lines in the function are pretty much the same as the previous example. We use getElementsByTagName to grab the parts of the XML that we want, then iterate over the result and extract the book titles from them. This actually extracts title objects, so we need to iterate over that as well and pull out the plain text, which is what the second nested for loop is for.

That’s it. There is no more.

Wrapping Up

Well, I hope this rambling article taught you a thing or two about parsing XML with Python’s builtin XML parser. We will be looking at XML parsing some more in future articles. If you have a method or module that you like, feel free to point me to it and I’ll take a look.

Additional Reading