Python: Parsing XML with minidom

If you’re a long time reader, you may remember that I started programming Python in 2006. Within a year or so, my employer decided to move away from Microsoft Exchange to the open source Zimbra client. Zimbra is an alright client, but it was missing a good way to alert the user to the fact that they had an appointment coming up, so I had to create a way to query Zimbra for that information and show a dialog. What does all this mumbo jumbo have to do with XML though? Well, I thought that using XML would be a great way to keep track of which appointments had been added, deleted, snoozed or whatever. It turned out that I was wrong, but that’s not the point of this story.

In this article, we’re going to look at my first foray into parsing XML with Python. If you do a little research on this topic, you’ll soon discover that Python has an XML parser built into the language in its xml module. I ended up using the minidom sub-component of that module…at least at first. Eventually I switched to lxml, which uses ElementTree, but that’s outside the scope of this article. Let’s take a quick look at some ugly XML that I came up with:

<?xml version="1.0" ?>
<zAppointments reminder="15">
        <subject>Bring pizza home</subject>

Now we know what I needed to parse. Let’s take a look at the typical way of parsing something like this using minidom in Python.

import xml.dom.minidom
import urllib2
class ApptParser(object):
    def __init__(self, url, flag='url'):
        self.list = []
        self.appt_list = []        
        self.flag = flag
        self.rem_value = 0
        xml = self.getXml(url) 
        print "xml"
        print xml
    def getXml(self, url):
            print url
            f = urllib2.urlopen(url)
            f = url
        #print f
        doc = xml.dom.minidom.parse(f)
        node = doc.documentElement        
        if node.nodeType == xml.dom.Node.ELEMENT_NODE:
            print 'Element name: %s' % node.nodeName
            for (name, value) in node.attributes.items():
                #print '    Attr -- Name: %s  Value: %s' % (name, value)
                if name == 'reminder':
                    self.rem_value = value                    
        return node
    def handleXml(self, xml):
        rem = xml.getElementsByTagName('zAppointments')        
        appointments = xml.getElementsByTagName("appointment")
    def getElement(self, element):
        return self.getText(element.childNodes)
    def handleAppts(self, appts):
        for appt in appts:
            self.list = []
    def handleAppt(self, appt):
        begin     = self.getElement(appt.getElementsByTagName("begin")[0])
        duration  = self.getElement(appt.getElementsByTagName("duration")[0])
        subject   = self.getElement(appt.getElementsByTagName("subject")[0])
        location  = self.getElement(appt.getElementsByTagName("location")[0])
        uid       = self.getElement(appt.getElementsByTagName("uid")[0])
        if self.flag == 'file':
                state     = self.getElement(appt.getElementsByTagName("state")[0])
                alarm     = self.getElement(appt.getElementsByTagName("alarmTime")[0])
            except Exception, e:
                print e
    def getText(self, nodelist):
        rc = ""
        for node in nodelist:
            if node.nodeType == node.TEXT_NODE:
                rc = rc +
        return rc

If I recall correctly, this code was based on an example from the Python documentation (or maybe a chapter in Dive Into Python). I still don’t like this code. The url parameter you see in the ApptParser class can be either a url or a file. I had an XML feed from Zimbra that I would check periodically for changes and compare it to the last copy of that XML that I had downloaded. If there was something new, I would add the changes to the downloaded copy. Anyway, let’s unpack this code a little.

In the getXml, we use an exception handler to try and open the url. If it happens to raise an error, than we assume that the url is actually a file path. Next we use minidom’s parse method to parse the XML. Then we pull out a node from the XML. We’ll ignore the conditional as it isn’t important to this discussion (it has to do with my program). Finally, we return the node object.

Technically, the node is XML and we pass it on to the handleXml. To grab all the appointment instances in the XML, we do this: xml.getElementsByTagName(“appointment”). Then we pass that information to the handleAppts method. Yes, there is a lot of passing around various values here and there. It drove me crazy trying to follow this and debug it later on. Anyway, all the handleAppts method does is loop over each appointment and call the handleAppt method to pull some additional information out of it, add the data to a list and add that list to another list. The idea was to end up with a list of lists that held all the pertinent data regarding my appointments.

You will notice that the handleAppt method calls the getElement method which calls the getText method. I don’t know why the original author did it that way. I would have just called the getText method and skipped the getElement one. I guess that can be an exercise for you, dear reader.

Now you know the basics of parsing with minidom. Personally I never liked this method, so I decided to try to come up with a cleaner way of parsing XML with minidom.

Making minidom Easier to Follow

I’m not going to claim that my code is any good, but I will say that I think I came up with something much easier to follow. I’m sure some will argue that the code is not as flexible, but oh well. Here’s a new XML example that we will parse (found on MSDN):

<?xml version="1.0"?>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <description>An in-depth look at creating applications 
      with XML.</description>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
   <book id="bk103">
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <description>After the collapse of a nanotechnology 
      society in England, the young survivors lay the 
      foundation for a new society.</description>
   <book id="bk104">
      <author>Corets, Eva</author>
      <title>Oberon's Legacy</title>
      <description>In post-apocalypse England, the mysterious 
      agent known only as Oberon helps to create a new life 
      for the inhabitants of London. Sequel to Maeve 
   <book id="bk105">
      <author>Corets, Eva</author>
      <title>The Sundered Grail</title>
      <description>The two daughters of Maeve, half-sisters, 
      battle one another for control of England. Sequel to 
      Oberon's Legacy.</description>
   <book id="bk106">
      <author>Randall, Cynthia</author>
      <title>Lover Birds</title>
      <description>When Carla meets Paul at an ornithology 
      conference, tempers fly as feathers get ruffled.</description>
   <book id="bk107">
      <author>Thurman, Paula</author>
      <title>Splish Splash</title>
      <description>A deep sea diver finds true love twenty 
      thousand leagues beneath the sea.</description>
   <book id="bk108">
      <author>Knorr, Stefan</author>
      <title>Creepy Crawlies</title>
      <description>An anthology of horror stories about roaches,
      centipedes, scorpions  and other insects.</description>
   <book id="bk109">
      <author>Kress, Peter</author>
      <title>Paradox Lost</title>
      <genre>Science Fiction</genre>
      <description>After an inadvertant trip through a Heisenberg
      Uncertainty Device, James Salway discovers the problems 
      of being quantum.</description>
   <book id="bk110">
      <author>O'Brien, Tim</author>
      <title>Microsoft .NET: The Programming Bible</title>
      <description>Microsoft's .NET initiative is explored in 
      detail in this deep programmer's reference.</description>
   <book id="bk111">
      <author>O'Brien, Tim</author>
      <title>MSXML3: A Comprehensive Guide</title>
      <description>The Microsoft MSXML3 parser is covered in 
      detail, with attention to XML DOM interfaces, XSLT processing, 
      SAX and more.</description>
   <book id="bk112">
      <author>Galos, Mike</author>
      <title>Visual Studio 7: A Comprehensive Guide</title>
      <description>Microsoft Visual Studio 7 is explored in depth,
      looking at how Visual Basic, Visual C++, C#, and ASP+ are 
      integrated into a comprehensive development 

For this example, we’ll just parse the XML, extract the book titles and print them to stdout. Are you ready? Here we go!

import xml.dom.minidom as minidom
def getTitles(xml):
    Print out all titles found in xml
    doc = minidom.parse(xml)
    node = doc.documentElement
    books = doc.getElementsByTagName("book")
    titles = []
    for book in books:
        titleObj = book.getElementsByTagName("title")[0]
    for title in titles:
        nodes = title.childNodes
        for node in nodes:
            if node.nodeType == node.TEXT_NODE:
if __name__ == "__main__":
    document = 'example.xml'

This code is just one short function that accepts one argument, the XML file. We import the minidom module and give it the same name to make it easier to reference. Then we parse the XML. The first two lines in the function are pretty much the same as the previous example. We use getElementsByTagName to grab the parts of the XML that we want, then iterate over the result and extract the book titles from them. This actually extracts title objects, so we need to iterate over that as well and pull out the plain text, which is what the second nested for loop is for.

That’s it. There is no more.

Wrapping Up

Well, I hope this rambling article taught you a thing or two about parsing XML with Python’s builtin XML parser. We will be looking at XML parsing some more in future articles. If you have a method or module that you like, feel free to point me to it and I’ll take a look.

Additional Reading

Print Friendly, PDF & Email