2009-12-23

Parsing XML Streams with SAX

DOMs rule. You can load the entire XML into memory and traipse random-accessy all over the DOM tree.

But what if your entire DOM won't fit in memory? No, "Hit the pub" is not an acceptable answer in this case. (If it is, it's a coincidence.)

What you can do is use an "event-based" XML parser. This parser will read your XML file and execute callbacks at important places, like when a tag begins and when it ends. A couple of such parsers are Expat and SAX. I'll use SAX for this example because it's quite popular, though the mechanism is similar in Expat.

So far, so good. You have your file being read in, and you're getting callbacks when it gets to each of your tags. But now let's make it a little bit complex and say your XML contains something like this:

<company>
    <name>Jorgensen Industrial</name>
    <emplist>
        <employee>
            <name>Beej Jorgensen</name>
            ...

Hey, now we have two <name> elements in there, one for employee and one for company. How will we tell them apart when we get our callbacks?

One thing you could do is have a boolean type that tracks whether or not you're inside an <employee> element, and then you'll know to do the right thing when you get the <name>. But that's a lot of specialized work, and a more general solution could be useful.

To this end, what you can do is use a stack to keep track of tags you've seen so far. When you get a "tag start" callback, you push that tag onto the stack, and when you get a "tag end" callback you pop the stack.

Then if you need to know your parent element's name, you can just look at the top of the stack. If you want to know if you're the ancestor of some distant parent, you can search the stack for that parent. More complex matching is possible, as well, if you look at specific places back on the stack.

That's a lot of power you get for basically two lines of overhead code.

This works because the XML describes data that is in a tree (with the "root element" being the root of the tree), and as you read the XML data, you are effectively doing a pre-order traversal of that tree. And keeping a stack like this keeps a "route" back to the root of the tree.

Here's some sample code written in Python that does exactly this. At the beginning of each element, it prints out a variety of information about the elements encountered so far.

#!/usr/bin/python
# this code is hereby granted to the public domain

import sys
import xml.sax

# -------------------------------------------------------------
class MyHandler(xml.sax.ContentHandler):
    def __init__(self):
        xml.sax.ContentHandler.__init__(self)  # super constructor
        self.path = []                         # do some init

    def getParentElement(self):
        """Return the immediate parent element name."""
        if len(self.path) == 0:
            return None
        return self.path[-1]

    def descendentOf(self, name):
        """Return True if the current path contains the given name."""
        try:
            self.path.index(name)
            return True
        except:
            return False

    def getPathStr(self):
        """Return the path as a string."""
        return '/'.join(self.path)

    def startElement(self, name, attrs):
        """Handle the start of an element and do normal processing."""
        sys.stdout.write('<%s> ' % name)
        sys.stdout.write(' path=%s ' % self.getPathStr())
        sys.stdout.write(' parent=%s ' % self.getParentElement())
        sys.stdout.write(' descOf(emplist)=%s\n' % \
            self.descendentOf('emplist'))  # normal processing

        self.path.append(name)  # track the path

    def endElement(self, name):
        """Handle the end of the element."""
        self.path.pop()  # track the path

# MAIN --------------------------------------------------------

someXml = """<company>
    <name>Jorgensen Parts</name>
    <emplist>
        <employee>
            <name>Beej Jorgensen</name>
            <salary>B$3490</salary>
        </employee>
    </emplist>
</company>"""

handler = MyHandler()
xml.sax.parseString(someXml, handler)</pre>

Output:

<company>  path=  parent=None  descOf(emplist)=False
<name>  path=company  parent=company  descOf(emplist)=False
<emplist>  path=company  parent=company  descOf(emplist)=False
<employee>  path=company/emplist  parent=emplist  descOf(emplist)=True
<name>  path=company/emplist/employee  parent=employee  descOf(emplist)=True
<salary>  path=company/emplist/employee  parent=employee  descOf(emplist)=True</pre>

Beej's Bit Bucket

⚡ Tech and Programming Fun

Parsing XML Streams with SAX