2009-12-23
Parsing XML Streams with SAX
DOMs rule. You can load the entire XML into memory and traipse random-accessy all over the DOM tree.
But what if your entire DOM won't fit in memory? No, "Hit the pub" is not an acceptable answer in this case. (If it is, it's a coincidence.)
What you can do is use an "event-based" XML parser. This parser will read your XML file and execute callbacks at important places, like when a tag begins and when it ends. A couple of such parsers are Expat and SAX. I'll use SAX for this example because it's quite popular, though the mechanism is similar in Expat.
So far, so good. You have your file being read in, and you're getting callbacks when it gets to each of your tags. But now let's make it a little bit complex and say your XML contains something like this:
<company>
<name>Jorgensen Industrial</name>
<emplist>
<employee>
<name>Beej Jorgensen</name>
...
Hey, now we have two <name>
elements in there, one for employee
and
one for company
. How will we tell them apart when we get our
callbacks?
One thing you could do is have a boolean type that tracks whether or not
you're inside an <employee>
element, and then you'll know to do the
right thing when you get the <name>
. But that's a lot of specialized
work, and a more general solution could be useful.
To this end, what you can do is use a stack to keep track of tags you've seen so far. When you get a "tag start" callback, you push that tag onto the stack, and when you get a "tag end" callback you pop the stack.
Then if you need to know your parent element's name, you can just look at the top of the stack. If you want to know if you're the ancestor of some distant parent, you can search the stack for that parent. More complex matching is possible, as well, if you look at specific places back on the stack.
That's a lot of power you get for basically two lines of overhead code.
This works because the XML describes data that is in a tree (with the "root element" being the root of the tree), and as you read the XML data, you are effectively doing a pre-order traversal of that tree. And keeping a stack like this keeps a "route" back to the root of the tree.
Here's some sample code written in Python that does exactly this. At the beginning of each element, it prints out a variety of information about the elements encountered so far.
#!/usr/bin/python
# this code is hereby granted to the public domain
import sys
import xml.sax
# -------------------------------------------------------------
class MyHandler(xml.sax.ContentHandler):
def __init__(self):
xml.sax.ContentHandler.__init__(self) # super constructor
self.path = [] # do some init
def getParentElement(self):
"""Return the immediate parent element name."""
if len(self.path) == 0:
return None
return self.path[-1]
def descendentOf(self, name):
"""Return True if the current path contains the given name."""
try:
self.path.index(name)
return True
except:
return False
def getPathStr(self):
"""Return the path as a string."""
return '/'.join(self.path)
def startElement(self, name, attrs):
"""Handle the start of an element and do normal processing."""
sys.stdout.write('<%s> ' % name)
sys.stdout.write(' path=%s ' % self.getPathStr())
sys.stdout.write(' parent=%s ' % self.getParentElement())
sys.stdout.write(' descOf(emplist)=%s\n' % \
self.descendentOf('emplist')) # normal processing
self.path.append(name) # track the path
def endElement(self, name):
"""Handle the end of the element."""
self.path.pop() # track the path
# MAIN --------------------------------------------------------
someXml = """<company>
<name>Jorgensen Parts</name>
<emplist>
<employee>
<name>Beej Jorgensen</name>
<salary>B$3490</salary>
</employee>
</emplist>
</company>"""
handler = MyHandler()
xml.sax.parseString(someXml, handler)</pre>
Output:
<company> path= parent=None descOf(emplist)=False
<name> path=company parent=company descOf(emplist)=False
<emplist> path=company parent=company descOf(emplist)=False
<employee> path=company/emplist parent=emplist descOf(emplist)=True
<name> path=company/emplist/employee parent=employee descOf(emplist)=True
<salary> path=company/emplist/employee parent=employee descOf(emplist)=True</pre>