Wednesday, April 11, 2007

Effective XML processing with DOM and XPath in Java









evel: Intermediate

Parand Tony Darugar ( tdarugar@yahoo.com), Head of architecture, Yahoo! Search Marketing Services

01 Dec 2001
Updated 01 May 2002

Based on an analysis of several large XML projects, this article examines how to make effective and efficient use of DOM in Java. The DOM offers a flexible and powerful means for creating, processing, and manipulating XML documents, but it can be awkward to use and can lead to brittle and buggy code. Author Parand Tony Daruger provides a set of Java usage patterns and a library of functions to make DOM robust and easy to use.

This article was updated with a correction to Figure 1: DOM Representation of an XML Document.

The Document Object Model (DOM), is a recognized W3C standard for platform- and language-neutral dynamic access and update of the content, structure, and style of XML documents. It defines a standard set of interfaces for representing documents, as well as a standard set of methods for accessing and manipulating them. The DOM enjoys significant support and popularity, and it is implemented in a wide variety of languages, including Java, Perl, C, C++, VB, Tcl, and Python.

As I'll demonstrate in this article, DOM is an excellent choice for XML handling when stream-based models (such as SAX) are not sufficient. Unfortunately, several aspects of the specification, such as its language-neutral interface and its use of the "everything-is-a-node" abstraction, make it difficult to use and prone to generating brittle code. This was particularly evident in a recent review of several large DOM projects that were created by a variety of developers over the past year. The common problems, and their remedies, are discussed below.

The Document Object Model

The DOM specification is designed to be usable with any programming language. Therefore, it attempts to use a common, core set of features that are available in all languages. The DOM specification also attempts to remain neutral in its interface definitions. This allows Java programmers to apply their DOM knowledge when working with Visual Basic or Perl, and vice versa.

The specification also treats every part of the document as a node, consisting of a type and a value. This provides an elegant conceptual framework for dealing with all aspects of the document. As an example, the following XML fragment

 <paragraph align="left">the <it>Italicized</it> portion.</paragraph>


is represented via the following DOM structure:


Figure 1: DOM Representation of an XML Document
DOM Representation

Each of the Document , Element, Text, and Attr pieces of the tree are DOM Nodes.

The elegant abstraction does come at a price. Consider the XML fragment: <tagname>Value</tagname> . You may think the text value would be represented by a normal Java String object, and be accessible via a simple getValue call. In fact, the text is treated as one or more child Node s under the tagname node. Thus, in order to get the text value, you need to traverse the children of tagname, collating the value of each into a string. There is good reason for this: tagname may contain other embedded XML elements, in which case getting its text value makes less sense. In the real world, however, we have seen frequent coding errors caused by the lack of a convenient function to cover the 80% of cases where it does make sense.

Design issues

The downside of DOM's language neutrality is that the methodologies and patterns that are normally used in each programming language cannot be employed. For example, instead of being able to create new Element s using the familiar Java new construct, developers must use factory constructor methods. Collections of Nodes are represented as NodeLists, instead of the normal List or Iterator objects. These minor inconveniences add up to unusual coding practices and increased lines of code, and they force the programmer to learn the DOM way of doing things in place of the intuitive way.

DOM employs an everything-is-a-node abstraction. This means that almost every piece of the XML document, such as Document, Element, and Attr, all inherit (extend) the Node interface. This is not only conceptually elegant, but also allows each different implementation of DOM to expose its own classes via the standard interfaces, without the performance loss of going through intermediate wrapper classes.

The everything-is-a-node abstraction loses some value because of the number of node types that exist, and because of the lack of uniformity present in their access methods. For example, the insertData method is used to set the value of CharacterData nodes, while the value of Attr (attribute) nodes is set using the a setValue method. By presenting different interfaces for the different nodes, the uniformity and elegance of the model is diminished, and the learning curve is increased.



Back to top


JDOM

JDOM is an effort to adapt the DOM API for Java, providing a more natural and easy-to-use interface. Recognizing the cumbersome nature of the language-neutral DOM constructs, JDOM aims to use native Java representations and objects, and provide convenience functions for common tasks.

For example, JDOM tackles the everything-is-a-node and the use of DOM specific constructs (for example, NodeList) directly. JDOM defines the different node types (for example, Document, Element, and Attribute) as separate Java classes, which means developers can construct them using new, obviating the need for frequent type casts. JDOM represents strings as Java String s, and collections of nodes via normal List and Iterator classes. (JDOM substitutes its own classes for the DOM classes.)

JDOM does a very good job of providing a better interface. It has been accepted as a JSR (a formal Java Specification Request), and it may well be incorporated into the core Java platform in the future. However, because it is not yet a part of the core Java APIs, some are hesitant to use it. There have also been reports of performance issues related to the frequent creation of Iterators and Java objects (see Resources).

If you are comfortable with the acceptance and availability of JDOM, and if you don't have a direct need to move Java code or programmers to other languages, JDOM is a good option to explore. The companies whose projects we examined for this article were not yet comfortable with JDOM, and therefore used plain vanilla DOM. This article does the same.



Back to top


Common coding problems

An analysis of several large XML projects revealed some common problems in working with the DOM. A few of these are presented below.

Code bloat

In all of the projects that we looked at in our review, an overarching problem presented itself: It took many lines of code to do simple things. In one example, 16 lines of code were used to check the value of an attribute. The same task, with improved robustness and error handling, can be accomplished in three lines of code. The low-level nature of the DOM API, incorrect application of methods and programming patterns, and lack of knowledge of the full API all contributed to the increase in the number of code lines. The following summary presents specific instances of these issues.

Traversing the DOM

In the code we examined, the most common task was to traverse, or search, the DOM. Listing 1 shows a condensed version of the code required to find a node called "header" under the config section of the document:

In Listing 1, the document is traversed from the root by retrieving the top element, getting its first child (configNode), and finally by examining configNode 's children individually. Unfortunately, this method is not only quite verbose, but it's also fraught with fragility and the potential for bugs.

As an example, the second line of the code gets the intermediate config node using the getFirstChild method. Already, a multitude of potential problems exist. The first child of the root node may not actually be the node the user is searching for. By blindly following the first child, I have ignored the actual name of the tag and will potentially be searching the incorrect part of the document. A frequent error in this scenario occurs when the source XML document contains whitespace or a carriage return after the root node; the first child of the root node is actually a Node.TEXT_NODE node, not the intended element node. You can experiment with this yourself by downloading the sample code from Resources and editing the sample.xml file to put a carriage return between the sample and config tags. The code immediately breaks with an exception. To correctly navigate to the intended node, I need to examine each of root's child nodes until I find one that is not a Text node and that has the name I'm looking for.

Listing 1 also ignores the possibility that the document may have a different structure from what I'm expecting. If the root doesn't have any child nodes, for example, configNode will be set to null, and the third line of the example will raise an error. Therefore, to navigate the document properly, not only do I have to examine each child node individually and check for the appropriate name, but at every step I also have to check to make sure each method call returned a valid value. Writing robust, error-free code that can handle arbitrary input requires both a great deal of attention to detail and many lines of code.

Finally, all of the functionality of the example in Listing 1 could have been implemented with a simple call to the getElementsByTagName function, had the original developer known about it. This is discussed below.

Retrieving the text value within an element

In the analyzed projects, after DOM traversal, the second most common task was to retrieve the text value contained in an element. Consider the XML fragment <sometag>The Value</sometag>. Having navigated to the sometag node, how do you capture its text value (The Value)? An intuitive implementation may be:

sometagElement.getData();

As you may have guessed, the above code will not perform the desired action. You cannot call a getData or a similar function on the sometag element because the actual text is stored as one or more child nodes. A better approach would be:

 sometag.getFirstChild().getData();

The problem with the second try is that the value may not actually be contained in the first child; processing instructions or other embedded nodes may be found within sometag, or the text value may be contained in several child nodes instead of in just one. Recall that whitespace is frequently represented as a text node, so the call to sometag.getFirstChild() may get you only the carriage return between the tag and its value. In fact, you need to traverse all of the children, checking for nodes of type Node.TEXT_NODE, and collate their values until you have the complete value.

Note that JDOM has already solved this problem for us with the convenient funtion getText. DOM Level 3 will also have an answer with the planned getTextContent method. The lesson: It's good to use higher level APIs where you can.

getElementsByTagName

The DOM Level 2 interface includes a method for finding child nodes with a given name. For example, the call:

   NodeList names = someElement.getElementsByTagName("name");

will return a NodeList of nodes called names contained within the someElement node. This is certainly more convenient than the traversal methods I discussed. It is also the cause of a common set of bugs.

The problem is that getElementsByTagName recursively traverses the document, returning all matching nodes. Suppose you have a document containing customer information, company information, and product information. All three of these items can potentially have a name tag within them. Your program would likely misbehave if you called getElementsByTagName to search for customer names, and retrieved the product and company names in addition to the customer names. Calling the function on a subtree of the document can diminish the risks, but XML's flexible nature makes it quite difficult to ensure that the subtree you are operating on has the structure you are expecting, and that it doesn't have spurious child nodes with the name you are searching on.



Back to top


Effective use of the DOM

Given the limitations imposed by DOM's design, how can you use the specification effectively and efficiently? Below are a few basic principles and guidelines for DOM usage, and a library of functions to make life easier.

Basic principles

Your experience using DOM will be significantly improved if you follow a few basic principles:

  • Do not use DOM to traverse the document.
  • Whenever possible, use XPath to find nodes or traverse the document.
  • Use a library of higher-level functions to make DOM use easier.

These principles are derived directly from examination of common problems. DOM traversal, as discussed above, is a leading cause of errors. However, it is also one of the most commonly needed functionalities. How do you traverse the document without using the DOM?

XPath

XPath is a language for addressing, searching, and matching pieces of the document. It is a W3C Recommendation, and it is implemented in most languages and XML packages. Chances are your DOM package supports XPath either directly or via an add-on. The code samples for this article use the Xalan package for XPath support.

XPath uses a path notation, similar to that used in file systems and URLs, to specify and match pieces of the document. For example, the XPath: /x/y/z searches the document for a root node of x , under which resides the node y, under which resides the node z. This statement returns all nodes that match the specified path structure.

More complex matchings are possible both in terms of the structure of the document, and in the values of the nodes and their attributes. The statement /x/y/* returns all nodes under any node y with the parent x. /x/y[@name='a'] matches all nodes y who have a parent x, and have an attribute called name with the value a. Note that XPath handles the whole issue of sifting through the whitespace text nodes to get at the actual element nodes -- it returns only the element nodes.

A full examination of XPath and its usage is beyond the scope of this article. See Resources for links to some excellent tutorials. Take a little time to learn XPath, and you'll be rewarded with much easier handling of XML documents.



Back to top


Library of functions

One finding that surprised us when we examined the DOM projects was the amount of copy-and-paste code that was present. Why would experienced developers who otherwise employ good programming practices engage in copy-and-paste methods instead of creating helper libraries? We believe this is because the complexity of DOM presents a steep learning curve and leads developers to grab the first piece of code that does what they need. It takes a long time to develop the expertise needed to produce the canonical functions that make up the helper libraries.

To save some of that ramp-up time, here are some basic helper functions that will get you started with your own library.

findValue

The most commonly performed action when working with XML documents is looking up the value of a given node. As discussed above, this can present difficulties both in traversing the document to find the desired node and in retrieving the value of the node. The traversal can be simplified using XPath, and the retrieval of the value can be coded once and then reused. We have implemented the getValue function with the help of two lower-level functions: XPathAPI.selectSingleNode, provided by the Xalan package (which finds and returns the first node which matches the given XPath expression); and getTextContents which non-recursively returns the concatenated values of the text contained in a node. Note that the getText function from JDOM, or the proposed getTextContent method that will appear in DOM Level 3, could be used in place of getTextContents. Listing 2 contains a simplified listing; you can access the full functions by downloading the sample code (see Resources).

findValue is called by passing in both a node from which to start the search and an XPath statement that specifies the node you're searching for. The function finds the first node to match the given XPath and extracts its text value.

setValue

Another common action is to set the value of a node to a desired value, as shown in Listing 3. This function takes a starting node and an XPath statement -- just like findValue -- and a string to set the value of the matching node to. It finds the desired node, removes all of its children (thereby removing any text and other elements contained within it), and sets its text contents to the passed-in string.

appendNode

While some programs look up and modify the values contained in XML documents, others modify the structure of the document itself by adding and removing nodes. This helper function simplifies the addition of a node to the document, as shown in Listing 4.

The parameters to this function are the node to add the new node under, the name of the new node to add, and the XPath statement specifying the location to add it under (that is, what the parent node of the new node should be). The new node is appended to the document at the specified location.



Back to top


In the final analysis

The language-neutral design of the DOM has given it very wide applicability and brought about implementations on a large number of systems and platforms. This has come at the expense of making DOM more difficult and less intuitive than APIs designed specifically for each language.

DOM forms a very effective base on which easy-to-use systems can be built by following a few simple principles. Future versions of DOM are being designed with the combined wisdom and experience of a large group of users, and will likely present solutions to some of the problems discussed here. Projects such as JDOM are adapting the API for a more natural Java feel, and techniques such as those described in this article can help make XML manipulation easier, less verbose, and less prone to bugs. Leveraging these projects and following these usage patterns allows DOM to be an excellent platform for XML-based projects.



Back to top


Resources



Back to top


About the author

Parand Tony Darugar is the head of architecture for Yahoo! Search Marketing Services (formerly Overture). His interests include Web services and Service Oriented Architectures (SOA), XML, high-performance business systems, distributed architectures, and artificial intelligence. You can reach him at tdarugar@yahoo.com .

No comments: