Effective XML processing with DOM and XPath in Java
evel: Intermediate Parand Tony Darugar ( tdarugar@yahoo.com), Head of architecture, Yahoo! Search Marketing Services 01 Dec 2001 Based on an analysis of several large XML projects, this article examines how to make effective and efficient use of DOM in Java. The DOM offers a flexible and powerful means for creating, processing, and manipulating XML documents, but it can be awkward to use and can lead to brittle and buggy code. Author Parand Tony Daruger provides a set of Java usage patterns and a library of functions to make DOM robust and easy to use. The Document Object Model (DOM), is a recognized W3C standard for platform- and language-neutral dynamic access and update of the content, structure, and style of XML documents. It defines a standard set of interfaces for representing documents, as well as a standard set of methods for accessing and manipulating them. The DOM enjoys significant support and popularity, and it is implemented in a wide variety of languages, including Java, Perl, C, C++, VB, Tcl, and Python. As I'll demonstrate in this article, DOM is an excellent choice for XML handling when stream-based models (such as SAX) are not sufficient. Unfortunately, several aspects of the specification, such as its language-neutral interface and its use of the "everything-is-a-node" abstraction, make it difficult to use and prone to generating brittle code. This was particularly evident in a recent review of several large DOM projects that were created by a variety of developers over the past year. The common problems, and their remedies, are discussed below. The DOM specification is designed to be usable with any programming language. Therefore, it attempts to use a common, core set of features that are available in all languages. The DOM specification also attempts to remain neutral in its interface definitions. This allows Java programmers to apply their DOM knowledge when working with Visual Basic or Perl, and vice versa. The specification also treats every part of the document as a node, consisting of a type and a value. This provides an elegant conceptual framework for dealing with all aspects of the document. As an example, the following XML fragment
is represented via the following DOM structure: Figure 1: DOM Representation of an XML Document Each of the The elegant abstraction does come at a price. Consider the XML fragment: The downside of DOM's language neutrality is that the methodologies and patterns that are normally used in each programming language cannot be employed. For example, instead of being able to create new DOM employs an everything-is-a-node abstraction. This means that almost every piece of the XML document, such as The everything-is-a-node abstraction loses some value because of the number of node types that exist, and because of the lack of uniformity present in their access methods. For example, the
JDOM is an effort to adapt the DOM API for Java, providing a more natural and easy-to-use interface. Recognizing the cumbersome nature of the language-neutral DOM constructs, JDOM aims to use native Java representations and objects, and provide convenience functions for common tasks. For example, JDOM tackles the everything-is-a-node and the use of DOM specific constructs (for example, JDOM does a very good job of providing a better interface. It has been accepted as a JSR (a formal Java Specification Request), and it may well be incorporated into the core Java platform in the future. However, because it is not yet a part of the core Java APIs, some are hesitant to use it. There have also been reports of performance issues related to the frequent creation of Iterators and Java objects (see Resources). If you are comfortable with the acceptance and availability of JDOM, and if you don't have a direct need to move Java code or programmers to other languages, JDOM is a good option to explore. The companies whose projects we examined for this article were not yet comfortable with JDOM, and therefore used plain vanilla DOM. This article does the same.
An analysis of several large XML projects revealed some common problems in working with the DOM. A few of these are presented below. In all of the projects that we looked at in our review, an overarching problem presented itself: It took many lines of code to do simple things. In one example, 16 lines of code were used to check the value of an attribute. The same task, with improved robustness and error handling, can be accomplished in three lines of code. The low-level nature of the DOM API, incorrect application of methods and programming patterns, and lack of knowledge of the full API all contributed to the increase in the number of code lines. The following summary presents specific instances of these issues. In the code we examined, the most common task was to traverse, or search, the DOM. Listing 1 shows a condensed version of the code required to find a node called "header" under the In Listing 1, the document is traversed from the root by retrieving the top element, getting its first child ( As an example, the second line of the code gets the intermediate Listing 1 also ignores the possibility that the document may have a different structure from what I'm expecting. If the Finally, all of the functionality of the example in Listing 1 could have been implemented with a simple call to the Retrieving the text value within an element In the analyzed projects, after DOM traversal, the second most common task was to retrieve the text value contained in an element. Consider the XML fragment
As you may have guessed, the above code will not perform the desired action. You cannot call a
The problem with the second try is that the value may not actually be contained in the first child; processing instructions or other embedded nodes may be found within Note that JDOM has already solved this problem for us with the convenient funtion The DOM Level 2 interface includes a method for finding child nodes with a given name. For example, the call:
will return a The problem is that
Given the limitations imposed by DOM's design, how can you use the specification effectively and efficiently? Below are a few basic principles and guidelines for DOM usage, and a library of functions to make life easier. Your experience using DOM will be significantly improved if you follow a few basic principles:
These principles are derived directly from examination of common problems. DOM traversal, as discussed above, is a leading cause of errors. However, it is also one of the most commonly needed functionalities. How do you traverse the document without using the DOM? XPath is a language for addressing, searching, and matching pieces of the document. It is a W3C Recommendation, and it is implemented in most languages and XML packages. Chances are your DOM package supports XPath either directly or via an add-on. The code samples for this article use the Xalan package for XPath support. XPath uses a path notation, similar to that used in file systems and URLs, to specify and match pieces of the document. For example, the XPath: More complex matchings are possible both in terms of the structure of the document, and in the values of the nodes and their attributes. The statement A full examination of XPath and its usage is beyond the scope of this article. See Resources for links to some excellent tutorials. Take a little time to learn XPath, and you'll be rewarded with much easier handling of XML documents.
One finding that surprised us when we examined the DOM projects was the amount of copy-and-paste code that was present. Why would experienced developers who otherwise employ good programming practices engage in copy-and-paste methods instead of creating helper libraries? We believe this is because the complexity of DOM presents a steep learning curve and leads developers to grab the first piece of code that does what they need. It takes a long time to develop the expertise needed to produce the canonical functions that make up the helper libraries. To save some of that ramp-up time, here are some basic helper functions that will get you started with your own library. The most commonly performed action when working with XML documents is looking up the value of a given node. As discussed above, this can present difficulties both in traversing the document to find the desired node and in retrieving the value of the node. The traversal can be simplified using XPath, and the retrieval of the value can be coded once and then reused. We have implemented the Another common action is to set the value of a node to a desired value, as shown in Listing 3. This function takes a starting node and an XPath statement -- just like While some programs look up and modify the values contained in XML documents, others modify the structure of the document itself by adding and removing nodes. This helper function simplifies the addition of a node to the document, as shown in Listing 4. The parameters to this function are the node to add the new node under, the name of the new node to add, and the XPath statement specifying the location to add it under (that is, what the parent node of the new node should be). The new node is appended to the document at the specified location.
The language-neutral design of the DOM has given it very wide applicability and brought about implementations on a large number of systems and platforms. This has come at the expense of making DOM more difficult and less intuitive than APIs designed specifically for each language. DOM forms a very effective base on which easy-to-use systems can be built by following a few simple principles. Future versions of DOM are being designed with the combined wisdom and experience of a large group of users, and will likely present solutions to some of the problems discussed here. Projects such as JDOM are adapting the API for a more natural Java feel, and techniques such as those described in this article can help make XML manipulation easier, less verbose, and less prone to bugs. Leveraging these projects and following these usage patterns allows DOM to be an excellent platform for XML-based projects.
|
No comments:
Post a Comment