Monday, May 31, 2010

Wifi|network These Wi-Fi software tools


A wide variety of Wi-Fi software tools are available. These tools for Wi-Fi perform functions such as:

  • Wireless network discovery
  • Wireless network mapping
  • Wireless network traffic analysis
  • Wireless network RF signal strength monitoring
  • Wireless network encryption cracking
  • Wireless network custom frame generation
  • Dictionary or brute force attacks against wireless networks
  • Denial of Service (DoS) attacks against wireless networks
These Wi-Fi software tools are available for a variety of platforms:


Wi-Fi Software Tools for Multiple Platforms

Aircrack-ng

Aircrack-ng is a WEP and WPA-PSK key cracking program for use on 802.11 networks. The primary purpose for the program is to recover a lost or unknown key once enough data is captured.

Aircrack-ng has the following advantages over the original Aircrack release:

  • Updated and better documentation
  • Updated drivers, including new drivers not originally supported in Aircrack
  • New and faster WEP attack algorithm PTW
  • Supports Unix, Windows, and Zaurus
  • Includes fragmentation in attacks
  • Better cracking performance
  • Dictionary support for WEP attacks
  • Use multiple cards to capture simultaneously
  • New tools including airtun-ng, packetforge-ng (improved arpforge), wesside-ng (still under development), and airserv-ng(still under development)
  • Code optimizations and bug fixes

 

Wi-Fi Software Tools for Windows

KNSGEM II

KNSGEM II is a program that takes the survey logs produced by NetStumbler, Kismet, or WiFiHopper and compiles the data with data google earth to provide colorized 3D coverage maps.

NetStumbler

NetStumbler is a Wi-Fi tool for Windows that allows you to detect Wireless Local Area Networks (WLANs) using 802.11b, 802.11a and 802.11g. It has many uses:

  • Verify that your network is set up the way you intended.
  • Find locations with poor coverage in your WLAN.
  • Detect other networks that may be causing interference on your network.
  • Detect unauthorized "rogue" access points in your workplace.
  • Help aim directional antennas for long-haul WLAN links.
  • Use it recreationally for WarDriving.

OmniPeek

Omnipeek is the next generation version of commercial wireless analysis software from wildpackets which combines the legacy applications AiroPeek and EtherPeek.

Features of OmniPeek include the ability to:

  • Analyze any network interface, including 10Gigabit, Gigabit, and WAN adapters
  • Analyze media and data traffic simultaneously
  • View results in normal document formats such as PDF, HTML or just through email or IM clients
  • View high level details of traffic in a dashboard, or drill down into the individual packet payloads
  • View local, remote, or previously stored captures, including view multiple active captures at once.
  • View capture details by conversation pairs to quickly identify useful or problematic events
  • Change capture filters at will without restarting the capture sequence

Stumbverter

StumbVerter is a standalone application which allows you to import Network Stumbler's summary files into Microsoft's MapPoint 2004 maps. The logged WAPs will be shown with small icons, their colour and shape relating to WEP mode and signal strength.

As the AP icons are created as MapPoint pushpins, the balloons contain other information, such as MAC address, signal strength, mode, etc. This balloon can also be used to write down useful information about the AP.

Lucent/Orinoco Registry Encryption/Decryption

Lucent Orinoco Client Manager stores WEP keys in the Windows registry under a certain encryption/obfuscation. This wi-fi tool can be used to encrypt WEP keys into a registry value or to decrypt registry values into WEP keys.

WiFi Hopper

WiFi Hopper is a windows network discovery and connection client. WiFi Hopper can assist auditors with Site Surveys, Connection parameter testing, and Network Discovery. Filters allow you to easily limit the details displayed, as well as what kinds and configurations of equipment will be tested.

APTools

APTools is a utility that queries ARP Tables and Content-Addressable Memory (CAM) for MAC Address ranges associated with 802.11b Access Points. It will also utilize Cisco Discovery Protocol (CDP) if available. If an Access Point that is web managed is identified, the security configuration of the Access Point is audited via HTML parsing.

Wi-Fi Software Tools for Unix

Aircrack

Aircrack is a unix static WEP and WPA-PSK key cracking utility. Aircrack isn't under development anymore, and has been replaced by Aircrack-ng. Although functional, you probably want to get aircrack-ng unless you have a specific reason to use aircrack.

Aircrack-ptw

Aircrack-ptw was a proof of concept software release showcasing the performance gains you can receive by implementing a new cracking algorithm. The focus of this toolset is on the WEP security algorithm. Aircrack-ptw is implemented in Aircrack-ng, which is a much more robust and complete package.

AirSnort

AirSnort is a wireless LAN (WLAN) tool which cracks encryption keys on 802.11b WEP networks. AirSnort operates by passively monitoring transmissions, computing the encryption key when enough packets have been gathered.

CoWPAtty

CoWPAtty is a program that utilized look up tables to optmize brute force key cracking for shortest time. The hash tables provides include 100,000 dictionary and common key words with the top 1000 most common WiFi SSIDs. The focus for cracking is on the WPA1 and WPA2 protocols. If you need to crack a WEP key, try Aircrack-ng.

Karma

Karma is a set of wireless client assessment tools compiled into a single package release. The intent of the package is to indentify and take advantage of methods operating systems use to connect to access points. Although no exploit codes are provided with the code release, the suite has been tested with multiple exploit releases.

Kismet

Kismet is an 802.11 Layer 2 wireless network detector, sniffer, and Intrusion Detection System. Kismet will work with any wireless card which supports raw monitoring (rfmon) mode, and can sniff 802.11b, 802.11a, and 802.11g traffic.

Kismet identifies networks by passively collecting packets and detecting standard named networks, detecting (and given time, decloaking) hidden networks, and infering the presence of nonbeaconing networks via data traffic.

Wellenreiter

Wellenreiter, by Max Moser, is a GTK/Perl program that makes the discovery and auditing of 802.11b Wi-Fi wireless networks much easier. All three major wireless cards (Prism2, Lucent, and Cisco) are supported. It has an embedded statistics engine for the common parameters provided by wireless drivers. Its scanner window can be used to discover access-points, networks, and ad-hoc cards. It detects SSID broadcasting or non-broadcasting networks in every channel. The manufacturer and WEP is automaticly detected. A flexible sound event configuration lets you work in unattended environments. An ethereal / tcpdump-compatible dumpfile can be created for the whole session. GPS is used to track the location of the discovered networks immediately. Automatic associating is possible with randomly generated MAC addreses. Wellenreiter can reside on low-resolution devices that can run GTK/Perl and Linux/BSD (such as iPaqs). Uniq Essod-bruteforcer is now included too.

Airsnarf

Airsnarf is a simple rogue wireless access point setup utility designed to demonstrate how a rogue AP can steal usernames and passwords from public Wi-Fi hotspots. Airsnarf was developed and released to demonstrate an inherent vulnerability of public 802.11b hotspots--snarfing usernames and passwords by confusing users with DNS and HTTP redirects from a competing AP.

Hotspotter

Hotspotter passively monitors Wi-Fi networks for probe request frames to identify the preferred networks of Windows XP clients, and will compare it to a supplied list of common hotspot network names. If the probed network name matches a common hotspot name, Hotspotter will act as an access point to allow the client to authenticate and associate. Once associated, Hotspotter can be configured to run a command, possibly a script to kick off a DHCP daemon and other scanning against the new victim.

BSD-Airtools

bsd-airtools is a package that provides a complete toolset for wireless 802.11b auditing. Namely, it currently contains a bsd-based wep cracking application, called dweputils (as well as kernel patches for NetBSD, OpenBSD, and FreeBSD). It also contains a curses based ap detection application similar to netstumbler (dstumbler) that can be used to detect wireless access points and connected nodes, view signal to noise graphs, and interactively scroll through scanned ap's and view statistics for each. It also includes a couple other tools to provide a complete toolset for making use of all 14 of the prism2 debug modes as well as do basic analysis of the hardware-based link-layer protocols provided by prism2's monitor debug mode.

WaveStumbler

WaveStumbler is console based 802.11 network mapper for Linux.

WEPCrack

WEPCrack is a tool that cracks 802.11 WEP encryption keys by exploiting the weaknesses of RC4 key scheduling.

AirFart

AirFart is a wireless tool created to detect Wi-Fi devices, calculate their signal strengths, and present them to the user in an easy-to-understand fashion. It is written in C/C++ with a GTK front end. Airfart supports all wireless network cards supported by the linux-wlan-ng Prism2 driver that provide hardware signal strength information in the "raw signal" format (ssi_type 3). Airfart implements a modular n-tier architecture with the data collection at the bottom tier and a graphical user interface at the top.

AirTraf

AirTraf is one of the first wireless 802.11(b) network analyzers. With the growth of interest in wireless networks, network administrators of today are faced with a challenge. The challenge is to effectively deploy numerous access points within their organization to provide wireless coverage for all users, and at the same time make sure that everyone who is granted access is able to operate in a fast, robust network environment.

AirTraf is a 100% passive packet sniffing tool for the wireless 802.11b networks. It captures and tracks all wireless activity in the coverage area, decodes packets, and maintains acquired information associated by access points, as well as detected individual wireless nodes. It dynamically detects any access points in the area, finds association between wireless clients and access points, and builds information table for each packet that is transmitted via the air. AirTraf is able to maintain packet count, byte information, related bandwidth, as well as signal strength of nodes.

And best of all, its open-source, and distributed under the GPL. Other comparable products that perform wireless network analysis price their products above $10,000 (such as Sniffer Wireless), and is limited to single-licenses of copy, while AirTraf can be installed at any detection location you choose, enabled to run in (Server Mode), and polled periodically via the polling server to retrieve active wireless data from multiple stations at once, resulting in consolidation of wireless information over your entire organization into a single point of access (database), and able to be administered via a web interface, visualizing your wireless network performance in a single glance. At absolutely no cost to you, or your organization.

However, AirTraf is still a work in progress, meaning much of planned features, such as injecting packets into the network to test Access Point security, are not available yet. But it is constantly being worked on, and soon it will prove to be a critical tool in managing healthy wireless networks in the future.

AP Hunter

AP Hunter (Access Point Hunter) can find and automatically connect to whatever wireless network is within range. AP Hunter can be used for site surveys, writing the results in a file.

AP Radar

AP Radar (Access Point Radar) is a Linux/GTK+ based graphical netstumbler and wireless profile manager. This project makes use of the version 14 wireless extensions in linux 2.4.20 and 2.6 to provide access point scanning capabilities for most models of wireless cards. It is meant to replace the manual process of running iwconfig and dhclient. It makes reconfiguring for different wireless access points quick and easy.

Mognet

Mognet is a simple, lightweight 802.11b sniffer written in Java and available under the GPL. It features realtime capture output, support for all 802.11b generic and frame-specific headers, easy display of frame contents in hex or ascii, text mode capture for GUI-less devices, and loading/saving capture sessions in libpcap format.

PrismStumbler

Prismstumbler is a wireless LAN (WLAN) discovery tool which scans for beaconframes from accesspoints. Prismstumbler operates by constantly switching channels and monitors any frames recived on the currently selected channel.

Prismstumbler is designed to be a flexible tool to find as much information about wireless LAN installations as possible. It comes with an easy to use GTK2 frontend and is small enough to fit on a small portable system. Because of its client-sever architecture the scanner engine may be used for different frontends. An example for this is gpe-aerial, a wireless LAN access tool for GPE.

The current GTK user interface is designed to work on large PC screens as well as on PDA displays. Prismstumbler uses an embedded SQL database to store network information. It is also able to create networks lists in GPSdrive format and store captured packages to pcap dump files.

THC WarDrive

THC-WarDrive is a tool for mapping your city for wavelan networks with a GPS device while you are driving a car or walking through the streets. It is effective and flexible, a "must-download" for all wavelan nerds.

Wi-find

Wi-find is a wirelesss network detection tool that is written in C and is aiming for flexibility and clean easy to understand code. Wi-find currently only supports Prism2 based cards using the wlan-ng drive.

Wifi-Scanner

Wifi-Scanner is a tool that has been designed to discover wireless nodes (i.e access point and wireless clients). It is distributed under the GPL License.

WiFi-Scanner will work with Cisco cards and prism cards with the hostap driver or wlan-ng driver.

An IDS (Intrusion Detection System) is integrated into Wifi-Scanner to detect anomalies like MAC usurpation.

WaveMon

wavemon is a ncurses-based monitor for wireless devices. It allows you to watch the signal and noise levels, packet statistics, device configuration, and network parameters of your wireless network hardware.

WPM (Wireless Power Meter)

WPM (Wireless Power Meter) is intended to give you a nice signal strength meter for analyzing your wireless connection, and facilitate setting up point-to-point links.

asleap

asleap exploits weaknesses in Cisco's LEAP protocol. Specifically, asleap:

    Recovers weak LEAP passwords.
  • Can read live from any wireless interface in RFMON mode.
  • Can monitor a single channel, or perform channel hopping to look for targets.
  • Will actively deauthenticate users on LEAP networks, forcing them to reauthenticate. This makes the capture of LEAP passwords very fast.
  • Will only deauth users who have not already been seen, doesn't waste time on users who are not running LEAP.
  • Can read from stored libpcap files, or AiroPeek NX files (1.X or 2.X files).
  • Uses a dynamic database table and index to make lookups on large files very fast. Reduces the worst-case search time to .0015% as opposed to lookups in a flat file.
  • Can write *just* the LEAP exchange information to a libpcap file. This could be used to capture LEAP credentials with a device short on disk space (like an iPaq), and then process the LEAP credentials stored in the libpcap file on a system with more storage resources.

anwrap

anwrap.pl is a wrapper for ancontrol that serves as a Dictionary attack tool against LEAP enabled Cisco Wireless Networks. anwrap traverses a user list and password list attempting authentication and logging the results to a file. anrwap really wrecks havoc on RADIUS calls to NT networks that have lockout policies in place, you have been warned. Tweak the Timeouts, a lengthy LEAP timeout on the Cisco side could make for a very boring afternoon. anwrap was designed to audit authentication strengths before deploying LEAP in a production environment.

WAP Attack

WepAttack is a WLAN open source Linux tool for breaking 802.11 WEP keys. This tool is based on an active dictionary attack that tests millions of words to find the right key. Only one packet is required to start an attack.

WEPWedgie

WEPWedgie is a toolkit for determining 802.11 WEP keystreams and injecting traffic with known keystreams. The toolkit also includes logic for firewall rule mapping, pingscanning, and portscanning via the injection channel and a cellular modem.

AirJack

AirJack is a device driver (or suite of device drivers) for 802.11(a/b/g) raw frame injection and reception. It is meant as a development tool for all manor of 802.11 applications that need to access the raw protocol.

Fake AP

Black Alchemy's Fake AP generates thousands of counterfeit 802.11b access points. Hide in plain sight amongst Fake AP's cacophony of beacon frames. As part of a honeypot or as an instrument of your site security plan, Fake AP confuses Wardrivers, NetStumblers, Script Kiddies, and other undesirables.

macfld

macfld tool utilizes the Linux wireless extensions to generate and set random MAC addresses on a Cisco or patched Lucent (drivers) NIC, eventually filling up the association ID table on a wireless bridge. The IEEE 802.11 specification identifies a max value of 2007 concurrent associations to an IBSS access point, but does not discuss what to do when the AID table is full. I have found that ~250 concurrent associations will cause an access point to restart.

void11

ivoid11 is a free implementation of basic 802.11 attacks:

  • deauth (Network DOS) (flood wireless networks with deauthentication packets and spoofed BSSID; authenticated stations will drop their network connections)
  • auth (Accesspoint DOS) (flood accesspoints with authentication packets and random stations addresses; some accesspoints will deny any service after some flooding)
    • Apple Airport aka "UFO" died after ~60sec flooding for about 15 minutes
    • Lucent OR1000 survived with minor problems
    • OpenBSD 3.1/3.2 HostAP freezed after some flooding
    • Linux HostAP driver survived ;-) (max. 1023 authenticated stations)

Wireless Access point Utilities for Unix

Wireless Access Point Utilites for Unix is a set of wi-fi utilities to configure and monitor Wireless Access Points under Unix using SNMP protocol. Wireless Access Point Utilities compiles by GCC and IBM C compiler and runs under Linux, FreeBSD, NetBSD, MacOS-X, AIX, QNX, OpenBSD.

AP Hopper

AP Hopper is a program that automatically hops between access points of different wireless networks. It checks for DHCP and Internet Access on all the networks found. It logs successful and unsuccessful attempts.

APTools

APTools is a utility that queries ARP Tables and Content-Addressable Memory (CAM) for MAC Address ranges associated with 802.11b Access Points. It will also utilize Cisco Discovery Protocol (CDP) if available. If an Access Point that is web managed is identified, the security configuration of the Access Point is audited via HTML parsing.

gpsd

gpsd is a daemon that listens to a GPS or Loran receiver and translates the positional data into a simplified format that can be more easily used by other programs, like chart plotters. The package comes with a sample client that plots the location of the currently visible GPS satellites (if available) and a speedometer. It can also use DGPS/ip.

GpsDrive

GpsDrive is a car (bike, ship, plane) navigation system. GpsDrive displays your position provided from your NMEA capable GPS receiver on a zoomable map, the map file is autoselected depending of the position and prefered scale. Speech output is supported if the "festival" software is running. The maps are autoselected for best resolution depending of your position and can be downloaded from Internet. All Garmin GPS reveiver with a serial output should be usable, also other GPS receiver which supports NMEA protocol.

airpwn

Airpwn is a tool for generic packet injection on an 802.11 network.

airpwn requires two 802.11b interfaces, one for listening, and another for injecting. It uses a config file with multiple config sections to respond to specific data packets with arbitrary content.

Wifitap

WifiTap allows users to connect to wifi networks using traffic injection. The concept is the same as most "man-in-the-middle" or "monkey-in-the-middle" attacks. For WifiTap to work, another system must have an association with an access point that the WifiTap system wants to pass traffic through.

Benefits of using WifiTap over normal Wifi clients:

  • The system running wifitap is not associated with any wireless access point
  • The system is not handled by any access point.

 

Wi-Fi Software Tools for Mac OS

MacStumbler.

MacStumbler is a utility to display information about nearby 802.11b and 802.11g wireless access points. It is mainly designed to be a tool to help find access points while traveling, or to diagnose wireless network problems. Additionally, MacStumbler can be used for "wardriving", which involves co-ordinating with a GPS unit while traveling around to help produce a map of all access points in a given area.

KisMAC

KisMAC is a free stumbler application for MacOS X, that puts your card into the monitor mode. Unlike most other applications for OS X we are completely invisible and send no probe requests. KisMAC supports third party PCMCIA cards with Orinoco and PrismII chipsets, as well as Cisco Aironet cards.

Kismet

Kismet is an 802.11 Layer 2 wireless network detector, sniffer, and Intrusion Detection System. Kismet will work with any wireless card which supports raw monitoring (rfmon) mode, and can sniff 802.11b, 802.11a, and 802.11g traffic.

Kismet identifies networks by passively collecting packets and detecting standard named networks, detecting (and given time, decloaking) hidden networks, and infering the presence of nonbeaconing networks via data traffic.

 

Windows tools useful when associated with Wi-Fi tools

MacIdChanger

MacIdChanger allows you to easily and temporarily change the MAC Address of your windows network adapter without much fuss. This is generally used to conceal the unique mac id that is on every network adapter. This software only operates on Windows XP/2003.

Technitium MAC Address Changer

Free, and very verbose and functional tool to change your network adapters MAC Address. The tool works regardless of which network adapter or driver is installed in your system. Supported platforms are Windows NT, Windows 2000, Windows XP and Windows Vista.

Wednesday, May 26, 2010

Mac|Macports through a proxy

Macports through a proxy

March18

We have a proxy in work that prevents connections from being made out directly. I found out about the awesome MacPorts program, which is a bit like apt for OSX. It pulls backports from a repository and installs them for you.

There's not a lot to the tool's installation if you live on the open web, but I needed to do some stuff to get it working with our squid proxy.

If you run sudo port selfupdate, and get an error that says 'port selfupdate failed: Couldn't sync the ports tree' or something like that, chances are your proxy is blocking rsync.

There are three steps. The prerequisites required for this to work are that you have the proxy address, admin access to your mac, and that the proxy supports the rsync port (873/tcp).

You can test the connectivity by going to http://rsync.macports.org:873, you should get the following error:

@RSYNCD: 30.0
@ERROR: protocol startup error

Step 1

If that works ok, then you need to set up the sudo environment for osx to let proxy environment settings through. First, edit your sudoers file with:
sudo visudo
Do not just edit /etc/sudoers

You need to append these lines:

Defaults env_keep += "http_proxy HTTP_PROXY HTTPS_PROXY FTP_PROXY RSYNC_PROXY"
Defaults env_keep += "ALL_PROXY NO_PROXY"

Step 2

Now, you need to set your http proxy
export http_proxy=http://proxy.example.com:8080
where 8080 is the port number of the proxy

Step 3

By Default, port uses rsync to manage its updates. RSync can use a proxy environment setting (man rsync for mre)
export RSYNC_PROXY=proxy.example.com:8080
Note the rsync proxy capitalisation, and the fact that it does not need http://

That should do it. You can then run selfupdate to get port to the latest version.

The hard way

If that doesn't work, you can have a look at these instructions for replace rsync with subversion:
Syncing with SVN in Macports


Another way:
edit /opt/local/etc/macports/macports.conf
at the bottom there are options to set all the proxy variables just for macports.

How to sync your ports tree using Subversion (over HTTP)

  • Audience: end users who cannot use rsync (873/tcp) due to firewalls, proxies, policy, etc.
  • Requires: MacPorts
  • Requires: Subversion

Leopard comes with subversion already installed. If you are using Tiger, or some other system which does not provide a subversion client, you will need to install subversion yourself. If you have a copy of the ports tree already, just run:

sudo port install subversion

If you do not have a copy of the ports tree, you can download the daily tarball by following the tarball howto.

Introduction

Some people live and work behind a firewall or proxy that block or otherwise break rsync, which is the primary means of getting updated portfiles in MacPorts. The following steps will switch your tree over to using subversion (over http) for syncing.

Note: replace "$prefix" with the location of your MacPorts install, which defaults to /opt/local.

Installation

Step 1: Checkout Initial Copy

cd $prefix/var/macports/sources
mkdir -p svn.macports.org/trunk/dports
cd svn.macports.org/trunk/dports
svn co http://svn.macports.org/repository/macports/trunk/dports/ .

Configuration

Step 2: Configure MacPorts

Edit $prefix/etc/macports/sources.conf to comment out the rsync entry and add the "file" entry:

Note: don't forget to replace $prefix.

#rsync://rsync.macports.org/release/ports/ [default]
file:///$prefix/var/macports/sources/svn.macports.org/trunk/dports/ [default]

Optional Parts

Step 3: Test Sync

Run sync in debug mode and watch for "svn update" instead of "rsync" being used:

port -d sync 

<- Back to the HOWTO section

Tuesday, May 11, 2010

SSL|Java SSL No Subject Alternative Matched

Wednesday, December 10, 2008

Java SSL No Subject Alternative Matched

When you trying to connect to a server with untrusted SSL certificate, you might encounter below mentioned exceptions:
java.security.cert.CertificateException: No subject alternative names matching IP address xxx.xxx.xxx found
or
java.security.cert.CertificateException: No subject alternative DNS name matching hostname.com found.
The reason is because the certificate did not set the correct subject alternative value correctly. Two possible solution for above scenario:
  • Change certificate's subject alternative value
  • Create customize HostnameVerifier
Change Certificate's Subject Alternative Value

If you're connecting to your host by using IP address, then you must change the subject alternative value to your IP address value. Likewise if you're connecting using DNS name, the subject alternative value must match with the DNS name.

Create Customize HostnameVerifier

Basically you just need to create your customized HostnameVerifier class like example below:

private static class CustomizedHostnameVerifier implements HostnameVerifier {
public boolean verify(String hostname, SSLSession session) {
return true;
}
}


and then apply this class to your single SSL connection

HttpsURLConnection connection = (HttpsURLConnection) new URL("https://url").openConnection();
connection.setHostnameVerifier(new CustomizedHostNameVerifier());


or apply to all SLL connection

HttpsURLConnection.setDefaultHostnameVerifier(new CustomizedHostnameVerifier());


However this method might pose a security risk because basically we don't verify the hostname anymore. The server may use other website's certificate and the program will still accept it.

Monday, May 10, 2010

XSD|XML Schema: Understanding Structures

XML Schema: Understanding Structures
by Rahul Srivastava

Learn how to use XML Schema constructs to declare, extend, and restrict the structure of your XML.

Other articles in this series:
XML Schema: Understanding Namespaces
XML Schema: Understanding Datatypes

Downloads for this article:
Oracle XML Developer's Kit
Oracle JDeveloper 10g (includes visual XML Schema editor)

A grammar defines the structure and semantics of a language, enforces constraints, and ensures validity of the instance (the actual data). Just as the English (or any other) language has an associated grammar that defines the rules about how a particular sentence can be composed—and at the same time, given an English sentence, can be used to check the validity of that sentence—a grammar for an XML instance document defines as well as ensures the validity of the structure and content of that document.

The W3C XML Schema definition (WXS) represents the Abstract Data Model of W3C XML Schema (WXS) in XML language. By defining an Abstract Data Model of the schema, the W3C Schema becomes agnostic about the language used to represent that model. XML representation is the formal representation specified by WXS, but you are free to represent the Abstract Data Model any way you want and use it for validation. For example, you can directly create an in-memory schema using any data structure that adheres to the Abstract Data Model. This encourages the vendors that develop W3C Schema validators to provide an API that you can use create an in-memory schema directly.

There are numerous grammars available for validating XML-instance documents. Some became obsolete immediately, while others—such as DTD, which is part of W3C XML 1.0 REC—have passed the test of time. Of the extant grammars, XML Schema is the most popular among XML developers because:

  1. It uses XML as the language to define the schema.
  2. It has more than 44 built-in datatypes, and each of these datatypes can be further refined for fine-grained validation of the character data in XML.
  3. The cardinality of the elements can be defined in a fine-grained manner using the minOccurs and maxOccurs attributes.
  4. It supports modularity and re-usability by extension, restriction, import, include, and redefine constructs.
  5. It supports identity constraint to ensure uniqueness of a value in an XML document, in the specified set.
  6. It has an Abstract Data Model and therefore is not bound to the XML representation only.

Here's an example of how you would validate an XML instance against an externally specified schema:

import java.io.FileInputStream;
import oracle.xml.parser.v2.XMLError;
import oracle.xml.parser.schema.XML Schema;
import oracle.xml.parser.schema.XSDBuilder;
import oracle.xml.schemavalidator.XSDValidator;
...
//load XML Schema
XSDBuilder schemaBuilder = new XSDBuilder();
XML Schema schema = schemaBuilder.build(new FileInputStream("myschema.xsd"), null);

//set the loaded XML Schema to the XSDValidator
XSDValidator validator = new XSDValidator();
validator.setSchema(schema);

//validate the XML-instance against the supplied XML Schema.
validator.validate(new FileInputStream("data.xml"));

//check for errors
XMLError error = validator.getError();
if (error.getNumMessages() > 0) {
System.out.println("XML-instance is invalid.");
error.flushErrors();
}
else {
System.out.println("XML-instance is valid.");
}
Of course, XML Schema has limitations as well:
  1. It doesn't support rule-based validation. An example of rule-based validation would be: If the value of attribute "score" is greater than 80, then the element "distinction" must exist in the XML instance, otherwise not.
  2. The Unique Particle Attribution (UPA) constraint too strictly defines a grammar for all types of XML documents. (See the "UPA Constraint" section for details.)

In my previous articles, I discussed the concept of namespaces, which is essential to understand before you dive into XML Schema; and the datatypes supported in XMLSchema, as well as the simpleType construct used for further constraining these datatypes and using them.

In this article, I will explain the schema constructs used to declare, extend, and restrict the structure of XML. You will also learn about the model groups, particles, and other constraints provided by XML Schema.

Oracle XML Developer's Kit (XDK) includes a W3C-complaint XML Schema processor, as well as several utilities, such as for creating schema datatypes and restricting them programatically using the APIs, parsing and validating the XML Schema structure itself, traversing the Abstract Data Model of an XMLSchema, and so on. Check out the oracle.xml.parser.schema and oracle.xml.schemavalidator packages.

The Content and Model

Element Content

In an XML document, the content of an element is the content enclosed between its <opening> and </closing> tag. An element can have only four types of content: TextOnly, ElementOnly, Mixed, and Empty. Attributes declared on an element are not considered to contribute to the content of an element. They are just part of the element on which they are declared, and contribute to the structure of XML.

TextOnly

The content of an element is said to be TextOnly, when that element has only character data (or simply called as text data) between its <opening> and </closing> tag, or in other words, when that element has no child elements. For example:

<TextOnly>some character data</TextOnly>
ElementOnly

The content of an element is said to be ElementOnly, when that element has only child elements between its <opening> and </closing> tag, optionally separated by whitespaces (space, tab, newline, carriage return). These whitespaces are called ignorable whitespaces, and are often used for indenting the XML. Therefore the following:

ElementOnly content without whitespaces

<ElementOnly><child1 .../><child2 .../></ElementOnly>
is the same as:

ElementOnly content with whitespaces

<ElementOnly>
<child1 .../>
<child2 .../>
</ElementOnly>
Mixed

The content of an element is said to be Mixed when that element has character data interspersed with child elements between its <opening> and </closing> tag. (In other words, its content has both character data as well as child elements.) When the content is mixed, then so-called ignorable whitespaces are not ignorable anymore. Therefore, the following:

<Mixed><child1.../>some character data<child1.../></Mixed>
is different than:
<Mixed>
<child1 .../>
some character data
<child1 .../>
</Mixed>
Empty

The content of an element is said to be Empty when that element has absolutely nothing between the <opening> and </closing> tag, not even whitespaces. For example:

<Empty></Empty>
Another way, for ease of use and clarity, to represent an element, which has an empty content is to use a single empty tag, as follows:
<Empty />
Content Models

In an XML grammar, one declares the content model of an element to specify the type of element content in the corresponding XML instance document. Therefore, a content model is the definition of the element content.

The figure below illustrates how to declare the content models in an XML Schema. Trace the paths in this figure starting from <schema>, to understand how to declare the content model for the four types of element content, with and without attribute declarations. Let's examine each one briefly.

figure 1
Figure 1. Declare the content models in an XML Schema

TextOnly

In the illustration above, trace the path until simpleType-1 to declare an element with TextOnly content model:

<xsd:element name="TextOnly">
<xsd:simpleType>
<xsd:restriction base="xsd:string" />
</xsd:simpleType>
</xsd:element>

OR equivalent

<xsd:element name="TextOnly" type="xsd:string" />
The above schema declares an element named "TextOnly" (can be anything) with the TextOnly content model, whose content must be a string in the corresponding XML instance. When the content model of an element is TextOnly there is always a simpleType associated with it that indicates the datatype of that element. For example, in this case the datatype for element TextOnly is string. See the corresponding XML instance for this schema in the previous section.

As mentioned previously, attributes don't contribute to the element content; therefore, another example of an XML instance with a TextOnly content, and with attributes, is:

<TextOnly att="val">some character data</TextOnly>
Now trace the path in Figure 1 until simpleContent-3 to declare an element with TextOnly content model, and with attributes:
<xsd:element name="TextOnly">
<xsd:complexType>
<xsd:simpleContent>
<xsd:extension base="xsd:string">
<xsd:attribute name="att" type="xsd:string" use="required" />
</xsd:extension>
</xsd:simpleContent>
</xsd:complexType>
</xsd:element>
The above schema declares an element named "TextOnly" with TextOnly content model whose content must be a string and must have an attribute named "attr" in the corresponding XML instance.

ElementOnly

Trace the path in Figure 1 until either one of sequence-5, choice-6, or all-7 to declare an element with ElementOnly content model:

<xsd:element name="ElementOnly">
<xsd:complexType>
<xsd:sequence> <!-- could have used choice or all instead —>
<xsd:element name="child1" type="xsd:string" />
<xsd:element name="child2" type="xsd:string" />
</xsd:sequence>
</xsd:complexType>
</xsd:element>
The above schema declares an element named "ElementOnly" with ElementOnly content model. The element "ElementOnly" must have the child elements "child1" and "child2" in the corresponding XML instance document. See the corresponding XML instance for this schema in the previous section.

Another XML instance with ElementOnly element content and with attributes looks like:

<ElementOnly att="val">
<child1 .../>
<child2 .../>
</ElementOnly>
Mixed

Trace the path in Figure 1 until either one of sequence-5, choice-6, or all-7 to declare an element with Mixed content model—which is identical to declaring ElementOnly content model—but this time set the mixed attribute on the complexType to true, as follows:

<xsd:element name="Mixed">
<xsd:complexType mixed="true">
<xsd:sequence>
<xsd:element name="child1" type="xsd:string" />
<xsd:element name="child2" type="xsd:string" />
</xsd:sequence>
<xsd:attribute name="att" type="xsd:string" use="required" />
</xsd:complexType>
</xsd:element>
To declare an element with ElementOnly content model and with attributes, the path in Figure 1 is same as that of declaring ElementOnly content model. The attributes are then declared within the complexType as follows:
<xsd:element name="ElementOnly">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="child1" type="xsd:string" />
<xsd:element name="child2" type="xsd:string" />
</xsd:sequence>
<xsd:attribute name="att" type="xsd:string" use="required" />
</xsd:complexType>
</xsd:element>
The corresponding XML instance for the above schema looks like
<Mixed att="val">
<child1 .../>
some character data
<child1 .../>
</Mixed>
Empty

Trace the path until complexType-2 to declare an element with Empty content model, with or without attributes:

<xsd:element name="EmptyContentModels">
<xsd:complexType>
<xsd:sequence>

<xsd:element name="Empty1">
<xsd:complexType />
</xsd:element>

<xsd:element name="Empty2">
<xsd:complexType>
<xsd:attribute name="att" type="xsd:string" use="required" />
</xsd:complexType>
</xsd:element>

</xsd:sequence>
</xsd:complexType>
</xsd:element>
The corresponding XML instance for the above schema looks like
<EmptyContentModels>
<Empty1 />
<Empty2 att="val" />
</EmptyContentModels>
Model Groups

When the content model of an element is declared to be ElementOnly (or mixed), which means that the element has child elements, then you can specify the order and occurrence of the child elements in more detail using the model groups. A model group consists of particles; a particle can be an element declaration or yet another model group. The model groups itself can have a cardinality, which can be refined using the minOccurs and maxOccurs attributes. These characteristics make model groups quite powerful.

The three model groups supported by XML Schema are:

  • Sequence - (a , b)* - means that the child elements declared within the sequence model group must occur in the corresponding XML-instance in the same order as defined in the schema. The cardinality of a sequence model group can range from 0 to unbounded. A sequence model group can futher contain a sequence or a choice model group recursively.
  • Choice - (a | b)* - means that from the set of child elements declared within the choice model group exactly one element must occur in the corresponding XML-instance. The cardinality of a choice model group can range from 0 to unbounded. A choice model group can futher contain a sequence or a choice model group recursively.
  • All - {a , b}? - means that the entire set of child elements declared within the all model group must occur in the corresponding XML-instance, but unlike sequence model group, the order is not important. The child elements can therefore occur in any order. The cardinality of an all model group can only be either 0 or 1. An all model group can only contain element declarations and not any other model group.

These model groups can either be declared in-line or as a global declaration (immediate child of <schema> construct with a name for re-usability). A global model group must be declared within the <group> construct, which you can later refer to by its name. But unlike the in-line model groups, the minOccurs/maxOccurs attributes cannot be declared on the globally declared model groups. When required, you can use the minOccurs/maxOccurs attributes when referencing the globally declared model group. For example:

<xsd:group name="globalDecl">
<xsd:sequence>
<xsd:element name="child1" type="xsd:string" />
<xsd:element name="child2" type="xsd:string" />
</xsd:sequence>
</xsd:group>
Subsequently, you can reference the globally declared model group using the group construct along with the minOccurs/maxOccurs attributes, if required, as follows:
<xsd:group ref="globalDecl" maxOccurs="unbounded">
Here is a complex example for a much better understanding of model groups:
((a | b)* , c+)?

<xsd:element name="complexModelGroup">
<xsd:complexType>

<xsd:sequence minOccurs="0" maxOccurs="1">
<xsd:choice minOccurs="0" maxOccurs="unbounded">
<xsd:element name="a" type="xsd:string" />
<xsd:element name="b" type="xsd:string" />
</xsd:choice>
<xsd:element name="c" type="xsd:string" minOccurs="1" maxOccurs="unbounded">
</xsd:sequence>

</xsd:complexType>
</xsd:element>
The complexType story

You now have enough information to write a simple schema for an XML document. But many advanced concepts in XML Schema remain to be addressed.

complexType is one of the other most powerful constructs in the XML Schema. Apart from allowing you to declare all four content models with or without attributes, you can derive a new complexType by inheriting an already declared complexType. Consequently, the derived complexType can either add more declarations to the ones inherited from the base complexType (using extension) or can restrict the declarations from the base complexType (using restriction).

A complexType can be extended or restricted using either simpleContent or complexContent. A complexType with simpleContent declares a TextOnly content model, with or without attributes. A complexType with complexContent can be used to declare the remaining three content models—ElementOnly, Mixed, or Empty—with or without attributes.

Extending a complexType

simpleContent

figure 2
Figure 2. A complexType with simpleContent can only be extended to add attributes.

A complexType with simpleContent can extend either a simpleType or a complexType with simpleContent. As illustrated in Figure 2, in the derived complexType, then, the only thing you are allowed to do is add attributes. For example:

<?xml version="1.0" ?>
<xsd:schema targetNamespace="http://inheritance-ext-res"
xmlns:tns="http://inheritance-ext-res"
xmlns:xsd="http://www.w3.org/2001/XML Schema"
elementFormDefault="qualified"
attributeFormDefault="unqualified">

<xsd:complexType name="DerivedType1">
<xsd:simpleContent>
<xsd:extension base="xsd:string">
<xsd:attribute name="att1" type="xsd:string" use="required" />
</xsd:extension>
</xsd:simpleContent>
</xsd:complexType>

<xsd:complexType name="DerivedType2">
<xsd:simpleContent>
<xsd:extension base="tns:DerivedType1">
<xsd:attribute name="att2" type="xsd:string" use="required" />
</xsd:extension>
</xsd:simpleContent>
</xsd:complexType>

<xsd:element name="SCExtension">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="Derived1" type="tns:DerivedType1" />
<xsd:element name="Derived2" type="tns:DerivedType2" />
</xsd:sequence>
</xsd:complexType>
</xsd:element>

</xsd:schema>
In the above schema:

  1. DerivedType1 extends from the built-in simpleType string, and adds an attribute attr1.
  2. DerivedType2 inherits attribute attr1 from the base DerivedType1, which is a "complexType with simpleContent," and adds an attribute attr2.

An XML instance corresponding to the above schema looks like:

<SCExtension xmlns="http://inheritance-ext-res"
xmlns:xsi="http://www.w3.org/2001/XML Schema-instance"
xsi:schemaLocation="http://inheritance-ext-res CTSCExt.xsd">

<Derived1 att1="val">abc</Derived1>
<Derived2 att1="val" att2="val">def</Derived2>

</SCExtension>
complexContent

figure 3
Figure 3. A complexType with complexContent can be used to extend the model group as well as add attributes.

A complexType with complexContent can extend either a complexType or a complexType with complexContent. As illustrated in Figure 3, in the derived complexType, then, you are allowed to add attributes, as well as extend the model group. For example:

<?xml version="1.0" ?>
<xsd:schema targetNamespace="http://inheritance-ext-res"
xmlns:tns="http://inheritance-ext-res"
xmlns:xsd="http://www.w3.org/2001/XML Schema"
elementFormDefault="qualified"
attributeFormDefault="unqualified">

<!— (child1)+ —>
<xsd:complexType name="BaseType">
<xsd:sequence maxOccurs="unbounded">
<xsd:element name="child1" type="xsd:string" />
</xsd:sequence>
<xsd:attribute name="att1" type="xsd:string" use="required" />
</xsd:complexType>

<!— ((child1)+ , (child2 | child3)) —>
<xsd:complexType name="DerivedType">
<xsd:complexContent>
<xsd:extension base="tns:BaseType">
<xsd:choice>
<xsd:element name="child2" type="xsd:string" />
<xsd:element name="child3" type="xsd:string" />
</xsd:choice>
<xsd:attribute name="att2" type="xsd:string" use="required" />
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>

<xsd:element name="CCExtension">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="Base" type="tns:BaseType" />
<xsd:element name="Derived" type="tns:DerivedType" />
</xsd:sequence>
</xsd:complexType>
</xsd:element>

</xsd:schema>
In the above schema:

  1. The DerivedType inherits the sequence model group from the base complexType, and adds a choice model group, thereby, making the final content model of the derived complexType - ((child1)+ , (child2 | child3)).
  2. The DerivedType inherits attribute attr1 from the BaseType, and adds attribute attr2.

An XML instance corresponding to the above schema looks like:

<CCExtension xmlns="http://inheritance-ext-res"
xmlns:xsi="http://www.w3.org/2001/XML Schema-instance"
xsi:schemaLocation="http://inheritance-ext-res CTCCExt.xsd">

<Base att1="val">
<child1>This is base</child1>
<child1>This is base</child1>
</Base>

<Derived att1="val" att2="val">
<child1>This is inherited from base</child1>
<child1>This is inherited from base</child1>
<child1>This is inherited from base</child1>
<child3>This is added in the derived</child3>
</Derived>

</CCExtension>
Restricting a complexType

simpleContent

figure 4
Figure 4. A complexType with simpleContent can be used to restrict the datatype and attributes.

A complexType with simpleContent can only restrict a complexType with simpleContent. As illustrated in Figure 4, in the derived complexType, then, you can restrict the simpleType of the base, as well as restrict the type and use (optional, mantatory, etc.) of the attributes from the base. For example:

<?xml version="1.0" ?>
<xsd:schema targetNamespace="http://inheritance-ext-res"
xmlns:tns="http://inheritance-ext-res"
xmlns:xsd="http://www.w3.org/2001/XML Schema"
elementFormDefault="qualified"
attributeFormDefault="unqualified">

<xsd:complexType name="BaseType">
<xsd:simpleContent>
<xsd:extension base="xsd:string">
<xsd:attribute name="att1" type="xsd:string" use="optional" />
<xsd:attribute name="att2" type="xsd:integer" use="optional" />
</xsd:extension>
</xsd:simpleContent>
</xsd:complexType>

<xsd:complexType name="DerivedType">
<xsd:simpleContent>
<xsd:restriction base="tns:BaseType">
<xsd:maxLength value="35" />
<xsd:attribute name="att1" use="prohibited" />

<xsd:attribute name="att2" use="required">
<xsd:simpleType>
<xsd:restriction base="xsd:integer">
<xsd:totalDigits value="2" />
</xsd:restriction>
</xsd:simpleType>
</xsd:attribute>

</xsd:restriction>
</xsd:simpleContent>
</xsd:complexType>

<xsd:element name="SCRestriction">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="Base" type="tns:BaseType" />
<xsd:element name="Derived" type="tns:DerivedType" />
</xsd:sequence>
</xsd:complexType>
</xsd:element>

</xsd:schema>
In the above schema:

  1. You restricted the simpleType content of the base (of type string) to a string of length 35 in the derived.
  2. You blocked the attribute att1 from being inherited from base.
  3. You restricted the type of the attribute att2 to an integer of 2 digits, and made it mandatory from optional.

An XML instance corresponding to the above schema looks like:

<SCRestriction xmlns="http://inheritance-ext-res"
xmlns:xsi="http://www.w3.org/2001/XML Schema-instance"
xsi:schemaLocation="http://inheritance-ext-res CTSCRes.xsd">

<Base att1="val">This is base type</Base>
<Derived att2="12">This is restricted in the derived</Derived>

</SCRestriction>
complexContent

figure 5
Figure 5. A complexType with complexContent can be used to restrict the model group as well as the attributes.

A complexType with complexContent can either restrict a complexType or a complexType with complexContent. As illustrated in Figure 5, in the derived complexType, then, you must repeat the entire content model from the base and restrict them as desired, if required. You can restrict the attributes the same way as you would do while restricting a simpleContent. For example:

<?xml version="1.0" ?>
<xsd:schema targetNamespace="http://inheritance-ext-res"
xmlns:tns="http://inheritance-ext-res"
xmlns:xsd="http://www.w3.org/2001/XML Schema"
elementFormDefault="qualified"
attributeFormDefault="unqualified">

<xsd:complexType name="BaseType">
<xsd:sequence>
<xsd:element name="child1" type="xsd:string" maxOccurs="unbounded" />
<xsd:element name="child2" type="xsd:string"/>
</xsd:sequence>
<xsd:attribute name="att1" type="xsd:string" use="optional" />
</xsd:complexType>

<xsd:complexType name="DerivedType">
<xsd:complexContent>
<xsd:restriction base="tns:BaseType">
<xsd:sequence>
<xsd:element name="child1" type="xsd:string" maxOccurs="4" />

<xsd:element name="child2">
<xsd:simpleType>
<xsd:restriction base="xsd:string">
<xsd:maxLength value="35" />
</xsd:restriction>
</xsd:simpleType>
</xsd:element>

</xsd:sequence>
<xsd:attribute name="att1" type="xsd:string" use="prohibited" />
</xsd:restriction>
</xsd:complexContent>
</xsd:complexType>

<xsd:element name="CCRestriction">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="Base" type="tns:BaseType" />
<xsd:element name="Derived" type="tns:DerivedType" />
</xsd:sequence>
</xsd:complexType>
</xsd:element>

</xsd:schema>
In the above schema:

  1. You restricted the cardinality of child1 in the DerivedType, inherited from the BaseType, from unbounded to 4.
  2. You restricted the type of child2 in the DerivedType, inherited from the BaseType to a string of length 35
  3. You prohibited the attribute att1 from being inherited from the BaseType.

An XML instance corresponding to the above schema looks like:

<CCRestriction xmlns="http://inheritance-ext-res"
xmlns:xsi="http://www.w3.org/2001/XML Schema-instance"
xsi:schemaLocation="http://inheritance-ext-res CTCCRes.xsd">

<Base att1="val">
<child1>This is base type</child1>
<child2>This is base type</child2>
</Base>

<Derived>
<child1>This is restricted in the derived</child1>
<child2>This is restricted in the derived</child2>
</Derived>

</CCRestriction>
Assembling Schemas

Imports, includes, and chameleon effects

Many Java projects involve multiple different classes and packages instead of a single, huge Java file because modularization makes the code easy to re-use, read, and maintain. Subsequently, you have to stick the necessary import into the classes before you can use them. Similarly, in XML Schema, you have to manage multiple different schemas from various different namespaces and you need to stick the necessary import in the schemas before you use them.

XML Schemas can be assembled using <import/> and <include/> schema constructs, and of course, the following should be the first statement in the schema before any other declarations:

<schema>
<import namespace="foo" schemaLocation="bar.xsd" />
<include schemaLocation="baz.xsd" />
...
</schema>
Usually <import /> is used when the schema being imported has a targetNamespace, while <include /> is used when the schema being included has no targetNamespace declared.

Let's look at an example involving two schemas - A and B— with A referring to items declared in B.

Case I
When both the schemas have a targetNamespace and the targetNamespace of schema A (tnsA) is different from the targetNamespace of schema B (tnsB), then A must import B.

<import namespace="tnsB" schemaLocation="B.xsd">
It is however an error for A to import B without specifying the namespace, as well as for A to include B.

Case II
When both the schemas have a targetNamespace and the targetNamespace of schema A (tnsAB) is same as the targetNamespace of schema B (tnsAB), then A must include B.

<include schemaLocation="B.xsd">
It is an error for A to import B.

Case III
When both the schemas A and B don't have a targetNamespace. In this case, A must include B.

<include schemaLocation="B.xsd" />
Case IV
When schema A has no targetNamespace, and schema B has a targetNamespace (tnsB), then, A must import B.
<import namespace="tnsB" schemaLocation="B.xsd" />
It is an error for A to include B because B has a targetNamespace.

Case V
When schema A has a targetNamespace (tnsA) and schema B has no targetNamespace, then...? Loudly please! A should include B. But what if I say that in this case, A should import B? Actually, in this case A can either import or include B, and both are legal, though the effects are different.

When A includes B, all the included items from B get the namespace of A. Such an include is known as a chameleon include.

When you don't want such a chameleon effect to take place, you must use an import without specifying the namespace. An import without the namespace attribute allows unqualified reference to components with no target namespace.

<import schemaLocation="B.xsd">
Importing or including a schema multiple times is not an error, because the schema processors can detect such a scenario and not load an already loaded schema. Therefore, it is not an error if A.xsd imports B.xsd and C.xsd; and both B.xsd and C.xsd individually import A.xsd. Circular references are not errors either but are strongly discouraged.

By the way, a mere import like <import /> is legal as well. This approach simply allows unqualified reference to foreign components with no target namespace without giving any hints as to where to find them. It is up to the Schema processor to either throw an error or lookup for unknown items using some mechanism, and this behaviour may vary from one Schema processor to other. A mere <include /> is however illegal.

Rules of thumb:

  1. <include/> - is as good as saying that the <include/>d schema is defined in-line in the including schema.
  2. <import/> - is always used when <import/>ed schema has a targetNamespace, which is different than the targetNamespace of the importing schema.

Redefining Schemas

You may not always want to assemble schemas in their original forms. For example, you may want to modify the components being imported from the schema. In such cases, when we want to redefine a declaration without changing its name, we use the redefine component to do this, with the constraint that the schema which is to be redefined must either have (a) the same targetNamespace as the <redefine>ing schema document, or have (b) no targetNamespace at all, in which case the <redefine>d schema document is converted to the <redefine>ing schema document's targetNamespace.

For example:

actual.xsd
<?xml version="1.0" ?>
<xsd:schema targetNamespace="http://inheritance-ext-res"
xmlns:tns="http://inheritance-ext-res"
xmlns:xsd="http://www.w3.org/2001/XML Schema"
elementFormDefault="qualified"
attributeFormDefault="unqualified">

<xsd:complexType name="BaseType">
<xsd:sequence>
<xsd:element name="child1" type="xsd:string" />
</xsd:sequence>
<xsd:attribute name="att1" type="xsd:string" use="required" />
</xsd:complexType>

<xsd:complexType name="DerivedType">
<xsd:complexContent>
<xsd:extension base="tns:BaseType">
<xsd:choice>
<xsd:element name="child2" type="xsd:string" />
<xsd:element name="child3" type="xsd:string" />
</xsd:choice>
<xsd:attribute name="att2" type="xsd:string" use="required" />
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>

</xsd:schema>
redefine.xsd
<?xml version="1.0" ?>
<xsd:schema targetNamespace="http://inheritance-ext-res"
xmlns:tns="http://inheritance-ext-res"
xmlns:xsd="http://www.w3.org/2001/XML Schema"
elementFormDefault="qualified"
attributeFormDefault="unqualified">

<xsd:redefine schemaLocation="actual.xsd">
<xsd:complexType name="DerivedType">
<xsd:complexContent>
<xsd:extension base="tns:DerivedType">
<xsd:sequence>
<xsd:element name="child4" type="xsd:string" />
</xsd:sequence>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>
</xsd:redefine>

<xsd:element name="Redefine">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="Base" type="tns:BaseType" />
<xsd:element name="Derived" type="tns:DerivedType" />
</xsd:sequence>
</xsd:complexType>
</xsd:element>

</xsd:schema>
In the above schema:

  1. You redefined the DerivedType complexType by adding one more element to the content model, without changing its name.
  2. By not redefining the BaseType in the redefine schema, it is inherited as is.

Note that the name of a type is not changed when redefining it. Therefore, redefined types use themselves as their base types.

In the above example, we redefine a complexType named DerivedType without changing its name. While redefining DerivedType, any reference to "DerivedType" (for example base="tns:DerivedType") is supposed to refer to the actual DerivedType. After the type is redefined, any reference to the DerivedType is supposed to refer to the redefined type.

An XML instance corresponding to the above-redefined schema looks like:

<Redefine xmlns="http://inheritance-ext-res"
xmlns:xsi="http://www.w3.org/2001/XML Schema-instance"
xsi:schemaLocation="http://inheritance-ext-res redefine.xsd">

<Base att1="val">
<child1>This is base type</child1>
</Base>

<Derived att1="val" att2="val">
<child1>This is inherited from the base as is</child1>
<child2>This is added in the derived</child2>
<child4>This is added when redefining</child4>
</Derived>

</Redefine>
Constraints

Identity constraint

XML Schema allows you to enforce uniqueness constraints on the content of elements and attributes, which guarantees that in the instance document the value of the specified elements or attributes are unique. When uniqueness is enforced, there must be an item whose value is to be checked for uniqueness—ISBN number, for example. When you have identified the item, then you must identify the set in which the value of those selected items should be checked for uniqueness (a set of books, for example).

XML Schema provides two constructs — unique and key—to enforce uniqueness constraints. Unique ensures that if the specified values are not null, then they must be unique in the defined set; key ensures that the specified values are never null and are unique in the defined set.

There is one more construct — keyref, which points to some key already defined. Keyref then ensures that the value of the specified item within keyref exists in the set of keys the keyref is pointing to.

All three constructs have the same syntax (all of them use a selector and fields) but different meanings. The selector is used to define the set in which uniqueness is to enforced, and field (multiple fields are used to define a composite item) is used to define the item whose value is to be checked for uniqueness. The value for both selector and field are XPath expressions. XPath expressions do not respect default namespaces; therefore, it becomes very essential to make the XPath expressions namespace aware by explicitly using prefixes bound to appropriate namespace, if the elements/attributes are in a namespace. For example:

<?xml version="1.0" ?>
<xsd:schema targetNamespace="http://identity-constraint"
xmlns:tns="http://identity-constraint"
xmlns:xsd="http://www.w3.org/2001/XML Schema"
elementFormDefault="qualified"
attributeFormDefault="unqualified">


<xsd:complexType name="BookType">
<xsd:sequence>
<xsd:element name="title" type="xsd:string" />
<xsd:element name="half-isbn" type="xsd:string" />
<xsd:element name="other-half-isbn" type="xsd:float" />
</xsd:sequence>
</xsd:complexType>

<xsd:element name="Books">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="Book" type="tns:BookType" maxOccurs="unbounded" />
</xsd:sequence>
</xsd:complexType>

<xsd:key name="isbn">
<xsd:selector xpath=".//tns:Book" />
<xsd:field xpath="tns:half-isbn" />
<xsd:field xpath="tns:other-half-isbn" />
</xsd:key>

</xsd:element>

</xsd:schema>
In the above schema, we declared a key named "isbn" that says, "The composite value (half-isbn + other-half-isbn) specified by field must be not null and unique in the set of books, as specified by the selector."

Unique Particle Attribution (UPA) Constraint

The UPA constraint ensures that the content model of every element be specified in a way such that while validating XML instance there is no ambiguity and the correct element declarations can be determined deterministically for validation. For example, the following schema violates the UPA constraint:

<xsd:element name="upa">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="a" minOccurs="0"/>
<xsd:element name="b" minOccurs="0"/>
<xsd:element name="a" minOccurs="0"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
...because in the corresponding XML-instance for the above schema:
<upa>
<a/>
</upa>
It is not deterministic that the element "a" in the XML instance corresponds to which element declaration in the schema—the element declaration for "a", which is before the element declaration for "b"; or the element declaration for "a", which is after the element declaration for "b"? This restriction limits you to write an XMLSchema for the type of XML instance you just saw. Anyway, in this case, if you just set the minOccurs of element "b" to anything greater than 0, then the UPA is not violated.

The following, then, is a valid schema:

<xsd:element name="upa">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="a" minOccurs="0"/>
<xsd:element name="b" minOccurs="1"/>
<xsd:element name="a" minOccurs="0"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
...because in the corresponding XML-instance for the above schema:
<upa>
<a/>
<b/>
</upa>
It is quite clear that the element "a" in the XML instance is actually an instance of the element declaration for "a", which is before the element declaration for "b" in the schema.

Conclusion

Now that you have completed this series, you should understand:

  1. The concept of namespaces in XML and XML Schema
  2. The scalar datatypes supported in XML Schema, and how to further restrict them using simpleType
  3. The element content, content model, model groups, particles, extending and restricting a complexType, assembling schemas, identity constraint, and UPA, which allow you to define and constrain the structure of XML.

You should have a pretty good grasp of XML Schema by now.


Rahul Srivastava (rahuls@apache.org) is a senior member of Oracle Application Server development team at Oracle and is presently working in the EAI space. He has contributed in the development of the Apache open-source Xerces2-J W3C complaint validating XML Parser primarily in the area of W3C XML Schema. Rahul was also a contributor to JAXP and JSR-173 when working with Sun Microsystems as part of the Web services team.