Java: finding the validation mechanism for an arbitrary XML document

Unfortunately, there is no 100% foolproof process for determining how to validate an arbitrary XML document. If you are receiving a document, you should not leave choosing the validation mechanism to a remote party (e.g. downloading a DTD using its document-specified URI). Doing so opens your application to, at the very least, a potential denial-of-service attack. A validation mechanism may not even be specified in the document: W3 XML Schema (XSD) does not require it; RELAX NG does not seem to support such a mechanism. Then there are some XML documents that just don't have a schema of any form.

Nevertheless, there are times when you need to inspect a document to find out what it is. Most commonly, support is required for multiple versions of a document, where the structure and validation mechanisms change over time.

Note: when talking about validation, this post is not referring to whether the XML is well formed or not. Any XML parser should be able to check the syntax. This is about external constraints imposed on the document structure via a schema, DTD, etc.

Validation schema hints

Validation information in a XML file should be taken as a hint, not an instruction.

Example XML declaration that specifies a DOCTYPE:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head> ...

Example XML file that provides a schema location hint:

<?xml version="1.0" encoding="UTF-8"?>
<web-app id="WebApp_ID" version="2.4"
    xmlns="http://java.sun.com/xml/ns/j2ee"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://java.sun.com/xml/ns/j2ee http://java.sun.com/xml/ns/j2ee/web-app_2_4.xsd">

Note that this is an equally valid way to express the above document:

<?xml version="1.0" encoding="UTF-8"?>
<web-app id="WebApp_ID" version="2.4"
    xmlns="http://java.sun.com/xml/ns/j2ee">

There are a few things you can inspect to garner information on a document in order to determine how to validate it:

The DOCTYPE. If a document specifies a DTD, you've found its validation mechanism.
The Root Element. If the root element of a document is html, it is a good indication that it is a HTML file. This is generally not enough, though - there are numerous versions of HTML and you still need to pick the right DTD. (Let's pretend for a moment that all HTML files are well formed XML.)
The Root Element's Namespace. If the XML file specifies a namespace, it is a good indicator of how to validate the document. Not all documents, particularly older ones, will specify a namespace.
Schema Location Hints. The presence of the XSD namespace http://www.w3.org/2001/XMLSchema-instance is a good indicator that a XML schema should be used. The XML schema provides two mechanisms to hint at the location of the XSD file: the shemaLocation and noNamespaceSchemaLocation attributes. An XML document that should be validated by schema may provide one, both, or neither of these attributes.

Extracting this information using Java

The standard Java API provides a number of packages for handling XML. Since all we want from the XML is "header" information, there is no need to parse the entire document. This makes the SAX and DOM parsers poor choices. The XML stream reader in StAX is a better fit (tutorial here).

Code fragment showing use of XMLStreamReader:

     XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();

      xmlInputFactory.setProperty(XMLInputFactory.RESOLVER,

          new CustomEntityResolver(validationInfo));

      try {

        XMLStreamReader xmlStreamReader = xmlInputFactory

            .createXMLStreamReader(source);

        try {

          while (xmlStreamReader.hasNext()) {

            int eventType = xmlStreamReader.next();

  

            // stop reading on first element

            if (eventType == XMLStreamConstants.START_ELEMENT) {

              validationInfo.rootElementName = xmlStreamReader

                  .getLocalName();

              validationInfo.namespace = xmlStreamReader

                  .getNamespaceURI();

              validationInfo.schemaLocation = xmlStreamReader

                  .getAttributeValue(XSD_NAMESPACE_URI,

                      "schemaLocation");

              validationInfo.noNamespaceSchemaLocation = xmlStreamReader

                  .getAttributeValue(XSD_NAMESPACE_URI,

                      "noNamespaceSchemaLocation");

  

              return validationInfo;

            }

XMLStreamReader.next() is used to iterate through parsing events. There is a DTD event, but an implementation of XMLResolver can be used as an alternative, both to report entities and to improve performance.

Sample output for three documents (XSD-based; DTD-based; no validation):

File: sample_xsd.xml
Root element: web-app
Namespace: http://java.sun.com/xml/ns/j2ee
DTD name: null
DTD URI: null
Schema:  http://java.sun.com/xml/ns/j2ee http://java.sun.com/xml/ns/j2ee/web-app_2_4.xsd
No namespace schema: null

File: sample_dtd.xml
Root element: web-app
Namespace: null
DTD name: -//Sun Microsystems, Inc.//DTD Web Application 2.2//EN
DTD URI: http://java.sun.com/j2ee/dtds/web-app_2_2.dtd
Schema:  null
No namespace schema: null

File: build.xml
Root element: project
Namespace: null
DTD name: null
DTD URI: null
Schema:  null
No namespace schema: null

The first two examples above are for different generations of deployment descriptors for Java web applications. The latter is an Ant build file, which does not support a static definition of its structure.

Code

All the sources are available in a public Subversion repository.

Repository: http://illegalargumentexception.googlecode.com/svn/trunk/code/java/
License: MIT
Project: FindXmlValidationMechanism

Java: determining the version of an XML document

Illegal Argument Exception

Saturday 21 February 2009