After, my post concerning the XML parsing with JAXP (SAX and DOM APIs), here, I would present simple examples of validation XML stream with the JAXP (Java APIs for XML Processing) API which is a common interface for creating, parsing and manipulating XML documents using the standard SAX, DOM and XSLTs.

1. XML Validating
XML has become indispensable in Information Systems Architectures and J2EE. Used as a standard format for data exchange, standardized by the W3C, the XML document is present everywhere in applications, databases, and is at the heart of EAI exchanges.

In this fact, the knowledge of the APIs of XML parsing like DOM, SAX is often necessary in the development of a J2EE application. Understand the differences, strengths and weaknesses of these APIs is important to avoid performance problems that may be encountered on these complex APIs.

So, to process the XML documents, an application needs an XML parser to tokenize and retrieve the data/objects in the XML streams. An XML parser is the programme between the application and the XML documents which reads a XML stream, ensures that is well-formed, and may validate the document against a DTD or schema definition XSD.

There are two standard APIs for validating XML documents:
1. SAX (Simple API for XML)
2. DOM (Document Object Model)

The JAXP (Java APIs for XML Processing) provides a common interface for creating, parsing and manipulating XML documents using the standard SAX, DOM and XSLTs.

Well-formed and valid document
A XML document is well-formed, if its structure meets the XML specification, i.e. it is syntactically correct. A XML document is valid, if it is well-formed AND if its structure and datas (elements and attributes) meet the specifications defined in definition documents.

In this article, we will study examples with the Document Type Definition (DTD) and XML Schema Definition (XSD).


2. Document Type Definition (DTD)
Document Type Definition (DTD) describes the objects (such as elements, attributes, entities) and the relationship of the objects in a XML document. It specifies a set of constraints and establishes the trees that are acceptable in an XML document.

A DTD can be declared inside an XML document (i.e., inline), or referenced as an external file.
An inline DTD is wrapped in a DOCTYPE declaration, and has the following syntax:

<!DOCTYPE root-element [
   declarations
]>

A DTD can also be stored in an external file. An XML document can reference an external DTD via the following syntax:

<!DOCTYPE root-element SYSTEM "DTD-filename">

DTD Syntax
XML’s DTD has its own syntax different of XML’s syntax which consists of declarations (for element, attributes and so on) such as:

  • Document Type declaration:
    <!DOCTYPE ...>
  • Element declaration:
    <!ELEMENT element-name (element-content)>
    <!-- OR -->
    <!ELEMENT element-name category>
    

    Examples:

    <!ELEMENT title (#PCDATA)>              // contain parsed character data
    <!ELEMENT name (first_name, last_name)> // contain child elements
    <!ELEMENT person(title, name+, born, died?, nationality*)>
            // name (one), ? (zero or one), + (one or more), * (zero or more)
    
    <!ELEMENT linebreak EMPTY>   // an empty-element
    <!ELEMENT message ANY>       // combination of all
    

    Category (or special content):
    o #PCDATA (Parsed Character Data): texts that will be examined for entity references and tags;
    o EMPTY: Empty element (for leaf element only);
    o ANY: unrestrictive;

    Occurrence Indicators:
    o "+": one or more occurrences;
    o "*": zero or more occurrences;
    o "?": zero or exactly one occurrence;
    o No occurrence indicator: exactly one;

    Connector:
    o ",": indicate the sequence of the child elements
    o "|": choices (or) – choose only one of them

  • The structure of attributes in element start-tag is declared in DTD like:
    // Declaring "attribute" in DTD
    <!ATTLIST element-name
      attribute-1-name attribute-1-type default
      attribute-2-name attribute-2-type default
      ...
    >
    // default
    default-value|#REQUIRED|#IMPLIED|#FIXED value
    

    Examples:

    // Examples
    <!ATTLIST person ID CDATA #REQUIRED>
    <!ATTLIST trade action (buy|sell) #REQUIRED>  // enumeration type
    <!ATTLIST person 
        ID CDATA #REQUIRED
        SSNUM CDATA 
    >
    

    Attribute types:
    o CDATA (Character Data): text strings that will not be parsed for entity references and tags.
    o ID: an unique identifier.
    o IDREF, IDREFS: reference(s) to a previously defined ID.
    o ENTITY, ENTITIES: external entity(entities).
    o NMTOKEN, NMTOKENS: word(s) not containing spaces.
    o Enumeration: list of NMTOKEN separated by “|”.

    Default:
    o #REQUIRED: must be provided in the document.
    o #IMPLIED: use the application default.
    o #FIXED value: must use this value.
    o A literal default value.

  • Entity declaration: A “entity” is a variable allowing the definition of replacement text or special characters where the entity reference is used in the form of &entity-name; to obtain the value of the variable. Entities can be declared inline or external:
    // Inline "entity" declaration
    <!ENTITY entity-name "entity-value">
    // External "entity" declaration
    <!ENTITY entity-name SYSTEM "url">
    

    Examples:

    <!ENTITY author "Huseyin OZVEREN">  // In XML documents, entity referenced as &author;
    <!ENTITY mywebsite SYSTEM "http://www.javablog.fr">
    

  • Define Notation for an external entity:
    <!NOTATION ...>

Usage and Limitations of DTD
DTD defines the structure of XML documents, which could facilitate exchanges of documents between services. However, DTD has some limitations:

  • DTD has its own syntax (which is inherited from SGML DTD) and requires a dedicate processing tool to process the content. It does not use XML syntax and XML processor.
  • DTD does not support object-oriented concepts such as hierarchies and inheritance.
  • DTD’s data type is limited to text string; and does not support other data types like number, date etc.
  • DTD does not support namespaces.
  • DTD’s occurrence indicator is limited to 0, 1 and many; cannot support a specific number such as 8.


3. XML Schema Definition (XSD)
XML Schema developed by W3C via a recommendation in May 2001, is a description language to define the structure and content type of an XML document. It overcomes the limitation of DTD and meant to replace DTD for the checking of XML document validity. In brief, the XML Schema:

  • is a well-formed XML document, which uses XML syntax;
  • is object-oriented, support concepts like inheritance;
  • supports namespaces;
  • supports more data type;
  • more element occurrence indicators.

Note: The current version of XSD 1.1 (september 2012) became a approved W3C specification in April 2012.

So, the purpose of an XML Schema is to define the legal building blocks of an XML document, just like a DTD. Steps to follow in order to write a XSD document:

  • Use of XML syntax
  • Defining rules:

    • define elements that can appear in a document
    • define attributes that can appear in a document
    • define which elements are child elements
    • define the order of child elements
    • define the number of child elements
    • define whether an element is empty or can include text
    • define data types for elements and attributes
    • define default and fixed values for elements and attributes

    Source: https://www.w3schools.com/xml/schema_intro.asp

  • Save the document with the extension “.xsd”
  • Reference the XSD document in the XML document like:
    <cars xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="car.xsd">
    

XSD Syntax
This article outlines the XML-Schema but is not a reference to the syntax of this language.
We assume have the below configuration:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  • Elements:
    • The most common types are xs:string, xs:decimal, xs:integer, xs:boolean, xs:date, xs:time; for example:
      <xs:element name="age" type="xs:integer"/>
    • define a default value (default):
      <xs:element name="country" type="xs:string" default="USA"/>
    • define a fixed value (fixed):
      <xs:element name="country" type="xs:string" fixed="France"/>

  • Attributes:
    • only complex elements can have attributes;
    • the declaration of attributes with a default or fixed value is identical to the elements:
      <xs:attribute name="firstname" type="xs:string" default="Huseyin"/>
      <xs:attribute name="lastname" type="xs:string" fixed="OZVEREN"/>
      
    • define a mandatory attribute (required):
      <xs:attribute name="email" type="xs:string" use="required"/>
    • define a optional attribute (optional):
      <xs:attribute name="email" type="xs:string" use="optional"/>

  • Restrictions
    It is possible

    • to place restrictions on attributes or elements;
    • to define a values range (minInclusive and maxInclusive):
      <xs:minInclusive value="minimum"/> <xs:maxInclusive value="maximum"/>
    • to define a list of values (enumeration):
      <xs:enumeration value="a_value"/>
    • to define a pattern (pattern):
      <xs:pattern value="[AZ][AZ][AZ]"/> <xs:pattern value="([az])*"/>
    • to define a behaviour for the treatment of spaces:
      o spaces are kept (preserve):

      <xs:whiteSpace value="preserve"/>

      o spaces replace the LF,CR,TAB… (replace):

      <xs:whiteSpace value="replace"/>

      o spaces replace the LF,CR,TAB… and remove the spaces before/after and concatenate successives spaces in one space (collapse):

      <xs:whiteSpace value="collapse"/>
    • to define a length (length, minLength and maxLength):
      <xs:length value="8"/>
      <xs:minLength value="5"/>
      <xs:maxLength value="8"/>
      
    • to define restriction on decimals with fractionDigits and totalDigits

  • Complex elements
    • use of tag xs:complexType;
    • the definition of a complex element can be done directly at the element itself or by the name of the complex type (which allows multiple elements share the same complex type);
    • a complex type can enrich another complex type or not (extension):
      <xs:extension base="basic_type">
    • a complex type can also restrict another (restriction):
      <xs:restriction base="xs:integer">
    • mix free text with tags (mixed) like:
      my web site <blogname>JAVA BLOG</blogname>
      <xs:complexType mixed="true">

  • Indicators
    • Allows to control how the elements will be used;
    • Order indicators:
      o xs:all: sub-elements appear in any order;
      o xs:choice: indicates that a single sub-elements may appear;
      o xs: sequence: instructs the sub-elements, they must appear in a specific order;
    • Indicators of occurrence (how many times an element can appear):
      o maxOccurs: maximum number (default 1). For unlimited use: maxOccurs=”unbounded”;
      o minOccurs: minimum number;
    • Indicators of group:
      o group: allows to group logically the elements;
      o attributeGroup: allows to group logically the attributes;

  • Extension
    • the Any tag allows to add any item as a result of those precisely defined:
      <xs:any minOccurs="0"/>
    • the anyAttribute tag allows to add attributes not specified in the schema;
    • the substitutionGroup tag allows to define a schema that applies to an XML document whose the tags still would not carry the same name:
      <name /> <nom />

      . It is also possible to block the substitution.

Usage and Limitations of XSD
XSD is a description language to define the structure and content type of an XML document. It overcomes the limitation of DTD and meant to replace DTD for the checking of XML document validity.

XSD allows the creation of standards (Internet languages like xHTML, RSS, WSDL…etc), allows the data integrity, allows a very accurate validation compared to the DTD. However, XSD has the limitation to be long to write for complex structures.


4. XSD vs DTD

The “old” DocType (DTD: Document Type Definition) allows to define a structure for an XML file, however, the XML Schema standard is intended to replace it for several reasons:

  • The XML Schemas are XML documents, hence, all tools (validators, parsers, processors, …) but also scripts and languages (XSLT, XPath, …) working on XML documents, are used on XSD documents.
  • The XML Schemas allow the much finer management of documents structure: order (or disorder) of sub-elements, the number of occurrences of an element, very precise management of data types contained by the elements and attributes (possibility to apply regular expressions on the data, or types with high semantic value such as date type), …
  • It is possible to use and to interact very easily the XSD documents themselves.

Following, some examples where XSD is more useful and adapted than DTD:

Example n°1: We need to communicate a date via a XML stream between 2 different systems like SAP (mm.dd.yyyy) and RMI server (dd/mm/yyyy). These systems have an incompatible date format, however, with XSD, we can use the date type which the format yyyy-mm-dd.

Example n°2: We need to validate an email address, however, there is not standard type for the format of an email address; no panic!!! with XSD, we can define new type “EmailAddress”. We could define an own format, but there are a lot of collection of universally-useful data types defined in the W3C XML Schema language like XML Schema Standard Type Library (XSSTL) at http://www.codesynthesis.com/projects/xsstl/.
The following type allows the validation of an email address:



5. Examples
Here, we will validate XML Documents with DTD and XSD via JAXP APIs (SAX, DOM).

Many Java XML APIs provide mechanisms to validate XML documents, the JAXP API can be used for most of these XML APIs but configuration differences exists. This article shows some ways of how to configure different Java SAX and DOM APIs using JAXP for checking and validating XML with DTD and XSD.

Error Handler
To report errors, it is necessary to provide an ErrorHandler to the underlying implementation.

/**
 * To report errors, it is necessary to provide an ErrorHandler to the underlying implementation. 
 * 
 * @author huseyin
 *
 */
public class MyErrorHandler implements ErrorHandler {
    public void warning(SAXParseException e) throws SAXException {
       System.out.print("Warning at line " + e.getLineNumber() + ": ");
       System.out.println(e.getMessage());
    }

    public void error(SAXParseException e) throws SAXException {
        System.out.print("Error at line " + e.getLineNumber() + ": ");
        System.out.println(e.getMessage());
    }

    public void fatalError(SAXParseException e) throws SAXException {
       System.out.print("Fatal error at line " + e.getLineNumber() + ": ");
       System.out.println(e.getMessage());
    }
}

Input Document
Here, the input document people.xml used in below examples:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE people>
<people>
  <person ID="01245cdf45x">
    <title>MR</title>
    <name>Malcolm X</name>
    <name>Malik Shabazz</name>
    <name>Malcolm Little</name>
    <born>19 May 1925</born>
    <died>21 February 1965</died>
	<nationality>american</nationality>
	<email>malcolm.x@java.lu</email>
  </person>
  <person ID="012qsabc3456002">
    <title>MR</title>
    <name>Mahatma Gandhi</name>
    <born>2 October 1869</born>
    <died>30 January 1948</died>
    <nationality>Indian</nationality>
    <email>gandhi@javablog.fr</email>
  </person>
  <person ID="0457d7887897">
    <title>MR</title>
    <name>John F. Kennedy</name>
    <name>JFK</name>
    <name>Jack Kennedy</name>
    <born>20 January 1961</born>
    <died>22 November 1963</died>
	<nationality>american</nationality>
	<email>jfk@javablog.fr</email>
  </person>
</people>

XML Schema
Here, the XML Schema people.xsd used in below examples to validate the input document with email address validation,…:

<?xml version = "1.0" encoding = "UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" >
	<xs:element name="people">
		<xs:complexType>
			<xs:sequence>
				<xs:element ref="person" minOccurs="0" maxOccurs="unbounded"/>
			</xs:sequence>
		</xs:complexType>
	</xs:element>
	<xs:element name="person">
		<xs:complexType>
			<xs:sequence>
				<xs:element name="title" type="titleType" minOccurs="1" maxOccurs="1" default="MR" />
				<xs:element name="name" type="xs:string" minOccurs="1" maxOccurs="unbounded" />
				<xs:element name="born" type="xs:string"  minOccurs="1" maxOccurs="1" />
				<xs:element name="died" type="xs:string" minOccurs="0" maxOccurs="1" />
				<xs:element name="nationality" type="xs:string" minOccurs="0" maxOccurs="unbounded" />
				<xs:element name="email" type="emailType" minOccurs="0" maxOccurs="unbounded" />
			</xs:sequence>
			<xs:attribute name="ID" type="xs:string" use="required" />
		</xs:complexType>
	</xs:element>
	<xs:simpleType name="titleType">
		<xs:restriction base="xs:string">
			<xs:enumeration value="MR" />
			<xs:enumeration value="MS" />
			<xs:enumeration value="MRS" />
		</xs:restriction>
	</xs:simpleType>
	<xs:simpleType name="emailType">
	    <xs:restriction base="xs:string">
	      <xs:pattern value="([a-zA-Z0-9_\-])([a-zA-Z0-9_\-\.]*)@(\[((25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\.){3}|((([a-zA-Z0-9\-]+)\.)+))([a-zA-Z]{2,}|(25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])\])"/>
	    </xs:restriction>
  	</xs:simpleType>
</xs:schema>

DTD
Here, the DTD people.dtd used in below examples to validate the input document:

  <!ELEMENT people (person+)>
  <!ELEMENT person (title, name+, born, died?, nationality*, email*)>
    <!ATTLIST person ID CDATA #REQUIRED>
  <!ELEMENT title (#PCDATA)>
  <!ELEMENT name (#PCDATA)>
  <!ELEMENT born (#PCDATA)>
  <!ELEMENT died (#PCDATA)>
  <!ELEMENT nationality (#PCDATA)>
  <!ELEMENT email (#PCDATA)>

Checking Well formedness
…with standard DOM parser of Sun implementation:

String xmlFileName = "src/docs/people.xml";

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
	
// Disable the document validation as the document is being parsed.
factory.setValidating(false);
factory.setNamespaceAware(true);

DocumentBuilder builder = factory.newDocumentBuilder();
builder.setErrorHandler(new MyErrorHandler());

// Generates a Document object tree
Document xmlDocument = builder.parse(new InputSource(xmlFileName));

…with standard SAX parser of Sun implementation:

String xmlFileName = "src/docs/people.xml";

SAXParserFactory factory = SAXParserFactory.newInstance();
			
// Disable the document validation as the document is being parsed.
factory.setValidating(false);
factory.setNamespaceAware(true);

SAXParser parser = factory.newSAXParser();

XMLReader reader = parser.getXMLReader();
reader.setErrorHandler(new MyErrorHandler());

// Parse the XML streamm
reader.parse(new InputSource(xmlFileName));

Validating XML with internal or external DTD
With internal DTD:

<?xml version="1.0" encoding="UTF-8"?>
<!-- people-internalDTD.xml -->
<!DOCTYPE people [
  <!ELEMENT people (person+)>
  <!ELEMENT person (title, name+, born, died?, nationality*, email*)>
    <!ATTLIST person ID CDATA #REQUIRED>
  <!ELEMENT title (#PCDATA)>
  <!ELEMENT name (#PCDATA)>
  <!ELEMENT born (#PCDATA)>
  <!ELEMENT died (#PCDATA)>
  <!ELEMENT nationality (#PCDATA)>
  <!ELEMENT email (#PCDATA)>
]>
<people>
//....
</people>

With external DTD:

<?xml version="1.0" encoding="UTF-8"?>
<!-- people-externalDTD.xml -->
<!DOCTYPE people SYSTEM "people.dtd">
<people>
//....
</people>

…with standard DOM parser of Sun implementation:

// String xmlFileName = "src/docs/dtd/people-internalDTD.xml";
// OR
// String xmlFileName = "src/docs/dtd/people-externalDTD.xml";

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
// Enable the document validation as the document is being parsed.
factory.setValidating(true);
factory.setNamespaceAware(true);

DocumentBuilder builder = factory.newDocumentBuilder();
builder.setErrorHandler(new MyErrorHandler());
	   
// Generates a Document object tree
Document xmlDocument = builder.parse(new InputSource(xmlFileName));

…with standard SAX parser of Sun implementation:

// String xmlFileName = "src/docs/dtd/people-internalDTD.xml";
// OR
// String xmlFileName = "src/docs/dtd/people-externalDTD.xml";

SAXParserFactory factory = SAXParserFactory.newInstance();
// Enable the document validation as the document is being parsed.
factory.setValidating(true);
factory.setNamespaceAware(true);

SAXParser parser = factory.newSAXParser();

XMLReader reader = parser.getXMLReader();
reader.setErrorHandler(new MyErrorHandler());
reader.parse(new InputSource(xmlFileName));

Validating XML with internal XSD

<?xml version="1.0" encoding="UTF-8"?>
<!-- people-XSD.xml -->
<!DOCTYPE people>
<people xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="people.xsd">
//....
</people>

…with standard DOM parser of Sun implementation:

String xmlFileNameWithXSD = "src/docs/xsd/people-XSD.xml";
String xsdFileName = "src/docs/xsd/people.xsd";

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();

// Enable the document validation as the document is being parsed.
factory.setValidating(true);
factory.setNamespaceAware(true);
factory.setAttribute("http://java.sun.com/xml/jaxp/properties/schemaLanguage", "http://www.w3.org/2001/XMLSchema");

DocumentBuilder builder = factory.newDocumentBuilder();
builder.setErrorHandler(new MyErrorHandler());

// Generates a Document object tree
Document document = builder.parse(new InputSource(xmlFileName));

…with standard SAX parser of Sun implementation:

String xmlFileNameWithXSD = "src/docs/xsd/people-XSD.xml";
String xsdFileName = "src/docs/xsd/people.xsd";

SAXParserFactory factory = SAXParserFactory.newInstance();
// Enable the document validation as the document is being parsed.
factory.setValidating(true);
factory.setNamespaceAware(true);

SAXParser parser = factory.newSAXParser();
parser.setProperty("http://java.sun.com/xml/jaxp/properties/schemaLanguage", "http://www.w3.org/2001/XMLSchema");
			
XMLReader reader = parser.getXMLReader();
reader.setErrorHandler(new MyErrorHandler());
reader.parse(new InputSource(xmlFileName));

Validating XML with external XSD

<?xml version="1.0" encoding="UTF-8"?>
<!-- people.xml -->
<!DOCTYPE people>
<people>
//...
</people>

…with standard DOM parser of Sun implementation:

String xmlFileName = "src/docs/people.xml";
String xsdFileName = "src/docs/xsd/people.xsd";

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
// Disable the document validation as the document is being parsed.
factory.setValidating(false);
factory.setNamespaceAware(true);
        	
SchemaFactory schemaFactory = SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");
factory.setSchema(schemaFactory.newSchema(new StreamSource(new BufferedReader(new FileReader(xsdFileName)))));

DocumentBuilder builder = factory.newDocumentBuilder();
builder.setErrorHandler(new MyErrorHandler());

// Generates a Document object tree
Document document = builder.parse(new InputSource(xmlFileName));

…with standard SAX parser of Sun implementation:

String xmlFileName = "src/docs/people.xml";
String xsdFileName = "src/docs/xsd/people.xsd";

SAXParserFactory factory = SAXParserFactory.newInstance();
// Disable the document validation as the document is being parsed.
factory.setValidating(false);
factory.setNamespaceAware(true);

SchemaFactory schemaFactory = SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");
factory.setSchema(schemaFactory.newSchema(new StreamSource(new BufferedReader(new FileReader(xsdFileName)))));
			
SAXParser parser = factory.newSAXParser();			
XMLReader reader = parser.getXMLReader();
reader.setErrorHandler(new MyErrorHandler());
reader.parse(new InputSource(xmlFileName));

…with a manner to validate a XML with XSD validator which compiles the XSD schema and applies it on XML stream:

String xmlFileName = "src/docs/people.xml";
String xsdFileName = "src/docs/xsd/people.xsd";

// Lookup a factory for the W3C XML Schema language
SchemaFactory factory = SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");

// Compile the schema.
Schema schema = factory.newSchema(new StreamSource(new BufferedReader(new FileReader(xsdFileName))));
	        
// Get a validator from the schema.
Validator validator = schema.newValidator();
	        
// Parse the document
String xml = getContentFromFilename(xmlFileName);
Source source = new StreamSource(new StringReader(xml));
	        
// Check the document
validator.validate(source);

//...

	/**
	 * Get the content of a file
	 * @param filename
	 * @return
	 */
	public static String getContentFromFilename(String filename){
		BufferedReader br = null;
		String everything = null;
	    try { 
	    	br = new BufferedReader(new FileReader(filename)); 
	        StringBuilder sb = new StringBuilder(); 
	        String line = br.readLine(); 
	 
	        while (line != null) { 
	            sb.append(line); 
	            sb.append("\n"); 
	            line = br.readLine(); 
	        } 
	        everything = sb.toString(); 
	        
	    }catch (Throwable ex) {
        	ex.printStackTrace();
        } finally { 
	        try {
				br.close();
			} catch (IOException e) {
				e.printStackTrace();
			} 
	    } 
	    return everything;
	}

Source: test_xml_validating.zip

That’s all!!!

Huseyin OZVEREN