JARV interface

Written by Kohsuke KAWAGUCHI

Table of Contents

  1. Introduction
  2. Architecture
  3. Using JARV
    1. Step 1: create VerifierFactory
    2. Step 2: compile a schema
    3. Step 3: create a verifier
    4. Step 4-1: perform validation
    5. Step 4-2: validation via SAX
  4. Advanced Topics
    1. Finding implementation at Run-time
    2. Fail-fast Design
    3. Creating Verifier directly from VerifierFactory
    4. Thread Affinity
    5. Schema Language Auto Detection
  5. Examples
    1. Validating bunch of files
    2. Multi-threaded example
    3. DOM validation
    4. SAX validation

Introduction

MSV implements JARV interface, which allows you to easily use MSV in whatever way you like. JARV is an implementation-independent interface set developed by the RELAX community and there are several implementations available that supports this interface, including MSV

Although it came from the RELAX community, JARV is not limited to RELAX; it can be used with any schema language that MSV supports. For information about JARV, see http://iso-relax.sourceforge.net/.

Architecture

JARV consists of three components. VerifierFactory, Schema and Verifier.

The VerifierFactory interface is the main interface between the implementation and your application. It has a method to compile a schema into a Schema object. The Schema interface is the internal representation of the schema. This interface is thread-safe, so you can have multiple threads access one Schema object concurrently. Also, this interface has a method to create a new Verifier object. The Verifier interface represents a so-called "validator"; it has a schema object in it and it validates documents by using that schema.

Using JARV

Step 1: create VerifierFactory

The first thing you would do is to create an instance of VerifierFactory. To do that, simply create an instance of TheFactoryImpl:

VerifierFactory factory = new com.sun.msv.verifier.jarv.TheFactoryImpl();

JARV is also capable of finding an implementation that supports a particular schema language at run-time. To learn more about this discovery mechanism, please read this.

Step 2: compile a schema

Once you get a factory, then you can use it to compile a schema. To compile a schema, call the compileSchema method of the factory.

Schema schema = factory.compileSchema("http://www.example.org/test.xsd");

This method can accept many types of input. For example, you can pass InputSource, File, InputStream, etc.

Schema objects are thread-safe. So even if you have more than one threads, you only need one instance of Schema; you can share that one instance with as many threads as you want.

Step 3: create a verifier

Schema is just a compiled schema, so it cannot do anything by itself. Verifier object is the object that performs the actual validation. To create a Verifier object, do as follows:

Verifier verifier = schema.newVerifier();

In this way, you can create a Verifier that checks documents against a particular schema.

Verifier is not thread-safe. So typically you want to create one instance per one validation (or one thread.)

Step 4-1: perform validation

Verifier has several methods to validate documents. One way is to call the verify method, which accepts a DOM tree, File, URL, etc and returns the validity. For example, to validate a DOM document, simply pass it as an argument:

if(verifier.verify(domDocument))
  // the document is valid
else
  // the document is invalid (wrong)

This method will only give you yes/no answer, but you can get more detailed error information by setting an error handler through the setErrorHandler method.

Just like a parser reports well-formedness errores through org.xml.sax.ErrorHandler, JARV implementations (like MSV) reports validity errors through the same interface. In this way, you can get the error message, line number that caused the error, etc. For example, in the following code, a custom error handler is set to report error messages to the client.

verifier.setErrorHandler( new MyErrorHandler() );
try {
  if(verifier.verify(new File("abc.xml")))
    // the document is valid
  else
    // the execution will never reach here because
    // if the document is invalid, then an exception should be thrown.
} catch( SAXParseException e ) {
  // if the document is invalid, then the execution will reach here
  // because we throw an exception for an error.
}
...

class MyErrorHandler implements ErrorHandler {
  public void fatalError( SAXParseException e ) throws SAXException {
    error(e);
  }
  public void error( SAXParseException e ) throws SAXException {
    System.out.println(e);
    throw e;
  }
  public void warning( SAXParseException e ) {
    // ignore warnings
  }
}

If you throw an exception from the error handler, that exception will not be catched by the verify method. So the validation is effectively aborted there. If you return from the error handler normally, then MSV will try to recover from the error and find as much errors as possible.

Step 4-2: perform validation via SAX

Every JARV implementation supports the validation via SAX2 in two ways.

The first one is a validator implemented as ContentHandler, which can be obtained by calling the getVerifierHandler method. This content handler will validate incoming SAX2 events, and you can obtain the validaity through the isValid method. For example,

XMLReader reader = ... ; // get XML reader from somewhere
VerifierHandler handler = verifier.getVerifierHandler();
reader.setContentHandler(handler);
reader.parse("http://www.mydomain.com/some/file.xml");

if(handler.isValid())
  // the document is correct
else
  // the document is incorrect

The second one is a validator implemented as XMLFilter, which can be obtained by calling the getVerifierFilter method.

A verifier implemented as a filter, VerifierFilter, is particularly useful because you can plug it right in the middle of any SAX event pipeline.

Not only you can validate documents before you process them, you can validate them after your application process them.

In the following example, a verifier filter is used to validate documents before your own handler process it.

VerifierFilter filter = verifier.getVerifierFilter();
// create a new XML reader and setup the pipeline
filter.setParent(getNewXMLReader());
filter.setContentHandler( new MyApplicationHandler() );

// parse the document
filter.parse("http://www.mydomain.com/some/file.xml");
if(filter.isValid())
  // the parsed document was valid
else
  // invalid

SAX-based validation will not make much sense unless you set an error handler, because to know that the document was invalid after you've processed it is too late.

To set an error handler, call the setErrorHandler method just as you did with the verify method.

filter = verifier.getXMLFilter();
verifier.setErrorHandler(new MyErrorHandler());
...
filter.parse(...);

In this way, you can abort the processing by throwing an exception in case of an error. If you are using VerifierFilter you can also set an error handler by calling the setErrorHandler method of the VerifierFilter interface.

MSV always runs in the fail-fast manner. So as long as you set an error handler, it is guaranteed that your application will never see incorrect document at all.

Advanced Topics

Finding implementation at Run-time

A simple, obvious way to create a VerifierFactory is to create a new instance of com.sun.msv.verifier.jarv.TheFactoryImpl.

The advantage of this way is the support of the "multi-schema" capability. The factory will accept any schema written in any of the supported languages. Thus you can instantly change the schema language without changing your code at all

However, there is one problem in this approach. Specifically, it locks you into MSV, so you need to change your code to use other JARV implementations.

For this reason, you may want to "discover" an implementation at run-time by calling the static newInstance method of the VerifierFactory class. To do that, you need to pass the name of schema language you want to use. This method will find an implementation that supports a given schema language from the class path and returns its VerifierFactory.

VerifierFactory factory = VerifierFactory.newInstance(
  "http://relaxng.org/ns/structure/0,9");

Usually, the namespace URI of the schema language is used as the name. For the complete list, plaese consult the javadoc.

Fail-Fast Design

One of the problems of some validators (like DTD validator in Xerces) is that it doesn't work in the fail-fast manner. This problem is unique to SAX.

What is "fail-fast"? A fail-fast validator is a validator that can flag an error as soon as an error is found. A non fail-fast validator may let some part of the wrong document slip in (they will flag an error at the later moment.)

When you are using non fail-fast validator, you need to take extra care to write your code because your code may be exposed to bad documents.

For example, imagine a following simple DTD and a bad document:

<!ELEMENT root (a,b)*>
<!ELEMENT a    #EMPTY>
<!ELEMENT b    #EMPTY>

<root>
  <b/>  <!-- error -->
  <b/>
</root>

Suprisingly, in a typical non-fail-fast validator, the error will be signaled as late as in the end-element event of the root element. So you have to make sure that your application behaves gracefully when it sees the wrong 'b'.

Typically, this robs the merit of the validation because you do the validation to protect your application code from unexpected inputs.

MSV is a fail-fast validator; so it will signal an error at the start-element event of the first 'b'. This guarantees that the application will never see a wrong document.

Note that some other JARV implementations may be non fail-fast validators.

Creating Verifier directly from VerifierFactory

The VerifierFactory class has the newVerifier method as a short-cut. It is a short-cut in the sense that the following two code fragments have exactly the same meaning:

Verifier v = factory.compileSchema(x).newVerifier();

Verifier v = factory.newVerifier(x);

This is sometimes useful when you are using only one thread.

Thread Affinity

The VerifierFactory interface is not thread-safe. This basically means that you cannot use one object from two threads.

The Schema interface is thread-safe. So once you compile a schema file into a Schema object, it can be shared by multiple threads and accessed concurrently. This is useful at server-side, where multiple threads process client requests simultaneously.

The Verifier interface is again not thread-safe. Each thread needs its own copy of Verifier. Verifier objects are still re-usable, as you can use the same object to validate multiple documents one by one. What you cannot do is to validate multiple documents simultaneously.

The thread affinity of JARV is designed after that of TrAX API (javax.transform package). Familiarity with TrAX will help you understand JARV better.

Schema Language Auto Detection

com.sun.msv.verifier.jarv.TheFactoryImpl automatically detects the schema language from the schema file. However, there is one important limitation. Currently, the detection of XML DTDs is based on the file extension. Specifically, if the schema name has ".dtd" extension, it is treated as XML DTD and otherwise it is treated as other schema languages.

This causes a problem when you are passing InputStream as the parameter to the compileSchema method. Since InputStreams do not have names, they are always treated as non-DTD schemas.

To avoid this problem, wrap it by an InputSource and call the setSystemId method to set the system id. The following example shows how to do that:

InputSource is = new InputSource(
  MyClass.class.getResourceAsStream("abc.dtd") );
is.setSystemId("abc.dtd");

verifierFactory.compileSchema(is);

This ugly limitation came from the difficulty in correctly detecting XML DTDs, which are written in non-XML syntax, from other schema languages, which are written in XML syntax.

Any input on this restriction is very welcome.

Examples

If you need an example that is not listed here, please let me know so that I can add it in the next release.

Validating bunch of files

The distribution should contain this example at examples/jarv/jarvDemo.java. It compiles a schema and obtains a verifier object, then use the same verifier to validate multiple documents.

Multi-threaded example

The distribution should contain this example at examples/jarv/GrammarCacheDemo.java. This example first compiles a schema, then it launches a lot of threads and let them share one schema object.

This example shows you how to use JARV in the multi-threaded environment and how you can cache a compiled schema into memory.

DOM validation

The following code shows how you can validate DOM by using JARV.

import org.iso_relax.verifier.*;

void f( org.w3c.dom.Document dom )
{
  // create a VerifierFactory
  VerifierFactory factory = new com.sun.msv.verifier.jarv.TheFactoryImpl();
  
  // compile a RELAX schema (or whatever schema you like)
  Schema schema = factory.compileSchema( new File("foo.rxg") );
  
  // obtain a verifier
  Verifier verifier = schema.newVerifier();
  
  
  // check the validity of a DOM.
  if( verifier.verify(dom) )
    // the document is valid
  else
    // the document is not valid
  
  
  // you can use the same verifier object to test multiple DOMs
  // as long as you don't use it concurrently.
  if( verifier.verify(anotherDom) )
    ...
  
  
  // or you can pass an Element to validate that subtree.
  Element e = (Element)dom.getDocumentElement().getFirstSibling();
  if( verifier.verify(e) )
    ...
}

Passing an Element is supported by MSV, but please be warned that other JARV implementations may not support this capability.

SAX validation

The following code shows how you can use JARV together with JARV.

import org.iso_relax.verifier.*;

void f( javax.xml.parsers.SAXParserFactory parserFactory )
{
  // create a VerifierFactory with the default SAX parser
  VerifierFactory factory = new com.sun.msv.verifier.jarv.TheFactoryImpl();

  // compile a RELAX schema (or whatever schema you like)
  Schema schema = factory.compileSchema( new File("foo.rxg") );
  
  
  
  // obtain a verifier
  Verifier verifier = schema.newVerifier();
  
  // set an error handler
  // this error handler will throw an exception if there is an error
  verifier.setErrorHandler( com.sun.msv.verifier.util.ErrorHandlerImpl.theInstance );
  
  // get a XMLFilter
  VerifierFilter filter = verifier.getVerifierFilter();
  
  // set up the pipe-line
  XMLReader reader = parserFactory.newSAXParser().getXMLReader();
  filter.setParent( reader );
  filter.setContentHandler( new MyContentHandler() );
  
  
  // parse the document
  try {
    filter.parse( "MyInstance.xml" );
    // if the execution reaches here, the document was valid and
    // there was nothing wrong.
  } catch( SAXException e ) {
    // error.
    
    // maybe the document is not well-formed, or it's not valid
    // or some other reasons.
  }
}