My hacking with XML Binding to Java continued tonight. Tonight, I explored
JAXB and the associted compiler
XJC in more detail.
If you would remember, the problem is to parse
a large XML schema (~40K) and convert it into appropriate java classes (~235 classes !! for that many distinct elements in schema) with a system to marshal / unmarshal / validate parts of instances of XML schema to java objects and back such that parsing is simplest and fast.
As I figured out, there exist technology like
Castor Source Generator and JAXB for exactly this task.
So far so good but any further down water got murkier. For example, an immediate issue is that converting XML elements (and comnplex types, attributes etc) - which should together be classified as XML Schema elements - directly to java classes does not always work automatically. There are many reasons why human intervention might be required. And most problems involve how compilers choose names for their Java classes to map XML. A simple and popular startegy would be to just use the same name as XMLSchema elements. However, this leads to java class name clash across namespaces (specially if you are interested in putting all generated classes in one package). So, the next strategy would be to prefix namespace to element name. In its simplest form, this is unwieldy as most namesspace names are URLs and hence very long. But it can be argued that is somehow possible to ensure that using a part of namespace and element name, the java class name is kept reasonably simple (just using prefixes for namespaces does not work as the same namespace can be referred by different prefixes at different places in schema). So, just resolving the namespaces issues while keeping the Java class names is a tricky enough problem to solve.
But there is more. The other kind of namespace clashes can occur between an element name and complextype (fairly regular occurence as many schema writers - for good reasons - first define a complexType (especially if it is really complex) and then define an element of that type instead of defining it implictly) or between element or attribute and so on. A totally different kind of clash may occur if Schema element names contain
java reserved words and will need then need mangling (which itself can cause a clash with something else).
There are other examples but the point is these problems occur because for XML Schema each kind of element reside in is own symbol space and name space. Each symbol space is never larger than one namespace and within it restricted to a particular "kind" of element (say XML element, atrribute or complex type) and sometimes also restricted to a particular local scope within the larger schema. The job of an XML to Java compiler is to flatten out all this "space" heirarchy and create one straight package with straight names. Clearly, there are issues on how such flattening can occur and AFAIK no compiler does a good job of this. (The best (and which is not good enogh), I have seen, is XJC which tries to solve this problem by moving this problem outside the domain of the compiler and to a separate "binding file" (which sort of defines the mapping from Schema elements to Java classes). A good example of a binding file that solves some of these name collision issues is shipped with JAXB as a sample. (unfortunately no direct link !!)
In general, .NET has done a much better job than Java of handling XML serialization, so I am curious to know how do they tackle this issue with their xsd.exe. (may be I will figure that out when I have more time to kill !!)