XML is becoming a universal standard for the storage and exchange of structured data. One reason for its popularity is the existence of several formalisms for specifying the structure of XML documents. By permitting dynamic consistency checking to ensure that information being exchanged (between modules in an application, or nodes in a distributed system) has the expected structure, these schema languages significantly increase the robustness of complex XML-based information systems.
However, the exploitation of schema languages in current XML technologies falls far short of what is possible. In particular, schemas currently play little part in the static analysis of programs that operate on XML structures: they are not used for checking code for inconsistencies at compile time, or for optimization---in short, they are not used as types in the usual programming-language sense of the term. Taking advantage of this missed opportunity, and thereby improving both the robustness and the efficiency of XML-based information systems, is the overall goal of the Xtatic project.
The key technology for this project is regular expression types. Regular expression types are based on well-known constructions from automata theory (they are a mild generalization of nondeterministic tree automata). Their basic constructors (union, concatenation, repetition, etc.) are similar to those found in existing XML schema formalisms such as DTDs and XML-Schema. The difference from these schema languages is that, in Xtatic, XML trees become built-in values of the language, and static prediction of the shapes of trees that may appear at run time (as values of variables, parameters to methods, results of complex expressions, etc.) becomes part of the ordinary behavior of the typechecker.
Past work on regular expression types at the University of Pennsylvania has culminated in an experimental language design and implementation called XDuce. XDuce is a statically typed language for writing recursive tree tranformers---roughly, a statically typed analog of XSLT. Besides regular expression types, its main innovation is a powerful form of pattern matching based on regular expression types---a statically typed "tree grep" primitive. The implementation demonstrates efficient algorithms for subtyping and typechecking.
XDuce has made a significant impact in the XML world; in particular, it strongly influenced the type system of XQuery, the W3C standard query language for XML, as well as new schema languages such as TREX. Several papers describing the language and its implementation can be found via the Xduce home page.
We are now starting the next phase of the project---a redesign and re-implementation of XDuce along more ambitious lines, dubbed Xtatic. The main focus this phase will be inter-operability with established host languages such as C#.
Our goal is to make Xtatic look to the programmer like a mild extension of C#, adding just two things to the existing language. The first is a new family of primitive types, written XML<T>, where T is a regular expression type in the style of XDuce; the values belonging to such types will be XML trees whose shape is described by T. For example, the type XML<person[name[String],(email[String]|tel[String])*]> contains trees labeled person and containing a name followed by arbitrarily many email addresses and telephone numbers. (Since these descriptions can become large, we offer facilities for naming common ones and for interpreting existing DTDs and XML-Schema specifications as regular expression types.) Second, we provide XDuce-style regular expression pattern matching for destructing XML trees.
This scheme allows smooth inter-operation between Xtatic code and modules written in pure C# or in other languages whose compilers target the CLR. A pure C# module sees the whole family of types XML<T> as a single class XML and manipulates XML values as ordinary objects. (This is achieved by using a homogeneous compilation scheme that represents XML structures uniformly at run time.)
At present, we have a paper design for a core fragment of Xtatic that combines key features from C# and XDuce and shows how they interact. This design demonstrates the feasibility of the basic approach and highlights the technical issues that remain to be addressed. These include:
We plan to structure our effort in three phases: first, producing a interpreter for a simple version of Xtatic by extending the current XDuce implementation with core features of C#; second, re-implementing this prototype as a full-fledged compiler, targeting the CLR and including a high-performance pattern matcher; and third, addressing the more significant extensions mentioned above. We expect to complete the first phase withing a few months and the second in about a year; the timing and outcome of the third phase will depend on availability of resources. A final phase---integration with the production C# compiler and development tools---falls beyond the scope of the present effort but may be pursued in collaboration with external developers.