1. Introduction
The goal of XML is
to provide many of SGML's benefits not available in HTML and to provide
them in a language that is easier to learn and use than complete SGML.
These benefits include user-defined tags, nested elements, and an
optional validation of document structure with respect to a Document Type Descriptor (DTD).
One
important application of XML is the interchange of electronic data
(EDI) between two or more data sources on the Web. Electronic data is
primarily intended for computer, not human, consumption. For example,
search robots could integrate automatically information from related
sources that publish their data in XML format, e.g., stock quotes from
financial sites, sports scores from news sites; businesses could
publish data about their products and services, and potential customers
could compare and process this information automatically; and business
partners could exchange internal operational data between their
information systems on secure channels. New opportunities will arise
for third parties to add value by integrating, transforming, cleaning,
and aggregating XML data. In this paper, we focus on XML's application
to EDI. Specifically, we take a database view, as opposed to document
view, of XML. We consider an XML document to be a database and a DTD to
be a database schema.
EDI applications require tools that support the following tasks:
- extraction of data from large XML documents,
- conversion of data between relational or object-oriented databases and XML data,
- transformation of data from one DTD to a different DTD, and/or
- integration of multiple XML data sources.
Data
extraction, conversion, transformation, and integration are all
well-understood database problems. Their solutions rely on a query language, either relational (SQL) or object-oriented (OQL). We present a query language for XML, called XML-QL, which we argue is suitable for performing the above tasks. XML-QL has the following features:
- It is declarative.
- It is ``relational complete''; in particular, it can express joins.
- It
is simple enough that known database techniques for query optimization,
cost estimation, and query rewriting could be extended to XML-QL.
- It can extract data from existing XML documents and construct new XML documents.
- It can support both ordered and unordered views on an XML document.
An initial draft of the query language is a W3C note, http://www.w3.org/TR/NOTE-xml-ql/.
One
salient question is why not adapt SQL or OQL to query XML. The answer
is that XML data is fundamentally different than relational and
object-oriented data, and therefore, neither SQL nor OQL is appropriate
for XML. The key distinction between data in XML and data in
traditional models is that is XML is not rigidly structured. In the
relational and object-oriented models, every data instance has a schema, which is separate from and independent of the data. In XML, the schema exists with the data as tag names. For example, in the relational model, a schema might define the relation person with attribute names name and address, e.g., person(name, address). An instance of this schema would contain tuples such as ("Smith", "Philadelphia"). The relation and attribute names are separate from the data and are usually stored in a database catalog.
In XML, the schema information is stored with the data. Structured values are called elements. Attributes, or element names, are called tags, and elements may also have attributes whose values are always atomic. For instance, <person><name>Smith</name><address>Philadelphia</address></person>.
is well-formed XML. Thus, XML data is self-describing and can naturally
model irregularities that cannot be modeled by relational or
object-oriented data. For example, data items may have missing elements
or multiple occurrences of the same element; elements may have atomic
values in some data items and structured values in others; and
collections of elements can have heterogeneous structure. Even XML data
that has an associated DTD is self-describing (the schema is always
stored with the data) and, except for restrictive forms of DTDs, may
have all the irregularities described above. Most importantly, this
flexibility is crucial for EDI applications.
Self-describing
data has been considered recently in the database research community.
Researchers have found this data to be fundamentally different from
relational or object-oriented data, and called it semistructured data.
Semistructured data is motivated by the problems of integrating
heterogeneous data sources and modeling sources such as biological
databases, Web data, and structured text documents, such as SGML and
XML. Research on semistructured data has addressed data models,
query-language design, query processing and optimization, schema
languages, and schema extraction. The key observation in this paper is
that XML data is an instance of semistructured data.
In designing XML-QL, we drew from other query languages for semistructured data [1, 2, 5]: tutorials describing some of the work on semistructured data can be found in [3] and [4].
XML-QL includes most features found in these languages, but it differs
from all of them in several important respects. Specifically, this
paper makes the following contributions:
- We propose a
data model for XML data that extends the semistructured-data model with
order. This extension is necessary for XML documents, which are
ordered.
- We design a syntax for XML-QL that combines elements of the XML syntax with traditional query-language syntax.
- We propose a novel semantics for XML-QL to support order in the input and output data.
- We combine two powerful data-construction mechanisms, nested queries and Skolem functions, in a novel way.
- We
illustrate that XML-QL can be used for the tasks it has been designed
for, such as data extraction, transformation, and integration.
Download