Search Intranets
Current Issue
March/April 2012
Editorial
Columns
Features
News & Tools
Read_Me_File

Services
About Intranets
Subscribe to
Intranets
Past Issues
Sample Issue (PDF)
XML: Under the Hood
- Sep/Oct 2002 Issue Posted Sep 1, 2002 Print Version  
Page 1

Note: This article appeared in Intranet Professional, prior to it re-launch as Intranets (in 2004).


XML is the most important technological advance for information professionals since the Web. Extensible Markup Language (XML) increasingly provides the underlying technological infrastructure for many of the information systems and services used every day. For a profession that is founded on the collection of information and the provision of services to users who wish to use that information, it is hard to think of a more useful tool. This article briefly describes XML and how it can be used, then highlights particular uses of it in libraries.

Anyone who has coded an HTML document without a WYSIWIG editor (in the raw) or used early word processors that required the addition of formatting codes to the text is familiar with the basic premise of XML. XML is a method by which the meaning of a given bit of information is made explicit so that software can understand and process it. This is done with tags, which are named elements enclosed in brackets, such as "." For example, a human being can look at the text "Jack London wrote The Call of the Wild" and easily understand the information being carried by those words. A person, by name of "Jack London" created a written work with the title "The Call of the Wild." Software cannot easily understand these elements and relationships without explicit marking. There are a variety of ways to add this information using XML but one way would be something like the following:

<?xml version="1.0 encoding="utf-8"?>
<book>
<title>The Call of the Wild</title>
</author>
</book>

As the above example shows, each part of the information that is described separately is tagged individually, and hierarchy can make relationships explicit. Software can process this to extract the information that this item is a book with the title of "The Call of the Wild," authored by "Jack London." The first line simply specifies that this fragment is encoded in XML using UTF-8 (Unicode) character encoding.

Well-Formed XML
The example above is what is called "well-formed." Well-formed means that the XML follows certain basic tagging rules:

An appropriate XML declaration is the first line.
• There is only one unique root element (in the example above, the root element "book" encloses all other tags).
• All tags have a beginning and an end (an end tag is the same name as the beginning tag, but is preceded with a slash; or, if the tag does not wrap anything the tagging syntax can be minimized; e.g., <br/>instead of <br></br>).
• All tags properly nest; that is, a tag that is begun within another tag ends before the tag that encloses it: <tag1>This is an<tag2> example<tag2> of proper nesting<tag1>.
• Any tag may have an attribute, but all attribute values are quoted: <name role="author"> ("role" is an attribute of the "name" tag, and "author" is the attribute value).

So long as the above rules are followed, one can make up whatever tags one wishes. Doing so is fine for applications that don't require sharing information, but if the XML needs to be shared with others, it is necessary to go one step further and create "valid" XML.

Valid XML
Valid XML adheres to rules set out for a particular set of XML tags. These rules are codified in either a Document Type Definition (DTD) or an XML schema definition. The rules specify not only which specific tags can be used but also such things as where the tags can be used (that is, within which other tags), which attributes and values are allowed, etc. By having a DTD or schema, XML processing software can check a given instance of XML for proper adherence to the rules and point out specific instances where those rules have been broken. This is very useful for error checking.

Doing Useful Work with XML
Tagging information is the first step, but to do useful work with it there must be a method to process the information in various ways. The primary method for processing XML-encoded information is by using the Extensible Stylesheet Language Transformations (XSLT). XSLT provides methods for selecting individual pieces of an XML-encoded file and performing a variety of transformations to that information (applying templates). For example, one may want to display the XML above to a standard Web browser, which will require translation into HTML. Here is an XSLT stylesheet that does just that:

<xls:styelsheetxmins:xsl="http://www.w3.org/1999/XSL/Transform"version="1.0">
<xsl:apply-templates=select="book"/>
</xsl:template>
<xsl:template match="book">
Title:
Author:<xsl:template?>
<xsl:stylesheet>

This XSLT stylesheet renders the XML as:

Title: The Call of the Wild
Author: Jack London

Note how the HTML markup can be mixed among XSLT instructions to create the kind of displays desired. XSLT has a very powerful set of functions, including standard programming constructs like looping, decision structures, and variables. The definitive work on XSLT is Michael Kay's XSLT Programmer's Reference (Wrox Press, 2nd ed., 2001). For a more detailed example of how XML and XSLT work together, including the raw XML and XSLT files, see Eric Lease Morgan's information on how he uses XML and XSLT for his Water Collection [http://infomotions.com/water/about/].

The Usefulness of XML
Like any technology there are tasks for which XML would be an excellent choice and other tasks for which it would be a poor choice. In particular it is sometimes difficult to decide when to use XML and when to use a database. There are no hard-and-fast rules but there are some general rules of thumb that apply. If the information is loosely structured (sometimes there is an element, sometimes not), of arbitrary lengths (e.g. full text such as journal articles or books), and will not grow to a very large set of individual items, then consider XML. However, if the information is highly structured, of fixed field lengths, and must have the ability to scale to a very large set of individual records, consider a database solution.

How Libraries Use XML
XML is used in libraries today across all areas of the organization. In the book XML in Libraries (Neal-Schuman, 2002) [Editor's Note: edited by Roy Tennant] 13 library projects are described in which XML plays an essential part. Here are brief descriptions of a few of them.

The Lane Medical Library at Stanford University developed a set of tools for batch updating MARC records using XML. They extract records from the catalog, translate the MARC tags into XML, apply transformations using the XMLMARC software program they created, and then reinsert them back into the library catalog. This allows all kinds of batch changes not well supported by the catalog system. See the XMLMARC Web site at [http://xmlmarc stanford.edu/ for more information].

At Oregon State University, Kyle Banerjee created a system to take downloaded ILL request, query the library catalog for additional information, and output an XML file with the combined information. He uses an XSLT stylesheet to create a printable file to fetch the item from the stacks. By using XML and XSLT, he created an infrastructure that is responsive to different circumstances (e.g., the borrower's preferred delivery method) and can easily be adapted to changing requirements (e.g. the need to change where something is printed on the page).

The University of California, San Francisco, used XML to federate access to several collections of documents numbering in the millions of pages [http://www.legacy.library.ucsf.edu/]. The documents are an outcome of a court settlement between the National Association of Attorneys General and six major tobacco companies. Each company was required to make some of its documents publicly available but each did so in its own way. The UCSF Library mapped the disparate metadata elements into a common set while retaining the often-richer set of elements that applied to a specific collection. As the underlying technology XML supports both the metadata description as well as the structural markup required to knit the multiple page images of each document into a navigable whole.

The eScholarship program at the California Digital Library uses XML to publish books in association with the University of California Press [http://escholarship.cdlib.org/ucpressbooks.html]. Using XML to encode the books provides a method by which they can easily migrate into new formats or structures as they become useful or important, while also creating highly usable versions using today's technology. For example, the books can be displayed to the user in multiple different views such as large print or even a user-defined format, while the underlying information itself remains unchanged.

The Bottom Line
These are but a few examples of how XML is helping libraries to perform existing tasks more effectively, support new kinds of collections and services, and provide a standard method for sharing and manipulating information of many different types. XML is not only finding its way into library projects but also into an incredible array of software products and online systems in the broader consumer market. XML is clearly here to stay for the simple reason that it is an effective tool for solving a number of problems that face a wide variety of organizations and individuals.

Print Version  
Page 1