Help Desk: XML
Collection: xml.ss (xml)
Due to the emergence of the Web and HTML, people have recently rediscovered the efficiency of using parenthesized forms of reprsenting data. The development of XML, or eXtensible Markup Language, aims at standardizing the representation of data for storage on disks and transmission over the Web. It is quickly gaining popularity and modern languages must provide interfaces for XML data. We will see that Scheme is a particularly good language for interacting with the XML world.
Roughly speaking, XML is a language for representing data. An XML expression is a fully parenthesized form of data. The major visual difference between an XML expression and an external S-expression is that there are many forms of ``parentheses'' in XML, not just parentheses, brackets, and braces. These parentheses are made up of words, called tags. For example, when we write
(1 2 3)
for the list of 1, 2, and 3, an XMLer might write
<parenthesis>1 2 3</parenthesis>
or
<paren>1 2 3</paren>
or something similar. The tokens <paren> and </paren> are
called the start tag and the end tag. We think of them as
parentheses.
With XML we can turn (almost) any sequence of characters into a pair of start and end tags. The pair of tags and everything in between is called an element. The sequence of characters for the tag is the name of the element. The rest is called the contents. In addition to name and content, an XML expression may also have attributes. For example,
<paren title="nat nums" date="oct 22, 2000"> 1 2 3 </paren>
has two attributes: title and date. The values of the
attributes are the two strings "nat nums" and "oct 22, 2000".
Figure 8 shows one of many ways of representating an S-expression for tracking grades with XML. A comparison shows how an XML data designer might use attributes. For example, the course title and the course semester are attributes of the <course> parentheses. Similarly, the name of the student is a <grades> attribute. Each grade is surrounded by an additional pair of parentheses.
Clearly the extensible markup language is a generalization of the language of S-expressions. The parentheses are named; each parenthesized element may have additional attributes. At the same time, it is also clear that we can naturally represent all forms of XML data expressions in an S-expression format. Indeed, there are many different ways of translating XML expressions into S-expressions.
The data definition in figure 9 defines PLT Scheme's choice of mapping XML into S-expressions. We refer to this subset of S-expressions as X-expressions. The figure also specifies a collection of functions that allows us to read XML, to convert XML into X-expressions, and to print X-expressions.
An X-expression is an S-expression that belongs to the following grammar:
Xexpr ::= string | (symbol ((symbol string) ...) Xexpr ...) an element | (symbol Xexpr) an element without attributes | symbol a symbolic entity such as | number a numeric entity such as | misc see Help Desk
;; A Document is a structure. ;; An XML (element) represents an XML data expression. ;;document-element : Document -> XML;; to extract the XML value in the element field of a document structure ;;read-xml : -> Document;; to read a single XML expression from standard input ;;write-xml : Document -> Void;; to print (aDocumentas) as XML to the standard output ;;xmlxexpr : XML -> Xexpr;; to convert an XML element into an X-expression ;;xexprxml : Xexpr -> XML;; to convert an X-expression into an XML element ;;eliminate-whitespace : (listof Symbol) (Bool -> Bool) -> XML -> XML;; to eliminate whitespaces from XML elements that contain XML elements
Figure 9: Reading XML and X-expressions
With the functions in figure 9 we can read XML expressions
from files almost as easily as S-expressions. Reading an XML expression
yields a document from which we extract the element, which we can convert
into an X-expression. Consider the example in
figure 10. The left column is the textual
representation of an XML expression. Assume this text is stored in a file
called "sample.xml". Then the evaluation of the expression
(xmlxexpr (document-element (with-input-from-file "sample.xml" read-xml)))
yields the X-expression in the right column of figure 10.
<course title="Comp210"> <grades name="Adam"> <g>88</g> </grades> <grades name="Beth"> <g>96</g> </grades> <grades name="Cath"> <g>70</g> </grades> <grades name="Dave"> <g>68</g> </grades> <grades name="Fawn"> <g>99</g> </grades> <grades name="Gege"> <g>100</g> </grades> </course>
(course ((title "Comp210")) " " (grades ((name "Adam")) " " (g () "88") " ") " " (grades ((name "Beth")) " " (g () "96") " ") " " (grades ((name "Cath")) " " (g () "70") " ") " " (grades ((name "Dave")) " " (g () "68") " ") " " (grades ((name "Fawn")) " " (g () "99") " ") " " (grades ((name "Gege")) " " (g () "100") " ") " ")
Figure 10: Reading XML: a first example
Figure 10 shows that read-xml preserves
whitespaces (blanks, tabs, newlines) in the file and turns them into
strings. Although this whitespace preservation is important for
text-processing within XML elements, it is a nuisance for other
applications. This X-expression is clearly not what we want; it contains
every whitespace that the file contains as an additional string.
We can eliminate (most of) these useless whitespaces with the
eliminate-whitespace function in the XML library. Here is a
simple use:
> (pretty-print (xmlxexpr ((eliminate-whitespace '(course grades) identity) (document-element (with-input-from-file "sample.xml" read-xml))))) '(course ((title "Comp210")) (grades ((name "Adam")) (g () "88")) (grades ((name "Beth")) (g () "96")) (grades ((name "Cath")) (g () "70")) (grades ((name "Dave")) (g () "68")) (grades ((name "Fawn")) (g () "99")) (grades ((name "Gege")) (g () "100")))
In general, eliminate-whitespace consumes a list of XML tags
(symbols) and a function; for now we just use identity or
(lambda (x) x) for this second argument. The result is a function
that traverses an XML element and that eliminates systematically whitespaces
from those elements whose tags are included in the given list. Of course, the
function cannot eliminate whitespace from elements that must contain text.
Exercises
Exercise 2.0.2.
Write a function that transforms the X-expressions for keeping track of
grades such that the 'g elements contain numbers not strings. The
function should signal an error if any of the strings represents something
other than a number. Hint: Recall that stringnumber may produce
false.