2  Input and Output: XML and X-expressions

Help Desk: XML

Collection: xml.ss (xml)

Due to the emergence of the Web and HTML, people have recently rediscovered the efficiency of using parenthesized forms of reprsenting data. The development of XML, or eXtensible Markup Language, aims at standardizing the representation of data for storage on disks and transmission over the Web. It is quickly gaining popularity and modern languages must provide interfaces for XML data. We will see that Scheme is a particularly good language for interacting with the XML world.

Roughly speaking, XML is a language for representing data. An XML expression is a fully parenthesized form of data. The major visual difference between an XML expression and an external S-expression is that there are many forms of ``parentheses'' in XML, not just parentheses, brackets, and braces. These parentheses are made up of words, called tags. For example, when we write

  (1 2 3)

for the list of 1, 2, and 3, an XMLer might write

  <parenthesis>1 2 3</parenthesis>


  <paren>1 2 3</paren>

or something similar. The tokens <paren> and </paren> are called the start tag and the end tag. We think of them as parentheses.

With XML we can turn (almost) any sequence of characters into a pair of start and end tags. The pair of tags and everything in between is called an element. The sequence of characters for the tag is the name of the element. The rest is called the contents. In addition to name and content, an XML expression may also have attributes. For example,

  <paren title="nat nums" date="oct 22, 2000">
  1 2 3

has two attributes: title and date. The values of the attributes are the two strings "nat nums" and "oct 22, 2000".

(Comp210 "Fall 2001"
  (Adam 78 88 69)
  (Brad 88 87 86)
  (Cath 99 88 88)
  (Dave 77 78 77)
  (Fawn 90 89 81)
  (Gege 67 78 81))



<course title="Comp210" 
        semester="Fall 2001">
 <grades name="Adam">
 <grades name="Brad">
 <grades name="Cath">
 <grades name="Dave">
 <grades name="Fawn">
 <grades name="Gege">

Figure 8:  An XML representation of a grade file

Figure 8 shows one of many ways of representating an S-expression for tracking grades with XML. A comparison shows how an XML data designer might use attributes. For example, the course title and the course semester are attributes of the <course> parentheses. Similarly, the name of the student is a <grades> attribute. Each grade is surrounded by an additional pair of parentheses.

Clearly the extensible markup language is a generalization of the language of S-expressions. The parentheses are named; each parenthesized element may have additional attributes. At the same time, it is also clear that we can naturally represent all forms of XML data expressions in an S-expression format. Indeed, there are many different ways of translating XML expressions into S-expressions.

The data definition in figure 9 defines PLT Scheme's choice of mapping XML into S-expressions. We refer to this subset of S-expressions as X-expressions. The figure also specifies a collection of functions that allows us to read XML, to convert XML into X-expressions, and to print X-expressions.

An X-expression is an S-expression that belongs to the following grammar:
Xexpr ::= string
| (symbol ((symbol string) ...) Xexpr ...)
an element
| (symbol Xexpr) an element without attributes
| symbol a symbolic entity such as  
| number a numeric entity such as 
| misc see Help Desk

;; A Document is a structure.
;; An XML (element) represents an XML data expression.

;; document-element : Document -> XML
;; to extract the XML value in the element field of a document structure

;; read-xml : -> Document
;; to read a single XML expression from standard input

;; write-xml : Document -> Void
;; to print (a Document as) as XML to the standard output

;; xmlxexpr : XML -> Xexpr
;; to convert an XML element into an X-expression

;; xexprxml : Xexpr -> XML
;; to convert an X-expression into an XML element

;; eliminate-whitespace : (listof Symbol) (Bool -> Bool) -> XML -> XML
;; to eliminate whitespaces from XML elements that contain XML elements

Figure 9:  Reading XML and X-expressions

With the functions in figure 9 we can read XML expressions from files almost as easily as S-expressions. Reading an XML expression yields a document from which we extract the element, which we can convert into an X-expression. Consider the example in figure 10. The left column is the textual representation of an XML expression. Assume this text is stored in a file called "sample.xml". Then the evaluation of the expression

    (with-input-from-file "sample.xml" read-xml)))

yields the X-expression in the right column of figure 10.

<course title="Comp210">
  <grades name="Adam">
  <grades name="Beth">
  <grades name="Cath">
  <grades name="Dave">
  <grades name="Fawn">
  <grades name="Gege">


(course ((title "Comp210")) "
  " (grades ((name "Adam")) "
    " (g () "88") "
  ") "
  " (grades ((name "Beth")) "
    " (g () "96") "
  ") "
  " (grades ((name "Cath")) "
    " (g () "70") "
  ") " 
  " (grades ((name "Dave")) "
    " (g () "68") "
  ") "
  " (grades ((name "Fawn")) "
    " (g () "99") "
  ") "
  " (grades ((name "Gege")) "
    " (g () "100") "
  ") "

Figure 10:  Reading XML: a first example

Figure 10 shows that read-xml preserves whitespaces (blanks, tabs, newlines) in the file and turns them into strings. Although this whitespace preservation is important for text-processing within XML elements, it is a nuisance for other applications. This X-expression is clearly not what we want; it contains every whitespace that the file contains as an additional string.

We can eliminate (most of) these useless whitespaces with the eliminate-whitespace function in the XML library. Here is a simple use:

> (pretty-print
      ((eliminate-whitespace '(course grades) identity)
	 (with-input-from-file "sample.xml" read-xml)))))

   ((title "Comp210"))
   (grades ((name "Adam")) (g () "88"))
   (grades ((name "Beth")) (g () "96"))
   (grades ((name "Cath")) (g () "70"))
   (grades ((name "Dave")) (g () "68"))
   (grades ((name "Fawn")) (g () "99"))
   (grades ((name "Gege")) (g () "100")))

In general, eliminate-whitespace consumes a list of XML tags (symbols) and a function; for now we just use identity or (lambda (x) x) for this second argument. The result is a function that traverses an XML element and that eliminates systematically whitespaces from those elements whose tags are included in the given list. Of course, the function cannot eliminate whitespace from elements that must contain text.


Exercise 2.0.2.   Write a function that transforms the X-expressions for keeping track of grades such that the 'g elements contain numbers not strings. The function should signal an error if any of the strings represents something other than a number. Hint: Recall that stringnumber may produce false

Exercise 2.0.3.