Thursday, November 02, 2006

how not to write xml

I came across Writing XML with Java the other day, and hoped that the authors wouldn't advocate building XML through plain string concatenations. Unfortunately, they had.

Here's what the author said to justify this:
Eventually we’ll take up some alternatives to the direct string approach such as DOM and JDOM that do allow you to automatically maintain well-formedness and sometimes even validity. However, for many simple cases, these are vast overkill. It can be much simpler to just write a few strings onto an output stream.

I disagree: maintaining well-formedness should be the computer's job, not ours, because it it mindless and very easy to get wrong if we do it by hand. Well-formedness is something I shouldn't have to worry about: it should be trivial to get this right.

The author also said:

Making sure the output is correct is simply one part of testing and debugging your code.

Yes, making sure we're outputting the right thing is part of a good test suite. But there are some kinds of bugs that just don't deserve to be out into the open. The most common error I've seen in HTML/XML generation is failing to properly quote and escape things. It's precisely because of this cavalier attitude toward generating structured data that we see such problems.

Here's the style of code they wrote (translation of Exercise 3.8):

(define (simple)
(printf "<?xml version=\"1.0\"?>~n")
(printf "<mathml:math xmlns:mathml=\"\">~n")
(let loop ([i 1]
[low 1]
[high 1])
(when (<= i 10)
(printf "<mathml:mrow>~n")
(printf " <mathml:mi>f(~a)</mathmi>~n" i)
(printf " <mathml:mo>=</mathml:mo>~n")
(printf " <mathml:mn>~a</mathml:mn>~n" low)
(printf "</mathml:mrow>~n")
(loop (add1 i) high (+ high low))))
(printf "</mathml:math>~n"))

This style of building XML is simple, as the author notes, but it doesn't scale. As soon as we start dealing with XML documents that have interesting content, we suddenly have to start thinking about HTML injection issues, entity quotation, and keeping those darn tags balanced all the time.

Writing code that treats the XML as real structure is not much harder than the above:

(require (lib "" "xml")
(lib ""))
(define (simple-2)
(define (make-row i low)
`(mathml:mrow (mathml:mi ,(format "f(~a)" i))
(mathml:mo "=")
(mathml:mn ,(format "~a" low))))
(define (make-rows)
(let loop ([i 1]
[low 1]
[high 1])
[(<= i 10) (cons (make-row i low)
(loop (add1 i) high (+ high low)))]
[else empty])))
(printf "~a" (xexpr->string
((xmlns:mathml ""))

The difference here is that all the niggling issues from the first example --- balancing tags, properly quoting values --- don't apply at all. The XML library takes care of this busywork, as it should. It's guaranteed to be well-formed.

Not everything has to be treated as a string. We shouldn't be afraid to play with structured data.

No comments: