Illegal character 0x1FFFF

$ perl -le 'use warnings; my $x=chr(0x1FFFF)' 
Unicode character 0x1ffff is illegal at -e line 1.

XML supports UTF-8 so I check for valid UTF-8 string and use it in XML if valid. Right? No!!!

There are some "non-illegal" characters that are perfect valid in UTF-8 (or even in the plain old ASCII), but are invalid for XML. The most obvious 0x00. Here is what W3C XML 1.0 specification say:

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

I spend some time playing with it and the result is XML::Char->valid(). The dev Data::asXML is using it now. If you you want, have a look at the test suit and try to break it. :-)

comments powered by Disqus