Born to code, rock ...

Friday, May 29, 2009

Skipping Invalid XML Character with ReaderFilter

While doing integration with a legacy system through xml files I face a strange fact. It turned out that according to specification of xml version 1.0 there are unicode characters that are not allowed in the content of the xml document.
Naturally, this legacy system produced such invalid characters. So we had nothing to do, but to get rid of them in one way or another.

This is because a standard java xml parser will throw an exception with message like:
"An invalid XML character (Unicode: 0xXXXX) was found in the element content of the document".
And since our application does not need this symbols we decided just to skip 'em.

Here is a sample file that contains the symbol START TEXT (Unicode: 0x2)

<?xml version="1.0" encoding="UTF-8" ?>
<chars>
<valid>a</valid>
<invalid></invalid>
</chars>

And next is simple Java code that shows the problem. It uses XML streaming API (aka StAX).

package xmlchars;

import java.io.FileNotFoundException;
import java.io.FileReader;

import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;

public class XmlInvalidCharactersDemo {

public static void main(String[] args) throws FileNotFoundException,
XMLStreamException {
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
XMLStreamReader reader = inputFactory
.createXMLStreamReader(new FileReader(
"resources/invalid-chars.xml"));
while (reader.hasNext()) {
reader.next();
}
}
}
This code just passes through the document and throws ParseError when tries to read text content of tag.

How can we skip this ugly chars?
  1. One solution is to read the xml document to memory remove all nasty (restricted) chars and then give the result to the parser. But in this case we will read the document twice which is not what I want.
  2. Other solution that came into my mind was to extends the java.io.FilterReader class. With this we can skip the unwanted characters or escape or replace them.
I wrote a class implementing the second approach. Here it is:


package xmlchars;

import java.io.FilterReader;
import java.io.IOException;
import java.io.Reader;

import com.sun.org.apache.xerces.internal.util.XMLChar;

/**
* {@link FilterReader} to skip invalid xml version 1.0 characters. Valid
* Unicode chars for xml version 1.0 according to http://www.w3.org/TR/xml are
* #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD], [#x10000-#x10FFFF] . In
* other words - any Unicode character, excluding the surrogate blocks, FFFE,
* and FFFF.
*
* @author tsachev
*
*/
public class Xml10FilterReader extends FilterReader {

/**
* Creates filter reader which skips invalid xml characters.
*
* @param in
* original reader
*/
public Xml10FilterReader(Reader in) {
super(in);
}

/**
* Every overload of {@link Reader#read()} method delegates to this one so
* it is enough to override only this one. <br />
* To skip invalid characters this method shifts only valid chars to left
* and returns decreased value of the original read method. So after last
* valid character there will be some unused chars in the buffer.
*
* @return Number of read valid characters or <code>-1</code> if end of the
* underling reader was reached.
*/
@Override
public int read(char[] cbuf, int off, int len) throws IOException {
int read = super.read(cbuf, off, len);
/*
* If read chars are -1 then we have reach the end of the reader.
*/
if (read == -1) {
return -1;
}
/*
* pos will show the index where chars should be moved if there are gaps
* from invalid characters.
*/
int pos = off - 1;

for (int readPos = off; readPos < off + read; readPos++) {
if (XMLChar.isValid(cbuf[readPos])) {
pos++;
} else {
continue;
}
/*
* If there is gap(s) move current char to its position.
*/
if (pos < readPos) {
cbuf[pos] = cbuf[readPos];
}
}
/*
* Number of read valid characters.
*/
return pos - off + 1;
}

}


Note that this is solution with Readers (aka character streams) only. Yes, you can use java.io.InputStreamReader, but expect encoding problems.

So that's how I tricked the legacy system's xml content which we cannot change and which does not follow the standards.

You can download Eclipse project with source here.

9 comments:

  1. Few months later this post is still useful :) It helped to solve an annoying xml issue ;)

    ReplyDelete
  2. This is really helpful.. thanks!

    ReplyDelete
  3. great fix! was doing exports from some nasty source files. this worked on the first pass. thumbs up making stax a bit easier to roll with

    ReplyDelete
  4. Too bad that XML isn't character data, it's binary data that just happen to be human readable so using Readers is a very bad idea.

    ReplyDelete
  5. Many thanks for posting this code. Useful little class.

    ReplyDelete
  6. The data im parsing also contains the control char 0x2. I tries using this solution, but when I try to get value inside CDATA section using getText method of XMLStreamReader, it fails.

    ReplyDelete
  7. Nice and clean way of doing this.. Many thanks..

    ReplyDelete
  8. this will handle individual chars but I've found certain applications (Microsoft ones) allow esacped invalid chars e.g. in which case I came up with a solution like this
    http://stackoverflow.com/questions/2897085/filtering-illegal-xml-characters-in-java
    Shefali using this works with Stax so should work for you.

    Cheers

    ReplyDelete
  9. Thank you very much for the explanation..

    ReplyDelete