Naturally, this legacy system produced such invalid characters. So we had nothing to do, but to get rid of them in one way or another.
This is because a standard java xml parser will throw an exception with message like:
"An invalid XML character (Unicode: 0xXXXX) was found in the element content of the document".
And since our application does not need this symbols we decided just to skip 'em.
Here is a sample file that contains the symbol START TEXT (Unicode: 0x2)
And next is simple Java code that shows the problem. It uses XML streaming API (aka StAX).
<?xml version="1.0" encoding="UTF-8" ?>
<chars>
<valid>a</valid>
<invalid></invalid>
</chars>
package xmlchars;
import java.io.FileNotFoundException;
import java.io.FileReader;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
public class XmlInvalidCharactersDemo {
public static void main(String[] args) throws FileNotFoundException,
XMLStreamException {
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
XMLStreamReader reader = inputFactory
.createXMLStreamReader(new FileReader(
"resources/invalid-chars.xml"));
while (reader.hasNext()) {
reader.next();
}
}
}
This code just passes through the document and throws ParseError when tries to read text content of How can we skip this ugly chars?
- One solution is to read the xml document to memory remove all nasty (restricted) chars and then give the result to the parser. But in this case we will read the document twice which is not what I want.
- Other solution that came into my mind was to extends the java.io.FilterReader class. With this we can skip the unwanted characters or escape or replace them.
package xmlchars;
import java.io.FilterReader;
import java.io.IOException;
import java.io.Reader;
import com.sun.org.apache.xerces.internal.util.XMLChar;
/**
* {@link FilterReader} to skip invalid xml version 1.0 characters. Valid
* Unicode chars for xml version 1.0 according to http://www.w3.org/TR/xml are
* #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD], [#x10000-#x10FFFF] . In
* other words - any Unicode character, excluding the surrogate blocks, FFFE,
* and FFFF.
*
* @author tsachev
*
*/
public class Xml10FilterReader extends FilterReader {
/**
* Creates filter reader which skips invalid xml characters.
*
* @param in
* original reader
*/
public Xml10FilterReader(Reader in) {
super(in);
}
/**
* Every overload of {@link Reader#read()} method delegates to this one so
* it is enough to override only this one. <br />
* To skip invalid characters this method shifts only valid chars to left
* and returns decreased value of the original read method. So after last
* valid character there will be some unused chars in the buffer.
*
* @return Number of read valid characters or <code>-1</code> if end of the
* underling reader was reached.
*/
@Override
public int read(char[] cbuf, int off, int len) throws IOException {
int read = super.read(cbuf, off, len);
/*
* If read chars are -1 then we have reach the end of the reader.
*/
if (read == -1) {
return -1;
}
/*
* pos will show the index where chars should be moved if there are gaps
* from invalid characters.
*/
int pos = off - 1;
for (int readPos = off; readPos < off + read; readPos++) {
if (XMLChar.isValid(cbuf[readPos])) {
pos++;
} else {
continue;
}
/*
* If there is gap(s) move current char to its position.
*/
if (pos < readPos) {
cbuf[pos] = cbuf[readPos];
}
}
/*
* Number of read valid characters.
*/
return pos - off + 1;
}
}
Note that this is solution with Readers (aka character streams) only. Yes, you can use java.io.InputStreamReader, but expect encoding problems.
So that's how I tricked the legacy system's xml content which we cannot change and which does not follow the standards.
You can download Eclipse project with source here.