Born to code, rock ...

Friday, May 29, 2009

Skipping Invalid XML Character with ReaderFilter

While doing integration with a legacy system through xml files I face a strange fact. It turned out that according to specification of xml version 1.0 there are unicode characters that are not allowed in the content of the xml document.
Naturally, this legacy system produced such invalid characters. So we had nothing to do, but to get rid of them in one way or another.

This is because a standard java xml parser will throw an exception with message like:
"An invalid XML character (Unicode: 0xXXXX) was found in the element content of the document".
And since our application does not need this symbols we decided just to skip 'em.

Here is a sample file that contains the symbol START TEXT (Unicode: 0x2)

<?xml version="1.0" encoding="UTF-8" ?>
<chars>
<valid>a</valid>
<invalid></invalid>
</chars>

And next is simple Java code that shows the problem. It uses XML streaming API (aka StAX).

package xmlchars;

import java.io.FileNotFoundException;
import java.io.FileReader;

import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;

public class XmlInvalidCharactersDemo {

public static void main(String[] args) throws FileNotFoundException,
XMLStreamException {
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
XMLStreamReader reader = inputFactory
.createXMLStreamReader(new FileReader(
"resources/invalid-chars.xml"));
while (reader.hasNext()) {
reader.next();
}
}
}
This code just passes through the document and throws ParseError when tries to read text content of tag.

How can we skip this ugly chars?
  1. One solution is to read the xml document to memory remove all nasty (restricted) chars and then give the result to the parser. But in this case we will read the document twice which is not what I want.
  2. Other solution that came into my mind was to extends the java.io.FilterReader class. With this we can skip the unwanted characters or escape or replace them.
I wrote a class implementing the second approach. Here it is:


package xmlchars;

import java.io.FilterReader;
import java.io.IOException;
import java.io.Reader;

import com.sun.org.apache.xerces.internal.util.XMLChar;

/**
* {@link FilterReader} to skip invalid xml version 1.0 characters. Valid
* Unicode chars for xml version 1.0 according to http://www.w3.org/TR/xml are
* #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD], [#x10000-#x10FFFF] . In
* other words - any Unicode character, excluding the surrogate blocks, FFFE,
* and FFFF.
*
* @author tsachev
*
*/
public class Xml10FilterReader extends FilterReader {

/**
* Creates filter reader which skips invalid xml characters.
*
* @param in
* original reader
*/
public Xml10FilterReader(Reader in) {
super(in);
}

/**
* Every overload of {@link Reader#read()} method delegates to this one so
* it is enough to override only this one. <br />
* To skip invalid characters this method shifts only valid chars to left
* and returns decreased value of the original read method. So after last
* valid character there will be some unused chars in the buffer.
*
* @return Number of read valid characters or <code>-1</code> if end of the
* underling reader was reached.
*/
@Override
public int read(char[] cbuf, int off, int len) throws IOException {
int read = super.read(cbuf, off, len);
/*
* If read chars are -1 then we have reach the end of the reader.
*/
if (read == -1) {
return -1;
}
/*
* pos will show the index where chars should be moved if there are gaps
* from invalid characters.
*/
int pos = off - 1;

for (int readPos = off; readPos < off + read; readPos++) {
if (XMLChar.isValid(cbuf[readPos])) {
pos++;
} else {
continue;
}
/*
* If there is gap(s) move current char to its position.
*/
if (pos < readPos) {
cbuf[pos] = cbuf[readPos];
}
}
/*
* Number of read valid characters.
*/
return pos - off + 1;
}

}


Note that this is solution with Readers (aka character streams) only. Yes, you can use java.io.InputStreamReader, but expect encoding problems.

So that's how I tricked the legacy system's xml content which we cannot change and which does not follow the standards.

You can download Eclipse project with source here.

Wednesday, May 27, 2009

How to post source in blogger

I created this blog to share various things, but mostly to bring interesting and useful source code. For this purpose, had to find a clear and easy way to post source code here.
The strange is that google (as you see I'm using blogger) do not provide any meaningful default solution.

I seek for different solutions and this was the coolest one Source Code Highlighting - In Blogger!. It uses vi editor and works great for me.

However I need to do some modifications.
First one is before executing :TOhtml command I do enable xhtml with :let use_xhtml=1. Second one is to remove all the <br /> tags since the rich editor of the blogger is making new lines for me. This can be done with :1,$s/<br \/>//g

Then you can paste the source in the reserved placeholders.
<pre> tag is needed to protect your code from the compose. It will damage it if <pre> is missing..

Here is the result from a simple java file.


package source.in.blogger;

import java.util.List;

/**
* Some commet is here.
*/
@Documented
public class BloggerSource {
// Simple comment
public static void main(String[] args) {
System.out.printf("I'm source code%n");
}
}



It's not extremely easy but works. Try it out and have fun.
Surely I'll use it for my few next post till I find something more useful.