0

I have a Web Service written in Java. I want to send some strings in the form of a XML file. But these strings may contain some characters that are recognized as illegal in XML. Currently I replace all of them with ?, create the XML and send it over the network (to the Silverlight app). But sometimes all I get are question marks! So I want to somehow encode/decode these strings before and after I send them to get the exact strings. These strings are in UTF-8 encoding. I'm using something like this to create the XML:

try{
    DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder docBuilder = docFactory.newDocumentBuilder();

    //root elements
    Document doc = docBuilder.newDocument();
    Element rootElement = doc.createElement("SearchResults");
    rootElement.setAttribute("count", Integer.toString(total));
    doc.appendChild(rootElement);

    for(int i = 0; i < results.size(); i++)
    {
        Result res = results.get(i);
        //title
        Element title = doc.createElement("Title");
        title.appendChild(doc.createTextNode(res.title));
        searchRes.appendChild(title);

        //...
    }
    //write the content into xml file
    TransformerFactory transformerFactory = TransformerFactory.newInstance();
    Transformer transformer = transformerFactory.newTransformer();
    DOMSource source = new DOMSource(doc);
    StringWriter sw = new StringWriter();
    StreamResult result =  new StreamResult(sw);
    transformer.transform(source, result);
    String ret = sw.toString();
    return ret;
}
catch(ParserConfigurationException pce){
    pce.printStackTrace();
}catch(TransformerException tfe){
    tfe.printStackTrace();
}
return null;

Thank you.

PS: Some people said that they didn't understand my question so maybe I didn't say it right so I try to clarify it with an example. Suppose I have an array of items.
Each item has 3 strings.
These strings are UTF-8 strings (from many languages).
I want to send these strings to the client via a Web Service in Java.
The client part is Silverlight. In the Silverlight app,
I get the XML, parse it and use LinQ to extract data from it and I use that data in my Silverlight app.
When I try to escape the characters, somehow the parser in the Silverlight throws an exception saying that there's an illegal character in the source string (XML string) after debugging I found out that actually there IS an illegal character but I don't know how to produce a guaranteed legal XML string.

Edit: Thank you all for your support. I REALLY appreciate it.
I solved my problem.
Turns out somewhere in my code I was producing an illegal character and appending it to my result string.
The question still remains (How can I produce a legal XML file even though I'm providing it some illegal characters - note that I solved the problem by eliminating the illegal character before producing the XML so I still wonder what if I wanted to somehow send it over?) but since my problem is solved and there's tons of answers here, I guess the future readers have a head start to begin the journey to face this problem.
I didn't have the time but I'm sure these will help.
There's lots of answers and helps so I cannot select one of them to be my specific answer.
But I have to choose one of them.
I sincerely thank all of the responses.

11
  • Just encode the characters correctly in the first place. A good approach is using the &#-construction. Commented Apr 15, 2011 at 19:22
  • @Thorbjorn (sorry, not an EU keyboard) - that's escaping, not encoding, and it won't help with characters like 0x01, which are not permitted under XML 1.0. Commented Apr 15, 2011 at 20:49
  • @Alireza - I notice that you're converting the output to a String and then presumably writing it to a stream. A better approach (because it avoids possible encoding bugs) is to pass that stream directly to the transformer. Commented Apr 15, 2011 at 20:55
  • @Anon : In my Web Method, I return this string (ret in the code above) as a result. I didn't get what exactly you said sorry :D Commented Apr 15, 2011 at 21:38
  • 1
    To debug this, I suggest checking the original strings to see if they contain illegal characters before you convert them to XML. If they don't, then the problem is how you write the string to the output. Commented Apr 15, 2011 at 22:12

5 Answers 5

3

If you're sending non-character data (i.e. binary data for example) in your XML, you might encode them using Base64. But I'm not sure I've understood your question correctly.

Maybe you just forgot to encode your XML in UTF-8, using transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8")

Sign up to request clarification or add additional context in comments.

5 Comments

+1. No other form of XML escaping will let you to have characters like '\0' to be present in XML.
Thanks. These are not binary data (they're some strings clipped from web pages) and I don't know how to encode in Base64. Could you provide me a little tutorial or an example?
One more thing, using transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8"); didn't help.
Alireza, take a look at [Apache Commons Codec] (commons.apache.org/codec).
You can use BCodec encode method. Or something like this.
0

Not sure I understand your question, but maybe you should wrap the data under CDATA tag so that its not parsed by the XML parser. Here is the documentation from MSDN.

1 Comment

CDATA does not permit "illegal" characters. Here is the documentation from the W3C: w3.org/TR/xml/#dt-cdsection
0

Wrap the content with <![CDATA[ and ]]>.

More info here: http://www.w3schools.com/xml/xml_cdata.asp

2 Comments

CDATA is a good approach when you don't want the XML to be parsed (it's the tag original function). But since he is building the XML from scratch to be consumed a more recommended (and just as simple) way would be to escape the Strings.
CDATA won't allow you to use "illegal" characters (such as 0x01, SOH). It exists so that you can use characters that would normally need escaping, like <. But even then, it's not particularly useful.
0

By experience I would recommend escaping / unescaping XML. Take at look at StringEscapeUtils from Apache Commons Lang.

2 Comments

I tried it like this: desc.appendChild(doc.createTextNode(StringEscapeUtils.escapeXml(res.description))); but in the silverlight part, when I use this: XDocument xmlStories = XDocument.Parse(xmlContent); I get an exception saying that there's an illegal character in the XML!
Characters like '\0' are illegeal in XML. There is no way to escape them (short of custom encoding - see JB Nizet answer for using Base64).
0

You should try the StringEscapeUtils from apache

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.