3

can you help me in parsing xml with nested <?xml version="1.0" encoding="utf-8"?> tags. when i am trying to parse this xml, i m getting parsing error.

<?xml version="1.0" encoding="utf-8"?>      
<soap>
            <soapenvBody>
                <serviceResponse>
                    <?xml version="1.0" encoding="UTF-8"?>
                    <data>
                        <respCode>0</respCode>
                    </data>
                </serviceResponse>
            </soapenvBody>
        </soap>  
4
  • 2
    There is no easy way to parse that since that isn't valid xml. But seeing that it is a soap-response it makes me wonder what service that gives it to you, wouldn't it be better to see if they can fix the service (or if you can fix the service if you have access?). Commented Aug 6, 2012 at 7:19
  • You can try to pre-process the stream, remove the invalid part (e.g. with regular expression replacement), and then parse it with regular XML parser. I also think that you could be able to parse it using a SAX parser. Commented Aug 6, 2012 at 7:21
  • I've seen this a few times from soap responses - ie a response within a response - if you can html encode your response before you parse it so it becomes something like &lt;serviceResponse&gt; etc etc its the way forward. Commented Aug 6, 2012 at 7:33
  • You are not trying to parse XML with nested XML declarations, because XML cannot contain nested XML declarations. Rather, you are trying to parse non-XML input. So you will need a non-XML parser. It would be better to persuade the supplier of these files to generate proper well-formed XML. Commented Aug 6, 2012 at 11:56

4 Answers 4

3

I don't think this is really a Java problem. Having a second XML declaration within the XML body is just illegal, so I don't think you'll be able to get any XML parsers to parse that. If you have control over the XML (it looks like you're generating it to store a response) then you could try wrapping the inner-XML document with CDATA:

<?xml version="1.0" encoding="utf-8"?>     
<soap>
    <soapenvBody>
        <serviceResponse>
          <![CDATA[
              <?xml version="1.0" encoding="UTF-8"?>
              <data>
                  <respCode>0</respCode>
              </data>
          ]]>
        </serviceResponse>
    </soapenvBody>
</soap>

EDIT:

I'm thinking that you most likely don't want the extra XML declaration inside that response at all. Do you have control over the code that creates the response? My guess is that the XML snippet <data>...</data> is created as a separate DOM object and then the string is spliced in the middle of the response. Writing out the entire XML document object results in the XML declaration being included, but if you just grab the document root node object (<data>) and write that out as a string then it probably won't include the extra XML declaration that's causing you all this trouble.

Sign up to request clarification or add additional context in comments.

2 Comments

thanks for reply. actually i do not have control over the xlm response, and i know it is not valid xml. So i choose the xml preprocessing option and selected the inner xml only then parse it using SAX parse and able to retrieve the data that i wanted :). thanks again.
I thought of a new solution after some work I did this afternoon, but since it's completely unrelated to my remarks here I decided to post it as a separate answer.
2

It occurred to me that a parser made for dealing with HTML might be able to do what you want. Since HTML tends to be a total mess compared to strict XML, HTML parsers are usually much more error-tolerant. A quick search turned up jsoup. I was able to pull the respCode from your sample XML above with roughly this code:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

String data = "your xml goes here";
Document doc = Jsoup.parse(data);
String respCodeRaw = doc.select("respCode").first().text();
int respCode = Integer.valueOf(respCodeRaw);

(I actually tested the library in the Clojure repl, but the code above should work!)

Comments

0

A tag that starts with like <? is a processing instruction. <?xml...> is an XML declaration, and can only be present at the beginning of the xml content. It's not allowed in the XML body.

Why does your soap body contain this? Do you have the option of removing it?

1 Comment

thnks for replying. i do not hv control over xml i m receiving. so i preprocessed the xml then parsed it.
0

i did not find any parser in java to parse such embedded xml as it is not a valid xml and i guess almost all parses validate the xml before parsing it. so i choose the option to preprocess the xml and selected the inner xml then using SAX parser i parsed the xml and retrieved the values from xml. Guys thanks for your replies.

1 Comment

It's been a week since your original post, so you've probably already moved on from this—but if you're still interested in parsing without the preprocessing you should look at my new answer about using Jsoup.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.