How to parse the full content of a XML Tag in java

Question

I have some kind of complex XML data structure. The structure contains different fragments like in the following example:

<data>
  <content-part-1>
   <h1>Hello <strong>World</strong>. This is some text.</h1>
   <h2>.....</h2>
  </content-part1>
  ....
</data>

The h1 tag within the tag 'content-part-1' is of interest. I want to get the full content of the xml tag 'h1'.

In java I used the javax.xml.parsers.DocumentBuilder and tried something like this:

String my_content="<h1>Hello <strong>World</strong>. This is some text.</h1>"; 
// parse h1 tag..
DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = documentBuilder.parse(new InputSource(new StringReader(my_content)));
Node node = doc.importNode(doc.getDocumentElement(), true);
if (node != null && node.getNodeName().equals("h1")) {
    return node.getTextContent();
}

But the method 'getTextContent()' will return:

Hello World. This is some text.

The tag "strong" is removed by the xml parser (as it is the documented behavior).

My question is how I can extract the full content of a single XML Node within a org.w3c.dom.Document without any further parsing the node content?

That's a pretty unusual requirement. Please edit the question and explain why you believe this is a useful thing to do. — kimbert
– kimbert, Commented Feb 3, 2020 at 17:34
I edit my question. I don't think the requirement is so exotic. But maybe the javax.xml.parsers.DocumentBuilder approach is the wrong one. Seems easier to parse the XML fragment manually with regex.... — Ralph
– Ralph, Commented Feb 3, 2020 at 18:45

Daniil · Accepted Answer · 2020-02-05 13:02:44Z

2

Although java DOM parser provides functionality for parsing mixed content, in this particular case it could be more convenient to use Jsoup library. When using it code to extract h1 element content would be as follows:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

String text = "<data>\n"
+ "  <content-part1>\n"
+ "   <h1>Hello <strong>World</strong>. This is some text.</h1>\n"
+ "   <h2></h2>\n"
+ "  </content-part1>\n"
+ "</data>";

Document doc = Jsoup.parse(text);

Elements h1Elements = doc.select("h1");

for (Element h1 : h1Elements) {
    System.out.println(h1.html());
}

Output in this case will be "Hello <strong>World</strong>. This is some text."

answered Feb 5, 2020 at 13:02

Daniil

9538 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jonathan Hedley Over a year ago

Please make sure you specify to Jsoup that you want to use the XML parser. Otherwise it will default to HTML, and because of the tree parser, you may get unexpected results. Jsoup.parse(html, "", Parser.xmlParser());

y_ug · Accepted Answer · 2020-02-03 18:11:42Z

0

What you probaly want is XML generation from some subnode of your document. So with slighlty modified nodeToString from earlier answer to similar question I can propose to generate text <h1>Hello <strong>World</strong>. This is some text.</h1>. Some extra effor might be needed to get rid of <h1> and </h1>

package com.github.vtitov.test;

import org.junit.Test;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.xml.sax.InputSource;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import java.io.StringReader;
import java.io.StringWriter;

import static org.hamcrest.MatcherAssert.*;
import static org.hamcrest.Matchers.*;


public class XmlTest {

    @Test
    public void buildXml() throws Exception {
        String my_content="<h1>Hello <strong>World</strong>. This is some text.</h1>";
        // parse h1 tag..
        DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
        Document doc = documentBuilder.parse(new InputSource(new StringReader(my_content)));
        Node node = doc.importNode(doc.getDocumentElement(), true);
        String h1Content = null;
        if (node != null && node.getNodeName().equals("h1")) {
            h1Content = nodeToString(node);
        }
        assertThat("h1", h1Content, equalTo("<h1>Hello <strong>World</strong>. This is some text.</h1>"));
    }

    private static String nodeToString(Node node) throws TransformerException {
        StringWriter sw = new StringWriter();
        Transformer t = TransformerFactory.newInstance().newTransformer();
        t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
        t.setOutputProperty(OutputKeys.INDENT, "no");
        t.transform(new DOMSource(node), new StreamResult(sw));
        return sw.toString();
    }
}

answered Feb 3, 2020 at 18:11

y_ug

1,1248 silver badges8 bronze badges

3 Comments

Ralph Over a year ago

This is near to a solution. But now I need a way to remove the h1 tag. It seems that using a DocumentBuilder is in general the wrong approach.

y_ug Over a year ago

I believe DOM model isn't appropriate here, as you want to create non-xml from xml. Probably it worth using SAX or StAX.

kimbert Over a year ago

I think your approach is probably wrong. You should parse the HTML, modify the resulting DOM tree and then serialize the modified DOM tree as HTML. I understand that not all HTML is valid XML, but parse-modify-serialize is the way to go.

Collectives™ on Stack Overflow

How to parse the full content of a XML Tag in java

2 Answers 2

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related