0

I receive large XML document, i need to extract some of the fields from it, and return them. Problem is, when I was looking at various solution on how to deserialize object with Jackson, it was mostly 1-to-1 mapping, or with building custom parser. My situation looks more or less like that

XML

<a>
 <b>
   <c>val</c>
   <d x='val' z='val'><e>val</e><f>lot of irrelevant fields</f></d>
   <g>lot of irrelevant fields</g>
  <b>
<a>

and I'm interested only in values of C X Z E so recreating entire structure in java is definitely no-no. Implementing custom parser also sound like overkill. Is it any nicer solution, IE via annotations or something similar? I remember some time ago, I've seen library which allowed to do it via annotations, but now I'm bit restricted in terms of libraries I can use.

3
  • You could build a minimal dto and annotate the class with @JsonIgnoreProperties(ignoreUnknown = true), see baeldung.com/jackson-deserialize-json-unknown-properties Commented May 12, 2020 at 18:55
  • @MichaelKreutz this example is about JSON while I'm parsing XML will that work? Do I need to replicate nested structure in my dto? because it is slightly more complicated than in my example Commented May 12, 2020 at 19:48
  • I did not try it out, but I think it should work as well for XML. You need to model the structure of the fields that you are interested in - all others you can omit. baeldung.com/jackson-xml-serialization-and-deserialization uses also @Json prefixed annotations in combination with xml parsing... Commented May 12, 2020 at 19:55

2 Answers 2

1

The most obvious way is with XPath. This is included in Java - no extra libraries. While there are many ways to get to what you want I wrote a quick test:

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.nio.charset.StandardCharsets;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.w3c.dom.Document;
import org.xml.sax.SAXException;

public class XPathDemo {
    private static final String xmlString = "<a>\n" +
            " <b>\n" +
            "   <c>val</c>\n" +
            "   <d x=\"x-val\" z=\"z-val\"><e>e-val</e><f>lot of irrelevant fields</f></d>\n" +
            "   <g>lot of irrelevant fields</g>\n" +
            "  </b>\n" +
            "</a>";

    public static void main(String[] argv) throws IOException, SAXException, ParserConfigurationException, XPathExpressionException {
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        dbf.setNamespaceAware(true);
        DocumentBuilder db = dbf.newDocumentBuilder();
        Document document = db.parse(new ByteArrayInputStream(xmlString.getBytes(StandardCharsets.UTF_8)));

        XPath xpath = XPathFactory.newInstance().newXPath();
        String c_value = (String) xpath.evaluate("/a/b/c/text()", document, XPathConstants.STRING);
        System.out.println( "value of c is \"" + c_value + "\"");

        String x_value = (String) xpath.evaluate("/a/b/d/@x", document, XPathConstants.STRING);
        System.out.println( "value of x is \"" + x_value + "\"");

        String z_value = (String) xpath.evaluate("/a/b/d/@z", document, XPathConstants.STRING);
        System.out.println( "value of z is \"" + z_value + "\"");

        String e_value = (String) xpath.evaluate("/a/b/d/e/text()", document, XPathConstants.STRING);
        System.out.println( "value of e is \"" + e_value + "\"");
    }
}

Output:

value of c is "val"
value of x is "x-val"
value of z is "z-val"
value of e is "e-val"

This is a super simple example. It gets harder when you have the same basic structure repeated many times. I'd read up on XPath Syntax as it is very powerful but can be a bit of a pain to get what you want sometimes.

There are a few caveats that you should know about:

  1. You need valid XML. What you posted is not and wouldn't work.
  2. This will read the entire document into memory. That's fine if you have a few thousand lines. But if you've got a 10GB document you may need another way.
Sign up to request clarification or add additional context in comments.

Comments

0

You should look at DSM library. It did exactly what you want.

https://github.com/mfatihercik/dsm

1 Comment

as i mentioned, I'm not interested in other libraries. I'm working in very close environment so adding some random libraries is not valid approach for me.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.