4

I have an XML file as per below, and I need to generate a .txt file with the plain text in the tag, each one in a row, using Java.

I read that I could use SAX in order to access the different labels, but in this case, where there can be random tags inside the like in the example below, this is not valid.

What is the best approach to do this? Regex perhaps?

<?xml version="1.0" encoding="utf-8"?>
[...]
<source>
  <g id="_0">
    <g id="_1">First valid sentence</g>
  </g>
</source>
<source>Another valid string</source>

The output results.txt should be something like this:

First valid sentence
Another valid string
4
  • 1
    Edited. Sorry I haven't read the random tags part. I would consider the whole document as a string and try to extract the 'random' tags identifiers first. Commented Jun 30, 2015 at 14:53
  • Well with SAX you just wait for your start tag, turn on a flag, and then collect all the characters you see until you see the closing tag. Just ignore the start and end events for the inner tags. Commented Jun 30, 2015 at 14:59
  • @JPMoresmau so in your solution I would still need to use regex in order to discard the <g> tags (example above), if present. Right? Wouldn't it be easier to consider the whole XML as a string and apply regex, as Slow Trout suggests? Commented Jun 30, 2015 at 15:05
  • 2
    Well no, the SAX events will tell you "I start the source tag", and that's your cue to start collecting text. Then you'll get a SAX event telling you "I'm starting a g tag", that doesn't change anything. Then when you get some some text that you collect, until you see the event "close tag source", you stop collecting the text. Commented Jun 30, 2015 at 15:06

1 Answer 1

1

You can use the joox library to parse xml data. Using its find() method you can get all <source> elements, and then use getTextContent() to extract its text, like:

import java.io.File;
import java.io.IOException;
import org.xml.sax.SAXException;
import static org.joox.JOOX.$;

public class Main {

    public static void main(String[] args) throws SAXException, IOException {
        $(new File(args[0]))
            .find("source")
            .forEach(elem -> System.out.println(elem.getTextContent().trim()));

    }
}

I will assume a well-formed xml file, like:

<?xml version="1.0" encoding="utf-8"?>
<root>
    <source>
        <g id="_0">
            <g id="_1">First valid sentence</g>
        </g>
    </source>
    <source>Another valid string</source>
</root>

And it yields:

First valid sentence
Another valid string
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.