4

I get the following XML which represents a news article:

<content>
   Some text blalalala
   <h2>Small subtitle</h2>
   Some more text blbla
   <ul class="list">
      <li>List item 1</li>
      <li>List item 2</li>
   </ul>
   <br />
   Even more freakin text
</content>

I know the format isn't ideal but for now I have to take it.

The Article should look like:

  • Some text blalalala
  • Small subtitle
  • List with items
  • Even more freakin text

I parse this XML with Jsoup. I can get the text within the <content> tag with doc.ownText() but then I have no idea where the other stuff (subtitle) is placed, I get only one big String.

Would it be better to use an event based parser for this (I hate them :() or is there a possibility to do something like doc.getTextUntilTagAppears("tagName")?

Edit: For clarification, I know hot to get the elements under <content>, my problem is with getting the text within <content>, broken up every time when its interrupted by an element.

I learned that I can get all the text within content with .textNodes(), works great, but then again I don't know where which text node belongs in my article (one at the top before h2, the other one at the bottom).

0

2 Answers 2

9

Jsoup has a fantastic selector based syntax. See here

If you want the subtitle

Document doc = Jsoup.parse("path-to-your-xml"); // get the document node

You know that subtitle is in the h2 element

Element subtitle = doc.select("h2").first();  // first h2 element that appears

And if you like to have the list:

Elements listItems = doc.select("ul.list > li");
for(Element item: listItems)
    System.out.println(item.text());  // print list's items one after another
Sign up to request clarification or add additional context in comments.

2 Comments

Who ever graciously did the -1 may please explain the reason so I can improve myself.
Hi, thanks for your effort, but I know how to get the elements. I'll try to specify my question.
3

The mistake I made was going through the XML by Elements, which do not include TextNodes. When I go through it Node by Node, I can check wether the Node is an Element or a TextNode, that way I can treat them accordingly.

1 Comment

Good that it's working for you now. Now that you've found the solution, please update the question and answer accordingly. And accept your own answer when it is eligible.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.