Parsing XML with Jsoup

Question

I get the following XML which represents a news article:

<content>
   Some text blalalala
   <h2>Small subtitle</h2>
   Some more text blbla
   <ul class="list">
      <li>List item 1</li>
      <li>List item 2</li>
   </ul>
   <br />
   Even more freakin text
</content>

I know the format isn't ideal but for now I have to take it.

The Article should look like:

Some text blalalala
Small subtitle
List with items
Even more freakin text

I parse this XML with Jsoup. I can get the text within the <content> tag with doc.ownText() but then I have no idea where the other stuff (subtitle) is placed, I get only one big String.

Would it be better to use an event based parser for this (I hate them :() or is there a possibility to do something like doc.getTextUntilTagAppears("tagName")?

Edit: For clarification, I know hot to get the elements under <content>, my problem is with getting the text within <content>, broken up every time when its interrupted by an element.

I learned that I can get all the text within content with .textNodes(), works great, but then again I don't know where which text node belongs in my article (one at the top before h2, the other one at the bottom).

zEro · Accepted Answer · 2013-07-11 11:12:02Z

9

Jsoup has a fantastic selector based syntax. See here

If you want the subtitle

Document doc = Jsoup.parse("path-to-your-xml"); // get the document node

You know that subtitle is in the h2 element

Element subtitle = doc.select("h2").first();  // first h2 element that appears

And if you like to have the list:

Elements listItems = doc.select("ul.list > li");
for(Element item: listItems)
    System.out.println(item.text());  // print list's items one after another

answered Jul 11, 2013 at 11:12

zEro

1,27314 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

zEro Over a year ago

Who ever graciously did the -1 may please explain the reason so I can improve myself.

fweigl Over a year ago

Hi, thanks for your effort, but I know how to get the elements. I'll try to specify my question.

fweigl · Accepted Answer · 2013-07-11 12:27:50Z

3

The mistake I made was going through the XML by Elements, which do not include TextNodes. When I go through it Node by Node, I can check wether the Node is an Element or a TextNode, that way I can treat them accordingly.

answered Jul 11, 2013 at 12:27

fweigl

22.2k24 gold badges123 silver badges217 bronze badges

1 Comment

zEro Over a year ago

Good that it's working for you now. Now that you've found the solution, please update the question and answer accordingly. And accept your own answer when it is eligible.

Collectives™ on Stack Overflow

Parsing XML with Jsoup

2 Answers 2

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related