I get the following XML which represents a news article:
<content>
Some text blalalala
<h2>Small subtitle</h2>
Some more text blbla
<ul class="list">
<li>List item 1</li>
<li>List item 2</li>
</ul>
<br />
Even more freakin text
</content>
I know the format isn't ideal but for now I have to take it.
The Article should look like:
- Some text blalalala
- Small subtitle
- List with items
- Even more freakin text
I parse this XML with Jsoup. I can get the text within the <content> tag with doc.ownText() but then I have no idea where the other stuff (subtitle) is placed, I get only one big String.
Would it be better to use an event based parser for this (I hate them :() or is there a possibility to do something like doc.getTextUntilTagAppears("tagName")?
Edit: For clarification, I know hot to get the elements under <content>, my problem is with getting the text within <content>, broken up every time when its interrupted by an element.
I learned that I can get all the text within content with .textNodes(), works great, but then again I don't know where which text node belongs in my article (one at the top before h2, the other one at the bottom).