1

I'm trying to parse an XML-formatted document with Jsoup, specifically what is located in the paragraph tag in the example code show below.

...
<nitf:body.content>
     <p> Content would be here. </p>
</nitf:body.content>
...

There are multiple paragraph tags in the document. As a result, I chose to use selector-syntax to get inside the body.content tag and then the paragraph tag underneath it. I am trying and failing to get it right now with:

// epochFileDoc is the name of the document with the code shown above.
Element tag_element = epochFileDoc.selectFirst("nitf|body.content > p");

I have tried a few different combinations of the selector syntax, including "nitf|content.body > p" and "nitf|body > p". None of the ones I have tried have worked.

How would I use selector-syntax in Jsoup to get the paragraph tag shown above?

EDIT: I see why content.body does not work in the selector syntax, since that searches for nitf:content="body" in the tags, but I'm still lost on how to get that element.

2
  • Can you use a different XML parser instead, e.g: one based on en.wikipedia.org/wiki/Document_Object_Model and en.wikipedia.org/wiki/XPath rather than something that only supports CSS selectors? A dot has a special meaning in CSS. Commented Jun 19, 2019 at 17:05
  • 1
    I would follow that suggestion if I could; I am required to use Jsoup for this software. I created a workaround for this issue which I'll post in a second, since the dot has a special meaning (like you said). Commented Jun 19, 2019 at 17:29

2 Answers 2

1

@dacmacho's explanation is correct and the workaround will do, if you can modify the data before parsing it.

There now is a less invasive solution: I've just pushed a pull request ( https://github.com/jhy/jsoup/pull/1442 ) to JSoup, enabling the use of escape backslashes within the selector for element-names and CSS-identifiers.

So with that change, you'd simply use (note the backslash right before the dot):

Element tag_element = epochFileDoc.selectFirst("nitf|body\.content > p");
Sign up to request clarification or add additional context in comments.

Comments

0

The reason why it is not possible to select using a CSS selector, like Jsoup uses, is because a dot has a special meaning in CSS (like @Shlomi Fish said). In my code, I replaced instances of nitf:body.content with nitf:body-content using the line below, where file is the string where the XML is stored:

file = file.replace("<nitf:body.", "<nitf:body-");

This allowed me to select the Element using:

Element tag_element = epochFileDoc.selectFirst("nitf|body-content > p");

It would be smarter to use a different parser for XML-formatted code in cases like this, but if you have requirements like mine/want to keep Jsoup this workaround works properly.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.