1

I am trying to parse multiple html documents in such a way that I get only the tags discarding all its attributes and values. Can someone help me please.

For example: <img src="pic_trulli.jpg" alt="Italian Trulli">

changes to

<img>

Similarly, I want this to work for all the tags in an HTML document.

3 Answers 3

1

To remove the attributes of a single element you can use this:

element.attributes().asList()
        .stream().map(Attribute::getKey)
        .forEach(element::removeAttr);

To remove the attributes of all elements you can use this in combination with document.getAllElements():

Document document = Jsoup.parse("<img src=\"pic_trulli.jpg\" alt=\"Italian Trulli\">");
document.getAllElements()
        .forEach(e -> e.attributes().asList()
                .stream().map(Attribute::getKey)
                .forEach(e::removeAttr));

The result will be this:

<html>
 <head></head>
 <body>
  <img>
 </body>
</html>
Sign up to request clarification or add additional context in comments.

Comments

0

You can iterate over all elements from document and then over each element's attributes which should allow you to remove them.

Demo:

String html = "<img src=\"pic_trulli.jpg\" alt=\"Italian Trulli\">" +
        "<div class=\"foo\"><a href=\"pic_trulli.jpg\" alt=\"Italian Trulli\" non-standard></div>";
Document doc = Jsoup.parse(html);

System.out.println(doc);
for (Element el : doc.getAllElements()){
    for (Attribute atr : el.attributes().asList()){
        el.removeAttr(atr.getKey());
    }
}
System.out.println("-----");
System.out.println(doc);

Output:

<html>
 <head></head>
 <body>
  <img src="pic_trulli.jpg" alt="Italian Trulli">
  <div class="foo">
   <a href="pic_trulli.jpg" alt="Italian Trulli" non-standard></a>
  </div>
 </body>
</html>
-----
<html>
 <head></head>
 <body>
  <img>
  <div>
   <a></a>
  </div>
 </body>
</html>

Comments

0

If your aim is to receive a clear document structure, you need to remove text and data nodes as well. Consider the following snippet.

Document document = Jsoup.connect("http://example.com").get();
document.getAllElements().forEach(element -> {
      element.attributes().asList().forEach(attr -> element.removeAttr(attr.getKey()));
      element.textNodes().forEach(Node::remove);
      element.dataNodes().forEach(Node::remove);
    });
System.out.println(document);

Output:

<!doctype html>
<html>
 <head>
  <title></title>
  <meta>
  <meta>
  <meta>
  <style></style>
 </head>
 <body>
  <div>
   <h1></h1>
   <p></p>
   <p><a></a></p>
  </div>
 </body>
</html>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.