2

My task is to get the HTML structure of the document without data. From:

<html>
  <head>
    <title>Hello!</title>
  </head>
  <body id="uniq">
    <h1>Hello World!</h1>
  </body>
</html>

I want to get:

<html>
  <head>
    <title></title>
  </head>
  <body id="uniq">
    <h1></h1>
  </body>
</html>

There are a number of ways to extract data with Nokogiri, but I couldn't find a way perform the reverse task.

UPDATE: The solution found is the combination of two answers I received:

doc = Nokogiri::HTML(open("test.html"))
  doc.at_css("html").traverse do |node|
    if node.text?
      node.remove
    end
  end
    puts doc

The output is exactly the one I want.

1

2 Answers 2

4

It sounds like you want to remove all the text nodes. You can do this like so:

doc.xpath('//text()').remove
puts doc
Sign up to request clarification or add additional context in comments.

1 Comment

doc = Nokogiri::HTML(open("trial.html")) puts doc.xpath('//text()').remove gives the following result : Hello! Hello world! It is the opposite of what I want..
1

Traverse the document. For each node, delete what you don't want. Then write out the document.

Remember that Nokogiri can change the document. Doc

2 Comments

Thanks, Larry. I read the page from url. You would suggest to write the whole page source to the file and manipulate from there?
You mean for loading the doc at the start? You can load direct from an URL into nokogiri. See doc

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.