Remove specific HTML elements in Ruby

Question

I have seen whitelist based sanitizers for HTML in Ruby, but I need the opposite, I need ONLY links removed from a page to be readied for PDF conversion. I tried Sanitize, but it does not fit what I need as it is too difficult to guess what HTML elements will be used on the fetched page, so that I can add them to the list.

If my input was

<a href="link">Link!</a>
<b>Bold Text</b>
<div>A div!</div>

I would want

Link!
<b>Bold Text</b>
<div>A div!</div>

to be the output.

Is there any 'blacklist-based sanitizer' for Ruby?

Alternatively, would it be sufficient to use a print CSS that removes coloration and text-underline from all links? — Phrogz
– Phrogz, Commented Nov 10, 2012 at 5:13

Phrogz · Accepted Answer · 2012-11-10 14:23:47Z

3

Minor variation on the Tin Man's answer, still using Nokogiri:

require 'nokogiri' # gem install nokogiri
doc = Nokogiri.HTML( my_html )
doc.css('a,blink,marquee').each do |el|
  el.replace( el.inner_html )
end
cleaned = doc.to_html

The two differences here are:

Using css over search to be slightly more specific about the selectors being used (though it offers no functional difference), but more importantly
By replacing with inner_html we preserve possible markup inside the link. For example, given the markup:
```
<a href="foo">Hi Mom</a>!
```
then replacing with .content would produce:
```
Hi Mom!
```
whereas replacing with .inner_html produces:
```
Hi Mom!
```

edited Nov 10, 2012 at 14:23

answered Nov 10, 2012 at 5:20

Phrogz

304k115 gold badges669 silver badges758 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Phrogz · Accepted Answer · 2012-11-10 05:22:40Z

2

You want a HTML parser, such as Nokogiri. It lets you walk through the document, searching for specific nodes ("tags") and do things to them:

require 'nokogiri'

html = '<a href="link">Link!</a>
<b>Bold Text</b>
<div>A div!</div>
'

doc = Nokogiri.HTML(html)

doc.search('a').each do |a|
  a.replace(a.content)
end

puts doc.to_html

Which results in:

<html><body>Link!
<b>Bold Text</b>
<div>A div!</div>
</body></html>

Notice that Nokogiri did some fixups to the code, supplying the appropriate <html> and <body> tags. It doesn't have to, I could have told it to use and return a document fragment, but usually we let it do its thing.

edited Nov 10, 2012 at 5:22

Phrogz

304k115 gold badges669 silver badges758 bronze badges

answered Nov 10, 2012 at 5:10

the Tin Man

161k44 gold badges222 silver badges308 bronze badges

4 Comments

Phrogz Over a year ago

Since the OP mentioned "elements" plural and "blacklist", you can do: doc.search('a,script,…') (for example) to select multiple element types to change at once.

the Tin Man Over a year ago

Good point, though the results of replacing <script> tags with their content will be "icky", to use the professional term.

Phrogz Over a year ago

You're right, that's a terrible alternative. I had trouble thinking of another element that might be unwanted in PDF.

the Tin Man Over a year ago

I'd like to see <blink> removed from PDFs. :-)

Nino van Hooff · Accepted Answer · 2015-05-11 11:39:34Z

1

Rails 4.2 can do this out of the box. For older versions gem 'rails-html-sanitizer' is required

white list only the supplied tags and attributes

white_list_sanitizer = Rails::Html::WhiteListSanitizer.new
white_list_sanitizer.sanitize(@article.body, tags: %w(table tr td), attributes: %w(id class style))

or use Loofah's TargetScrubber

Rails::Html::TargetScrubber

Where PermitScrubber picks out tags and attributes to permit in sanitization, Rails::Html::TargetScrubber targets them for removal.

scrubber = Rails::Html::TargetScrubber.new
scrubber.tags = ['img']

html_fragment = Loofah.fragment('<a><img/ ></a>')
html_fragment.scrub!(scrubber)
html_fragment.to_s # => "<a></a>"

Rails HTML sanitizer

edited May 11, 2015 at 11:39

answered May 11, 2015 at 10:55

Nino van Hooff

3,9233 gold badges41 silver badges58 bronze badges

1 Comment

codener Over a year ago

Note that Rails::Html::TargetScrubber strips the element by replacing it with its contents (as required by OP). If you want to actually remove the whole element, you need to define a custom scrubber like this: Loofah::Scrubber.new do |node| node.remove if REMOVE_TAGS.include?(node.name) end

Lucas Chwe · Accepted Answer · 2017-01-18 18:15:51Z

0

html_without_links = remove_tags("<a href="link">Link!</a><b>Bold Text</b><div>A div!</div>",'a')

You can use the method above with the code below and you should get what you want.

require 'nokogiri'

def is_html?(text)
  stripped_text = Nokogiri::HTML(text).text.strip
  return !stripped_text.eql?(text)
end

def remove_tags(message_string,tag=nil)
  return message_string if message_string.blank? || tag.blank? || !is_html?(message_string)
  html_doc = Nokogiri.HTML(message_string)
  html_doc.search(tag).each do |a|
    a.replace(a.content)
  end

  html_doc.text
end

answered Jan 18, 2017 at 18:15

Lucas Chwe

2,81829 silver badges18 bronze badges

Collectives™ on Stack Overflow

Remove specific HTML elements in Ruby

4 Answers 4

Comments

4 Comments

white list only the supplied tags and attributes

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

4 Comments

white list only the supplied tags and attributes

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related