1

I have seen whitelist based sanitizers for HTML in Ruby, but I need the opposite, I need ONLY links removed from a page to be readied for PDF conversion. I tried Sanitize, but it does not fit what I need as it is too difficult to guess what HTML elements will be used on the fetched page, so that I can add them to the list.

If my input was

<a href="link">Link!</a>
<b>Bold Text</b>
<div>A div!</div>

I would want

Link!
<b>Bold Text</b>
<div>A div!</div>

to be the output.

Is there any 'blacklist-based sanitizer' for Ruby?

1
  • Alternatively, would it be sufficient to use a print CSS that removes coloration and text-underline from all links? Commented Nov 10, 2012 at 5:13

4 Answers 4

3

Minor variation on the Tin Man's answer, still using Nokogiri:

require 'nokogiri' # gem install nokogiri
doc = Nokogiri.HTML( my_html )
doc.css('a,blink,marquee').each do |el|
  el.replace( el.inner_html )
end
cleaned = doc.to_html

The two differences here are:

  1. Using css over search to be slightly more specific about the selectors being used (though it offers no functional difference), but more importantly

  2. By replacing with inner_html we preserve possible markup inside the link. For example, given the markup:

    <p><a href="foo">Hi <b>Mom</b></a>!</p>
    

    then replacing with .content would produce:

    <p>Hi Mom!</p>
    

    whereas replacing with .inner_html produces:

    <p>Hi <b>Mom</b>!</p>
    
Sign up to request clarification or add additional context in comments.

Comments

2

You want a HTML parser, such as Nokogiri. It lets you walk through the document, searching for specific nodes ("tags") and do things to them:

require 'nokogiri'

html = '<a href="link">Link!</a>
<b>Bold Text</b>
<div>A div!</div>
'

doc = Nokogiri.HTML(html)

doc.search('a').each do |a|
  a.replace(a.content)
end

puts doc.to_html

Which results in:

<html><body>Link!
<b>Bold Text</b>
<div>A div!</div>
</body></html>

Notice that Nokogiri did some fixups to the code, supplying the appropriate <html> and <body> tags. It doesn't have to, I could have told it to use and return a document fragment, but usually we let it do its thing.

4 Comments

Since the OP mentioned "elements" plural and "blacklist", you can do: doc.search('a,script,…') (for example) to select multiple element types to change at once.
Good point, though the results of replacing <script> tags with their content will be "icky", to use the professional term.
You're right, that's a terrible alternative. I had trouble thinking of another element that might be unwanted in PDF.
I'd like to see <blink> removed from PDFs. :-)
1

Rails 4.2 can do this out of the box. For older versions gem 'rails-html-sanitizer' is required

white list only the supplied tags and attributes

white_list_sanitizer = Rails::Html::WhiteListSanitizer.new
white_list_sanitizer.sanitize(@article.body, tags: %w(table tr td), attributes: %w(id class style))

or use Loofah's TargetScrubber

Rails::Html::TargetScrubber

Where PermitScrubber picks out tags and attributes to permit in sanitization, Rails::Html::TargetScrubber targets them for removal.

scrubber = Rails::Html::TargetScrubber.new
scrubber.tags = ['img']

html_fragment = Loofah.fragment('<a><img/ ></a>')
html_fragment.scrub!(scrubber)
html_fragment.to_s # => "<a></a>"

Rails HTML sanitizer

1 Comment

Note that Rails::Html::TargetScrubber strips the element by replacing it with its contents (as required by OP). If you want to actually remove the whole element, you need to define a custom scrubber like this: Loofah::Scrubber.new do |node| node.remove if REMOVE_TAGS.include?(node.name) end
0
html_without_links = remove_tags("<a href="link">Link!</a><b>Bold Text</b><div>A div!</div>",'a')

You can use the method above with the code below and you should get what you want.

require 'nokogiri'

def is_html?(text)
  stripped_text = Nokogiri::HTML(text).text.strip
  return !stripped_text.eql?(text)
end

def remove_tags(message_string,tag=nil)
  return message_string if message_string.blank? || tag.blank? || !is_html?(message_string)
  html_doc = Nokogiri.HTML(message_string)
  html_doc.search(tag).each do |a|
    a.replace(a.content)
  end

  html_doc.text
end

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.