Ruby - Strip all HTML tags from string with Regex

Question

I have the following string as an example

"<p>Hello,</p><p><br></p><p>my name is Same</p><p><br></p><p><br></p><p>Farewell,</p><p>Same</p>"

And I would like to strip all HTML tags from it. I was using the following method which kind of worked

Nokogiri::HTML(CGI.unescapeHTML(@message_preview)).content

But it ultimately produced,

"Hello,my name is SameFarewell,Same"

When I wanted

"Hello, my name is Same Farewell, Same"

Notice the spaces, given a line break, I want there to be a space in its place instead of being the very next character in the string.

I was hoping to try to use gsub or regex but am kind of lost on how to make it happen.

I guess the easiest solution might be to replace all line break <br> with a space before your remove the HTML tags!? Also to trim multiple spaces to a single one (in case of multiple line breaks). — xander
– xander, Commented Dec 21, 2017 at 15:38
Actually, yea. You are right. I ended up using @message_preview.gsub!(/<br>/, ' ') But i just realized I need to account for a whole host of html tags because of the keyboard options. Bold, italic, underline, ol, ul, quotes etc. So I need to find a way to include all that in my gsub and then run the nokogiri — vin_Bin87
– vin_Bin87, Commented Dec 21, 2017 at 15:46
@xander you guess wrong; with this approach, sooner or later you’ll find yourself implementing HTML parser on regular expressions. — Aleksei Matiushkin
– Aleksei Matiushkin, Commented Dec 21, 2017 at 15:49
you are going about it the right way using a parser like nokogiri doing this as a Regex is a bad idea — engineersmnky
– engineersmnky, Commented Dec 21, 2017 at 16:14

Sagar Pandya · Accepted Answer · 2017-12-21 16:33:43Z

2

You can use split here passing a regex which works for your example (s is your string).

def wordy s
  s.split(/\<.*?\>/)
   .map(&:strip)
   .reject(&:empty?)
   .join(' ')
   .gsub(/\s,/,',')
end

s = "<p>Hello,</p><p><br></p><p>my name is Same</p><p><br></p><p><br></p><p>Farewell,</p><p>Same</p>"
t = "<p>Hello <strong>Jim</strong>,</p><p> </p><p>This is <em>Charlie</em> and<u> I wanted to say</u></p><ol><li>hello</li><li>goodby</li></ol><p> </p><p>Farewell,</p><p>Lawrence</p>"

p wordy s
#"Hello, my name is Same Farewell, Same"

p wordy t
#"Hello Jim, This is Charlie and I wanted to say hello goodby Farewell, Lawrence"

edited Dec 21, 2017 at 16:33

answered Dec 21, 2017 at 15:56

Sagar Pandya

9,5282 gold badges28 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

vin_Bin87 Over a year ago

Hey this could work! I was using split before this issue. I just notice though that I now need to account for a lot of other text options. <ol>, <ul>, bold, italics, underline, quotes etc. Is it possible to include those in this regex example you provided? If so, mind ammending it for me, this stuff is very confusing to me!

vin_Bin87 Over a year ago

I take it back, it kind of works. I have this new example string with what I discussed,

"<p>Hello <strong>Jim</strong>,</p><p> </p><p>This is <em>Charlie</em> and<u> I wanted to say</u></p><ol><li>hello</li><li>goodby</li></ol><p> </p><p>Farewell,</p><p>Lawrence</p>"

And using your method, I get this "Hello Jim , This is Charlie and I wanted to say hello goodby Farewell, Lawrence". How can I strip those extra spaces? Maybe do a check to make sure there is only a max of 1 space and no consecutive spaces?

vin_Bin87 Over a year ago

That's it! Thanks so much Sagar!

Sagar Pandya Over a year ago

@vin_Bin87 refactored answer again. I do agree with others that you should use a dedicated library rather than regex. These libraries are tried and tested whereas regex can break and not work for fringe examples.

vin_Bin87 Over a year ago

Hey, it's actually including this   in the resulted output..."Hello  Jim, This is  Daniel . Any idea how to fix this?

|

Aleksei Matiushkin · Accepted Answer · 2017-12-21 15:46:29Z

2

Unfortunately, Nokogiri::XML::Node#traverse does not return an enumerator when no block is given, that’s why we need this ugly hack with defining a local variable upfront.

require 'nokogiri'

result, input = [], "<p>Hello,</p><p><br></p><p>my name is Same</p>" \
                    "<p><br></p><p><br></p><p>Farewell,</p><p>Same</p>"
Nokogiri::HTML(CGI.unescapeHTML(input)).traverse do |e|
  result << e.text if e.text?
end
result.join(' ')
#⇒ "Hello, my name is Same Farewell, Same"

answered Dec 21, 2017 at 15:46

Aleksei Matiushkin

121k12 gold badges109 silver badges174 bronze badges

Comments

Viktor Ivliiev · Accepted Answer · 2020-02-06 10:24:19Z

0

My decision:

description.gsub!(/<("[^"]*"|'[^']*'|[^'">])*>/, ' ').strip

answered Feb 6, 2020 at 10:24

Viktor Ivliiev

1,3545 gold badges17 silver badges26 bronze badges

Collectives™ on Stack Overflow

Ruby - Strip all HTML tags from string with Regex

3 Answers 3

8 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

8 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related