1

I have the following string as an example

"<p>Hello,</p><p><br></p><p>my name is Same</p><p><br></p><p><br></p><p>Farewell,</p><p>Same</p>"

And I would like to strip all HTML tags from it. I was using the following method which kind of worked

Nokogiri::HTML(CGI.unescapeHTML(@message_preview)).content

But it ultimately produced,

"Hello,my name is SameFarewell,Same"

When I wanted

"Hello, my name is Same Farewell, Same"

Notice the spaces, given a line break, I want there to be a space in its place instead of being the very next character in the string.

I was hoping to try to use gsub or regex but am kind of lost on how to make it happen.

6
  • I guess the easiest solution might be to replace all line break <br> with a space before your remove the HTML tags!? Also to trim multiple spaces to a single one (in case of multiple line breaks). Commented Dec 21, 2017 at 15:38
  • Actually, yea. You are right. I ended up using @message_preview.gsub!(/<br>/, ' ') But i just realized I need to account for a whole host of html tags because of the keyboard options. Bold, italic, underline, ol, ul, quotes etc. So I need to find a way to include all that in my gsub and then run the nokogiri Commented Dec 21, 2017 at 15:46
  • 1
    @xander you guess wrong; with this approach, sooner or later you’ll find yourself implementing HTML parser on regular expressions. Commented Dec 21, 2017 at 15:49
  • you are going about it the right way using a parser like nokogiri doing this as a Regex is a bad idea Commented Dec 21, 2017 at 16:14
  • Why is that? What are the downsides to regex over nokogiri? Commented Dec 21, 2017 at 16:20

3 Answers 3

2

You can use split here passing a regex which works for your example (s is your string).

def wordy s
  s.split(/\<.*?\>/)
   .map(&:strip)
   .reject(&:empty?)
   .join(' ')
   .gsub(/\s,/,',')
end

s = "<p>Hello,</p><p><br></p><p>my name is Same</p><p><br></p><p><br></p><p>Farewell,</p><p>Same</p>"
t = "<p>Hello <strong>Jim</strong>,</p><p> </p><p>This is <em>Charlie</em> and<u> I wanted to say</u></p><ol><li>hello</li><li>goodby</li></ol><p> </p><p>Farewell,</p><p>Lawrence</p>"

p wordy s
#"Hello, my name is Same Farewell, Same"

p wordy t
#"Hello Jim, This is Charlie and I wanted to say hello goodby Farewell, Lawrence"
Sign up to request clarification or add additional context in comments.

8 Comments

Hey this could work! I was using split before this issue. I just notice though that I now need to account for a lot of other text options. <ol>, <ul>, bold, italics, underline, quotes etc. Is it possible to include those in this regex example you provided? If so, mind ammending it for me, this stuff is very confusing to me!
I take it back, it kind of works. I have this new example string with what I discussed, "<p>Hello <strong>Jim</strong>,</p><p> </p><p>This is <em>Charlie</em> and<u> I wanted to say</u></p><ol><li>hello</li><li>goodby</li></ol><p> </p><p>Farewell,</p><p>Lawrence</p>" And using your method, I get this "Hello Jim , This is Charlie and I wanted to say hello goodby Farewell, Lawrence". How can I strip those extra spaces? Maybe do a check to make sure there is only a max of 1 space and no consecutive spaces?
That's it! Thanks so much Sagar!
@vin_Bin87 refactored answer again. I do agree with others that you should use a dedicated library rather than regex. These libraries are tried and tested whereas regex can break and not work for fringe examples.
Hey, it's actually including this &nbsp; in the resulted output..."Hello&nbsp; Jim, This is&nbsp; Daniel . Any idea how to fix this?
|
2

Unfortunately, Nokogiri::XML::Node#traverse does not return an enumerator when no block is given, that’s why we need this ugly hack with defining a local variable upfront.

require 'nokogiri'

result, input = [], "<p>Hello,</p><p><br></p><p>my name is Same</p>" \
                    "<p><br></p><p><br></p><p>Farewell,</p><p>Same</p>"
Nokogiri::HTML(CGI.unescapeHTML(input)).traverse do |e|
  result << e.text if e.text?
end
result.join(' ')
#⇒ "Hello, my name is Same Farewell, Same"

Comments

0

My decision:

description.gsub!(/<("[^"]*"|'[^']*'|[^'">])*>/, ' ').strip

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.