8

I have a text blob field in a MySQL column that contains HTML. I have to change some of the markup, so I figured I'll do it in a ruby script. Ruby is irrelevant here, but it would be nice to see an answer with it. The markup looks like the following:

<h5>foo</h5>
  <table>
    <tbody>
    </tbody>
  </table>

<h5>bar</h5>
  <table>
    <tbody>
    </tbody>
  </table>

<h5>meow</h5>
  <table>
    <tbody>
    </tbody>
  </table>

I need to change just the first <h5>foo</h5> block of each text to <h2>something_else</h2> while leaving the rest of the string alone.

Can't seem to get the proper PCRE regex, using Ruby.

2
  • 2
    I implore you to consider using an HTML parser instead of using regex for html. As it has been said many, many, many times before, Regex parsers are incapable of accurately parsing HTML. Commented Apr 18, 2013 at 21:16
  • Specifically, I recommend using Nokogiri to load your HTML, manipulate it, and then emit the result. Commented Sep 26, 2014 at 19:32

3 Answers 3

31
# The regex literal syntax using %r{...} allows / in your regex without escaping
new_str = my_str.sub( %r{<h5>[^<]+</h5>}, '<h2>something_else</h2>' )

Using String#sub instead of String#gsub causes only the first replacement to occur. If you need to dynamically choose what 'foo' is, you can use string interpolation in regex literals:

new_str = my_str.sub( %r{<h5>#{searchstr}</h5>}, "<h2>#{replacestr}</h2>" )

Then again, if you know what 'foo' is, you don't need a regex:

new_str = my_str.sub( "<h5>searchstr</h5>", "<h2>#{replacestr}</h2>" )

or even:

my_str[ "<h5>searchstr</h5>" ] = "<h2>#{replacestr}</h2>"

If you need to run code to figure out the replacement, you can use the block form of sub:

new_str = my_str.sub %r{<h5>([^<]+)</h5>} do |full_match|
  # The expression returned from this block will be used as the replacement string
  # $1 will be the matched content between the h5 tags.
  "<h2>#{replacestr}</h2>"
end
Sign up to request clarification or add additional context in comments.

Comments

6

Whenever I have to parse or modify HTML or XML I reach for a parser. I almost never bother with regex or instring unless it's absolutely a no-brainer.

Here's how to do it using Nokogiri, without any regex:

text = <<EOT
<h5>foo</h5>
  <table>
    <tbody>
    </tbody>
  </table>

<h5>bar</h5>
  <table>
    <tbody>
    </tbody>
  </table>

<h5>meow</h5>
  <table>
    <tbody>
    </tbody>
  </table>
EOT

require 'nokogiri'

fragment = Nokogiri::HTML::DocumentFragment.parse(text)
print fragment.to_html

fragment.css('h5').select{ |n| n.text == 'foo' }.each do |n|
  n.name = 'h2'
  n.content = 'something_else'
end

print fragment.to_html

After parsing, this is what Nokogiri has returned from the fragment:

# >> <h5>foo</h5>
# >>   <table><tbody></tbody></table><h5>bar</h5>
# >>   <table><tbody></tbody></table><h5>meow</h5>
# >>   <table><tbody></tbody></table>

This is after running:

# >> <h2>something_else</h2>
# >>   <table><tbody></tbody></table><h5>bar</h5>
# >>   <table><tbody></tbody></table><h5>meow</h5>
# >>   <table><tbody></tbody></table>

Comments

2

Use String.gsub with the regular expression <h5>[^<]+<\/h5>:

>> current = "<h5>foo</h5>\n  <table>\n    <tbody>\n    </tbody>\n  </table>"
>> updated = current.gsub(/<h5>[^<]+<\/h5>/){"<h2>something_else</h2>"}
=> "<h2>something_else</h2>\n  <table>\n    <tbody>\n    </tbody>\n  </table>"

Note, you can test ruby regular expression comfortably in your browser.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.