How do I remove a substring from a string in Ruby?

Question

I have the following string, and I want to remove everything between the <EMAIL> tag including the tag itself:

"Great, I will send you something at [email protected].\n    <EMAIL><ADDRESS>[email protected]</ADDRESS><SUBJECT>Quick note on [email protected]</SUBJECT>\n      <BODY>Hi, just dropping you a quick note.</BODY></EMAIL>"

I use the following to remove it:

string =  string.gsub(/<EMAIL>(.*)<\/EMAIL>/, '').strip

It does not work.

When I remove the \n from the string (I'd prefer not to because it makes formatting and inputing more limiting), then I get the following:

=> "Great, I will send you something at [email protected]."

In other words, it works when I remove that.

How do I change my gsub statement to accommodate for \n and why does that cause the failure?

Don't do that. Trying to use patterns to manipulate HTML or XML is a path to madness, so instead use a real parser like Nokogiri. — the Tin Man
– the Tin Man, Commented Dec 18, 2014 at 20:44

the Tin Man · Accepted Answer · 2015-01-10 04:51:02Z

7

Your string is multiline, but by default, Ruby regexps work on a line-by-line basis, so <EMAIL> and </EMAIL> being on two different lines, the regexp will never match.

This because in default mode, the metacharacter . stands for Any character except a newline.

You need to use the m (multiline) flag:

s= "Great, I will send you something at [email protected].\n    <EMAIL><ADDRESS>[email protected]</ADDRESS><SUBJECT>Quick note on [email protected]</SUBJECT>\n      <BODY>Hi, just dropping you a quick note.</BODY></EMAIL>"=> "Great, I will send you something at [email protected].\n    <EMAIL><ADDRESS>[email protected]</ADDRESS><SUBJECT>Quick note on [email protected]</SUBJECT>\n      <BODY>Hi, just dropping you a quick note.</BODY></EMAIL>"
s.gsub(/<EMAIL>(.*)<\/EMAIL>/m, '').strip

This returns:

"Great, I will send you something at [email protected]."

edited Jan 10, 2015 at 4:51

the Tin Man

161k44 gold badges222 silver badges308 bronze badges

answered Dec 18, 2014 at 17:04

SirDarius

43.3k8 gold badges92 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Isaac Over a year ago

And if you want to keep the '\n', just remove the ".strip". You'll get "Great, I will send you something at [email protected].\n ".

SirDarius Over a year ago

@Isaac use backticks ( ` ) to delimit code fragments.

Community · Accepted Answer · 2017-05-23 11:57:52Z

What you're doing can work, but it's very fragile, and as a result is not recommended. Instead, use a parser like Nokogiri:

require 'nokogiri'

str = "Great, I will send you something at [email protected].\n    <EMAIL><ADDRESS>[email protected]</ADDRESS><SUBJECT>Quick note on [email protected]</SUBJECT>\n      <BODY>Hi, just dropping you a quick note.</BODY></EMAIL>"

Here's how to parse the document:

doc = Nokogiri::XML::DocumentFragment.parse(str)

If the string was valid XML I could use a shorter method to parse:

doc = Nokogiri::XML(str)

Now find and remove the tag and its contents:

doc.at('EMAIL').remove
puts doc.to_xml
# >> Great, I will send you something at [email protected].

at finds the first tag named <EMAIL> using a CSS selector. There are other related methods to find all matching tags or specific to CSS or XPath selectors.

XML/HTML parsers break the text down into nodes, making it easy to find things and manipulate them. The text can change, and as long as it's valid HTML or XML, properly written code will continue to work.

See the obligatory "RegEx match open tags except XHTML self-contained tags".

Regular expressions break down badly if there are embedded duplicate tags, something like:

<b>bold <i>italic <b>another bold</b></i></b>

Trying to strip the <b> tags with patterns only would be painful. It's more easily done with a parser.

If I was absolutely bound-and-determined to do it without using a parser, this would work:

foo = "Great, I will send you something at [email protected].\n <EMAIL><ADDRESS>asdf</ADDRESS><SUBJECT>sdfg</SUBJECT>\n <BODY>dfgh</BODY></EMAIL>" 
foo.gsub(%r#<EMAIL>.*?</EMAIL>#im, '').strip
# => "Great, I will send you something at [email protected]."

Or:

foo.gsub(%r#\s*<EMAIL>.*?</EMAIL>\s*#im, '')
# => "Great, I will send you something at [email protected]."

I prefer the first of these two because it's visually clearer.

Use the i flag to make the pattern case-insensitive: It'll match both <email> and <EMAIL>. Use the m flag to allow . to treat line-ends as if they were normal characters. The default is to treat them like they're special which makes a string with embedded line-ends be treated as multiple lines.

I'd prefer not to because it makes formatting and inputing more limiting

Sometimes it's easier to strip something like a trailing newline in the pattern, then re-add it later. If the choice is between maintaining a little Ruby code or a complicated pattern, I'd take the Ruby code. Patterns are powerful and I use them, but they're not the answer to everything.

I posted another question about parsing -- seems like you are the expert: stackoverflow.com/questions/27680007/…
I'm glad that helped. I'll look at your other question and see if I can help.

Collectives™ on Stack Overflow

How do I remove a substring from a string in Ruby?

2 Answers 2

2 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related