0

I have the following string, and I want to remove everything between the <EMAIL> tag including the tag itself:

"Great, I will send you something at [email protected].\n    <EMAIL><ADDRESS>[email protected]</ADDRESS><SUBJECT>Quick note on [email protected]</SUBJECT>\n      <BODY>Hi, just dropping you a quick note.</BODY></EMAIL>" 

I use the following to remove it:

string =  string.gsub(/<EMAIL>(.*)<\/EMAIL>/, '').strip

It does not work.

When I remove the \n from the string (I'd prefer not to because it makes formatting and inputing more limiting), then I get the following:

=> "Great, I will send you something at [email protected]."

In other words, it works when I remove that.

How do I change my gsub statement to accommodate for \n and why does that cause the failure?

2
  • 1
    PS using regex on markup is a bad habit to get into... Commented Dec 18, 2014 at 17:06
  • Don't do that. Trying to use patterns to manipulate HTML or XML is a path to madness, so instead use a real parser like Nokogiri. Commented Dec 18, 2014 at 20:44

2 Answers 2

7

Your string is multiline, but by default, Ruby regexps work on a line-by-line basis, so <EMAIL> and </EMAIL> being on two different lines, the regexp will never match.

This because in default mode, the metacharacter . stands for Any character except a newline.

You need to use the m (multiline) flag:

s= "Great, I will send you something at [email protected].\n    <EMAIL><ADDRESS>[email protected]</ADDRESS><SUBJECT>Quick note on [email protected]</SUBJECT>\n      <BODY>Hi, just dropping you a quick note.</BODY></EMAIL>"=> "Great, I will send you something at [email protected].\n    <EMAIL><ADDRESS>[email protected]</ADDRESS><SUBJECT>Quick note on [email protected]</SUBJECT>\n      <BODY>Hi, just dropping you a quick note.</BODY></EMAIL>"
s.gsub(/<EMAIL>(.*)<\/EMAIL>/m, '').strip

This returns:

"Great, I will send you something at [email protected]."
Sign up to request clarification or add additional context in comments.

2 Comments

And if you want to keep the '\n', just remove the ".strip". You'll get "Great, I will send you something at [email protected].\n ".
@Isaac use backticks ( ` ) to delimit code fragments.
2

What you're doing can work, but it's very fragile, and as a result is not recommended. Instead, use a parser like Nokogiri:

require 'nokogiri'

str = "Great, I will send you something at [email protected].\n    <EMAIL><ADDRESS>[email protected]</ADDRESS><SUBJECT>Quick note on [email protected]</SUBJECT>\n      <BODY>Hi, just dropping you a quick note.</BODY></EMAIL>"

Here's how to parse the document:

doc = Nokogiri::XML::DocumentFragment.parse(str)

If the string was valid XML I could use a shorter method to parse:

doc = Nokogiri::XML(str)

Now find and remove the tag and its contents:

doc.at('EMAIL').remove
puts doc.to_xml
# >> Great, I will send you something at [email protected].

at finds the first tag named <EMAIL> using a CSS selector. There are other related methods to find all matching tags or specific to CSS or XPath selectors.

XML/HTML parsers break the text down into nodes, making it easy to find things and manipulate them. The text can change, and as long as it's valid HTML or XML, properly written code will continue to work.

See the obligatory "RegEx match open tags except XHTML self-contained tags".

Regular expressions break down badly if there are embedded duplicate tags, something like:

<b>bold <i>italic <b>another bold</b></i></b>

Trying to strip the <b> tags with patterns only would be painful. It's more easily done with a parser.

If I was absolutely bound-and-determined to do it without using a parser, this would work:

foo = "Great, I will send you something at [email protected].\n <EMAIL><ADDRESS>asdf</ADDRESS><SUBJECT>sdfg</SUBJECT>\n <BODY>dfgh</BODY></EMAIL>" 
foo.gsub(%r#<EMAIL>.*?</EMAIL>#im, '').strip
# => "Great, I will send you something at [email protected]."

Or:

foo.gsub(%r#\s*<EMAIL>.*?</EMAIL>\s*#im, '')
# => "Great, I will send you something at [email protected]."

I prefer the first of these two because it's visually clearer.

Use the i flag to make the pattern case-insensitive: It'll match both <email> and <EMAIL>. Use the m flag to allow . to treat line-ends as if they were normal characters. The default is to treat them like they're special which makes a string with embedded line-ends be treated as multiple lines.

I'd prefer not to because it makes formatting and inputing more limiting

Sometimes it's easier to strip something like a trailing newline in the pattern, then re-add it later. If the choice is between maintaining a little Ruby code or a complicated pattern, I'd take the Ruby code. Patterns are powerful and I use them, but they're not the answer to everything.

2 Comments

I posted another question about parsing -- seems like you are the expert: stackoverflow.com/questions/27680007/…
I'm glad that helped. I'll look at your other question and see if I can help.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.