3

i have this string

string = "<p>para1</p><p>para2</p><p>para3</p>"

I want to split on the para2 text, so that i get this

["<p>para1</p>", "<p>para3</p>"]

The catch is that sometimes para2 might not be wrapped in p tags (and there might be optional spaces outside the p and inside it). I thought that this would do it:

string.split(/\s*(<p>)?\s*para2\s*(<\/p>)?\s*/)

but, i get this:

["<p>para1</p>", "<p>", "</p>", "<p>para3</p>"]

it's not pulling the start and end p tags into the matching pattern - they should be eliminated as part of the split. Ruby's regular expressions are greedy by default so i thought that they would get pulled in. And, this seems to be confirmed if i do a gsub instead of a split:

string.gsub(/\s*(<p>)?\s*para2\s*(<\/p>)?\s*/, "XXX")
=> "<p>para1</p>XXX<p>para3</p>"

They are being pulled in and got rid of here, but not on the split. Any ideas anyone?

thanks, max

1
  • 2
    Remember, you can never truly parse HTML with regex. If this string is in any way dependent on outside input, use an HTML parser like hpricot or nokogiri. Commented Jan 29, 2010 at 18:40

1 Answer 1

8

Replace your capturing groups (…) with non-capturing groups (?:…):

/\s*(?:<p>)?\s*para2\s*(?:<\/p>)?\s*/
Sign up to request clarification or add additional context in comments.

3 Comments

This answer is correct. When you split by a regex with capturing groups, it puts the captures into the array, so you can do more complex scanning/splitting operations.
Nifty...didn't know we had that in Ruby!
Thanks Gumbo, that does the trick. I'd never even heard of non-capturing groups before, that's a really useful bit of knowledge.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.