1

Here's a weird .NET regex problem that I can't figure out. I'm trying to unparse some HTML in my forum app. I haven't changed the code, but in certain environments, the regex simply never returns. I can reproduce it in the app:

line 66: https://github.com/POPWorldMedia/POPForums/blob/master/PopForums/Services/TextParsingService.cs

text = Regex.Replace(text, @"(<iframe )(\S+ )*(src=""http://www.youtube.com/embed/)(\S+)("")( *\S+)*( */iframe>)", "http://www.youtube.com/watch?v=$4", RegexOptions.IgnoreCase);

The input string it's choking on is:

<p>This is an <strong>important</strong> <em>preview</em> of a post.</p>[quote]<p>This is a quote.<br /></p>[/quote]<p><iframe width="640" height="360" src="http://www.youtube.com/embed/Zey3WWThErw" frameborder="0" allowfullscreen></iframe></p><p>O look! YouTube!</p>

It will eventually time out here: http://regexlib.com/RETester.aspx

The host process, IIS in this case, goes to about 50% locally (one core, I assume) and never lets go or returns. I'm completely stumped. The same code is running on one of my sites in Azure and it doesn't choke there.

2
  • 3
    obligatory: stackoverflow.com/a/1732454/3312 Commented Oct 3, 2014 at 20:25
  • 1
    Yeah, I've seen that before, and I tend to reject it. This is a known and predictable pattern. Commented Oct 3, 2014 at 20:40

2 Answers 2

3

The (\S+ )* and ( *\S+)* parts cause a lot of backtracking.

Consider replacing them simply by .*. It is not 100% equivalent, but I think it should work with what I feel like you are trying to do.

text = Regex.Replace(text, @"(<iframe )(.)*(src=""http://www.youtube.com/embed/)(\S+)("")(.)*( */iframe>)", "http://www.youtube.com/watch?v=$4", RegexOptions.IgnoreCase);

You'll have other problems with that regex since it maches greedily. You might want to try this instead to make sure you don't have any problems if there are several iframe tags within your text.

text = Regex.Replace(text, @"(<iframe )(.)*?(src=""http://www.youtube.com/embed/)(\S+)("")(.)*?( */iframe>)", "http://www.youtube.com/watch?v=$4", RegexOptions.IgnoreCase);

As always, you should also consider using an HTML parser instead of regex for this kind of task.

Sign up to request clarification or add additional context in comments.

3 Comments

Yep, that did it! I knew it was a greed problem, but wasn't sure if I was doing it right. Thank you very much!
Why are you using (.)* and (.)*?? If you want to capture those bits, the quantifiers should be inside the group: (.*) and (.*?). But I don't think you need those groups at all.
I mindlessly replaced the contents of the capture groups with ., my bad. But indeed, they might not be needed at all.
1

Your only problem is this ( \ * \S+ )*.

Engines get particularly annoyed mixing (zero/many* with many+)* inside a zero/many group.
Singularize the many in that case and problem is solved. ie: ( _* _+)* to => ( _* _)*
These are the only places where it causes problems, especially when the many could match a lot
of different characters.

Always look to check that first, and don't get paranoid about backtracking.

 # @"(<iframe\ )(\S+\ )*(src=""http://www\.youtube\.com/embed/)(\S+)("")(\ *\S)*(\ */iframe>)"

 ( <iframe\  )                            # (1)
 ( \S+ \  )*                              # (2)
 (                                        # (3 start)
      src="http://www \. youtube \. com/embed/ 
 )                                        # (3 end)
 ( \S+ )                                  # (4)
 ( " )                                    # (5)
 ( \ * \S )*                              # (6)
 ( \ */iframe> )                          # (7)

Output:

 **  Grp 0 -  ( pos 119 , len 121 ) 
<iframe width="640" height="360" src="http://www.youtube.com/embed/Zey3WWThErw" frameborder="0" allowfullscreen></iframe>  
 **  Grp 1 -  ( pos 119 , len 8 ) 
<iframe   
 **  Grp 2 -  ( pos 139 , len 13 ) 
height="360"   
 **  Grp 3 -  ( pos 152 , len 34 ) 
src="http://www.youtube.com/embed/  
 **  Grp 4 -  ( pos 186 , len 11 ) 
Zey3WWThErw  
 **  Grp 5 -  ( pos 197 , len 1 ) 
"  
 **  Grp 6 -  ( pos 231 , len 1 ) 
<  
 **  Grp 7 -  ( pos 232 , len 8 ) 
/iframe>  

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.