Regex Match Failure Parsing HTML Nodes

Question

I have a string:

<graphic id="8374932">Translating Cowl (Inner/Outer Bondments</graphic>

And my pattern:

"<graphic id=\"(.*?)\">(.*?)</graphic>"

But it fails for second group, saying: "Not enough )'s." How should I prevent it?

It looks like you're trying to parse XML. Would you like help? * Use LINQ to XML (recommended) * Use System.Xml * Use XPathDocument — dtb
– dtb, Commented Sep 18, 2011 at 16:21
Using an online regex tester, this works fine. Does the error come from the method that is given the value in .Group[2]? — Austin Salonen
– Austin Salonen, Commented Sep 18, 2011 at 16:22
@Austin: Good point, especially since that is the only place where there actually is a missing )... — Tim Pietzcker
– Tim Pietzcker, Commented Sep 18, 2011 at 16:27
I tested your search expression. It seems to work fine. Group1 ="8374932", Group2="Translating Cowl (Inner/Outer Bondments". — Olivier Jacot-Descombes
– Olivier Jacot-Descombes, Commented Sep 18, 2011 at 16:27
Didn't you accidentally switch input and pattern parameters? — svick
– svick, Commented Sep 18, 2011 at 17:05

ΩmegaMan · Accepted Answer · 2016-03-19 14:53:54Z

EDIT: First off, if you goal is to parse HTML or XML I strongly advise against it. If your goal is to learn or to surgically grab an element node then regex may, and I say may be a tool to use. I am answering this with the thought that you are using the html pattern to learn from....

I believe you have confused your data with your pattern and the regex pattern is failing.

I recommend these things

Don't use .*? to get text. It is too nebulous for the regex parser. Be more succinct in your pattern.
Since you know that the text is enclosed in quotes or by >xxx< use those as anchors.
Once anchors are determined extract the text
Place captured text into named capture groups.

How to get the text? Tell the regex parser to get everthing that is not an anchor character by using the set operation with the ^ (which means not when in a set [ ]) such as ([^\"]+) which says match everything that is not a quote.

Change your pattern to this which demonstrates the above suggestions:

string data = @"<graphic id=""8374932"">Translating Cowl (Inner/Outer Bondments</graphic>";

 // \x22 is the hex escape for the quote, makes it easier to read.
string pattern = @"
(?:graphic\s+id=\x22)  # Match but don't capture (MBDC) the beginning of the element
(?<ID>[^\x22]+)        # Get all that is not a quote
(?:\x22>)              # MBDC the quote
(?<Content>[^<+]+)     # Place into the Content match capture group all text that is not + or <  
(?:\</graphic)         # MBDC The graphic";

// Ignore Pattern whitespace only allows us to comment, does not influence regex processing.
var mt = Regex.Match(data, pattern, RegexOptions.IgnorePatternWhitespace);

Console.WriteLine ("ID: {0} Content: {1}", mt.Groups["ID"], mt.Groups["Content"]);

// Outputs:
// ID: 8374932 Content: Translating Cowl (Inner/Outer Bondments

Collectives™ on Stack Overflow

Regex Match Failure Parsing HTML Nodes

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related