Regex - matching html element with child elements on multiple lines

Question

I have a simple piece of HTML code.

<tr>
OtherElement
</tr>
<tr>
HelloWorld
</tr>

I need to match the <tr></tr> element containing HelloWorld. I am using this regular expression but it matches first element as well.

<tr[\s\S]*?HelloWorld[\s\S]*?<\/tr>

I am using Node.js so I can not use look behind.

Do you really need (melius abundare quam deficere) to parse broken HTML with regexes? Oh and where multiple lines on child elements are?! — Adriano Repetti
– Adriano Repetti, Commented Jan 8, 2016 at 16:56

Wiktor Stribiżew · Accepted Answer · 2016-01-09 00:34:20Z

1

I assume you receive the HTML fragment as a string. So, you need to parse it with DOM parser (after replacing all tr tags with another custom name since otherwise parsing will fail) and get only those tr elements that contain (not are equal to) the string HelloWorld.

var $txt = "<tr>\nOtherElement\n</tr>\n<tr>Initial text\nHelloWorld\nSome other text</tr>";
var $el = document.createElement( 'body' );
$el.innerHTML = $txt.replace(/<(\/?)tr\b([^<]*)>/g, "<$1tablerows$2>"); // normalize TR tags as tablerows tags
var $arr = [];
[].forEach.call($el.getElementsByTagName("tablerows"), function(v,i,a) {
    if (v.innerText.indexOf("HelloWorld") > -1) {
		$arr.push(v.innerText);
    }
});
document.write(JSON.stringify($arr, 0, 4));

A regex solution is nasty and fragile, but possible:

<tr\b[^<]*>[^<]*(?:<(?!tr\b)[^<]*)*HelloWorld[^<]*(?:<(?!\/tr>)[^<]*)*<\/tr>

See regex demo

The regex uses an unroll the loop technique to match the closest subpatterns.

<tr\b[^<]*> - matches an opening TR tag
[^<]*(?:<(?!tr\b)[^<]*)* - matches anything but <tr up to the
HelloWorld - literal sequence
[^<]*(?:<(?!\/tr>)[^<]*)* - all but closing </tr>
<\/tr> - closing TR tag

answered Jan 9, 2016 at 0:34

Wiktor Stribiżew

631k41 gold badges502 silver badges633 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

JeFf Over a year ago

I am using node.js so I can not use DOM parser, but your regex solution works like a charm. Thanks

Wiktor Stribiżew Over a year ago

Not sure if that answer is still relevant, but it says you can use the npm modules jsdom and htmlparser to create and parse a DOM in Node.JS.

user663031 · Accepted Answer · 2016-01-08 17:22:12Z

1

Don't parse HTML with regexps. Instead, use DOM routines and properties:

function find_hello_world() {
  var trs = document.querySelectorAll('tr');

  for (var i=0; i<trs.length; i++) 
    if (trs[i].textContent === "HelloWorld") return trs[i];

}

answered Jan 8, 2016 at 17:22

user663031

1 Comment

JeFf Over a year ago

I can not use DOM since I am not in the browser but in Node.js environment.

Thriggle · Accepted Answer · 2016-01-08 17:33:08Z

1

There's an error in your regular expression. This character set is too permissive: [\s\S]*?

Try the following:

<tr>\s*HelloWorld\s*<\/tr>

\s* means 0 or more whitespace characters and nothing else.

And you may want to examine why you're using RegEx to parse HTML. This can be a useful approach for working with string snippets of known HTML, such as from a database, but in JavaScript you're probably better off using an XML parser or the DOM query selector methods.

edited Jan 8, 2016 at 17:33

answered Jan 8, 2016 at 17:02

Thriggle

7,0692 gold badges29 silver badges37 bronze badges

2 Comments

user663031 Over a year ago

How is [\s] different from \s?

Thriggle Over a year ago

@torazaburo it's not... That's what I get for modifying somebody else's RegEx instead of starting from scratch! Thanks for the correction, I've edited my answer.

Collectives™ on Stack Overflow

Regex - matching html element with child elements on multiple lines

3 Answers 3

2 Comments

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related