Javascript, Use a regex to replace content outside of HTML tags only

Question

I am trying to write a regular expression in JavaScript to replace strings that are outside of HTML tags, and to ignore the strings within HTML tags.

Here's my JavaScript code:

var content = "Hi, my <span user="John">name</span> is &nbsp;John";
var user = 'John';
var regex = new RegExp('(&nbsp;)?' + user,'g');
content.replace(regex, function($0,$1){
    return $1 ? $0 : '<img src="images/user.png">';
});

My regex is "( )?John".

The pattern works the way I want to, but it applies the matching to tag data, which I don't want.

So, the idea is to ignore everything between tags: < and >, and to ignore:  John.

Can it be done?

Look what I found in the related questions. You should probably try to create a DOM from your input string and then iterate over text-nodes only. — Martin Ender
– Martin Ender, Commented Jun 28, 2013 at 21:27
@m.buettner is right; regex is not the right tool to parse html. it's really easy to parse a string into dom nodes if you use a javascript library though—for example, jquery has a great parse function: api.jquery.com/jQuery.parseHTML — user428517
– user428517, Commented Jun 28, 2013 at 21:40
I see you want to replace every occurrence of the word John with an image, except those that are inside attributes. Is that right? Or is it required that, in addition, a   precedes the word John (like # is the hashtag for twitter)? — acdcjunior
– acdcjunior, Commented Jun 29, 2013 at 0:21
@acdcjunior he replaces all text-instances that are not preceded by   — Martin Ender
– Martin Ender, Commented Jun 29, 2013 at 0:57
Filip, can you provide some desired output to match your sample text? — Ro Yo Mi
– Ro Yo Mi, Commented Jun 29, 2013 at 2:02

Ro Yo Mi · Accepted Answer · 2013-06-29 14:33:06Z

2

Description

This regex will match John providing it is either at the start or end of the string and/or has white space on either side.

Regex to match John: (?:\s| |^)(John)(?=\s|\r|\n|$)

This regex incorporates that last regex and also matches all html tags and plain text urls. The order here is important because John will only match providing it's outside an html tag or not embeded into a URL.

Regex: https?:\/\/[^\s]*|<\/?\w+\b(?=\s|>)(?:='[^']*'|="[^"]*"|=[^'"][^\s>]*|[^>])*>|\ John|(John)

If you take this last regex and pass it through your function, then only Johns outside the tags & urls will be replaced with a string.

Javascript Example

Working example: http://repl.it/J4T

Code

var content = "<span name=\"John\" funnytag:John>John John &nbsp;John DoeJohn JohnDoe Mr.JohnDoe http://cool.guy.john/LikesKittens</span>";
var rePattern = /https?:\/\/[^\s]*|<\/?\w+\b(?=\s|>)(?:='[^']*'|="[^"]*"|=[^'"][^\s>]*|[^>])*>|\&nbsp;John|(John)/gi;

content.replace(rePattern, function(match, capture) {
    return capture ? "<img src=\"images/user.png\">" : match;
});

Output

<span name="John" funnytag:John><img src="images/user.png"> <img src="images/user.png">  John Doe<img src="images/user.png"> <img src="images/user.png">Doe Mr.<img src="images/user.png">Doe http://cool.guy.john/LikesKittens</span>

edited Jun 29, 2013 at 14:33

answered Jun 29, 2013 at 2:16

Ro Yo Mi

15k5 gold badges38 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user472268 Over a year ago

Works almost great :). One problem is that John is replaced only if it is surrounded with spaces. E.g. <b>John or JohnDoe are not replaced. Another problem is that when I have " John" or  John, the replacement image removes the space in front of John. And one more thing, " John" should remain the same as it is. The example I gave in the question was replacing every "John" except the one that was starting with   and the idea was to continue with the same pattern, plus ignoring John in tags. I would fix it by myself, but it is too complex expression for my knowledge.

Ro Yo Mi Over a year ago

Updated to cover your examples.

Iulius Curt Over a year ago

What is this sorcery?

WebWanderer Over a year ago

I can't get over how wonderful this is. It works perfect. Your a wizard Denomales. Thanks again!

Adi Inbar · Accepted Answer · 2013-06-28 23:39:30Z

0

If I understand correctly, you're saying that you want to replace anything matching the regex as long as it's not contained within a tag, i.e. John and optionally a preceding non-breaking space would be replaced with the return value of function($0,$1) unless it appears inside an HTML tag?

If so, you could add this look-behind assertion to the beginning of your regex: (?<!<[^>]*?). That tells the regex to match the pattern if reading backwards from the match it doesn't encounter a < before it encounters a >.

This would be your code:

var regex = new RegExp('(?<!<[^>]*?)(&nbsp;)?' + user,'g');

answered Jun 28, 2013 at 23:39

Adi Inbar

12.4k13 gold badges60 silver badges71 bronze badges

1 Comment

Martin Ender Over a year ago

Javascript does not support lookbehinds

Collectives™ on Stack Overflow

Javascript, Use a regex to replace content outside of HTML tags only

2 Answers 2

Description

Javascript Example

4 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Description

Javascript Example

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related