1

I am trying to write a regular expression in JavaScript to replace strings that are outside of HTML tags, and to ignore the strings within HTML tags.

Here's my JavaScript code:

var content = "Hi, my <span user="John">name</span> is &nbsp;John";
var user = 'John';
var regex = new RegExp('(&nbsp;)?' + user,'g');
content.replace(regex, function($0,$1){
    return $1 ? $0 : '<img src="images/user.png">';
});

My regex is "(&nbsp;)?John".

The pattern works the way I want to, but it applies the matching to tag data, which I don't want.

So, the idea is to ignore everything between tags: < and >, and to ignore: &nbsp;John.

Can it be done?

5
  • 7
    Look what I found in the related questions. You should probably try to create a DOM from your input string and then iterate over text-nodes only. Commented Jun 28, 2013 at 21:27
  • @m.buettner is right; regex is not the right tool to parse html. it's really easy to parse a string into dom nodes if you use a javascript library though—for example, jquery has a great parse function: api.jquery.com/jQuery.parseHTML Commented Jun 28, 2013 at 21:40
  • I see you want to replace every occurrence of the word John with an image, except those that are inside attributes. Is that right? Or is it required that, in addition, a &nbsp; precedes the word John (like # is the hashtag for twitter)? Commented Jun 29, 2013 at 0:21
  • @acdcjunior he replaces all text-instances that are not preceded by &nbsp; Commented Jun 29, 2013 at 0:57
  • Filip, can you provide some desired output to match your sample text? Commented Jun 29, 2013 at 2:02

2 Answers 2

2

Description

This regex will match John providing it is either at the start or end of the string and/or has white space on either side.

Regex to match John: (?:\s|&nbsp;|^)(John)(?=\s|\r|\n|$)

This regex incorporates that last regex and also matches all html tags and plain text urls. The order here is important because John will only match providing it's outside an html tag or not embeded into a URL.

Regex: https?:\/\/[^\s]*|<\/?\w+\b(?=\s|>)(?:='[^']*'|="[^"]*"|=[^'"][^\s>]*|[^>])*>|\&nbsp;John|(John)

If you take this last regex and pass it through your function, then only Johns outside the tags & urls will be replaced with a string.

Javascript Example

Working example: http://repl.it/J4T

Code

var content = "<span name=\"John\" funnytag:John>John John &nbsp;John DoeJohn JohnDoe Mr.JohnDoe http://cool.guy.john/LikesKittens</span>";
var rePattern = /https?:\/\/[^\s]*|<\/?\w+\b(?=\s|>)(?:='[^']*'|="[^"]*"|=[^'"][^\s>]*|[^>])*>|\&nbsp;John|(John)/gi;

content.replace(rePattern, function(match, capture) {
    return capture ? "<img src=\"images/user.png\">" : match;
});

Output

<span name="John" funnytag:John><img src="images/user.png"> <img src="images/user.png"> &nbsp;John Doe<img src="images/user.png"> <img src="images/user.png">Doe Mr.<img src="images/user.png">Doe http://cool.guy.john/LikesKittens</span>

Sign up to request clarification or add additional context in comments.

4 Comments

Works almost great :). One problem is that John is replaced only if it is surrounded with spaces. E.g. <b>John or JohnDoe are not replaced. Another problem is that when I have " John" or &nbsp;John, the replacement image removes the space in front of John. And one more thing, "&nbsp;John" should remain the same as it is. The example I gave in the question was replacing every "John" except the one that was starting with &nbsp; and the idea was to continue with the same pattern, plus ignoring John in tags. I would fix it by myself, but it is too complex expression for my knowledge.
Updated to cover your examples.
What is this sorcery?
I can't get over how wonderful this is. It works perfect. Your a wizard Denomales. Thanks again!
0

If I understand correctly, you're saying that you want to replace anything matching the regex as long as it's not contained within a tag, i.e. John and optionally a preceding non-breaking space would be replaced with the return value of function($0,$1) unless it appears inside an HTML tag?

If so, you could add this look-behind assertion to the beginning of your regex: (?<!<[^>]*?). That tells the regex to match the pattern if reading backwards from the match it doesn't encounter a < before it encounters a >.

This would be your code:

var regex = new RegExp('(?<!<[^>]*?)(&nbsp;)?' + user,'g');

1 Comment

Javascript does not support lookbehinds

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.