14

I'm trying to return the contents of any tags in a body of text. I'm currently using the following expression, but it only captures the contents of the first tag and ignores any others after that.

Here's a sample of the html:

    <script type="text/javascript">
        alert('1');
    </script>

    <div>Test</div>

    <script type="text/javascript">
        alert('2');
    </script>

My regex looks like this:

//scripttext contains the sample
re = /<script\b[^>]*>([\s\S]*?)<\/script>/gm;
var scripts  = re.exec(scripttext);

When I run this on IE6, it returns 2 matches. The first containing the full tag, the 2nd containing alert('1').

When I run it on http://www.pagecolumn.com/tool/regtest.htm it gives me 2 results, each containing the script tags only.

3
  • Are you actually writing the regex in javascript? Can you include the matching code. Commented Sep 17, 2009 at 21:42
  • Using RegexBuddy 3.2.1, this works fine. It captures the content of both tags. Commented Sep 17, 2009 at 21:43
  • I'm using /gm. I modified the regexp slightly. Its now returning 2 results, each containing a script tag but it includes the html. <script\b[^>]*>([\s\S]*?)<\/script>/gm How do I return just the content? Commented Sep 17, 2009 at 21:47

6 Answers 6

47

The "problem" here is in how exec works. It matches only first occurrence, but stores current index (i.e. caret position) in lastIndex property of a regex. To get all matches simply apply regex to the string until it fails to match (this is a pretty common way to do it):

var scripttext = ' <script type="text/javascript">\nalert(\'1\');\n</script>\n\n<div>Test</div>\n\n<script type="text/javascript">\nalert(\'2\');\n</script>';

var re = /<script\b[^>]*>([\s\S]*?)<\/script>/gm;

var match;
while (match = re.exec(scripttext)) {
  // full match is in match[0], whereas captured groups are in ...[1], ...[2], etc.
  console.log(match[1]);
}
Sign up to request clarification or add additional context in comments.

3 Comments

<script>alert('</script>. Damn it, foiled again!');</script>
@Svante what about it? :)
@kangax, @Svante wants to say that your regular expression will fail on his example of code. Because he got string value with </script> inside.
5

Don't use regular expressions for parsing HTML. HTML is not a regular language. Use the power of the DOM. This is much easier, because it is the right tool.

var scripts = document.getElementsByTagName('script');

6 Comments

There's always reasons to want to manually parse dom from strings. IE8 blows away script tags if you try to use innerHTML, for example. If I'm building an application using modularized widgets and html templates, this becomes a problem.
Sometimes you need to sanitize an HTML string before turning it into a DOM.
@YuvalA.: two possibilities: 1. It is invalid HTML; then you need a "tag soup parser". 2. It is valid HTML; then you need an HTML parser. In any case, you can use simple query syntax after parsing.
If you just want to remove scripts, you can use e. g. jQuery.parseHTML
@Svante, jQuery.parseHTML will not remove inline event handlers. I once made a Firefox extension that takes HTML strings from Wikipedia API and creates DOM from them. Mozilla guys kept rejecting it because of lack of sanitazation. An HTML parser will always first create a DOM structure from a string, and they simply did not allow turning a string into DOM before "cleaning" it...
|
4

Try using the global flag:

document.body.innerHTML.match(/<script.*?>([\s\S]*?)<\/script>/gmi)

Edit: added multiple line and case insensitive flags (for obvious reasons).

4 Comments

or, if you are using a regex function, make sure it is configured to catch all matches. Some of them require multiple calls, or an extra parameter, or a difference function to be called.
@TheJacobTaylor The seems kind of vague. What regex function are your referring to other than new RegExp?
@Justin Johnson My comment was partially driven by questions above about what language the regex was in. Since I was not sure, and they were getting on result, I thought they might have been impacted by calling the wrong function. In PHP, for example, preg_match and preg_match_all will return the first or all matches.
Ah, very well. I assume JavaScript. I think it was tagged as such when I got to the question, not sure though.
1

The first group contains the content of the tags.

Edit: Don't you have to surround the regex-satement with quotes? Like:

re = "/<script\b[^>]*>([\s\S]*?)<\/script>/gm";

1 Comment

No, you don't. In javascript, /.../ denotes a regular expression. You can build it as a string if you want, but then you have to be more explicit in its construction. E.g.: /<script\b[^>]*>([\s\S]*?)<\/script>/g is equivalent to new RegExp("<script\b[^>]*>([\s\S]*?)<\/script>", "g")
0

In .Net, there's a submatch method, in PHP, preg_match_all, which should solve you problem. In Javascript there isn't such a method. But you can made by yourself.

Test in http://www.pagecolumn.com/tool/regtest.htm

Select $1elements method will return what you want

Comments

0

try this

for each(var x in document.getElementsByTagName('script');
     if (x && x.innerHTML){
          var yourRegex = /http:\/\/\.*\.com/g;
          var matches = yourRegex.exec(x.innerHTML);
             if (matches){
          your code
 }}

1 Comment

There is already an accepted answer to this question that accomplishes what is needed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.