How to get regex to match multiple script tags?

Question

I'm trying to return the contents of any tags in a body of text. I'm currently using the following expression, but it only captures the contents of the first tag and ignores any others after that.

Here's a sample of the html:

    <script type="text/javascript">
        alert('1');
    </script>

    <div>Test</div>

    <script type="text/javascript">
        alert('2');
    </script>

My regex looks like this:

//scripttext contains the sample
re = /<script\b[^>]*>([\s\S]*?)<\/script>/gm;
var scripts  = re.exec(scripttext);

When I run this on IE6, it returns 2 matches. The first containing the full tag, the 2nd containing alert('1').

When I run it on http://www.pagecolumn.com/tool/regtest.htm it gives me 2 results, each containing the script tags only.

Are you actually writing the regex in javascript? Can you include the matching code. — cdm9002
– cdm9002, Commented Sep 17, 2009 at 21:42
Using RegexBuddy 3.2.1, this works fine. It captures the content of both tags. — Phoexo
– Phoexo, Commented Sep 17, 2009 at 21:43
I'm using /gm. I modified the regexp slightly. Its now returning 2 results, each containing a script tag but it includes the html. <script\b[^>]*>([\s\S]*?)<\/script>/gm How do I return just the content? — Geuis
– Geuis, Commented Sep 17, 2009 at 21:47

kangax · Accepted Answer · 2009-09-19 16:18:54Z

47

The "problem" here is in how exec works. It matches only first occurrence, but stores current index (i.e. caret position) in lastIndex property of a regex. To get all matches simply apply regex to the string until it fails to match (this is a pretty common way to do it):

var scripttext = ' <script type="text/javascript">\nalert(\'1\');\n</script>\n\n<div>Test</div>\n\n<script type="text/javascript">\nalert(\'2\');\n</script>';

var re = /<script\b[^>]*>([\s\S]*?)<\/script>/gm;

var match;
while (match = re.exec(scripttext)) {
  // full match is in match[0], whereas captured groups are in ...[1], ...[2], etc.
  console.log(match[1]);
}

answered Sep 19, 2009 at 16:18

kangax

39.2k13 gold badges101 silver badges135 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Svante Over a year ago

<script>alert('</script>. Damn it, foiled again!');</script>

kangax Over a year ago

@Svante what about it? :)

Jekis Over a year ago

@kangax, @Svante wants to say that your regular expression will fail on his example of code. Because he got string value with </script> inside.

Svante · Accepted Answer · 2009-09-19 20:35:29Z

5

Don't use regular expressions for parsing HTML. HTML is not a regular language. Use the power of the DOM. This is much easier, because it is the right tool.

var scripts = document.getElementsByTagName('script');

answered Sep 19, 2009 at 20:35

Svante

51.8k11 gold badges84 silver badges127 bronze badges

6 Comments

user24950814234 Over a year ago

There's always reasons to want to manually parse dom from strings. IE8 blows away script tags if you try to use innerHTML, for example. If I'm building an application using modularized widgets and html templates, this becomes a problem.

Yuval A. Over a year ago

Sometimes you need to sanitize an HTML string before turning it into a DOM.

Svante Over a year ago

@YuvalA.: two possibilities: 1. It is invalid HTML; then you need a "tag soup parser". 2. It is valid HTML; then you need an HTML parser. In any case, you can use simple query syntax after parsing.

Svante Over a year ago

If you just want to remove scripts, you can use e. g. jQuery.parseHTML

Yuval A. Over a year ago

@Svante, jQuery.parseHTML will not remove inline event handlers. I once made a Firefox extension that takes HTML strings from Wikipedia API and creates DOM from them. Mozilla guys kept rejecting it because of lack of sanitazation. An HTML parser will always first create a DOM structure from a string, and they simply did not allow turning a string into DOM before "cleaning" it...

|

Justin Johnson · Accepted Answer · 2009-09-17 23:03:08Z

4

Try using the global flag:

document.body.innerHTML.match(/<script.*?>([\s\S]*?)<\/script>/gmi)

Edit: added multiple line and case insensitive flags (for obvious reasons).

edited Sep 17, 2009 at 23:03

answered Sep 17, 2009 at 21:42

Justin Johnson

31.3k7 gold badges67 silver badges89 bronze badges

4 Comments

TheJacobTaylor Over a year ago

or, if you are using a regex function, make sure it is configured to catch all matches. Some of them require multiple calls, or an extra parameter, or a difference function to be called.

Justin Johnson Over a year ago

@TheJacobTaylor The seems kind of vague. What regex function are your referring to other than new RegExp?

TheJacobTaylor Over a year ago

@Justin Johnson My comment was partially driven by questions above about what language the regex was in. Since I was not sure, and they were getting on result, I thought they might have been impacted by calling the wrong function. In PHP, for example, preg_match and preg_match_all will return the first or all matches.

Justin Johnson Over a year ago

Ah, very well. I assume JavaScript. I think it was tagged as such when I got to the question, not sure though.

Phoexo · Accepted Answer · 2009-09-17 22:10:44Z

1

The first group contains the content of the tags.

Edit: Don't you have to surround the regex-satement with quotes? Like:

re = "/<script\b[^>]*>([\s\S]*?)<\/script>/gm";

edited Sep 17, 2009 at 22:10

answered Sep 17, 2009 at 21:53

Phoexo

2,5655 gold badges26 silver badges33 bronze badges

1 Comment

Justin Johnson Over a year ago

No, you don't. In javascript, /.../ denotes a regular expression. You can build it as a string if you want, but then you have to be more explicit in its construction. E.g.: /<script\b[^>]*>([\s\S]*?)<\/script>/g is equivalent to new RegExp("<script\b[^>]*>([\s\S]*?)<\/script>", "g")

unigg · Accepted Answer · 2009-09-19 16:02:33Z

0

In .Net, there's a submatch method, in PHP, preg_match_all, which should solve you problem. In Javascript there isn't such a method. But you can made by yourself.

Test in http://www.pagecolumn.com/tool/regtest.htm

Select $1elements method will return what you want

answered Sep 19, 2009 at 16:02

unigg

4643 silver badges8 bronze badges

Comments

tommy · Accepted Answer · 2012-06-04 18:03:16Z

0

try this

for each(var x in document.getElementsByTagName('script');
     if (x && x.innerHTML){
          var yourRegex = /http:\/\/\.*\.com/g;
          var matches = yourRegex.exec(x.innerHTML);
             if (matches){
          your code
 }}

answered Jun 4, 2012 at 18:03

tommy

1

1 Comment

random_user_name Over a year ago

There is already an accepted answer to this question that accomplishes what is needed.

Collectives™ on Stack Overflow

How to get regex to match multiple script tags?

6 Answers 6

3 Comments

6 Comments

4 Comments

1 Comment

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

3 Comments

6 Comments

4 Comments

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related