Get javascript code from html file

Question

I am trying to get the javascript code from an html file using C# and regular expressions. The code I use now is the following:

string js = Regex.Replace(code, @"^.*?\<script\s?.*?\>((.|\r\n)+?)\<\/script\>.*$", "$1", RegexOptions.Multiline);

But when I use this I get the full html code with the script-tags stripped.

Can someone help me with this?

I use the html agility pack now with the following code:

var hwObject = new HtmlWeb();
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(code);
        foreach (var script in doc.DocumentNode.Descendants("script").ToArray())
        {
            string js = script.InnerText;
            HtmlTextNode text = (HtmlTextNode)script.ChildNodes.Single(d => d.NodeType == HtmlNodeType.Text);
            text.Text = TrimJs(js);
        }

But only the last script tag get's replaced. The javascripts before just disappear.

If you're at liberty for such a decision, I'd say you should use the HTML Agility Pack instead. — Bobby
– Bobby, Commented Jun 24, 2011 at 13:05
from what I understand you want to get only the script, and what you get is everything but the script ? — ub1k
– ub1k, Commented Jun 24, 2011 at 13:06
I get the html and the script but the script-tags hav disappeared. — Jerodev
– Jerodev, Commented Jun 24, 2011 at 13:10

Ryan Gross · Accepted Answer · 2011-06-24 15:32:39Z

9

You should take a look at Html Agility Pack.

It is generally much easier to parse HTML using an xml based parser than using regular expressions.

You could use something like this:

HtmlWeb hwObject = new HtmlWeb();
HtmlDocument htmldocObject = hwObject.Load("http://www...");
foreach(var script in doc.DocumentNode.Descendants("script").ToArray()) 
{ 
    string s = script.InnerText;
    // Modify s somehow
    HtmlTextNode text = (HtmlTextNode)script.ChildNodes
                        .Single(d => d.NodeType == HtmlNodeType.Text);
    text.Text = s;
}
htmldocObject .Save("file.htm");

edited Jun 24, 2011 at 15:32

answered Jun 24, 2011 at 13:05

Ryan Gross

6,5852 gold badges35 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

FishBasketGordo Over a year ago

This is a great answer. I feel compelled to say, in agreement with @Ryan Gross, that HTML is not a regular language, and using regular expressions to parse HTML is generally not a good idea.

Jerodev Over a year ago

This looks great, can I also replace the code between the script tags with something else?

Ryan Gross Over a year ago

The InnerText property is read only, but I think you could try setting the Text property.

Jerodev Over a year ago

Is it possible to load a local file or a string? I have tried, but it won't load a local file.

Ryan Gross Over a year ago

HtmlDocument doc = new HtmlDocument(); doc.Load("file.htm");

|

Johny Skovdal · Accepted Answer · 2011-06-24 13:16:55Z

2

You need to remove the "^.*?" and ".*$", as this is why everything is included, and there is no reason to use Replace when you are looking for a substring. Just use the Regex.Match method and you should be good to go.

edited Jun 24, 2011 at 13:16

answered Jun 24, 2011 at 13:07

Johny Skovdal

2,1141 gold badge20 silver badges38 bronze badges

2 Comments

Jerodev Over a year ago

Yes, but I want to replace the javascript later in the code. This was just to test if I can get the javascript code.

Johny Skovdal Over a year ago

Ok, well it might be because you have empty scripts on your page then. Try this: \<script.*?\>((.|\r\n)*?)\<\/script\>

Thaddee Tyl · Accepted Answer · 2011-06-24 13:04:25Z

0

Drop the .* (use the following regexp: \<script\s?.*?\>((.|\r\n)+?)\<\/script\>)

answered Jun 24, 2011 at 13:04

Thaddee Tyl

1,2142 gold badges13 silver badges18 bronze badges

Collectives™ on Stack Overflow

Get javascript code from html file

3 Answers 3

12 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

12 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related