3

I am trying to get the javascript code from an html file using C# and regular expressions. The code I use now is the following:

string js = Regex.Replace(code, @"^.*?\<script\s?.*?\>((.|\r\n)+?)\<\/script\>.*$", "$1", RegexOptions.Multiline);

But when I use this I get the full html code with the script-tags stripped.

Can someone help me with this?


I use the html agility pack now with the following code:

var hwObject = new HtmlWeb();
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(code);
        foreach (var script in doc.DocumentNode.Descendants("script").ToArray())
        {
            string js = script.InnerText;
            HtmlTextNode text = (HtmlTextNode)script.ChildNodes.Single(d => d.NodeType == HtmlNodeType.Text);
            text.Text = TrimJs(js);
        }

But only the last script tag get's replaced. The javascripts before just disappear.

3
  • If you're at liberty for such a decision, I'd say you should use the HTML Agility Pack instead. Commented Jun 24, 2011 at 13:05
  • from what I understand you want to get only the script, and what you get is everything but the script ? Commented Jun 24, 2011 at 13:06
  • I get the html and the script but the script-tags hav disappeared. Commented Jun 24, 2011 at 13:10

3 Answers 3

9

You should take a look at Html Agility Pack.

It is generally much easier to parse HTML using an xml based parser than using regular expressions.

You could use something like this:

HtmlWeb hwObject = new HtmlWeb();
HtmlDocument htmldocObject = hwObject.Load("http://www...");
foreach(var script in doc.DocumentNode.Descendants("script").ToArray()) 
{ 
    string s = script.InnerText;
    // Modify s somehow
    HtmlTextNode text = (HtmlTextNode)script.ChildNodes
                        .Single(d => d.NodeType == HtmlNodeType.Text);
    text.Text = s;
}
htmldocObject .Save("file.htm");
Sign up to request clarification or add additional context in comments.

12 Comments

This is a great answer. I feel compelled to say, in agreement with @Ryan Gross, that HTML is not a regular language, and using regular expressions to parse HTML is generally not a good idea.
This looks great, can I also replace the code between the script tags with something else?
The InnerText property is read only, but I think you could try setting the Text property.
Is it possible to load a local file or a string? I have tried, but it won't load a local file.
HtmlDocument doc = new HtmlDocument(); doc.Load("file.htm");
|
2

You need to remove the "^.*?" and ".*$", as this is why everything is included, and there is no reason to use Replace when you are looking for a substring. Just use the Regex.Match method and you should be good to go.

2 Comments

Yes, but I want to replace the javascript later in the code. This was just to test if I can get the javascript code.
Ok, well it might be because you have empty scripts on your page then. Try this: \<script.*?\>((.|\r\n)*?)\<\/script\>
0

Drop the .* (use the following regexp: \<script\s?.*?\>((.|\r\n)+?)\<\/script\>)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.