I am trying to get the javascript code from an html file using C# and regular expressions. The code I use now is the following:
string js = Regex.Replace(code, @"^.*?\<script\s?.*?\>((.|\r\n)+?)\<\/script\>.*$", "$1", RegexOptions.Multiline);
But when I use this I get the full html code with the script-tags stripped.
Can someone help me with this?
I use the html agility pack now with the following code:
var hwObject = new HtmlWeb();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(code);
foreach (var script in doc.DocumentNode.Descendants("script").ToArray())
{
string js = script.InnerText;
HtmlTextNode text = (HtmlTextNode)script.ChildNodes.Single(d => d.NodeType == HtmlNodeType.Text);
text.Text = TrimJs(js);
}
But only the last script tag get's replaced. The javascripts before just disappear.