18
<html>
    <head>
        <script type="text/javascript" src="jquery.js"></script>
        <script type="text/javascript">
            if (window.self === window.top) { $.getScript("Wing.js"); }
        </script>
   </head>
</html>

Is there a way in C# to modify the above HTML file and convert it into this format:

<html>
    <head>
    </head>
</html>

Basically my goal is to remove all the JavaScript from the HTML page. I don't know what is be the best way to modify the HTML files. I want to do it programmatically as there are hundreds of files which need to be modified.

1
  • 1
    Smihit, be very careful of the edge case (which if your lucky, you won't encounter), that i mention in my answer, where you have an embedded <script> within a <script> i.e. <script>var s = '<script></script>';</script>. this WILL cause pain, so look at the agility pack options or at least my proposal of <script(.+?)*</script>. take care.. Commented Oct 16, 2013 at 23:08

5 Answers 5

35

It can be done using regex:

Regex rRemScript = new Regex(@"<script[^>]*>[\s\S]*?</script>");
output = rRemScript.Replace(input, "");
Sign up to request clarification or add additional context in comments.

10 Comments

What's the problem? If there is possibility for nested script tags, Replace while Matches.Count > 0 can be used.
This works for the example given above. I agree that it is not the best way and HTML agility pack should be used. But it works. Thanks for all the answers
For any comrades out there that need to get rid of script tags in html with everything in between and inclusive the tags. Regex.Replace(htmlStr, @"<script[^>]*>.*?<\/script>", string.Empty, RegexOptions.Singleline)
It's worth having a read through of the XSS Filter Evasion Cheat Sheet to be aware of how many formatting possibilities there are for a working script tag
|
8

May be worth a look: HTML Agility Pack

Edit: specific working code

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
string sampleHtml = 
    "<html>" +
        "<head>" + 
                "<script type=\"text/javascript\" src=\"jquery.js\"></script>" +
                "<script type=\"text/javascript\">" + 
                    "if (window.self === window.top) { $.getScript(\"Wing.js\"); }" +
                "</script>" +
        "</head>" +
    "</html>";
MemoryStream ms = new MemoryStream(Encoding.ASCII.GetBytes(sampleHtml));

doc.Load(ms);

List<HtmlNode> nodes = new List<HtmlNode>(doc.DocumentNode.Descendants("head"));
int childNodeCount = nodes[0].ChildNodes.Count;
for (int i = 0; i < childNodeCount; i++)
    nodes[0].ChildNodes.Remove(0);
Console.WriteLine(doc.DocumentNode.OuterHtml);

4 Comments

I agree, but perhaps you could be a little more specific in your answer?
What if the script tag is not in the head?
The just replace the call to Descendants("head") with what ever tag it descends from. "html" would work if its located outside head I believe
-! the example does not remove script tag, it removes all elements from head. -! MemoryStream is not required. doc.LoadHtml(sampleHtml);
6

I think as others have said, HtmlAgility pack is the best route. I've used this to scrape and remove loads of hard to corner cases. However, if a simple regex is your goal, then maybe you could try <script(.+?)*</script>. This will remove nasty nested javascript as well as normal stuff, i.e the type referred to in the link (Regular Expression for Extracting Script Tags):

<html>
<head>
    <script type="text/javascript" src="jquery.js"></script>
    <script type="text/javascript">
        if (window.self === window.top) { $.getScript("Wing.js"); }
    </script>
    <script> // nested horror
    var s = "<script></script>";
    </script>
</head>
</html>

usage:

Regex regxScriptRemoval = new Regex(@"<script(.+?)*</script>");
var newHtml = regxScriptRemoval.Replace(oldHtml, "");

return newHtml; // etc etc

Comments

3

This may seem like a strange solution.

If you don't want to use any third party library to do it and don't need to actually remove the script code, just kind of disable it, you could do this:

html = Regex.Replace(html , @"<script[^>]*>", "<!--");
html = Regex.Replace(html , @"<\/script>", "-->");

This creates an HTML comment out of script tags.

Comments

2

using regex:

string result = Regex.Replace(
    input, 
    @"</?(?i:script|embed|object|frameset|frame|iframe|meta|link|style)(.|\n|\s)*?>", 
    string.Empty, 
    RegexOptions.Singleline | RegexOptions.IgnoreCase
);

1 Comment

I tried this and it removed the script tag - and every other tag in my HTML. (I was left with just a blank string)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.