How to remove <script> tags from an HTML page using C#?

Question

<html>
    <head>
        <script type="text/javascript" src="jquery.js"></script>
        <script type="text/javascript">
            if (window.self === window.top) { $.getScript("Wing.js"); }
        </script>
   </head>
</html>

Is there a way in C# to modify the above HTML file and convert it into this format:

<html>
    <head>
    </head>
</html>

Basically my goal is to remove all the JavaScript from the HTML page. I don't know what is be the best way to modify the HTML files. I want to do it programmatically as there are hundreds of files which need to be modified.

Smihit, be very careful of the edge case (which if your lucky, you won't encounter), that i mention in my answer, where you have an embedded <script> within a <script> i.e. <script>var s = '<script></script>';</script>. this WILL cause pain, so look at the agility pack options or at least my proposal of <script(.+?)*</script>. take care.. — jim tollan
– jim tollan, Commented Oct 16, 2013 at 23:08

Jerry · Accepted Answer · 2015-05-05 17:36:57Z

35

It can be done using regex:

Regex rRemScript = new Regex(@"<script[^>]*>[\s\S]*?</script>");
output = rRemScript.Replace(input, "");

edited May 5, 2015 at 17:36

answered Oct 16, 2013 at 22:13

Jerry

4,4363 gold badges35 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

pax162 Over a year ago

:D stackoverflow.com/questions/4683046/…

Jerry Over a year ago

What's the problem? If there is possibility for nested script tags, Replace while Matches.Count > 0 can be used.

StackOverflowVeryHelpful Over a year ago

This works for the example given above. I agree that it is not the best way and HTML agility pack should be used. But it works. Thanks for all the answers

brk Over a year ago

For any comrades out there that need to get rid of script tags in html with everything in between and inclusive the tags. Regex.Replace(htmlStr, @"<script[^>]*>.*?<\/script>", string.Empty, RegexOptions.Singleline)

robertc Over a year ago

It's worth having a read through of the XSS Filter Evasion Cheat Sheet to be aware of how many formatting possibilities there are for a working script tag

|

Uwe Keim · Accepted Answer · 2018-10-22 20:33:20Z

8

May be worth a look: HTML Agility Pack

Edit: specific working code

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
string sampleHtml = 
    "<html>" +
        "<head>" + 
                "<script type=\"text/javascript\" src=\"jquery.js\"></script>" +
                "<script type=\"text/javascript\">" + 
                    "if (window.self === window.top) { $.getScript(\"Wing.js\"); }" +
                "</script>" +
        "</head>" +
    "</html>";
MemoryStream ms = new MemoryStream(Encoding.ASCII.GetBytes(sampleHtml));

doc.Load(ms);

List<HtmlNode> nodes = new List<HtmlNode>(doc.DocumentNode.Descendants("head"));
int childNodeCount = nodes[0].ChildNodes.Count;
for (int i = 0; i < childNodeCount; i++)
    nodes[0].ChildNodes.Remove(0);
Console.WriteLine(doc.DocumentNode.OuterHtml);

edited Oct 22, 2018 at 20:33

Uwe Keim

40.9k61 gold badges193 silver badges309 bronze badges

answered Oct 16, 2013 at 22:11

gudatcomputers

2,8722 gold badges22 silver badges28 bronze badges

4 Comments

Oscar Mederos Over a year ago

I agree, but perhaps you could be a little more specific in your answer?

Krisztián Balla Over a year ago

What if the script tag is not in the head?

gudatcomputers Over a year ago

The just replace the call to Descendants("head") with what ever tag it descends from. "html" would work if its located outside head I believe

Kux Over a year ago

-! the example does not remove script tag, it removes all elements from head. -! MemoryStream is not required. doc.LoadHtml(sampleHtml);

Community · Accepted Answer · 2017-05-23 11:54:28Z

I think as others have said, HtmlAgility pack is the best route. I've used this to scrape and remove loads of hard to corner cases. However, if a simple regex is your goal, then maybe you could try <script(.+?)*</script>. This will remove nasty nested javascript as well as normal stuff, i.e the type referred to in the link (Regular Expression for Extracting Script Tags):

<html>
<head>
    <script type="text/javascript" src="jquery.js"></script>
    <script type="text/javascript">
        if (window.self === window.top) { $.getScript("Wing.js"); }
    </script>
    <script> // nested horror
    var s = "<script></script>";
    </script>
</head>
</html>

usage:

Regex regxScriptRemoval = new Regex(@"<script(.+?)*</script>");
var newHtml = regxScriptRemoval.Replace(oldHtml, "");

return newHtml; // etc etc

Krisztián Balla · Accepted Answer · 2015-09-07 11:54:25Z

3

This may seem like a strange solution.

If you don't want to use any third party library to do it and don't need to actually remove the script code, just kind of disable it, you could do this:

html = Regex.Replace(html , @"<script[^>]*>", "<!--");
html = Regex.Replace(html , @"<\/script>", "-->");

This creates an HTML comment out of script tags.

answered Sep 7, 2015 at 11:54

Krisztián Balla

20.5k13 gold badges78 silver badges91 bronze badges

Comments

ashuai · Accepted Answer · 2014-02-12 13:34:46Z

2

using regex:

string result = Regex.Replace(
    input, 
    @"</?(?i:script|embed|object|frameset|frame|iframe|meta|link|style)(.|\n|\s)*?>", 
    string.Empty, 
    RegexOptions.Singleline | RegexOptions.IgnoreCase
);

answered Feb 12, 2014 at 13:34

ashuai

312 bronze badges

1 Comment

Nigel Ellis Over a year ago

I tried this and it removed the script tag - and every other tag in my HTML. (I was left with just a blank string)

Collectives™ on Stack Overflow

How to remove <script> tags from an HTML page using C#?

5 Answers 5

10 Comments

4 Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

10 Comments

4 Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related