How to remove all script tags from html file

Question

How do I remove all script tags in html file using sed?

I try with this but doesn't work, the command below doesn't remove any script tag from test1.html.

$ sed -e 's/<script[.]+<\/script>//g' test1.html > test1_output.html

My goal is from test1.html to test1_output.html

test1.html:

<!DOCTYPE html>
<html>
    <head>
        <meta charset="UTF-8">
    </head>
    <body>
        <h1>My Website</h1>

        <div class="row">
            some text
        </div>

        <script  type="text/javascript"> utmx( 'url', 'A/B' );</script>

        <script src="ga_exp.js" type="text/javascript" charset="utf-8"></script>    
        <script type="text/javascript">
            window.exp_version = 'control';
        </script>        
    </body>
</html>

test1_output.html:

<!DOCTYPE html>
<html>
    <head>
        <meta charset="UTF-8">
    </head>
    <body>
        <h1>My Website</h1>

        <div class="row">
            some text
        </div>

    </body>
</html>

"Doesn't work". You should share with everyone how it doesn't work. What are the results or errors? Also, probably related, if not a duplicate: stackoverflow.com/q/19878056/1531971 (The info there can be expanded to this case, as well.) — user1531971
– user1531971, Commented Sep 28, 2018 at 16:30
@jdv the command "Doesn't work" cause does nothing (and I don know why), any error is arise. — Simone Bonelli
– Simone Bonelli, Commented Oct 2, 2018 at 11:52
But how would we know that? "Doesn't work" could be wrong results, zero results, purple monkeys flying out of your USB port, who knows? The idea is to tell us what you want to do, show what you tried, and share the results. — user1531971
– user1531971, Commented Oct 2, 2018 at 14:27
tks @jdv Thank you, I hope the question is better written now — Simone Bonelli
– Simone Bonelli, Commented Oct 3, 2018 at 18:00
I never pass up a chance to share this: stackoverflow.com/a/1732454/1531971 — user1531971
– user1531971, Commented Oct 3, 2018 at 19:25

Jorge Valentini · Accepted Answer · 2018-09-28 20:53:15Z

8

If I understood correctly your question, and you want to delete everything inside <script></script>, I think you have to split the sed in parts (You can do it one-liner with ;):

Using:

sed 's/<script>.*<\/script>//g;/<script>/,/<\/script>/{/<script>/!{/<\/script>/!d}};s/<script>.*//g;s/.*<\/script>//g'

The first piece (s/<script>.*<\/script>//g) will work for them when in one line;

The second section (/<script>/,/<\/script>/{/<script>/!{/<\/script>/!d}}) is almost a quote to @akingokay answer, only that I excluded the lines of occurrence (Just in case they have something before or after). Great explanation of that in here Using sed to delete all lines between two matching patterns;

The last two (s/<script>.*//g and s/.*<\/script>//g) finally take care of the lines that start and don't finish or don't start and finish.

Now if you have an index.html that has:

<html>
  <body>
        foo
        <script> console.log("bar) </script>
  <div id="something"></div>
        <script>
                // Multiple Lines script
                // Blah blah
        </script>
        foo <script> //Some
        console.log("script")</script> bar
  </body>
</html>

and you run this sed command, you will get:

cat index.html | sed 's/<script>.*<\/script>//g;/<script>/,/<\/script>/{/<script>/!{/<\/script>/!d}};s/<script>.*//g;s/.*<\/script>//g'
<html>
  <body>
    foo


        <div id="something"></div>




    foo 
 bar
  </body>

</html>

Finally you will have a lot of blank spaces, but the code should work as expected. Of course you could easily remove them with sed as well.

Hope it helps.

PS: I think that @l0b0 is right, and this is not the correct tool.

answered Sep 28, 2018 at 20:53

Jorge Valentini

4275 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Simone Bonelli Over a year ago

works tks ... if script tag has attributes the closing of the angle bracket has to be removed, right? $ cat test1.html | sed 's/<script.*<\/script>//g;/<script/,/<\/script>/{/<script/!{/<\/script>/!d}};s/<script.*//g;s/.*<\/script>//g' > test1_output.html

Jorge Valentini Over a year ago

that's right, I didn't consider that case but it is that simple. Regards

Christopher Over a year ago

If anyone just copy+paste uses this: This will not work for <script type="foo">. It should work like this, though: 's/<script.*<\/script>//g;/<script/,/<\/script>/{/<script/!{/<\/script>/!d}};s/<script.*//g;s/.*<\/script>//g'

l0b0 · Accepted Answer · 2018-09-28 20:42:21Z

6

sed is the wrong tool for this:

Do not attempt this with sed, awk, grep, and so on (it leads to undesired results). In many cases, your best option is to write in a language that has support for XML data. If you have to use a shell script, there are a few HTML- and XML-specific tools available to parse these files for you.

Have a look at pup or xsltproc to process any HTML on the shell.

edited Sep 28, 2018 at 20:42

answered Sep 28, 2018 at 20:31

l0b0

59.6k32 gold badges155 silver badges247 bronze badges

5 Comments

Steren Over a year ago

If pup is the right tool for that. Would you be able to answer the question using pup?

l0b0 Over a year ago

Nope, because I don't know enough about it.

zpangwin Over a year ago

I agree about sed not being the best tool for this... but after reading pup's github page for usage, it seems more focused on selecting elements. Didn't see anything listed for removing elements there or when runninbg --help. Might be some use-cases where selection of a specific element suffices but there are others where actual removal is needed / desired... anyway, I can already do some selections with xmllint --html --xpath <xpath>. unfortunately, pup does not seem to be the html equivalent of jq or xmlstarlet. would be happy to be proven wrong, but I don't think I am.

zpangwin Over a year ago

^ I think part of the problem is that html is just way more difficult to write tools for: there's html marlup, css, js/json, CDATA blocks, etc to handle parsing-wise, there are multiple diff standards like old html, htlm4, xhtml, etc to handle validation-wise, and let's be honest: most websites don't even try to write code that validates at all. even if somebody wrote something that just procesed the html tags as xml without choking on inline CDATA, javascript, and css and maybe just trearing them as text elements... that would be huge task.

zpangwin Over a year ago

^ some other options that I thought of but have not explored yet: use firefox + selenium scripts to load a local file make modifications and save it, one of the options mentioned here (maybe one of these?), or select pieces with xmllint and just rebuild the whole thing from scratch... worse case, if there's no good tool, I'd think perl -0777 -pe <regex> <file> would be at least slightly better than sed due to multi-line matching and more flexible regex

Reino · Accepted Answer · 2021-10-17 10:10:58Z

4

As l0b0 already mentioned, it's a bad idea to process HTML with sed.
Besides pup and xlstproc there's another tool, called xidel, you can have a look at.

$ xidel -s test1.html -e 'x:replace-nodes(//body/script,())' --output-format=html

1 Comment

progonkpa Over a year ago

Tested, worked, and I agree this is the correct way to go about it.

Pierz · Accepted Answer · 2024-05-07 11:42:05Z

1

You could use htmlq to remove script tags (using the -r option to remove script tags):

htmlq -r script -f test1.html > test1_output.html

answered May 7, 2024 at 11:42

Pierz

8,40864 silver badges66 bronze badges

Comments

xhienne · Accepted Answer · 2018-09-29 00:23:18Z

0

This will work:

sed 's/<script>//;s/<\/script>//' test1.html

This expression searches for <script> and </script> substrings inside the text and replaces them with nothing, so it is removed :)

edited Sep 29, 2018 at 0:23

xhienne

6,2141 gold badge19 silver badges36 bronze badges

answered Sep 28, 2018 at 19:56

Nihat Alpcan Onaran

892 bronze badges

4 Comments

oguz ismail Over a year ago

s/// does nothing, what's the point of it ? and related to your explanation, what if script tag has attributes ???

xhienne Over a year ago

@oguzismail No, s/// does something: it replaces the content of the buffer that matches the previous regular expression with nothing. Since there is no previous regex, this is an error. The original content was s/<script>// and it was removed because it was not properly quoted.

oguz ismail Over a year ago

@xhienne didn't know that thanks. the answer is still wrong anyways

Nihat Alpcan Onaran Over a year ago

@oguzismail Actually it is not wrong, I thought Simone only wants to remove tags not attributes inside them.

Matthias · Accepted Answer · 2018-09-29 14:09:33Z

0

You can test such utilities online for example on http://rextester.com/l/bash_online_compiler.

echo 'abc <script> def </script> xyz' | sed "/<script/,/<\/script>/d"

The output is = abc and xyz

edited Sep 29, 2018 at 14:09

Matthias

4,68713 gold badges51 silver badges87 bronze badges

answered Sep 28, 2018 at 17:08

akingokay

834 bronze badges

1 Comment

Benjamin W. Over a year ago

That only works properly if <script> and </script> are on different lines, unlike your example input; it also assumes that nothing else is on these lines. And the output of your example is actually the empty string.

Mike Slinn · Accepted Answer · 2021-03-29 12:43:50Z

0

I found that the answer from @JorgeValenti did not recognize script tags with src attributes. This version of the incantation addresses that problem:

sed -i 's/<script.*<\/script>//g;/<script/,/<\/script>/{/<script/!{/<\/script>/!d}};s/<script.*//g;s/.*<\/script>//g'

answered Mar 29, 2021 at 12:43

Mike Slinn

8,4887 gold badges57 silver badges92 bronze badges

Comments

Jijo John · Accepted Answer · 2022-11-11 05:02:21Z

0

You Can Use This in your Regex

<script\b[^>]>[\s\S\n]?/script\b[^>]*>\n

answered Nov 11, 2022 at 5:02

Jijo John

11 bronze badge

1 Comment

aymcg31 Over a year ago

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Collectives™ on Stack Overflow

How to remove all script tags from html file

8 Answers 8

3 Comments

5 Comments

1 Comment

Comments

4 Comments

1 Comment

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

3 Comments

5 Comments

1 Comment

Comments

4 Comments

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related