2

How do I remove all script tags in html file using sed?

I try with this but doesn't work, the command below doesn't remove any script tag from test1.html.

$ sed -e 's/<script[.]+<\/script>//g' test1.html > test1_output.html

My goal is from test1.html to test1_output.html

test1.html:

<!DOCTYPE html>
<html>
    <head>
        <meta charset="UTF-8">
    </head>
    <body>
        <h1>My Website</h1>

        <div class="row">
            some text
        </div>

        <script  type="text/javascript"> utmx( 'url', 'A/B' );</script>

        <script src="ga_exp.js" type="text/javascript" charset="utf-8"></script>    
        <script type="text/javascript">
            window.exp_version = 'control';
        </script>        
    </body>
</html>

test1_output.html:

<!DOCTYPE html>
<html>
    <head>
        <meta charset="UTF-8">
    </head>
    <body>
        <h1>My Website</h1>

        <div class="row">
            some text
        </div>

    </body>
</html>
5
  • 8
    "Doesn't work". You should share with everyone how it doesn't work. What are the results or errors? Also, probably related, if not a duplicate: stackoverflow.com/q/19878056/1531971 (The info there can be expanded to this case, as well.) Commented Sep 28, 2018 at 16:30
  • @jdv the command "Doesn't work" cause does nothing (and I don know why), any error is arise. Commented Oct 2, 2018 at 11:52
  • But how would we know that? "Doesn't work" could be wrong results, zero results, purple monkeys flying out of your USB port, who knows? The idea is to tell us what you want to do, show what you tried, and share the results. Commented Oct 2, 2018 at 14:27
  • tks @jdv Thank you, I hope the question is better written now Commented Oct 3, 2018 at 18:00
  • 1
    I never pass up a chance to share this: stackoverflow.com/a/1732454/1531971 Commented Oct 3, 2018 at 19:25

8 Answers 8

8

If I understood correctly your question, and you want to delete everything inside <script></script>, I think you have to split the sed in parts (You can do it one-liner with ;):

Using:

sed 's/<script>.*<\/script>//g;/<script>/,/<\/script>/{/<script>/!{/<\/script>/!d}};s/<script>.*//g;s/.*<\/script>//g'

The first piece (s/<script>.*<\/script>//g) will work for them when in one line;

The second section (/<script>/,/<\/script>/{/<script>/!{/<\/script>/!d}}) is almost a quote to @akingokay answer, only that I excluded the lines of occurrence (Just in case they have something before or after). Great explanation of that in here Using sed to delete all lines between two matching patterns;

The last two (s/<script>.*//g and s/.*<\/script>//g) finally take care of the lines that start and don't finish or don't start and finish.

Now if you have an index.html that has:

<html>
  <body>
        foo
        <script> console.log("bar) </script>
  <div id="something"></div>
        <script>
                // Multiple Lines script
                // Blah blah
        </script>
        foo <script> //Some
        console.log("script")</script> bar
  </body>
</html>

and you run this sed command, you will get:

cat index.html | sed 's/<script>.*<\/script>//g;/<script>/,/<\/script>/{/<script>/!{/<\/script>/!d}};s/<script>.*//g;s/.*<\/script>//g'
<html>
  <body>
    foo


        <div id="something"></div>




    foo 
 bar
  </body>

</html>

Finally you will have a lot of blank spaces, but the code should work as expected. Of course you could easily remove them with sed as well.

Hope it helps.

PS: I think that @l0b0 is right, and this is not the correct tool.

Sign up to request clarification or add additional context in comments.

3 Comments

works tks ... if script tag has attributes the closing of the angle bracket has to be removed, right? $ cat test1.html | sed 's/<script.*<\/script>//g;/<script/,/<\/script>/{/<script/!{/<\/script>/!d}};s/<script.*//g;s/.*<\/script>//g' > test1_output.html
that's right, I didn't consider that case but it is that simple. Regards
If anyone just copy+paste uses this: This will not work for <script type="foo">. It should work like this, though: 's/<script.*<\/script>//g;/<script/,/<\/script>/{/<script/!{/<\/script>/!d}};s/<script.*//g;s/.*<\/script>//g'
6

sed is the wrong tool for this:

Do not attempt this with sed, awk, grep, and so on (it leads to undesired results). In many cases, your best option is to write in a language that has support for XML data. If you have to use a shell script, there are a few HTML- and XML-specific tools available to parse these files for you.

Have a look at pup or xsltproc to process any HTML on the shell.

5 Comments

If pup is the right tool for that. Would you be able to answer the question using pup?
Nope, because I don't know enough about it.
I agree about sed not being the best tool for this... but after reading pup's github page for usage, it seems more focused on selecting elements. Didn't see anything listed for removing elements there or when runninbg --help. Might be some use-cases where selection of a specific element suffices but there are others where actual removal is needed / desired... anyway, I can already do some selections with xmllint --html --xpath <xpath>. unfortunately, pup does not seem to be the html equivalent of jq or xmlstarlet. would be happy to be proven wrong, but I don't think I am.
^ I think part of the problem is that html is just way more difficult to write tools for: there's html marlup, css, js/json, CDATA blocks, etc to handle parsing-wise, there are multiple diff standards like old html, htlm4, xhtml, etc to handle validation-wise, and let's be honest: most websites don't even try to write code that validates at all. even if somebody wrote something that just procesed the html tags as xml without choking on inline CDATA, javascript, and css and maybe just trearing them as text elements... that would be huge task.
^ some other options that I thought of but have not explored yet: use firefox + selenium scripts to load a local file make modifications and save it, one of the options mentioned here (maybe one of these?), or select pieces with xmllint and just rebuild the whole thing from scratch... worse case, if there's no good tool, I'd think perl -0777 -pe <regex> <file> would be at least slightly better than sed due to multi-line matching and more flexible regex
4

As l0b0 already mentioned, it's a bad idea to process HTML with sed.
Besides pup and xlstproc there's another tool, called , you can have a look at.

$ xidel -s test1.html -e 'x:replace-nodes(//body/script,())' --output-format=html

See also this online xidelcgi demo.

1 Comment

Tested, worked, and I agree this is the correct way to go about it.
1

You could use htmlq to remove script tags (using the -r option to remove script tags):

htmlq -r script -f test1.html > test1_output.html

Comments

0

This will work:

sed 's/<script>//;s/<\/script>//' test1.html

This expression searches for <script> and </script> substrings inside the text and replaces them with nothing, so it is removed :)

4 Comments

s/// does nothing, what's the point of it ? and related to your explanation, what if script tag has attributes ???
@oguzismail No, s/// does something: it replaces the content of the buffer that matches the previous regular expression with nothing. Since there is no previous regex, this is an error. The original content was s/<script>// and it was removed because it was not properly quoted.
@xhienne didn't know that thanks. the answer is still wrong anyways
@oguzismail Actually it is not wrong, I thought Simone only wants to remove tags not attributes inside them.
0

You can test such utilities online for example on http://rextester.com/l/bash_online_compiler.

echo 'abc <script> def </script> xyz' | sed "/<script/,/<\/script>/d"

The output is = abc and xyz

1 Comment

That only works properly if <script> and </script> are on different lines, unlike your example input; it also assumes that nothing else is on these lines. And the output of your example is actually the empty string.
0

I found that the answer from @JorgeValenti did not recognize script tags with src attributes. This version of the incantation addresses that problem:

sed -i 's/<script.*<\/script>//g;/<script/,/<\/script>/{/<script/!{/<\/script>/!d}};s/<script.*//g;s/.*<\/script>//g'

Comments

0

You Can Use This in your Regex

<script\b[^>]>[\s\S\n]?/script\b[^>]*>\n

1 Comment

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.