0

I have looked a lot to find the solution but could not find one. I know how to remove all tags using sed but I need to remove only those HTML tags that are empty or have just tabs or spaces in them and also remove tags explicitly. For example:

<p></p>  or <p>    </p> 

I used the following command to remove all the HTML tags, it works properly but I don't want to remove all tags.

sed -e 's/<[^>]*>//g' myfile.html

same command is used here. Kindly help me out.

2 Answers 2

1

You could use the below sed command to remove only the empty tags.

sed 's/<[^\/][^<>]*> *<\/[^<>]*>//g' file

Through Perl,

perl -pe 's/<([^<>]*)>\s*<\/\1>//g' file
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks! one more problem i that the tag doesnt always close like </>, some tags are written like this <img src="someimage" /> . will this command still hold for these tags?
then use this sed -r 's/<[^\/][^<>]*> *<\/?[^<>]*\/?>//g' file
To delete empty tags with any attributes.Tags may have any amount of whitespace/new lines in body. Does not delete tags without body(<tag/>). perl -0 -i -p -e "s/(<[[:space:]]*([a-zA-Z]*)[^>]*>)([[:space:]]*)(<\/\2>)([[:space:]]*)//g" "$path"
1
sed -r 's/<([a-zA-Z0-9]+)>[ \s\t]*<\/\1>//g' file

1 Comment

this will not work after you changed your requirement! :-) Since @Avinash provided the answer, i left it there :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.