0

I have a messy html that looks like this:

<div id=":0.page.0" class="page-element" style="width: 1620px;">
 <div>
  <img src="viewer_files/viewer_004.png" class="page-image" style="width: 800px; height: 1131px; display: none;">
  <img src="viewer_files/viewer_005.png" class="page-image" style="width: 1600px;">
 </div>
</div>// this repeats 100+ times with different 'src' attributes

Now this is all one line actually (i have formatted in multiple lines for easy readibility). I am trying to remove all <img> tags that have display:none; set in the inline css. Is it possible to use sed/awk or some other unix command to achieve this? I think if it were a well indented html document, it would've been easy.

6 Answers 6

3

HTML and regexes are a notoriously bad match, so you probably want something that is HTML-aware. I'd probably go for something like TagSoup, but there are no doubt other options that are more shell-friendly, or suitable for any favourite scripting language you may have.

Sign up to request clarification or add additional context in comments.

Comments

3

I would use either Twig or XMLStarlet to do this kind of processing. A lot more reliable than sed/awk/grep. Since your pattern is regular and repeating, they would work too.

1 Comment

+1 love xmlstarlet as much as I can love anything related to XML.
2
sed 's/<img.*display: none;[^>]>//g' file

Comments

1
sed -e "s/<img[^>]*display: none;[^>]*>//g" filein

A quick explanation about sed :

s stands for substitution / are delimiters

s means that the first field will be a pattern to be search, that will be replaced by the second one. The last one are options. g means global (replace it many times if many matches are found).

to replace inplace : sed -i -e "..."

4 Comments

should be display: *none\b instead of display: none;
@Pumbaa80 what is the difference?
Matches zero or more spaces, instead of exactly one.
also, it matches "display: none" and "display: none ;"
0

That would do it

sed -e "s@<img.*display: none;.*>@@g" FILINAME

4 Comments

Isn't the second .* going to match greedily?
it removed all the img tags :|
Well, it did work on original sample. But if greedy would be a problem we can always replace . with [^>]
Are u sure? I just tried it with your file. Worked like a charm.
0

Sed has several commands, but most people only learn the substitute command: "s". A useful command deletes every line that matches the restriction: "d".

sed -e "/<img[^>]*display: none;[^>]*>/d" File 

Be carreful it's delete entire line.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.