remove html tag if it contains some text inside

Question

If a child of div matches to some string I want to remove the whole div. For example:

<div>
some text here
if this text is matched, remove whole div
some other text
</div>

I have to do this on many files so I'm looking for some Linux commands like sed.

Thank you for looking into this.

Yeah don't use regular expressions for HTML, it'll go badly: stackoverflow.com/a/1732454/928098 — Kristian Glass
– Kristian Glass, Commented Apr 30, 2012 at 1:21

anubhava · Accepted Answer · 2011-04-26 18:44:41Z

1

If I understood your question correctly then it can be achieved in one single sed command:

sed '/<div>/I{:A;N;h;/<\/div>/I!{H;bA};/<\/div>/I{g;/\bsome text here\b/Id}}' file.txt

Testing

Let's say this is your file.txt:

a. no-div text

<DIV>

some text here
1. if this text is matched, remove whole DIV
some other text -- WILL MATCH
</div>

<div>
awesome text here
2. if this text is matched, remove whole DIV
this will NOT be matched
</div>

b. no-div text

<Div>
another text here
3. if this text is matched, remove whole DIV
and this too will NOT be matched
</Div>

<div>
Some TEXT Here
4. if this text is matched, remove whole DIV
foo bar foo bar - WILL MATCH
</DIV>

c. no-div text

Now when I run above sed command it gives this output:

a. no-div text


<div>
awesome text here
2. if this text is matched, remove whole DIV
this will NOT be matched
</div>

b. no-div text

<Div>
another text here
3. if this text is matched, remove whole DIV
and this too will NOT be matched
</Div>


c. no-div text

As you can verify from above output that wherever the pattern some text here was matched between div tags those div blocks have been completely removed.

PS: I am doing case insensitive search here, if you don't need that behavior please let me know. I will just need to remove I switch from above sed commands.

edited Apr 26, 2011 at 18:44

answered Apr 23, 2011 at 6:15

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Steven You Over a year ago

Hi @anubhava, your code looks awesome , could you explain it a little bit? For example, the :A command

drysdam · Accepted Answer · 2011-04-22 17:02:36Z

0

There's probably a better way to do this, but what I've done in the past is:

1) strip out newlines (because matching across lines is difficult at best and going backwards even worse)

2) parse

3) put newlines back in

cat /tmp/data | tr "\n" "@" | sed -e 's/<div>[^<]*some text here[^<]*<\/div>//g' | tr "@" "\n"

This is assuming that "@" can't appear in the file.

answered Apr 22, 2011 at 17:02

drysdam

8,6771 gold badge22 silver badges24 bronze badges

1 Comment

Kristian Glass Over a year ago

Yeah, don't use regular expressions for HTML, it'll go badly: stackoverflow.com/a/1732454/928098

jeff · Accepted Answer · 2011-04-24 15:12:15Z

You may use ed instead of sed. The ed command reads the entire file into memory and performs an in-place file edit (i.e. there will be no security backups).

htmlstr='
<see file.txt in answer by anubhava>
'
matchstr='[sS][oO][mM][eE]\ [tT][eE][xX][tT]\ [hH][eE][rR][eE]'
divstr='[dD][iI][vV]'
# for in-place file editing use "ed -s file" and replace ",p" with "w"
# cf. http://wiki.bash-hackers.org/howto/edit-ed
cat <<-EOF | sed -e 's/^ *//' -e 's/ *$//' -e '/^ *#/d' | ed -s <(echo "$htmlstr")
  H
  # ?re?   The previous line containing the regular expression re.  (see man ed)
  # '[[:<:]]' and '[[:>:]]' match the null string at the beginning and end of a word respectively. (see man re_format)
  #,g/[[:<:]]${matchstr}[[:>:]]/?<${divstr}>?,/<\/${divstr}>/d
  ,g/[[:<:]]${matchstr}[[:>:]]/?<${divstr}>?+0,/<\/${divstr}>/+0d
  ,p
  q
EOF

Collectives™ on Stack Overflow

remove html tag if it contains some text inside

3 Answers 3

Testing

Now when I run above sed command it gives this output:

1 Comment

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Testing

Now when I run above sed command it gives this output:

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related