0

If a child of div matches to some string I want to remove the whole div. For example:

<div>
some text here
if this text is matched, remove whole div
some other text
</div>

I have to do this on many files so I'm looking for some Linux commands like sed.

Thank you for looking into this.

1

3 Answers 3

1

If I understood your question correctly then it can be achieved in one single sed command:

sed '/<div>/I{:A;N;h;/<\/div>/I!{H;bA};/<\/div>/I{g;/\bsome text here\b/Id}}' file.txt

Testing

Let's say this is your file.txt:

a. no-div text

<DIV>

some text here
1. if this text is matched, remove whole DIV
some other text -- WILL MATCH
</div>

<div>
awesome text here
2. if this text is matched, remove whole DIV
this will NOT be matched
</div>

b. no-div text

<Div>
another text here
3. if this text is matched, remove whole DIV
and this too will NOT be matched
</Div>

<div>
Some TEXT Here
4. if this text is matched, remove whole DIV
foo bar foo bar - WILL MATCH
</DIV>

c. no-div text

Now when I run above sed command it gives this output:

a. no-div text


<div>
awesome text here
2. if this text is matched, remove whole DIV
this will NOT be matched
</div>

b. no-div text

<Div>
another text here
3. if this text is matched, remove whole DIV
and this too will NOT be matched
</Div>


c. no-div text

As you can verify from above output that wherever the pattern some text here was matched between div tags those div blocks have been completely removed.

PS: I am doing case insensitive search here, if you don't need that behavior please let me know. I will just need to remove I switch from above sed commands.

Sign up to request clarification or add additional context in comments.

1 Comment

Hi @anubhava, your code looks awesome , could you explain it a little bit? For example, the :A command
0

There's probably a better way to do this, but what I've done in the past is:

1) strip out newlines (because matching across lines is difficult at best and going backwards even worse)

2) parse

3) put newlines back in

cat /tmp/data | tr "\n" "@" | sed -e 's/<div>[^<]*some text here[^<]*<\/div>//g' | tr "@" "\n"

This is assuming that "@" can't appear in the file.

1 Comment

Yeah, don't use regular expressions for HTML, it'll go badly: stackoverflow.com/a/1732454/928098
0

You may use ed instead of sed. The ed command reads the entire file into memory and performs an in-place file edit (i.e. there will be no security backups).

htmlstr='
<see file.txt in answer by anubhava>
'
matchstr='[sS][oO][mM][eE]\ [tT][eE][xX][tT]\ [hH][eE][rR][eE]'
divstr='[dD][iI][vV]'
# for in-place file editing use "ed -s file" and replace ",p" with "w"
# cf. http://wiki.bash-hackers.org/howto/edit-ed
cat <<-EOF | sed -e 's/^ *//' -e 's/ *$//' -e '/^ *#/d' | ed -s <(echo "$htmlstr")
  H
  # ?re?   The previous line containing the regular expression re.  (see man ed)
  # '[[:<:]]' and '[[:>:]]' match the null string at the beginning and end of a word respectively. (see man re_format)
  #,g/[[:<:]]${matchstr}[[:>:]]/?<${divstr}>?,/<\/${divstr}>/d
  ,g/[[:<:]]${matchstr}[[:>:]]/?<${divstr}>?+0,/<\/${divstr}>/+0d
  ,p
  q
EOF

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.