0

I have a very basic html file called example.html (see below)

<html>
<body>
<div class="one">
    <div class="research">
        <div class="two">
            <p>Lorem ipsum...</p>
        </div>
        <div class="three">
            <p>Lorem ipsum...</p>
        </div>
        <div class="four">
            <p>Lorem ipsum...</p>
        </div>
    </div>  
</div>
</body>
</html>

and I'd like to get only phrase like (see below), but not by removing first and last 3 lines.

<div class="research">
    <p>Lorem ipsum...</p>
    <div class="two"></div>
    <div class="three"></div>
    <div class="four"></div>
</div>

I have tried with awk:

cat example.html | awk '/^<div\ class="research">$/,/^<\/div>$/ { print }'

but something seems to be wrong.

I also tried with body tag (see below)

cat example.html | awk '/^<body>$/,/^<\/body>$/ { print }'

(result)

<body>
<div class="one">
    <div class="research">
        <div class="two">
            <p>Lorem ipsum...</p>
        </div>
        <div class="three">
            <p>Lorem ipsum...</p>
        </div>
        <div class="four">
            <p>Lorem ipsum...</p>
        </div>
    </div>  
</div>
</body>

And it's working correctly.

What I've doing wrong?

Thanks in advance.

4
  • 2
    /^<div class="research">$/ doesn't work because <div isn't at the beginning of the line, and ^ matches the beginning of the line. Commented Aug 29, 2013 at 18:21
  • Yeah! You have right, but still the last </div> are in the game. So the question is how to select text to proper ending div tag? Commented Aug 29, 2013 at 18:28
  • You need to count all the matching <div> and </div> tags. You can't do this with a simple first,last pattern, you have to write awk code to increment a counter when you see another <div>, and decrement it when you see a </div>. When the counter goes to 0, you've matched the first one. Commented Aug 29, 2013 at 18:29
  • As an aside, avoid the Useless Use of cat. Commented Aug 29, 2013 at 19:32

1 Answer 1

6

You cannot parse HTML with regular expressions. Assuming the html is valid xml, you can use:

xmlstarlet sel -t -c '//div[@class="research"]' -nl example.html  
<div class="research">
        <div class="two">
            <p>Lorem ipsum...</p>
        </div>
        <div class="three">
            <p>Lorem ipsum...</p>
        </div>
        <div class="four">
            <p>Lorem ipsum...</p>
        </div>
    </div>
Sign up to request clarification or add additional context in comments.

1 Comment

+1 You cannot parse HTML with regular expressions. (I just like repeating that).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.