Extract text between two strings in simple example.html file

Question

I have a very basic html file called example.html (see below)

<html>
<body>
<div class="one">
    <div class="research">
        <div class="two">
            <p>Lorem ipsum...</p>
        </div>
        <div class="three">
            <p>Lorem ipsum...</p>
        </div>
        <div class="four">
            <p>Lorem ipsum...</p>
        </div>
    </div>  
</div>
</body>
</html>

and I'd like to get only phrase like (see below), but not by removing first and last 3 lines.

<div class="research">
    <p>Lorem ipsum...</p>
    <div class="two"></div>
    <div class="three"></div>
    <div class="four"></div>
</div>

I have tried with awk:

cat example.html | awk '/^<div\ class="research">$/,/^<\/div>$/ { print }'

but something seems to be wrong.

I also tried with body tag (see below)

cat example.html | awk '/^<body>$/,/^<\/body>$/ { print }'

(result)

<body>
<div class="one">
    <div class="research">
        <div class="two">
            <p>Lorem ipsum...</p>
        </div>
        <div class="three">
            <p>Lorem ipsum...</p>
        </div>
        <div class="four">
            <p>Lorem ipsum...</p>
        </div>
    </div>  
</div>
</body>

And it's working correctly.

What I've doing wrong?

Thanks in advance.

/^<div class="research">$/ doesn't work because <div isn't at the beginning of the line, and ^ matches the beginning of the line. — Barmar
– Barmar, Commented Aug 29, 2013 at 18:21
Yeah! You have right, but still the last </div> are in the game. So the question is how to select text to proper ending div tag? — Egel
– Egel, Commented Aug 29, 2013 at 18:28
You need to count all the matching <div> and </div> tags. You can't do this with a simple first,last pattern, you have to write awk code to increment a counter when you see another <div>, and decrement it when you see a </div>. When the counter goes to 0, you've matched the first one. — Barmar
– Barmar, Commented Aug 29, 2013 at 18:29

Community · Accepted Answer · 2017-05-23 12:05:35Z

6

You cannot parse HTML with regular expressions. Assuming the html is valid xml, you can use:

xmlstarlet sel -t -c '//div[@class="research"]' -nl example.html

<div class="research">
        <div class="two">
            <p>Lorem ipsum...</p>
        </div>
        <div class="three">
            <p>Lorem ipsum...</p>
        </div>
        <div class="four">
            <p>Lorem ipsum...</p>
        </div>
    </div>

edited May 23, 2017 at 12:05

CommunityBot

11 silver badge

answered Aug 29, 2013 at 18:56

glenn jackman

249k42 gold badges233 silver badges363 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

msw Over a year ago

+1 You cannot parse HTML with regular expressions. (I just like repeating that).

Collectives™ on Stack Overflow

Extract text between two strings in simple example.html file

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related