0

I am new to Perl, I am trying to read specific content between <div class="one"> of a HTML file.

HTML file:

<div class="one">

    <div id="two">Donec eu libero sit amet quam egestas semper. Aenean ultricies mi vitae est. Mauris placerat eleifend leo.
    </div>

    <pre>Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.
    </pre>

</div>

Perl Code:

my $file = "content.html";

if (-e $file) {
    open(IN, $file);
    while (<IN>) {
        chomp($line = $_);

        #print "$line\n";
    }
}

@contents = <IN>;

#check to if content in html file is in the right location,
#if content is in correct location (div class="one")
#print content in div two and three if exist

for (my $i = 0 ; $i <= $#contents ; $i++) {
    if (!$contents[$i] =~ m/^\s*<div/ && $contents[$i] =~ m/class\s*=\s*"one"/) {
        print "content in wrong location";
    }
    else {
        if ($contents[$i] =~ m/^\s*<div/) {
            print "$_";
        }
        else ($contents[$i] =~ m/^\s*<pre/) {
            print "$_";
        }
    }
}
3
  • 3
    That's not a "txt" file, it's an HTML file, and should be handled with an HTML parser. Down the "parse HTML with regex" road lies madness. Commented Apr 22, 2013 at 17:11
  • +1 on using a parser: search.cpan.org/dist/HTML-Parser/Parser.pm Commented Apr 22, 2013 at 17:13
  • @DavidO: It is a text file that happens to contain HTML. It has a MIME type of text/html. Commented Apr 22, 2013 at 17:16

1 Answer 1

1

I had some success using HTML::TreeBuilder which is good at handling broken HTML.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.