How to parse multi-line HTML using regex in Perl

Question

I am trying to parse out a multiline string using perl but I am getting only the number of matches. Here is the sample of what I am parsing:

<div id="content-ZAJ9E" class="content">
        Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look.
</div>

I am trying to get the content to be stored in a string using this code:

@a = ($html =~ m/class="content">.*<\/div>/gs);
print "array A, size: ",  @a+0,  ", elements: ";
print join (" ", @a);
print "\n";

but it returns the whole thing not just the text in the div's. Can someone point me out the error in my regex?

Marisa

If one of the answers solved your problem, please accept it so others can see that it was helpful. — simbabque
– simbabque, Commented Jun 21, 2012 at 15:05

daxim · Accepted Answer · 2012-06-21 14:06:52Z

7

Using a robust HTML parser:

use strictures;
use Web::Query qw();
my $w = Web::Query->new_from_html(<<'HTML');
<div id="content-ZAJ9E" class="content">
        Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look.
</div>
HTML
$w->find('div.content')->text

expression returns Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look.

answered Jun 21, 2012 at 14:06

daxim

39.3k4 gold badges71 silver badges135 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Sinan Ünür · Accepted Answer · 2012-06-21 14:13:46Z

Use something that is designed to parse HTML, such as HTML::TreeBuilder::XPath:

#!/usr/bin/env perl

use strict; use warnings;
use 5.014;
use HTML::TreeBuilder::XPath;
use YAML;

my $doc =<<EO_HTML;
<div id="content-ZAJ9E" class="content">
<!-- begin <div> -->
        Wow, I love the new top bar, so much easier to navigate now :)
        Anywho, got a few other fixes I am working on as well. :)
        I hope you all like the new look.
<!-- end </div> -->
<span class="extra">Here I am</span>
</div>
EO_HTML

use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new;
$tree->store_comments(1);
$tree->parse($doc);

print Dump [ $tree->findvalues('//div[@class="content"]') ];
print Dump [ $tree->findvalues('//*[@class="extra"]') ];
print Dump [ $tree->findvalues('//comment()') ];

Notice the ability provided by not relying on homebrewed regular expression patterns of dealing with various variations in input.

Output:

---
- '  Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look. Here I am '
---
- Here I am
---
- ' begin <div> '
- ' end </div> '

Community · Accepted Answer · 2017-05-23 12:20:04Z

4

You are only matching the string, you are not parsing anything out. If you want the text in the middle of the div, you should say:

$html =~ m/class="content">(.*)<\/div>/gs;
my $text = $1;
print $text;

Your match will be stored in the $1 variable. If there are multiple instances of such a div[class=content], you need a loop like this:

use strict; use warnings;
use Data::Dumper;

my $html = qq~<div id="content-ZAJ9E" class="content">
        Wow, I love the new top bar.
</div>
<div id="content-ZAJ9E" class="content">
        I still love it.
</div>
<div id="content-ZAJ9E" class="content">
        I cant get enough!
</div>
~;

my @matches;
# *? makes it non-greedy so it will only match to the first </div>
while ($html =~ m/class="content">(.*?)<\/div>/gs){ 
  my $group = $1;     
  $group =~ s/^\s+//; # strip whitespace at the beginning
  $group =~ s/\s+$//; # and the end

  push @matches, $group;
}
print Dumper \@matches;

I suggest you take a look at perlre and perlretut.

Some notes:

Always use strict and use warnings!
Try Data::Dumper, it's great to debug your variables.
Using regex for HTML parsing is not the best idea. If you are doing a lot of parsing, consider one of the modules available at CPAN, such as HTML::Parser, HTML::TreeBuilder::XPath, or HTML::TokeParser::Simple, or Mojo::DOM, or search for it on SO

edited May 23, 2017 at 12:20

CommunityBot

11 silver badge

answered Jun 21, 2012 at 13:32

simbabque

54.4k8 gold badges77 silver badges141 bronze badges

1 Comment

simbabque Over a year ago

Thanks for the fix Sinan, I was kind of in a hurry. :)

Collectives™ on Stack Overflow

How to parse multi-line HTML using regex in Perl

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related