1

I am trying to parse out a multiline string using perl but I am getting only the number of matches. Here is the sample of what I am parsing:

<div id="content-ZAJ9E" class="content">
        Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look.
</div>

I am trying to get the content to be stored in a string using this code:

@a = ($html =~ m/class="content">.*<\/div>/gs);
print "array A, size: ",  @a+0,  ", elements: ";
print join (" ", @a);
print "\n";

but it returns the whole thing not just the text in the div's. Can someone point me out the error in my regex?

Marisa

1
  • 1
    If one of the answers solved your problem, please accept it so others can see that it was helpful. Commented Jun 21, 2012 at 15:05

3 Answers 3

7

Using a robust HTML parser:

use strictures;
use Web::Query qw();
my $w = Web::Query->new_from_html(<<'HTML');
<div id="content-ZAJ9E" class="content">
        Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look.
</div>
HTML
$w->find('div.content')->text

expression returns Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look.

Sign up to request clarification or add additional context in comments.

Comments

5

Use something that is designed to parse HTML, such as HTML::TreeBuilder::XPath:

#!/usr/bin/env perl

use strict; use warnings;
use 5.014;
use HTML::TreeBuilder::XPath;
use YAML;

my $doc =<<EO_HTML;
<div id="content-ZAJ9E" class="content">
<!-- begin <div> -->
        Wow, I love the new top bar, so much easier to navigate now :)
        Anywho, got a few other fixes I am working on as well. :)
        I hope you all like the new look.
<!-- end </div> -->
<span class="extra">Here I am</span>
</div>
EO_HTML

use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new;
$tree->store_comments(1);
$tree->parse($doc);

print Dump [ $tree->findvalues('//div[@class="content"]') ];
print Dump [ $tree->findvalues('//*[@class="extra"]') ];
print Dump [ $tree->findvalues('//comment()') ];

Notice the ability provided by not relying on homebrewed regular expression patterns of dealing with various variations in input.

Output:

---
- '  Wow, I love the new top bar, so much easier to navigate now :) Anywho, got a few other fixes I am working on as well. :) I hope you all like the new look. Here I am '
---
- Here I am
---
- ' begin <div> '
- ' end </div> '

Comments

4

You are only matching the string, you are not parsing anything out. If you want the text in the middle of the div, you should say:

$html =~ m/class="content">(.*)<\/div>/gs;
my $text = $1;
print $text;

Your match will be stored in the $1 variable. If there are multiple instances of such a div[class=content], you need a loop like this:

use strict; use warnings;
use Data::Dumper;

my $html = qq~<div id="content-ZAJ9E" class="content">
        Wow, I love the new top bar.
</div>
<div id="content-ZAJ9E" class="content">
        I still love it.
</div>
<div id="content-ZAJ9E" class="content">
        I cant get enough!
</div>
~;

my @matches;
# *? makes it non-greedy so it will only match to the first </div>
while ($html =~ m/class="content">(.*?)<\/div>/gs){ 
  my $group = $1;     
  $group =~ s/^\s+//; # strip whitespace at the beginning
  $group =~ s/\s+$//; # and the end

  push @matches, $group;
}
print Dumper \@matches;

I suggest you take a look at perlre and perlretut.


Some notes:

1 Comment

Thanks for the fix Sinan, I was kind of in a hurry. :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.