How to grep string with regex in Perl?

Question

I am new to Perl and I want write a simple script which will be getting the webpage content via LSW::Simple get() and then I want it to grep in the get() result for some regex match. Here is my code:

$content = get("http://pl.wikipedia.org/wiki/$arg1");
my $result = grep(/en\.wikipedia\.org\/wiki\/[A-Za-z]+\"\s*title/, $content);
print $result;

When I print the result it is "1". How can I get the String which is hidden there: 'en.wikipedia.org/wiki/TextIWantToGet" title'?

Thanks in advance!

Put brackets around your $result. In scalar context grep will return the number of matches. In list context Perl will return the actual strings matched. Eg my ($result) = grep... — Chris Doyle
– Chris Doyle, Commented Dec 25, 2014 at 16:49
It is not an solution to my problem yet. What I need this script to do is to put TextIWantToGet into $result variable. — patrykf
– patrykf, Commented Dec 25, 2014 at 16:56

Gilles Quénot · Accepted Answer · 2014-12-25 21:08:28Z

6

What I would do using your base code :

use strict; use warnings;
use LWP::UserAgent;
use HTTP::Request;

my $arg1 = "Rower";

# Create a user agent object
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;

# Create a request
my $req = HTTP::Request->new(GET => "http://pl.wikipedia.org/wiki/$arg1");

# Pass request to the user agent and get a response back
my $res = $ua->request($req);

# Check the outcome of the response
die $res->status_line, "\n" unless $res->is_success;

my $content = $res->content;

$content =~ /en\.wikipedia\.org\/wiki\/([A-Za-z]+)\"\s*title/;
print $1;

But parsing HTML with regex are discouraged, instead, going further & learn how to use HTML::TreeBuilder::XPath using xpath :

use strict; use warnings;
use HTML::TreeBuilder::XPath;
use LWP::UserAgent;
use HTTP::Request;

my $arg1 = "Rower";

# Create a user agent object
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;

# Create a request
my $req = HTTP::Request->new(GET => "http://pl.wikipedia.org/wiki/$arg1");

# Pass request to the user agent and get a response back
my $res = $ua->request($req);

# Check the outcome of the response
die $res->status_line, "\n" unless $res->is_success;

my $tree = HTML::TreeBuilder::XPath->new_from_content( $res->content );

# Using XPath, searching for all links having a 'title' attribute
# and having a 'href' attribute matching 'en.wikipedia.org' 
my $link = $tree->findvalue(
    '//a[@title]/@href[contains(., "en.wikipedia.org")]'
);
$link =~ s!.*/!!;
print "$link\n";

Just for fun, this is a concise version using WWW::Mechanize :

use strict; use warnings;
use WWW::Mechanize;
use HTML::TreeBuilder::XPath;

my $m = WWW::Mechanize->new( autocheck => 1 );
$m->get("http://pl.wikipedia.org/wiki/$ARGV[0]");
my $tree = HTML::TreeBuilder::XPath->new_from_content( $m->content );

print join "\n", map { s!.*/!!; $_ } $tree->findvalues(
    '//a[@title]/@href[contains(., "en.wikipedia.org")]'
);

edited Dec 25, 2014 at 21:08

answered Dec 25, 2014 at 16:55

Gilles Quénot

188k43 gold badges232 silver badges229 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

patrykf Over a year ago

Thanks for your answer. It does almost what I need. while $arg1 is "Rower" I am getting - en.wikipedia.org/wiki/Bicycle" title - as an answer. What I want to get is "Bicycle".

Gilles Quénot Over a year ago

The latest one do exactly what you want using a proper solution.

patrykf Over a year ago

But it result with "Can't locate HTML/TreeBuilder/XPath.pm in @INC (you may need to install the HTML::TreeBuilder::XPath module)" and I have no rights to install this module on the machine scirpt will be tested on..

Gilles Quénot Over a year ago

So use my first modified solution. Modules can be installed as a user for the future and your knowledge.

Chris Doyle · Accepted Answer · 2014-12-25 17:18:42Z

2

You need to wrap $result in brackets to force list context instead of scalar context. The Perl documentation for grep says

"Evaluates the BLOCK or EXPR for each element of LIST (locally setting $_ to each element) and returns the list value consisting of those elements for which the expression evaluated to true. In scalar context, returns the number of times the expression was true."

So you need to use something like

my ($result) = grep(/en\.wikipedia\.org\/wiki\/([A-Za-z]+)\"\s*title/, $content);

However it really depends which part of the html your actually interested in? the end of the URL? or the title of the page?

the above code will grab anything after /wiki/ which is upper or lowercase A-Z thats all that should be in the $result.

edited Dec 25, 2014 at 17:18

answered Dec 25, 2014 at 16:56

Chris Doyle

12.4k2 gold badges30 silver badges49 bronze badges

8 Comments

patrykf Over a year ago

The result assigned to $result is whole html of the page

Chris Doyle Over a year ago

Can you show what the value of $contents is. As mentioned in another answer, using regex to parse HTML isnot the best solution. However if its what you need to do happy to help make it work for you. you can test regex code with various sites like perlfect.com/articles/regextutor.shtml

patrykf Over a year ago

$content variable is html which u cen get from wget of pl.wikipedia.org/wiki/rower

patrykf Over a year ago

I want to use wikipedia page as an translator (yup thats supid, but anyway..) Every wikipedia page on url like pl.wikipedia.org/wiki/Rower (Rower is bicycle in polish btw) will have links on the left to wikipedia in other languages. So the regex which I look for is supposed to grab the link for the wikipage in another language which looks like this en.wikipedia.org/wiki/Bicycle ant this way I am getting the translation of Rower to Bicycle. Thats what I basically want to do.

Chris Doyle Over a year ago

then above code should work, it should not pick up any other as specifically it only looks for A-Z, i would also suggest adding an underscore to your character group as some wiki URL's have underscroe also in them like en.wikipedia.org/wiki/Network_switching_subsystem

|

Collectives™ on Stack Overflow

How to grep string with regex in Perl?

2 Answers 2

4 Comments

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related