2

I am new to Perl and I want write a simple script which will be getting the webpage content via LSW::Simple get() and then I want it to grep in the get() result for some regex match. Here is my code:

$content = get("http://pl.wikipedia.org/wiki/$arg1");
my $result = grep(/en\.wikipedia\.org\/wiki\/[A-Za-z]+\"\s*title/, $content);
print $result;

When I print the result it is "1". How can I get the String which is hidden there: 'en.wikipedia.org/wiki/TextIWantToGet" title'?

Thanks in advance!

8
  • 2
    Put brackets around your $result. In scalar context grep will return the number of matches. In list context Perl will return the actual strings matched. Eg my ($result) = grep... Commented Dec 25, 2014 at 16:49
  • Thanks for the answer. It returns the whole html now. Commented Dec 25, 2014 at 16:52
  • I will add it as an answer now if you want to accept it Commented Dec 25, 2014 at 16:53
  • It is not an solution to my problem yet. What I need this script to do is to put TextIWantToGet into $result variable. Commented Dec 25, 2014 at 16:56
  • 1
    Regex are not the right tool to parse HTML Commented Dec 25, 2014 at 17:21

2 Answers 2

6

What I would do using your base code :

use strict; use warnings;
use LWP::UserAgent;
use HTTP::Request;

my $arg1 = "Rower";

# Create a user agent object
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;

# Create a request
my $req = HTTP::Request->new(GET => "http://pl.wikipedia.org/wiki/$arg1");

# Pass request to the user agent and get a response back
my $res = $ua->request($req);

# Check the outcome of the response
die $res->status_line, "\n" unless $res->is_success;

my $content = $res->content;

$content =~ /en\.wikipedia\.org\/wiki\/([A-Za-z]+)\"\s*title/;
print $1;

But parsing HTML with regex are discouraged, instead, going further & learn how to use HTML::TreeBuilder::XPath using :

use strict; use warnings;
use HTML::TreeBuilder::XPath;
use LWP::UserAgent;
use HTTP::Request;

my $arg1 = "Rower";

# Create a user agent object
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;

# Create a request
my $req = HTTP::Request->new(GET => "http://pl.wikipedia.org/wiki/$arg1");

# Pass request to the user agent and get a response back
my $res = $ua->request($req);

# Check the outcome of the response
die $res->status_line, "\n" unless $res->is_success;

my $tree = HTML::TreeBuilder::XPath->new_from_content( $res->content );

# Using XPath, searching for all links having a 'title' attribute
# and having a 'href' attribute matching 'en.wikipedia.org' 
my $link = $tree->findvalue(
    '//a[@title]/@href[contains(., "en.wikipedia.org")]'
);
$link =~ s!.*/!!;
print "$link\n";

Just for fun, this is a concise version using WWW::Mechanize :

use strict; use warnings;
use WWW::Mechanize;
use HTML::TreeBuilder::XPath;

my $m = WWW::Mechanize->new( autocheck => 1 );
$m->get("http://pl.wikipedia.org/wiki/$ARGV[0]");
my $tree = HTML::TreeBuilder::XPath->new_from_content( $m->content );

print join "\n", map { s!.*/!!; $_ } $tree->findvalues(
    '//a[@title]/@href[contains(., "en.wikipedia.org")]'
);
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for your answer. It does almost what I need. while $arg1 is "Rower" I am getting - en.wikipedia.org/wiki/Bicycle" title - as an answer. What I want to get is "Bicycle".
The latest one do exactly what you want using a proper solution.
But it result with "Can't locate HTML/TreeBuilder/XPath.pm in @INC (you may need to install the HTML::TreeBuilder::XPath module)" and I have no rights to install this module on the machine scirpt will be tested on..
So use my first modified solution. Modules can be installed as a user for the future and your knowledge.
2

You need to wrap $result in brackets to force list context instead of scalar context. The Perl documentation for grep says

"Evaluates the BLOCK or EXPR for each element of LIST (locally setting $_ to each element) and returns the list value consisting of those elements for which the expression evaluated to true. In scalar context, returns the number of times the expression was true."

So you need to use something like

my ($result) = grep(/en\.wikipedia\.org\/wiki\/([A-Za-z]+)\"\s*title/, $content);

However it really depends which part of the html your actually interested in? the end of the URL? or the title of the page?

the above code will grab anything after /wiki/ which is upper or lowercase A-Z thats all that should be in the $result.

8 Comments

The result assigned to $result is whole html of the page
Can you show what the value of $contents is. As mentioned in another answer, using regex to parse HTML isnot the best solution. However if its what you need to do happy to help make it work for you. you can test regex code with various sites like perlfect.com/articles/regextutor.shtml
$content variable is html which u cen get from wget of pl.wikipedia.org/wiki/rower
I want to use wikipedia page as an translator (yup thats supid, but anyway..) Every wikipedia page on url like pl.wikipedia.org/wiki/Rower (Rower is bicycle in polish btw) will have links on the left to wikipedia in other languages. So the regex which I look for is supposed to grab the link for the wikipage in another language which looks like this en.wikipedia.org/wiki/Bicycle ant this way I am getting the translation of Rower to Bicycle. Thats what I basically want to do.
then above code should work, it should not pick up any other as specifically it only looks for A-Z, i would also suggest adding an underscore to your character group as some wiki URL's have underscroe also in them like en.wikipedia.org/wiki/Network_switching_subsystem
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.