0

I have a doubt I've been trying to solve myself using CPAN modules documentation, but I'm a bit new and I'm confused with some terminology and sections within the different modules.

I'm trying to create the object in the code below, and get the absolute URL for relative links extracted from a website.

#!/usr/bin/perl

use strict;
use warnings;
use LWP::UserAgent;         
use Digest::MD5 qw(md5_hex);
use URI;

my $url = $ARGV[0];

if ($url !~ m{^https?://[^\W]+-?\.com/?}i) {
    exit(0);                         
}      

my $ua = LWP::UserAgent->new;
$ua->timeout( 10 );

my $response = $ua->get( $url );  

my $content = $response->decoded_content();

my $links = URI->new($content);
my $abs = $links->abs('http:', $content);
my $abs_links = $links->abs($abs);

while ($content =~ m{<a[^>]\s*href\s*=\s*"?([^"\s>]+)}gis) {
    $abs_links = $1;
    print "$abs_links\n";
    print "Digest for the above URL is " . md5_hex($abs_links) . "\n";             
}

The problem is when I try to add that part outside the While loop (the 3-line block preceding the loop), it does not work, whereas if I add the same part in the While loop, it will work fine. This one just gets the relative URLs from a given website, but instead of printing "Http://..." it prints "//...".

The script that works fine for me is the following:

#!/usr/bin/perl
use strict;
use warnings;

use LWP::UserAgent;            
use Digest::MD5 qw(md5_hex);
use URI::URL;

my $url = $ARGV[0];                            ## Url passed in command
if ($url !~ m{^https?://[\w]+-?[\w]+\.com/?}i) {
    exit(0);                                   ## Program stops if not valid URL
}         

my $ua = LWP::UserAgent->new;
$ua->timeout( 10 );

my $response = $ua->get( $url );               ## Get response, not content

my $content = $response->decoded_content();    ## Now let's get the content

while ($content =~ m{<a[^>]\s*href\s*=\s*"?([^"\s>]+)}gis) {    ## All links
    my $links = $1;
    my $abs = new URI::URL "$links";
    my $abs_url = $abs->abs('http:', $links);
    print "$abs_url\n";
    print "Digest for the above URL is " . md5_hex($abs_url) . "\n";              
} 

Any ideas? Much appreciated.

4
  • This question is unclear. Which is the part that only works inside the loop? I'm guessing that you've posted the non-working version of the code, and the part you're talking about is the 3 lines before the loop, but I'm not sure. But I notice you set the variable $abs_links twice, and the first value it's set to is never used. I suppose it would behave differently if you put the my $abs_links = $links->abs($abs); inside the loop, after the $abs_links = $1, because it sould use the other value of $abs_links. Is this what you've done? Commented Jul 2, 2017 at 10:16
  • 1
    URI->new($content) is wrong. You should be passing a URL. /// $links->abs('http:', $content) is completely wrong. It should be $links->abs($url); Commented Jul 2, 2017 at 15:52
  • "I'm confused with some terminology and sections within the different modules" Then you should say what it is that you don't understand. As it is you will probably get a working solution, but you will be no wiser and you probably won't understand the working code either. Commented Jul 2, 2017 at 17:24
  • Thanks everyone! ->David, I edited the post to make it clearer, my apologies for that. I tried that as well but that would not work as the bits seem unlinked between inside and outside the loop. You can see now the script that is working (just added). ->ikegami, thanks, it's clear now that using the $content variable there is totally wrong. ->Borodin, you are right. I get confused on how different modules work as I'm new and I need to get more familiar with it. Please find the script that works, I just added it. Thanks a lot! Next time I will make sure to add these doubts in my posts. Commented Jul 2, 2017 at 18:29

2 Answers 2

1

I don't understand your code. There are a few weird bits:

  • [^\W] is the same as \w
  • The regex allows an optional - before and an optional / after .com, i.e. http://bitwise.complement.biz matches but http://cool-beans.com doesn't.
  • URI->new($content) makes no sense: $content is random HTML, not a URI.
  • $links->abs('http:', $content) makes no sense: $content is simply ignored, and $links->abs('http:') tries to make $links an absolute URL relative to 'http:', but 'http:' is not a valid URL.

Here's what I think you're trying to do:

#!/usr/bin/perl
use strict;
use warnings;

use LWP::UserAgent;
use HTML::LinkExtor;
use Digest::MD5 qw(md5_hex);

@ARGV == 1 or die "Usage: $0 URL\n";
my $url = $ARGV[0];

my $ua = LWP::UserAgent->new(timeout => 10);

my $response = $ua->get($url);
$response->is_success or die "$0: " . $response->request->uri . ": " . $response->status_line . "\n";

my $content = $response->decoded_content;
my $base = $response->base;

my @links;
my $p = HTML::LinkExtor->new(
    sub {
        my ($tag, %attrs) = @_;
        if ($tag eq 'a' && $attrs{href}) {
            push @links, "$attrs{href}";  # stringify
        }
    },
    $base,
);

$p->parse($content);
$p->eof;

for my $link (@links) {
    print "$link\n";
    print "Digest for the above URL is " . md5_hex($link) . "\n";
}
  • I don't try to validate the URL passed in $ARGV[0]. Leave it to LWP::UserAgent. (If you don't like this, just add the check back in.)
  • I make sure $ua->get($url) was successful before proceeding.
  • I get the base URL for absolutifying relative links from $response->base.
  • I use HTML::LinkExtor for parsing the content, extracting links, and making them absolute.
Sign up to request clarification or add additional context in comments.

1 Comment

Just wow, melpomene. Thanks a lot for sharing this with me. I'm totally new into programming and have been "coding" just for a few weeks, this clarified some of my doubts and showed me a new module which is way easier to user than the URI || URI::URL ones. I just edited my post to make it clearer for everyone and now I understand why it's not gonna work adding the content to the URI module. However, it did not work adding the $url variable or trying similar alternatives, I guess I still need to learn more and keep on trying :)
1

I think your biggest mistake is trying to parse links out of HTML using a regular expression. You would be far better advised to use a CPAN module for this. I'd recommend WWW::Mechanize, which would make your code look something like this:

#!/usr/bin/perl

use strict;
use warnings;
use feature 'say';

use WWW::Mechanize;         
use Digest::MD5 qw(md5_hex);
use URI;

my $url = $ARGV[0];

if ($url !~ m{^https?://[^\W]+-?\.com/?}i) {
    exit(0);                         
}      

my $ua = WWW::Mechanize->new;
$ua->timeout( 10 );

$ua->get( $url );  

foreach ($ua->links) {
  say $_->url;
  say "Digest for the above URL is " . md5_hex($_->url) . "\n";
}

That looks a lot simpler to me.

3 Comments

Thanks a lot for sharing this module with me, Dave. I knew this one and I agree with you this is the easiest way to convey the purpose of the exercise. The thing is the person giving us exercises to practise with did explicitly mentioned that the WWW::Mechanize module was somehow "primitive" and we should avoid it for now, and as I'm quite new, I did not know what to say but to obey for now.
@Jestfer: I'm really not sure that "primitive" is the right word when comparing WWW::Mechanize to LWP::UserAgent :-)
and I agree again with you :) I'm not sure why he does not want us to use Mechanize... It would make our life easier. Thanks again for your comment.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.