my first perl script: using "get($url)" method in a loop?

Question

So it seemed easy enough. Use a series of nested loops to go though a ton of URLs sorted by year/month/day and download the XML files. As this is my first script, I started with the loop; something familiar in any language. I ran it just printing the constructed URLs and it worked perfect.

I then wrote the code to download the content and save it separately, and that worked perfect as well with a sample URL on multiple test cases.

But when I combined these two bits of code, it broke, the program just got stuck and did nothing at all.

I therefore ran the debugger and as I stepped through it, it became stuck on this one line:

warnings::register::import(/usr/share/perl/5.10/warnings/register.pm:25):25:vec($warnings::Bits{$k}, $warnings::LAST_BIT, 1) = 0;

If I just hit r to return from the subroutine it works and continues to another point on its way back down the call stack where something similar happens over and over for some time. The stack trace:

warnings::register::import('warnings::register') called from file `/usr/lib/perl/5.10/Socket.pm' line 7
Socket::BEGIN() called from file `/usr/lib/perl/5.10/Socket.pm' line 7
eval {...} called from file `/usr/lib/perl/5.10/Socket.pm' line 7
require 'Socket.pm' called from file `/usr/lib/perl/5.10/IO/Socket.pm' line 12
IO::Socket::BEGIN() called from file `/usr/lib/perl/5.10/Socket.pm' line 7
eval {...} called from file `/usr/lib/perl/5.10/Socket.pm' line 7
require 'IO/Socket.pm' called from file `/usr/share/perl5/LWP/Simple.pm' line 158
LWP::Simple::_trivial_http_get('www.aDatabase.com', 80, '/sittings/1987/oct/20.xml') called from file `/usr/share/perl5/LWP/Simple.pm' line 136
LWP::Simple::_get('http://www.aDatabase.com/1987/oct/20.xml') called from file `xmlfetch.pl' line 28

As you can see it is getting stuck inside this "get($url)" method, and I have no clue why? Here is my code:

#!/usr/bin/perl

use LWP::Simple;

$urlBase = 'http://www.aDatabase.com/subheading/';
$day=1;
$month=1;
@months=("list of months","jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec");
$year=1987;
$nullXML = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<nil-classes type=\"array\"/>\n";
    
while($year<=2006)
    {
    $month=1;
    while($month<=12)
        {
        $day=1;
        while($day<=31)
            {
            $newUrl = "$urlBase$year/$months[$month]/$day.xml";
            $content = get($newUrl);
            if($content ne $nullXML)
                {
                $filename = "$year-$month-$day.xml";
                open(FILE, ">$filename");
                print FILE $content;
                close(FILE);
                }
            $day++;
            }
        $month++;
        }
    $year++;
    }

I am almost positive it is something tiny I just dont know, but google has not turned up anything.

EDIT: It's official, it just hangs forever inside this get method, runs for several loops then hangs again for a while. But its still a problem. Why is this happening?

No idea why it's failing (your code looks OK), but please replace the 'while' loops with 'for' loops (i.e. "for ($year=1987; $year<=2006; $year++)" or "for $year (1987 .. 2006)". — nobody
– nobody, Commented Jan 21, 2009 at 21:38
fair call. consider them replaced. bummer about this method though. — gnomed
– gnomed, Commented Jan 21, 2009 at 21:40
Please supply 1. output of perl -V, 2. version numbers for LWP::Simple, Socket, and IO::Socket — Alnitak
– Alnitak, Commented Jan 21, 2009 at 21:55
it's also possible that the remote site is rate limiting your requests... — Alnitak
– Alnitak, Commented Jan 21, 2009 at 21:58
I'll second Alnitak's suggestion (the target site is rate-limiting, and the function "hangs" until the request times out). — Max Lybbert
– Max Lybbert, Commented Jan 21, 2009 at 22:09

Cebjyre · Accepted Answer · 2009-01-21 22:08:39Z

3

Since http://www.adatabase.com/1987/oct/20.xml is a 404 (and isn't something that can be generated from your program anyway (no 'subheading' in the path), I'm assuming that isn't the real link you are using, which makes it hard for us to test. As a general rule, please use example.com instead of making up hostnames, that's why it is reserved.

You should really

use strict;
use warnings;

in your code - this will help highlight any scoping issues you may have (I'd be surprised if it was the case, but there is a chance that a part of the LWP code is messing around with your $urlBase or something). I think it should be enough to change the inital variable declarations (and $newUrl, $content and $filename) to put 'my' in front to make your code strict.

If using strict and warnings doesn't get you any closer to a solution, you could warn out the link you are about to use each loop so when it sticks you can try it in a browser and see what happens, or alternatively using a packet sniffer (such as Wireshark) could give you some clues.

answered Jan 21, 2009 at 22:08

Cebjyre

6,6633 gold badges35 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

gnomed Over a year ago

this worked, adding those "use" statements and throwing a "my" in front of everything and there we go. Like I said, something tiny I didn't know. Many thanks, and sorry I'm new to some conventions, will remember for the future.

jfs Over a year ago

@gnomed: example.com is more than a convention it is in RFC 2606.

kyle Over a year ago

@gnomed: look into Date::Format and Date::Parse; you can collapse all your date loops into a single loop, and simultaneously avoid dates like '2005-02-31'.

jfs · Accepted Answer · 2009-01-21 23:08:58Z

2

(2006 - 1986) * 12 * 31 is more then 7000. Requesting web pages without a pause is not nice.

Slightly more Perl-like version (code-style wise):

#!/usr/bin/perl
use strict;
use warnings;

use LWP::Simple qw(get);    

my $urlBase = 'http://www.example.com/subheading/';
my @months  = qw/jan feb mar apr may jun jul aug sep oct nov dec/;
my $nullXML = <<'NULLXML';
<?xml version="1.0" encoding="UTF-8"?>
<nil-classes type="array"/>
NULLXML

for my $year (1987..2006) {
    for my $month (0..$#months) {
        for my $day (1..31) {
            my $newUrl = "$urlBase$year/$months[$month]/$day.xml";
            my $content = "abc"; #XXX get($newUrl);
            if ($content ne $nullXML) {
               my $filename = "$year-@{[$month+1]}-$day.xml";
               open my $fh, ">$filename" 
                   or die "Can't open '$filename': $!";
               print $fh $content;
               # $fh implicitly closed
            }
        }
    }
}

edited Jan 21, 2009 at 23:08

answered Jan 21, 2009 at 22:45

jfs

417k210 gold badges1k silver badges1.7k bronze badges

3 Comments

kyle Over a year ago

Quick heads-up: Perl subtly casts the min..max ranges to an array, then issues an iterator over it (at least, ActivePerl on Windows). Benchmark the behavior of min..max versus ( my $i = min; $i < max; ++$i ), and it is about 10x slower (as of my last test). Been slowly migrating all my scripts :)

gnomed Over a year ago

thanks for the tidy version, can't say I'm at that level yet, this is still my second day. but about the website requests, I changed my program since getting it to work to set pauses for slightly more reasonable request rates.

jfs Over a year ago

@kyle: You are using an ancient Perl version. for $i ($min..$max) is faster and doesn't consume more memory than for ($i=$min; $i<=$max; ++$i).

brian d foy · Accepted Answer · 2009-01-22 01:35:10Z

0

LWP has a getstore function that does most of the fetching then saving work for you. You might also check out LWP::Parallel::UserAgent and a bit more control over how you hit the remote site.

answered Jan 22, 2009 at 1:35

brian d foy

134k31 gold badges214 silver badges613 bronze badges

Collectives™ on Stack Overflow

my first perl script: using "get($url)" method in a loop?

3 Answers 3

3 Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related