2

So it seemed easy enough. Use a series of nested loops to go though a ton of URLs sorted by year/month/day and download the XML files. As this is my first script, I started with the loop; something familiar in any language. I ran it just printing the constructed URLs and it worked perfect.

I then wrote the code to download the content and save it separately, and that worked perfect as well with a sample URL on multiple test cases.

But when I combined these two bits of code, it broke, the program just got stuck and did nothing at all.

I therefore ran the debugger and as I stepped through it, it became stuck on this one line:

warnings::register::import(/usr/share/perl/5.10/warnings/register.pm:25):25:vec($warnings::Bits{$k}, $warnings::LAST_BIT, 1) = 0;

If I just hit r to return from the subroutine it works and continues to another point on its way back down the call stack where something similar happens over and over for some time. The stack trace:

warnings::register::import('warnings::register') called from file `/usr/lib/perl/5.10/Socket.pm' line 7
Socket::BEGIN() called from file `/usr/lib/perl/5.10/Socket.pm' line 7
eval {...} called from file `/usr/lib/perl/5.10/Socket.pm' line 7
require 'Socket.pm' called from file `/usr/lib/perl/5.10/IO/Socket.pm' line 12
IO::Socket::BEGIN() called from file `/usr/lib/perl/5.10/Socket.pm' line 7
eval {...} called from file `/usr/lib/perl/5.10/Socket.pm' line 7
require 'IO/Socket.pm' called from file `/usr/share/perl5/LWP/Simple.pm' line 158
LWP::Simple::_trivial_http_get('www.aDatabase.com', 80, '/sittings/1987/oct/20.xml') called from file `/usr/share/perl5/LWP/Simple.pm' line 136
LWP::Simple::_get('http://www.aDatabase.com/1987/oct/20.xml') called from file `xmlfetch.pl' line 28

As you can see it is getting stuck inside this "get($url)" method, and I have no clue why? Here is my code:

#!/usr/bin/perl

use LWP::Simple;

$urlBase = 'http://www.aDatabase.com/subheading/';
$day=1;
$month=1;
@months=("list of months","jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec");
$year=1987;
$nullXML = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<nil-classes type=\"array\"/>\n";
    
while($year<=2006)
    {
    $month=1;
    while($month<=12)
        {
        $day=1;
        while($day<=31)
            {
            $newUrl = "$urlBase$year/$months[$month]/$day.xml";
            $content = get($newUrl);
            if($content ne $nullXML)
                {
                $filename = "$year-$month-$day.xml";
                open(FILE, ">$filename");
                print FILE $content;
                close(FILE);
                }
            $day++;
            }
        $month++;
        }
    $year++;
    }

I am almost positive it is something tiny I just dont know, but google has not turned up anything.

EDIT: It's official, it just hangs forever inside this get method, runs for several loops then hangs again for a while. But its still a problem. Why is this happening?

6
  • No idea why it's failing (your code looks OK), but please replace the 'while' loops with 'for' loops (i.e. "for ($year=1987; $year<=2006; $year++)" or "for $year (1987 .. 2006)". Commented Jan 21, 2009 at 21:38
  • fair call. consider them replaced. bummer about this method though. Commented Jan 21, 2009 at 21:40
  • Please supply 1. output of perl -V, 2. version numbers for LWP::Simple, Socket, and IO::Socket Commented Jan 21, 2009 at 21:55
  • it's also possible that the remote site is rate limiting your requests... Commented Jan 21, 2009 at 21:58
  • I'll second Alnitak's suggestion (the target site is rate-limiting, and the function "hangs" until the request times out). Commented Jan 21, 2009 at 22:09

3 Answers 3

3

Since http://www.adatabase.com/1987/oct/20.xml is a 404 (and isn't something that can be generated from your program anyway (no 'subheading' in the path), I'm assuming that isn't the real link you are using, which makes it hard for us to test. As a general rule, please use example.com instead of making up hostnames, that's why it is reserved.

You should really

use strict;
use warnings;

in your code - this will help highlight any scoping issues you may have (I'd be surprised if it was the case, but there is a chance that a part of the LWP code is messing around with your $urlBase or something). I think it should be enough to change the inital variable declarations (and $newUrl, $content and $filename) to put 'my' in front to make your code strict.

If using strict and warnings doesn't get you any closer to a solution, you could warn out the link you are about to use each loop so when it sticks you can try it in a browser and see what happens, or alternatively using a packet sniffer (such as Wireshark) could give you some clues.

Sign up to request clarification or add additional context in comments.

3 Comments

this worked, adding those "use" statements and throwing a "my" in front of everything and there we go. Like I said, something tiny I didn't know. Many thanks, and sorry I'm new to some conventions, will remember for the future.
@gnomed: example.com is more than a convention it is in RFC 2606.
@gnomed: look into Date::Format and Date::Parse; you can collapse all your date loops into a single loop, and simultaneously avoid dates like '2005-02-31'.
2

(2006 - 1986) * 12 * 31 is more then 7000. Requesting web pages without a pause is not nice.

Slightly more Perl-like version (code-style wise):

#!/usr/bin/perl
use strict;
use warnings;

use LWP::Simple qw(get);    

my $urlBase = 'http://www.example.com/subheading/';
my @months  = qw/jan feb mar apr may jun jul aug sep oct nov dec/;
my $nullXML = <<'NULLXML';
<?xml version="1.0" encoding="UTF-8"?>
<nil-classes type="array"/>
NULLXML

for my $year (1987..2006) {
    for my $month (0..$#months) {
        for my $day (1..31) {
            my $newUrl = "$urlBase$year/$months[$month]/$day.xml";
            my $content = "abc"; #XXX get($newUrl);
            if ($content ne $nullXML) {
               my $filename = "$year-@{[$month+1]}-$day.xml";
               open my $fh, ">$filename" 
                   or die "Can't open '$filename': $!";
               print $fh $content;
               # $fh implicitly closed
            }
        }
    }
}

3 Comments

Quick heads-up: Perl subtly casts the min..max ranges to an array, then issues an iterator over it (at least, ActivePerl on Windows). Benchmark the behavior of min..max versus ( my $i = min; $i < max; ++$i ), and it is about 10x slower (as of my last test). Been slowly migrating all my scripts :)
thanks for the tidy version, can't say I'm at that level yet, this is still my second day. but about the website requests, I changed my program since getting it to work to set pauses for slightly more reasonable request rates.
@kyle: You are using an ancient Perl version. for $i ($min..$max) is faster and doesn't consume more memory than for ($i=$min; $i<=$max; ++$i).
0

LWP has a getstore function that does most of the fetching then saving work for you. You might also check out LWP::Parallel::UserAgent and a bit more control over how you hit the remote site.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.