3

I have a wget-like script which downloads a page and then retrieves all the files linked in IMG tags on that page.

Given the URL of the original page and the link extracted from the IMG tag in that page I need to build the URL for the image file I want to retrieve. Currently I use a function I wrote:

sub build_url {
    my ( $base, $path ) = @_;

    # if the path is absolute just prepend the domain to it
    if ($path =~ /^\//) {
        ($base) = $base =~ /^(?:http:\/\/)?(\w+(?:\.\w+)+)/;
        return "$base$path";
    }

    my @base = split '/', $base;
    my @path = split '/', $path;

    # remove a trailing filename
    pop @base if $base =~ /[[:alnum:]]+\/[\w\d]+\.[\w]+$/;

    # check for relative paths
    my $relcount = $path =~ /(\.\.\/)/g;
    while ( $relcount-- ) {
        pop @base;
        shift @path;
    }
    return join '/', @base, @path;
}

The thing is, I'm surely not the first person solving this problem, and in fact it's such a general problem that I assume there must be some better, more standard way of dealing with it, using either a core module or something from CPAN - although via a core module is preferable. I was thinking about File::Spec but wasn't sure if it has all the functionality I would need.

1

2 Answers 2

5

URI -- for building
HTML::TreeBuilder -- for parsing.

Sign up to request clarification or add additional context in comments.

3 Comments

@eugene y: Thanks, have any suggestions for doing it using only core modules?
@Robert: paste the code from these modules into your script :-)
Ahh, the old "Use the Source Luke" comeback :-) I'll take a look.
1

It sounds like you might want something like my HTML::SimpleLinkExtor module. That's what I use for my wget-like script called webreaper.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.