How can I extract links from an HTML file with Perl?

Question

I have some input with a link and I want to open that link. For instance, I have an HTML file and want to find all links in the file and open their contents in an Excel spreadsheet.

Why oh why must each of your posts be formatted like that? Why? — innaM
– innaM, Commented May 27, 2009 at 11:50
Are you asking how to get a list of links from some html file? Or are you asking how to follow the links? Or are you asking how to get something into an Excel spreadsheet? — innaM
– innaM, Commented May 27, 2009 at 12:14
The way I read it he/she wants to scrape data from pages that are linked from a given page and put the results in Excel documents. — Chas. Owens
– Chas. Owens, Commented May 27, 2009 at 13:17
i want to open the links and read its contents in a html file. — User1611
– User1611, Commented May 28, 2009 at 8:00

brian d foy · Accepted Answer · 2016-02-19 17:46:43Z

5

It sounds like you want the linktractor script from my HTML::SimpleLinkExtor module.

You might also be interested in my webreaper script. I wrote that a long, long time ago to do something close to this same task. I don't really recommend it because other tools are much better now, but you can at least look at the code.

CPAN and Google are your friends. :)

Mojo::UserAgent is quite nice for this, too:

use Mojo::UserAgent

print Mojo::UserAgent
    ->new
    ->get( $ARGV[0] )
    ->res
    ->dom->find( "a" )
    ->map( attr => "href" )
    ->join( "\n" );

edited Feb 19, 2016 at 17:46

answered May 27, 2009 at 12:14

brian d foy

134k31 gold badges214 silver badges613 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

jplindstrom · Accepted Answer · 2009-05-27 11:41:56Z

1

That sounds like a job for WWW::Mechanize. It provides a fairly high level interface to fetching and studying web pages.

Once you've read the docs, I think you'll have a good idea how to go about it.

answered May 27, 2009 at 11:41

jplindstrom

9941 gold badge11 silver badges13 bronze badges

2 Comments

User1611 Over a year ago

use WWW::Mechanize; my $mech = WWW::Mechanize->new( autocheck => 1 ); $mech->get( "google.com" ); print $mech->content; GETTING ERROR Error GETing google.com: Can't connect to www.google.com:80 (connect:Unknown error) I WANT TO KNOW WHAT IS WRONG.

brian d foy Over a year ago

google.com is special. It doesn't like robots. However, it sounds like you have a network issue if you can't even connect.

Yanick · Accepted Answer · 2016-02-19 18:08:41Z

1

There is also Web::Query:

#!/usr/bin/env perl 

use 5.10.0;

use strict;
use warnings;

use Web::Query;

say for wq( shift )->find('a')->attr('href');

Or, from the cli:

$ perl -MWeb::Query -E'say for wq(shift)->find("a")->attr("href")' \
       http://techblog.babyl.ca

answered Feb 19, 2016 at 18:08

Yanick

1,2808 silver badges9 bronze badges

Comments

Adam Millerchip · Accepted Answer · 2016-02-20 04:53:44Z

0

I've used URI::Find for this in the past (for when the file is not HTML).

answered Feb 20, 2016 at 4:53

Adam Millerchip

23.3k6 gold badges62 silver badges86 bronze badges

Collectives™ on Stack Overflow

How can I extract links from an HTML file with Perl?

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related