Get value from HTML table with PERL

Question

I am trying to get values from already existing html table with exact td (cell). Can anyone help me with it?

The existing table's code is as below.

<table>
<tr><td class="key">FIRST NAME</td><td id="firstname" class="value">ALEXANDR</td></tr>
<tr><td class="key">SURNAME NAME</td><td id="surname" class="value">PUSHKIN</td></tr>
<tr><td class="key">EMAIL</td><td id="email" class="value">[email protected]</td></tr>
<tr><td class="key">TELEPHONE</td><td id="telephone" class="value">+991122334455</td></tr>
</table>

I tried this below perl script but it does not work.

$pp = get("http://www.domain.com/something_something");
$out[0]="/home/.../public_html/perl_output.txt";
($firstname) = ($str =~ /<td id="firstname" class="value">(.+?)<\/firstname/);
($surname) = ($str =~ /<td id="surname" class="value">(.+?)<\/surname/);
($email) = ($str =~ /<td id="email" class="value">(.+?)<\/email/);
($telephone) = ($str =~ /<td id="telephone" class="value">(.+?)<\/telephone/);

print "First Name: $firstname \n";
print "Last Name: $surname \n";
print "Email: $email \n";
print "Telephone: $telephone \n";

exit;

Can anyone guide me?

Community · Accepted Answer · 2017-05-23 11:45:01Z

4

This answer solves the problem described in the question, but not the actual problem OP has revealed in the comments.

Because Web::Scraper is for HTML documents, this is not going to work with the website that OP wants to scrape. It uses XML. See my other answer for a solution that deals with XML.

Don't try to parse HTML with regular expressions! Use an HTML parser instead.

For web scraping I prefer Web::Scraper. It does everything from fetching the page to parsing the content in a very simple DSL.

use strict;
use warnings;
use Web::Scraper;
use URI;
use Data::Dumper;

my $people = scraper {
    # this will parse all tables and put the results into the key people
    process 'table', 'people[]' => scraper {
        process '#firstname', first_name => 'TEXT'; # grab those ids
        process '#lastname',  last_name  => 'TEXT'; # and put them into
        process '#email',     email      => 'TEXT'; # a hashref with the
        process '#telephone', phone      => 'TEXT'; # 2nd arg as key
    };
    result 'people'; # only return the people key
};
my $res = $people->scrape( URI->new("http://www.domain.com/something_something") );

print Dumper $res;

__DATA__
$VAR1 = [
  {
    firstname => 'ALEXANDR',
    lastname => 'PUSHKIN',
    email => '[email protected]',
    phone => '+991122334455',
  }
]

If one of the fields, like email or firstname occur multiple times in one table, you can use an array reference for that. In that case the document's HTML would not be valid because of the double ids though. Use a different selector and pray it works.

 process '#email', 'email[]' => 'TEXT';

Now you'll get this kind of structure:

{
  email => [
   '[email protected]',
   '[email protected]',
  ],
}

edited May 23, 2017 at 11:45

CommunityBot

11 silver badge

answered Feb 18, 2016 at 14:00

simbabque

54.4k8 gold badges77 silver badges141 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

simbabque Over a year ago

Note: I haven't run this code because there was no real URL supplied and Web::Scraper doesn't play well with __DATA__.

user5934920 Over a year ago

many thanks, how the code would look like if there are more than 1 email addresses and telephone numbers. foreach code should be somehow be included is not it?

Dave Cross Over a year ago

Give us an example of the HTML including multiple values.

simbabque Over a year ago

@esqeudero: Yes, we would need example data. It depends if it's normalized.

user5934920 Over a year ago

for example I want to get value of each published papers (articles) from the existing metadata at the link (ejeps.com/index.php/ejeps/…). I need only these value, but there may be more than 1 author: #dc_title #dc_author #dc_affiliation #dc_email #dc_jel #dc_keywords #dc_description #dc_format #dc_source #dc_year #dc_volume #dc_issue #dc_pages #dc_pdfurl

|

Community · Accepted Answer · 2017-05-23 12:31:08Z

1

Since it came out that the document is actually XML, here is a solution that uses an XML parser to deal with it, and also takes into account multiple fields. XML::Twig is very useful for this, and it even lets us download the document.

use strict;
use warnings;
use XML::Twig;
use Data::Printer;

my @docs; # we will save the docs here
my $twig = XML::Twig->new(
    twig_handlers => {
        'oai_dc:dc' => sub {
            my ($t, $elt) = @_;

            my $foo = {
                # grab all elements of type 'dc:author" inside our 
                # element and call text_only on them
                author => [ map { $_->text_only } $elt->descendants('dc:author') ],
                email => [ map { $_->text_only } $elt->descendants('dc:email') ],
            };

            push @docs, $foo;
        }
    }
);

$twig->parseurl("http://ejeps.com/index.php/ejeps/oai?verb=ListRecords&metadataPrefix=oai_dc");

p @docs;

__END__

[
    [0]  {
        author   [
            [0] "Nazila Isgandarova"
        ],
        email    [
            [0] "[email protected]"
        ]
    },
    [1]  {
        author   [
            [0] "Mette Nordahl Grosen",
            [1] "Bezen Balamir Coskun"
        ],
        email    [
            [0] "[email protected]",
            [1] "[email protected]"
        ]
    },
# ...

edited May 23, 2017 at 12:31

CommunityBot

11 silver badge

answered Feb 18, 2016 at 16:00

simbabque

54.4k8 gold badges77 silver badges141 bronze badges

1 Comment

simbabque Over a year ago

Now I stole my own accepted answer. That's a first. :D

Helmut Wollmersdorfer · Accepted Answer · 2016-02-18 17:05:39Z

0

First, you really should use an XML parser.

Now to some possible reasons why the code does not work:

Your regular expressions expect an ending tag, e.g. </firstnamewhich does not exist in your HTML.

If the HTML is plain and reliable and you really want a regex it should better look like this:

m/<td    
  [^>]+    # anything but '>'
  id="firstname"
  [^>]+    # anything but '>'
  >
  ([^<]+?) # anything but '<'
  <
/xms;

This does not take into account case insensitivity of HTML, decoding of HTML-entities, other allowed quote characters.

answered Feb 18, 2016 at 17:05

Helmut Wollmersdorfer

4513 silver badges12 bronze badges

Collectives™ on Stack Overflow

Get value from HTML table with PERL

3 Answers 3

This answer solves the problem described in the question, but not the actual problem OP has revealed in the comments.

8 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

This answer solves the problem described in the question, but not the actual problem OP has revealed in the comments.

8 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related