0

I am trying to get values from already existing html table with exact td (cell). Can anyone help me with it?

The existing table's code is as below.

<table>
<tr><td class="key">FIRST NAME</td><td id="firstname" class="value">ALEXANDR</td></tr>
<tr><td class="key">SURNAME NAME</td><td id="surname" class="value">PUSHKIN</td></tr>
<tr><td class="key">EMAIL</td><td id="email" class="value">[email protected]</td></tr>
<tr><td class="key">TELEPHONE</td><td id="telephone" class="value">+991122334455</td></tr>
</table> 

I tried this below perl script but it does not work.

$pp = get("http://www.domain.com/something_something");
$out[0]="/home/.../public_html/perl_output.txt";
($firstname) = ($str =~ /<td id="firstname" class="value">(.+?)<\/firstname/);
($surname) = ($str =~ /<td id="surname" class="value">(.+?)<\/surname/);
($email) = ($str =~ /<td id="email" class="value">(.+?)<\/email/);
($telephone) = ($str =~ /<td id="telephone" class="value">(.+?)<\/telephone/);

print "First Name: $firstname \n";
print "Last Name: $surname \n";
print "Email: $email \n";
print "Telephone: $telephone \n";

exit;

Can anyone guide me?

3 Answers 3

4

This answer solves the problem described in the question, but not the actual problem OP has revealed in the comments.

Because Web::Scraper is for HTML documents, this is not going to work with the website that OP wants to scrape. It uses XML. See my other answer for a solution that deals with XML.


Don't try to parse HTML with regular expressions! Use an HTML parser instead.

For web scraping I prefer Web::Scraper. It does everything from fetching the page to parsing the content in a very simple DSL.

use strict;
use warnings;
use Web::Scraper;
use URI;
use Data::Dumper;

my $people = scraper {
    # this will parse all tables and put the results into the key people
    process 'table', 'people[]' => scraper {
        process '#firstname', first_name => 'TEXT'; # grab those ids
        process '#lastname',  last_name  => 'TEXT'; # and put them into
        process '#email',     email      => 'TEXT'; # a hashref with the
        process '#telephone', phone      => 'TEXT'; # 2nd arg as key
    };
    result 'people'; # only return the people key
};
my $res = $people->scrape( URI->new("http://www.domain.com/something_something") );

print Dumper $res;

__DATA__
$VAR1 = [
  {
    firstname => 'ALEXANDR',
    lastname => 'PUSHKIN',
    email => '[email protected]',
    phone => '+991122334455',
  }
]

If one of the fields, like email or firstname occur multiple times in one table, you can use an array reference for that. In that case the document's HTML would not be valid because of the double ids though. Use a different selector and pray it works.

 process '#email', 'email[]' => 'TEXT';

Now you'll get this kind of structure:

{
  email => [
   '[email protected]',
   '[email protected]',
  ],
}
Sign up to request clarification or add additional context in comments.

8 Comments

Note: I haven't run this code because there was no real URL supplied and Web::Scraper doesn't play well with __DATA__.
many thanks, how the code would look like if there are more than 1 email addresses and telephone numbers. foreach code should be somehow be included is not it?
Give us an example of the HTML including multiple values.
@esqeudero: Yes, we would need example data. It depends if it's normalized.
for example I want to get value of each published papers (articles) from the existing metadata at the link (ejeps.com/index.php/ejeps/…). I need only these value, but there may be more than 1 author: #dc_title #dc_author #dc_affiliation #dc_email #dc_jel #dc_keywords #dc_description #dc_format #dc_source #dc_year #dc_volume #dc_issue #dc_pages #dc_pdfurl
|
1

Since it came out that the document is actually XML, here is a solution that uses an XML parser to deal with it, and also takes into account multiple fields. XML::Twig is very useful for this, and it even lets us download the document.

use strict;
use warnings;
use XML::Twig;
use Data::Printer;

my @docs; # we will save the docs here
my $twig = XML::Twig->new(
    twig_handlers => {
        'oai_dc:dc' => sub {
            my ($t, $elt) = @_;

            my $foo = {
                # grab all elements of type 'dc:author" inside our 
                # element and call text_only on them
                author => [ map { $_->text_only } $elt->descendants('dc:author') ],
                email => [ map { $_->text_only } $elt->descendants('dc:email') ],
            };

            push @docs, $foo;
        }
    }
);

$twig->parseurl("http://ejeps.com/index.php/ejeps/oai?verb=ListRecords&metadataPrefix=oai_dc");

p @docs;

__END__

[
    [0]  {
        author   [
            [0] "Nazila Isgandarova"
        ],
        email    [
            [0] "[email protected]"
        ]
    },
    [1]  {
        author   [
            [0] "Mette Nordahl Grosen",
            [1] "Bezen Balamir Coskun"
        ],
        email    [
            [0] "[email protected]",
            [1] "[email protected]"
        ]
    },
# ...

1 Comment

Now I stole my own accepted answer. That's a first. :D
0

First, you really should use an XML parser.

Now to some possible reasons why the code does not work:

Your regular expressions expect an ending tag, e.g. </firstnamewhich does not exist in your HTML.

If the HTML is plain and reliable and you really want a regex it should better look like this:

m/<td    
  [^>]+    # anything but '>'
  id="firstname"
  [^>]+    # anything but '>'
  >
  ([^<]+?) # anything but '<'
  <
/xms;

This does not take into account case insensitivity of HTML, decoding of HTML-entities, other allowed quote characters.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.