Trying to parse a text file for a string and print a value

Question

I'm trying to parse 40+ text files that are in a directory for the word "Phone:" and print the phone number that comes after the string. I'm a super perl novice so any help is greatly appreciated. I had to comment out the strict or it wouldn't run,

Here's my code:

#!/usr/bin/perl
#use strict;
use warnings;

my $DIR = "/Ask";
opendir $DIR, '.' or die "opendir .: $!\n";
my @files = grep /\.txt$/i, readdir $DIR;
closedir $DIR;

print "Got ", scalar @files, " files\n";

my %seen = ();
foreach my $file (@files) {
    open my $FILE, '<', $file or die "$file: $!\n";
    while (<$FILE>) {
        #print "test\n";
        if (/^phone\s*(.*)\r?$/i) {
            $seen{$1} = 1;
            foreach my $addr ( sort keys %seen ) {
                print "$addr\n";
            }
        }
    }
    close $FILE;
}

it sees the files but never seems to match the argument and print the results. I can also convert the files to html easily and parse them that way.

Thanks for all of the assistance so far. Here are a few more questions that have come up and an example of the files that I'm parsing:

Here's an example of the short files I'm parsing- Agilent Technologies,Inc. Headquarters. Toll-Free: +1 877-424-4536, phone: 4083458886.Fax: +1 408-345-8474 Address: 5301 Stevens Creek Blvd - I think the problem I'm having is that the phone: isn't always at the start of the line. If I modify my files and put it there all works well but I think the script has problems finding it in the middle of a row. Ideas?

You may want to add a Perl tag to the question to get more pertinent viewers. — Daryl Behrens
– Daryl Behrens, Commented Oct 14, 2014 at 19:50
Disabling strict is equally good idea as putting tape over car indicator lights. In both cases it looks like it solves the problem. — mpapec
– mpapec, Commented Oct 14, 2014 at 20:30

Miller · Accepted Answer · 2014-10-14 21:12:31Z

1

Few things

Never comment out use strict;
Don't include a newline after your die messages, that tells die to hide the line number and file messages
Your using %seen to make your phone numbers unique. Therefore output the results of them outside the file processing loop. Additionally, define %seen as lexical to the outside loop or phone numbers from previous files will still be around.
If you aren't getting any results, then your regex is probably not matching. Perhaps the anchor is too limitting: ^

Here's some cleanup of your script:

#!/usr/bin/perl
use strict;
use warnings;
use autodie;

my $DIR = "/Ask";

my @files = do {
    opendir my $dh, '.' or die "opendir .: $!";
    grep /\.txt$/i, readdir $dh;
};

print "Got ", scalar @files, " files\n";

foreach my $file (@files) {
    open my $fh, '<', $file or die "$file: $!";

    my %seen;

    while (<$fh>) {
        if (/^phone\s*(.*)$/i) {
            $seen{$1} = 1;
        }
    }

    foreach my $addr ( sort keys %seen ) {
        print "$addr\n";
    }

    close $fh;
}

edited Oct 14, 2014 at 21:12

answered Oct 14, 2014 at 20:59

Miller

35.3k4 gold badges42 silver badges61 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

tlialin Over a year ago

thanks to all so far. how about parsing html or rtf? first script worked well, second one gave me the following error: Bad symbol for dirhandle at parse2.pl line 10 - thanks, Tony

Miller Over a year ago

I had a typo at line 10. Corrected.

tlialin Over a year ago

thanks. what about returning multiple strings like phone,address,zip?

Miller Over a year ago

Anything is possible. However, you've shared no information about the nature of your data, so it is literally impossible to advise you in more detail. If you want to parse additional fields, you'll just need to program that parsing.

tlialin Over a year ago

Here's an example of the short files I'm parsing- Agilent Technologies,Inc. Headquarters. Toll-Free: +1 877-424-4536, phone: 4083458886.Fax: +1 408-345-8474 Address: 5301 Stevens Creek Blvd - I think the problem I'm having is that the phone: isn't always at the start of the line. If I modify my files and put it there all works well but I think the script has problems finding it in the middle of a row. Ideas?

|

Brandon Thompson · Accepted Answer · 2014-10-15 00:33:47Z

0

You're going to need to chomp() each line to remove the newline character "\n" that accompanies each line:

while (<$FILE>) {
    chomp;
    if (/^phone\s*(.*)\r?$/i) {
        $seen{$1} = 1;
        foreach my $addr ( sort keys %seen ) {
            print "$addr\n";
        }
    }
}

Alternatively, you can make your regular expression multi-line by adding the 's' modifier, which will allow your ".*" to consume newline characters:

while (<$FILE>) {
    if (/^phone\s*(.*)\r?$/is) {
        $seen{$1} = 1;
        foreach my $addr ( sort keys %seen ) {
            print "$addr\n";
        }
    }
}

answered Oct 15, 2014 at 0:33

Brandon Thompson

1

Collectives™ on Stack Overflow

Trying to parse a text file for a string and print a value

2 Answers 2

6 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related