0

I'm trying to parse 40+ text files that are in a directory for the word "Phone:" and print the phone number that comes after the string. I'm a super perl novice so any help is greatly appreciated. I had to comment out the strict or it wouldn't run,

Here's my code:

#!/usr/bin/perl
#use strict;
use warnings;

my $DIR = "/Ask";
opendir $DIR, '.' or die "opendir .: $!\n";
my @files = grep /\.txt$/i, readdir $DIR;
closedir $DIR;

print "Got ", scalar @files, " files\n";

my %seen = ();
foreach my $file (@files) {
    open my $FILE, '<', $file or die "$file: $!\n";
    while (<$FILE>) {
        #print "test\n";
        if (/^phone\s*(.*)\r?$/i) {
            $seen{$1} = 1;
            foreach my $addr ( sort keys %seen ) {
                print "$addr\n";
            }
        }
    }
    close $FILE;
}

it sees the files but never seems to match the argument and print the results. I can also convert the files to html easily and parse them that way.

Thanks for all of the assistance so far. Here are a few more questions that have come up and an example of the files that I'm parsing:

Here's an example of the short files I'm parsing- Agilent Technologies,Inc. Headquarters. Toll-Free: +1 877-424-4536, phone: 4083458886.Fax: +1 408-345-8474 Address: 5301 Stevens Creek Blvd - I think the problem I'm having is that the phone: isn't always at the start of the line. If I modify my files and put it there all works well but I think the script has problems finding it in the middle of a row. Ideas?

5
  • 1
    You may want to add a Perl tag to the question to get more pertinent viewers. Commented Oct 14, 2014 at 19:50
  • 2
    Do you need a : after /^phone in your regex? Commented Oct 14, 2014 at 20:02
  • yep, change your regex to ^phone\s*:\s*(.*)\r?$ Commented Oct 14, 2014 at 20:15
  • 1
    You should also un-comment use strict; Commented Oct 14, 2014 at 20:21
  • 1
    Disabling strict is equally good idea as putting tape over car indicator lights. In both cases it looks like it solves the problem. Commented Oct 14, 2014 at 20:30

2 Answers 2

1

Few things

  • Never comment out use strict;

  • Don't include a newline after your die messages, that tells die to hide the line number and file messages

  • Your using %seen to make your phone numbers unique. Therefore output the results of them outside the file processing loop. Additionally, define %seen as lexical to the outside loop or phone numbers from previous files will still be around.

  • If you aren't getting any results, then your regex is probably not matching. Perhaps the anchor is too limitting: ^

Here's some cleanup of your script:

#!/usr/bin/perl
use strict;
use warnings;
use autodie;

my $DIR = "/Ask";

my @files = do {
    opendir my $dh, '.' or die "opendir .: $!";
    grep /\.txt$/i, readdir $dh;
};

print "Got ", scalar @files, " files\n";

foreach my $file (@files) {
    open my $fh, '<', $file or die "$file: $!";

    my %seen;

    while (<$fh>) {
        if (/^phone\s*(.*)$/i) {
            $seen{$1} = 1;
        }
    }

    foreach my $addr ( sort keys %seen ) {
        print "$addr\n";
    }

    close $fh;
}
Sign up to request clarification or add additional context in comments.

6 Comments

thanks to all so far. how about parsing html or rtf? first script worked well, second one gave me the following error: Bad symbol for dirhandle at parse2.pl line 10 - thanks, Tony
I had a typo at line 10. Corrected.
thanks. what about returning multiple strings like phone,address,zip?
Anything is possible. However, you've shared no information about the nature of your data, so it is literally impossible to advise you in more detail. If you want to parse additional fields, you'll just need to program that parsing.
Here's an example of the short files I'm parsing- Agilent Technologies,Inc. Headquarters. Toll-Free: +1 877-424-4536, phone: 4083458886.Fax: +1 408-345-8474 Address: 5301 Stevens Creek Blvd - I think the problem I'm having is that the phone: isn't always at the start of the line. If I modify my files and put it there all works well but I think the script has problems finding it in the middle of a row. Ideas?
|
0

You're going to need to chomp() each line to remove the newline character "\n" that accompanies each line:

while (<$FILE>) {
    chomp;
    if (/^phone\s*(.*)\r?$/i) {
        $seen{$1} = 1;
        foreach my $addr ( sort keys %seen ) {
            print "$addr\n";
        }
    }
}

Alternatively, you can make your regular expression multi-line by adding the 's' modifier, which will allow your ".*" to consume newline characters:

while (<$FILE>) {
    if (/^phone\s*(.*)\r?$/is) {
        $seen{$1} = 1;
        foreach my $addr ( sort keys %seen ) {
            print "$addr\n";
        }
    }
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.