1

I'm (obviously) new to Perl and am trying to create a simple script to clean up a large file on about 4.5 million records on a weekly basis. I want to completely remove the lines that match one of three patterns. The file looks like this:

D0832
G2565
ZDS97
FHM2547
JDH1464
R2918
4918K
AG01023
AG02997

My script below works, but I get a blank line where a deletion occurs (substitution) rather than removing the line completely.

#!/usr/bin/perl

open( FH, "serial.txt" ) || die "Couldn't open file...\n";

while ( <FH> ) {
   $data .= $_;
}

$data =~ s/[A][F|G][(0-9)]{5}//g;
$data =~ s/[A-Z][0-9][0-9][0-9][0-9]//g;
$data =~ s/[0-9][0-9][0-9][0-9][A-Z]//g;

print $data;
close( FH );

My question is - with 4.5 million records, running this at least once a week, is this an efficient/fast way to accomplish what I want to do, or is there a more efficient way to do it? In addition, how can I remove the lines rather than substituting a blank line?

Thanks all. Stephen

3
  • 2
    About how to delete the lines - include \n at the end of your find regexes. About is it fast enough - it would definitely pass in a week timeframe, but you have to test and see if it satisfactory yourself :) Commented Aug 29, 2015 at 13:21
  • [(0-9)] also matches ( and ). Similarly, [F|G] also matches |. Commented Aug 29, 2015 at 15:16
  • Have you considered not using perl at all for this problem. grep -v 'regexp' will do the work better I think. See option -v in manual page of grep(1) utility. Grep is good on filtering lines of text. It has been developed with that target in mind. And is at least ten years older than perl. Commented Aug 31, 2015 at 8:02

4 Answers 4

3

@ndn's comment is correct. However, personally, rather than reading in the whole file, I'd process it line by line (I took the liberty to tidy up your regexes, too):

#!/usr/bin/perl -p
$_ = '' if /^A[FG]\d{5}$/ || /^[A-Z]\d{4}$/ || /^\d{4}[A-Z]$/;

or

#!/usr/bin/perl -n
print unless /^A[FG]\d{5}$/ || /^[A-Z]\d{4}$/ || /^\d{4}[A-Z]$/;

(In both cases, specify your input file on the command line. Read up the perlrun manual page on how the -p and -n options work.)

Sign up to request clarification or add additional context in comments.

2 Comments

All of these suggestions are great! Major progress, thank you. Only one issue with my regex - this is filtering FHM2547 and JDH1464 which I want to keep. I only want to delete lines that match exactly rather than a portion. Would I use ^ and $?
@StephenDundas Yep, anchor all the regexes with ^ and $. I'll edit my answer to incorporate.
3

At first pass, I'd make a list of pre-compiled patterns to test against each line. The problem is likely to change and I want to add and delete patterns without disturbing the meat of the code:

my @patterns = ( 
    qr/\A [A] [FG]  [0-9]{5} \Z/x,
    qr/\A [A-Z]     [0-9]{4} \Z/x,
    qr/\A [0-9]{4}  [A-Z]    \Z/x,
    );

while( my $line = <DATA> ) {
    next if grep { $line =~ $_ } @patterns;

    print $line;
    }

__END__
D0832
G2565
ZDS97
FHM2547
JDH1464
R2918
4918K
AG01023
AG02997

The big improvement isn't the patterns though. It's checking things one line at a time and printing the lines I want to keep. I don't have the entire file in memory at the same time; it's only a line at a time.

There's a problem with this though. It works, but it checks every pattern every time. That might not mean much if very few lines will ever match or there are only a few patterns. If you think it might matter, using first from List::Util instead of grep can help since it only needs to find one match and stops when it finds it:

use List::Util qw(first);

my @patterns = ( 
    qr/\A [A] [FG]  [0-9]{5} \Z/x,
    qr/\A [A-Z]     [0-9]{4} \Z/x,
    qr/\A [0-9]{4}  [A-Z]    \Z/x,
    );

while( my $line = <DATA> ) {
    next if first { $line =~ $_ } @patterns;

    print $line;
    }

__END__
D0832
G2565
ZDS97
FHM2547
JDH1464
R2918
4918K
AG01023
AG02997

Or, I might make one giant pattern. Regexp::Assemble can put them together (but so can you if you watch out for the alternation precedence):

use v5.10;

use Regexp::Assemble;

my @patterns = ( 
    '[A][FG][0-9]{5}',
    '[A-Z][0-9]{4}',
    '[0-9]{4}[A-Z]',
    );

my $grand_pattern = do {
    my $ra = Regexp::Assemble->new;
    $ra->add( $_ ) for @patterns;
    my $re = $ra->re;
    qr/ \A (?: $re ) \Z /x;
    };

say "Grand regex is $grand_pattern";

while( my $line = <DATA> ) {
    next if $line =~ $grand_pattern;

    print $line;
    }

__END__
D0832
G2565
ZDS97
FHM2547
JDH1464
R2918
4918K
AG01023
AG02997

The next step would be to take the patterns from the command line or a configuration file, but that's not so hard. The program shouldn't know the patterns at all. You'll have a much easier time changing the patterns if you don't have to change the code.

Comments

0

There's no need for multiple regex patterns. This will do what you need

perl -ne'print unless /^(?:[A][FG]\d{5}|[A-Z]\d{4}|\d{4}[A-Z])$/' serial.txt

output

ZDS97
FHM2547
JDH1464

Comments

0
 $data =~ s/[A-Z][0-9][0-9][0-9][0-9][\s\r\n]*//g;
 $data =~ s/[0-9][0-9][0-9][0-9][A-Z][\s\r\n]*//g;

From the question:

"how can I remove the lines rather than substituting a blank line?"

End of the each regex which we can have a linebreak/returns. And then regex will replacing the empty line. Hence I have added the [\s\r\n]* syntax and it will not replace the empty line.

1 Comment

Please add explanation

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.