Perl - Removing multiple lines from file with multiple regex

Question

I'm (obviously) new to Perl and am trying to create a simple script to clean up a large file on about 4.5 million records on a weekly basis. I want to completely remove the lines that match one of three patterns. The file looks like this:

D0832
G2565
ZDS97
FHM2547
JDH1464
R2918
4918K
AG01023
AG02997

My script below works, but I get a blank line where a deletion occurs (substitution) rather than removing the line completely.

#!/usr/bin/perl

open( FH, "serial.txt" ) || die "Couldn't open file...\n";

while ( <FH> ) {
   $data .= $_;
}

$data =~ s/[A][F|G][(0-9)]{5}//g;
$data =~ s/[A-Z][0-9][0-9][0-9][0-9]//g;
$data =~ s/[0-9][0-9][0-9][0-9][A-Z]//g;

print $data;
close( FH );

My question is - with 4.5 million records, running this at least once a week, is this an efficient/fast way to accomplish what I want to do, or is there a more efficient way to do it? In addition, how can I remove the lines rather than substituting a blank line?

Thanks all. Stephen

About how to delete the lines - include \n at the end of your find regexes. About is it fast enough - it would definitely pass in a week timeframe, but you have to test and see if it satisfactory yourself :) — ndnenkov
– ndnenkov, Commented Aug 29, 2015 at 13:21
[(0-9)] also matches ( and ). Similarly, [F|G] also matches |. — TLP
– TLP, Commented Aug 29, 2015 at 15:16
Have you considered not using perl at all for this problem. grep -v 'regexp' will do the work better I think. See option -v in manual page of grep(1) utility. Grep is good on filtering lines of text. It has been developed with that target in mind. And is at least ten years older than perl. — Luis Colorado
– Luis Colorado, Commented Aug 31, 2015 at 8:02

C. K. Young · Accepted Answer · 2015-08-29 13:51:15Z

3

@ndn's comment is correct. However, personally, rather than reading in the whole file, I'd process it line by line (I took the liberty to tidy up your regexes, too):

#!/usr/bin/perl -p
$_ = '' if /^A[FG]\d{5}$/ || /^[A-Z]\d{4}$/ || /^\d{4}[A-Z]$/;

or

#!/usr/bin/perl -n
print unless /^A[FG]\d{5}$/ || /^[A-Z]\d{4}$/ || /^\d{4}[A-Z]$/;

(In both cases, specify your input file on the command line. Read up the perlrun manual page on how the -p and -n options work.)

edited Aug 29, 2015 at 13:51

answered Aug 29, 2015 at 13:25

C. K. Young

224k47 gold badges394 silver badges446 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Stephen Dundas Over a year ago

All of these suggestions are great! Major progress, thank you. Only one issue with my regex - this is filtering FHM2547 and JDH1464 which I want to keep. I only want to delete lines that match exactly rather than a portion. Would I use ^ and $?

C. K. Young Over a year ago

@StephenDundas Yep, anchor all the regexes with ^ and $. I'll edit my answer to incorporate.

brian d foy · Accepted Answer · 2015-08-29 16:15:42Z

At first pass, I'd make a list of pre-compiled patterns to test against each line. The problem is likely to change and I want to add and delete patterns without disturbing the meat of the code:

my @patterns = ( 
    qr/\A [A] [FG]  [0-9]{5} \Z/x,
    qr/\A [A-Z]     [0-9]{4} \Z/x,
    qr/\A [0-9]{4}  [A-Z]    \Z/x,
    );

while( my $line = <DATA> ) {
    next if grep { $line =~ $_ } @patterns;

    print $line;
    }

__END__
D0832
G2565
ZDS97
FHM2547
JDH1464
R2918
4918K
AG01023
AG02997

The big improvement isn't the patterns though. It's checking things one line at a time and printing the lines I want to keep. I don't have the entire file in memory at the same time; it's only a line at a time.

There's a problem with this though. It works, but it checks every pattern every time. That might not mean much if very few lines will ever match or there are only a few patterns. If you think it might matter, using first from List::Util instead of grep can help since it only needs to find one match and stops when it finds it:

use List::Util qw(first);

my @patterns = ( 
    qr/\A [A] [FG]  [0-9]{5} \Z/x,
    qr/\A [A-Z]     [0-9]{4} \Z/x,
    qr/\A [0-9]{4}  [A-Z]    \Z/x,
    );

while( my $line = <DATA> ) {
    next if first { $line =~ $_ } @patterns;

    print $line;
    }

__END__
D0832
G2565
ZDS97
FHM2547
JDH1464
R2918
4918K
AG01023
AG02997

Or, I might make one giant pattern. Regexp::Assemble can put them together (but so can you if you watch out for the alternation precedence):

use v5.10;

use Regexp::Assemble;

my @patterns = ( 
    '[A][FG][0-9]{5}',
    '[A-Z][0-9]{4}',
    '[0-9]{4}[A-Z]',
    );

my $grand_pattern = do {
    my $ra = Regexp::Assemble->new;
    $ra->add( $_ ) for @patterns;
    my $re = $ra->re;
    qr/ \A (?: $re ) \Z /x;
    };

say "Grand regex is $grand_pattern";

while( my $line = <DATA> ) {
    next if $line =~ $grand_pattern;

    print $line;
    }

__END__
D0832
G2565
ZDS97
FHM2547
JDH1464
R2918
4918K
AG01023
AG02997

The next step would be to take the patterns from the command line or a configuration file, but that's not so hard. The program shouldn't know the patterns at all. You'll have a much easier time changing the patterns if you don't have to change the code.

Borodin · Accepted Answer · 2015-08-29 15:42:09Z

0

There's no need for multiple regex patterns. This will do what you need

perl -ne'print unless /^(?:[A][FG]\d{5}|[A-Z]\d{4}|\d{4}[A-Z])$/' serial.txt

output

ZDS97
FHM2547
JDH1464

answered Aug 29, 2015 at 15:42

Borodin

127k9 gold badges72 silver badges146 bronze badges

Comments

ssr1012 · Accepted Answer · 2015-08-31 12:45:00Z

0

 $data =~ s/[A-Z][0-9][0-9][0-9][0-9][\s\r\n]*//g;
 $data =~ s/[0-9][0-9][0-9][0-9][A-Z][\s\r\n]*//g;

From the question:

"how can I remove the lines rather than substituting a blank line?"

End of the each regex which we can have a linebreak/returns. And then regex will replacing the empty line. Hence I have added the [\s\r\n]* syntax and it will not replace the empty line.

edited Aug 31, 2015 at 12:45

answered Aug 31, 2015 at 11:44

ssr1012

2,5891 gold badge21 silver badges34 bronze badges

1 Comment

Daniel Cheung Over a year ago

Please add explanation

Collectives™ on Stack Overflow

Perl - Removing multiple lines from file with multiple regex

4 Answers 4

2 Comments

Comments

output

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

output

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related