Remove duplicates using regex

Question

Input:

OUT :abc123: : Warning: /var/tmp/prodperim/installer/abc123.fw is older than it should be (not updated for 36 hours)
OUT :abc123 : : Warning: /var/tmp/prodperim/installer/abc123.fw.schedule is older than it should be (not updated for 36 hours)
OUT abc1234: : Warning: / filesystem 100% full
OUT abc1234: : Warning: / filesystem 100% full
OUT abc1234: : Warning: /var/tmp/prodperim/installer/abc123.fw is older than it should be (not updated for 36 hours)
OUT bcd111: : Warning: /var/tmp/prodperim/installer/abc123.fw.schedule is older than it should be (not updated for 36 hours)
OUT bcd111: : Succeeded.

I want to filter only hosts which has matched "Warnings".

Output:

abc123 
abc1234
bcd111

I have tried the below regex it matched all.

([\w]+)\s+:\s+:\s+Warning

Is it possible to avoid duplicates using regex?

Probably better to iterate over the lines and populate a hash. — arco444
– arco444, Commented Oct 13, 2014 at 12:20

choroba · Accepted Answer · 2014-10-13 12:21:46Z

3

When you hear "unique" in Perl, think "hash":

#!/usr/bin/perl
use warnings;
use strict;

my %uniq;
while (<>) {
    /:?(\S+?)[:\s]+Warning/ and $uniq{$1} = 1;
}

print "$_\n" for keys %uniq;

BTW, You input and regex don't lead to the output you indicated. I changed the regex, but I'm not sure your input sample is correct. Is the placement of colons really so wild?

answered Oct 13, 2014 at 12:21

choroba

245k27 gold badges221 silver badges304 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

vks · Accepted Answer · 2014-10-13 12:37:21Z

1

OUT\s*:?([^:]*):(?=.*?\bWarning\b)(?:(?!OUT).)*(?!.*?\1[:\s]*Warning)

You can try this.See demo.Grab the capture.

http://regex101.com/r/sK8oK9/12

answered Oct 13, 2014 at 12:37

vks

68.1k11 gold badges96 silver badges132 bronze badges

Comments

anubhava · Accepted Answer · 2014-10-13 12:29:30Z

0

You can use this perl one-liner:

perl -lane 'if (/\bWarning\b/) { @F[1] =~ s/(\W+)//g; print "@F[1]" }' file
abc123
abc123
abc1234
abc1234
abc1234
bcd111

answered Oct 13, 2014 at 12:29

anubhava

790k67 gold badges603 silver badges671 bronze badges

Comments

alpha bravo · Accepted Answer · 2014-10-13 14:58:48Z

0

use this pattern w/ gs option

OUT\s*:?([^:]+):\s*:\s*Warning(?!.*?\1\s*:\s*:\s*Warning)

Demo

answered Oct 13, 2014 at 14:58

alpha bravo

7,9681 gold badge24 silver badges25 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 12:11:42Z

This is more of a supplement/complement to @choroba's response above since he nailed it with "when you hear 'unique' think 'hash'". You should accept @choroba's answer :-)

Here I simplified the regex part of your question into a call to grep in order to focus on uniqueness, changed the data in your file a bit (so it could fit here) and saved it as dups.log:

# dups.log 
OUT :abc123: : Warning: /var/tmp/abc123.fw old (not updated for 36 hours)
OUT :abc123: : Warning: /var/tmp/abc123.fw.sched old (not updated for 36 hours)
OUT abc1234: : Warning: / filesystem 100% full
OUT abc1234: : Warning: / filesystem 100% full
OUT abc1234: : Warning: /var/tmp/abc123.fw old (not updated for 36 hours)
OUT bcd111: : Warning: /var/tmp/abc123.fw.sched old (not updated for 36 hours)
OUT bcd111: : Warning: /var/tmp/abc123.fw.sched old (not updated for 36 hours)
OUT bcd111: : Warning: /var/tmp/abc123.fw.sched old (not updated for 36 hours)
OUT bcd111: : Succeeded.

This one-liner give the output below:

perl -E '++$seen{$_} for grep{/Warning/} <>; print %seen' dups.log

OUT :abc123: : Warning: /var/tmp/abc123.fw old (not updated for 36 hours)
OUT abc1234: : Warning: / filesystem 100% full
OUT :abc123: : Warning: /var/tmp/abc123.fw.sched old (not updated for 36 hours)
OUT bcd111: : Warning: /var/tmp/abc123.fw.sched old (not updated for 36 hours)
OUT abc1234: : Warning: /var/tmp/abc123.fw old (not updated for 36 hours)

This is pretty much the same output you'd get with uniq log_with_dups.log | grep Warning. It works because perl creates a hash key from each line it reads on STDIN adding a key to the hash and incrementing its value (with ++$seen{$_}) each time it sees the key. For perl "same key" here means a line that is a duplicate. Try printing values %seen or using -MDDP and p %seen to get a sense of what is going on.

To get your output @choroba's regex adds the capture (instead of the whole line) to the hash:

perl -nE '/:?(\S+?)[:\s]+Warning/ && ++$seen{$1} }{ say for keys %seen' dups.log

but, just as with the whole line method above, the regex will create only one copy of the key (from the match and capture) and then increment it with ++ so in the you get "unique" keys à la uniq in the %seen hash.

It's a neat perl trick you never forget :-)

References:

The SO question has some good explanations of the perl idiom for uniq using a hash as per @choroba.
This is touched on in perlfaq4 which describes the %seen{} hash trick.
Perlmaven shows how to make your own "home made" uniq using this approach.
...

Collectives™ on Stack Overflow

Remove duplicates using regex

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related