1

I have this code to take rows and place them into %data. One row in DATA (last row) is a duplicate so I don’t want it to be added to %data. How do I check of the app_id and ci_name combination doesn’t already exist before pushing the row into %data? Something like

push .. unless {app_id already exists}

The code to modify:

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;

my %data;

while( <DATA> ) {
    chomp;
    next if /app_id/;
    my ($app_id,$ci_name,$app_name) = split /,/;
    push @{$data{$ci_name}}, {app_id => $app_id, app_name => $app_name };
}

print Dumper(\%data);

__DATA__
app_id,ci_name,app_name
1234,hosta7,Managed File Transfer
1235,hosta7,Patrtol
1236,hosta7,RELATIONAL DATA WAREHOUSE
1237,hosta7,Managed File Transfer
1238,hosta7,Initio Application
1239,hosta7,Data Warehouse Operations Infrastructure
2345,hostb,Tableou
2345,hostb,Tableou
0

3 Answers 3

3

You could temporarily use a HoH instead of a HoA.

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;

<DATA>;  # Skip header.

my %data;
my %seen;
while (<DATA>) {
    chomp;
    my ($app_id, $ci_name, $app_name) = split /,/;
    $data{$ci_name}{$app_id} //= { app_id => $app_id, app_name => $app_name };
}

# Convert HoH to HoA.
$data{$_} = [ values(%{ $data{$_} }) ]
   for keys(%data);

print Dumper(\%data);

The above keeps the first of the duplicates, and it doesn't preserve order. Change //= to = to keep the last of the duplicates. Read on for a solution that preserves order.


The following is a common way of removing duplicates while preserving order:

my %seen;
my @uniq = grep !$seen{$_}++, @values;

We can adapt that idiom to our needs.

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;

<DATA>;  # Skip header.

my %data;
my %seen;
while (<DATA>) {
    chomp;
    my ($app_id, $ci_name, $app_name) = split /,/;
    push @{ $data{$ci_name} }, { app_id => $app_id, app_name => $app_name }
       if !$seen{$ci_name}{$app_id}++;
}

print Dumper(\%data);

The above keeps the first of the duplicates, and it preserves order.


Both of these solution have a speed of O(N), whereas the previously posted solution has a speed of O(N2), so this solution scales much better. To be honest though, the previously posted solution has a practical speed of O(N) unless there's a lot of duplicates.


Note how I added <DATA> before the loop? It's far better than skipping all lines that contain app_id anywhere in the line!

Sign up to request clarification or add additional context in comments.

2 Comments

I can't use next if $seen{$app_id}++; since the same app_id may be a valid row for multiple hosts but not for the same host. So for example 2345,hostb,Tableou and 2345,hosta,Tableou
That's not what your question said, so I fixed it. You can use $seen{$ci_name}{$app_id}++. See updated answer.
2

You can use grep() with a block, that checks if the app_id equals the one to be inserted.

...
push @{$data{$ci_name}}, {app_id => $app_id, app_name => $app_name } unless grep { $_->{'app_id'} == $app_id; } @{$data{$ci_name}};
...

2 Comments

and then List::MoreUtils::any is more efficient (exits on first find), what may add up here since it checks every single time. Also, you'd need to check whether $ci_name key exists in the first place
(if the key doesn't exist and we still try to use it then it is created ... so then it does exist, even though no such data has been seen and there's nothing there. That may -- or may not -- be a problem later)
0

If you want to keep all records (adapted from @ikegami):

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;

<DATA>;  # Skip header.

my %data;

while (<DATA>) {
    chomp;
    my ($app_id, $ci_name, $app_name) = split /,/;
    push @{ $data{$ci_name}{$app_id} }, { app_id => $app_id, app_name => $app_name };
}

print Dumper(\%data);

But then it would be better to code:

$data{$ci_name}{$app_id}{$app_name}++;

1 Comment

This still keeps the duplicate, no? (It adds the same hashref twice to @{$data{hostb}{2345}} array ?) They specifically don't want them

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.