Perl hash of array how to check if the object exists before adding

Question

I have this code to take rows and place them into %data. One row in DATA (last row) is a duplicate so I don’t want it to be added to %data. How do I check of the app_id and ci_name combination doesn’t already exist before pushing the row into %data? Something like

push .. unless {app_id already exists}

The code to modify:

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;

my %data;

while( <DATA> ) {
    chomp;
    next if /app_id/;
    my ($app_id,$ci_name,$app_name) = split /,/;
    push @{$data{$ci_name}}, {app_id => $app_id, app_name => $app_name };
}

print Dumper(\%data);

__DATA__
app_id,ci_name,app_name
1234,hosta7,Managed File Transfer
1235,hosta7,Patrtol
1236,hosta7,RELATIONAL DATA WAREHOUSE
1237,hosta7,Managed File Transfer
1238,hosta7,Initio Application
1239,hosta7,Data Warehouse Operations Infrastructure
2345,hostb,Tableou
2345,hostb,Tableou

ikegami · Accepted Answer · 2019-11-10 12:17:42Z

3

You could temporarily use a HoH instead of a HoA.

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;

<DATA>;  # Skip header.

my %data;
my %seen;
while (<DATA>) {
    chomp;
    my ($app_id, $ci_name, $app_name) = split /,/;
    $data{$ci_name}{$app_id} //= { app_id => $app_id, app_name => $app_name };
}

# Convert HoH to HoA.
$data{$_} = [ values(%{ $data{$_} }) ]
   for keys(%data);

print Dumper(\%data);

The above keeps the first of the duplicates, and it doesn't preserve order. Change //= to = to keep the last of the duplicates. Read on for a solution that preserves order.

The following is a common way of removing duplicates while preserving order:

my %seen;
my @uniq = grep !$seen{$_}++, @values;

We can adapt that idiom to our needs.

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;

<DATA>;  # Skip header.

my %data;
my %seen;
while (<DATA>) {
    chomp;
    my ($app_id, $ci_name, $app_name) = split /,/;
    push @{ $data{$ci_name} }, { app_id => $app_id, app_name => $app_name }
       if !$seen{$ci_name}{$app_id}++;
}

print Dumper(\%data);

The above keeps the first of the duplicates, and it preserves order.

Both of these solution have a speed of O(N), whereas the previously posted solution has a speed of O(N²), so this solution scales much better. To be honest though, the previously posted solution has a practical speed of O(N) unless there's a lot of duplicates.

Note how I added <DATA> before the loop? It's far better than skipping all lines that contain app_id anywhere in the line!

edited Nov 10, 2019 at 12:17

answered Nov 10, 2019 at 5:27

ikegami

391k17 gold badges291 silver badges555 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ravi M Over a year ago

I can't use next if $seen{$app_id}++; since the same app_id may be a valid row for multiple hosts but not for the same host. So for example 2345,hostb,Tableou and 2345,hosta,Tableou

ikegami Over a year ago

That's not what your question said, so I fixed it. You can use $seen{$ci_name}{$app_id}++. See updated answer.

sticky bit · Accepted Answer · 2019-11-10 03:24:22Z

2

You can use grep() with a block, that checks if the app_id equals the one to be inserted.

...
push @{$data{$ci_name}}, {app_id => $app_id, app_name => $app_name } unless grep { $_->{'app_id'} == $app_id; } @{$data{$ci_name}};
...

answered Nov 10, 2019 at 3:24

sticky bit

37.7k12 gold badges34 silver badges46 bronze badges

2 Comments

zdim Over a year ago

and then List::MoreUtils::any is more efficient (exits on first find), what may add up here since it checks every single time. Also, you'd need to check whether $ci_name key exists in the first place

zdim Over a year ago

(if the key doesn't exist and we still try to use it then it is created ... so then it does exist, even though no such data has been seen and there's nothing there. That may -- or may not -- be a problem later)

Helmut Wollmersdorfer · Accepted Answer · 2019-11-10 18:35:46Z

0

If you want to keep all records (adapted from @ikegami):

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;

<DATA>;  # Skip header.

my %data;

while (<DATA>) {
    chomp;
    my ($app_id, $ci_name, $app_name) = split /,/;
    push @{ $data{$ci_name}{$app_id} }, { app_id => $app_id, app_name => $app_name };
}

print Dumper(\%data);

But then it would be better to code:

$data{$ci_name}{$app_id}{$app_name}++;

answered Nov 10, 2019 at 18:35

Helmut Wollmersdorfer

4513 silver badges12 bronze badges

1 Comment

zdim Over a year ago

This still keeps the duplicate, no? (It adds the same hashref twice to @{$data{hostb}{2345}} array ?) They specifically don't want them

Collectives™ on Stack Overflow

Perl hash of array how to check if the object exists before adding

3 Answers 3

2 Comments

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related