How to remove duplicate values from hash of arrays with 2 references

Question

I have a hash of hash of arrays. The keys to the hashes are $duration and $attr. I want to sort descending $b <=> $a and remove only those duplicate values, which have equal duration. In the snippet these should be streams:

'h264/AVC, 1080p24 /1.001 (16:9)' & 'AC3, English, multi-channel, 48kHz' with duration '26' but not the duplicate values with $duration '2124' & '115'.

There are countless examples for removing duplicates and I've tried everything I could find to implement for my needs but with no success. What should be my approach for the solution. Thanks.

my ( %recordings_by_dur_attr ) = ();

push( @{ $recordings_by_dur_attr{ $duration }{ $attr } }, @stream );

print Data::Dumper->Dump( [\%recordings_by_dur_attr] );

Result:

$VAR1 = {
      '2124' => {
                  '00300.mpls, 00-35-24' => [
                                              '',
                                              'h264/AVC, 480i60 /1.001 (16:9)',
                                              'AC3, English, stereo, 48kHz'
                                            ]
                },
      '50' => {
                00021.mpls, 00-00-50' => [
                                            '',
                                            'h264/AVC, 1080p24 /1.001 (16:9)',
                                            'AC3, English, multi-channel, 48kHz'
                                          ]
              },
      '6528' => {
                  '00800.mpls, 01-48-48' => [
                                              '',
                                              'Chapters, 18 chapters',
                                              'h264/AVC, 1080p24 /1.001 (16:9)',
                                              'DTS, Japanese, stereo, 48kHz',
                                              'DTS Master Audio, English, stereo, 48kHz',
                                              'DTS, French, stereo, 48kHz',
                                              'DTS, Italian, stereo, 48kHz',
                                              'DTS, German, stereo, 48kHz',
                                              'DTS, Spanish, stereo, 48kHz',
                                              'DTS, Portuguese, stereo, 48kHz',
                                              'DTS, Spanish, stereo, 48kHz',
                                              'DTS, Russian, stereo, 48kHz'
                                            ]
                },
      '26' => {
                '01103.mpls, 00-00-26' => [
                                            '',
                                            'h264/AVC, 1080p24 /1.001 (16:9)',
                                            'AC3, English, multi-channel, 48kHz'
                                          ],
                '01102.mpls, 00-00-26' => [
                                            '',
                                            'h264/AVC, 1080p24 /1.001 (16:9)',
                                            'AC3, English, multi-channel, 48kHz'
                                          ],
                '00011.mpls, 00-00-26' => [
                                            '',
                                            'h264/AVC, 1080p24 /1.001 (16:9)',
                                            'AC3, English, multi-channel, 48kHz'
                                          ]
              },
      '115' => {
                 '00304.mpls, 00-01-55' => [
                                             '',
                                             'h264/AVC, 480i60 /1.001 (16:9)',
                                             'AC3, English, stereo, 48kHz'
                                           ]
               }
    };

Duplicate structure

 '',
'h264/AVC, 1080p24 /1.001 (16:9)',
'AC3, English, multi-channel, 48kHz'

Wanted result with removed duplicate structure:

$VAR1 = {
      '2124' => {
                  '00300.mpls, 00-35-24' => [
                                              '',
                                              'h264/AVC, 480i60 /1.001 (16:9)',
                                              'AC3, English, stereo, 48kHz'
                                            ]
                },
      '50' => {
                00021.mpls, 00-00-50' => [
                                            '',
                                            'h264/AVC, 1080p24 /1.001 (16:9)',
                                            'AC3, English, multi-channel, 48kHz'
                                          ]
              },
      '6528' => {
                  '00800.mpls, 01-48-48' => [
                                              '',
                                              'Chapters, 18 chapters',
                                              'h264/AVC, 1080p24 /1.001 (16:9)',
                                              'DTS, Japanese, stereo, 48kHz',
                                              'DTS Master Audio, English, stereo, 48kHz',
                                              'DTS, French, stereo, 48kHz',
                                              'DTS, Italian, stereo, 48kHz',
                                              'DTS, German, stereo, 48kHz',
                                              'DTS, Spanish, stereo, 48kHz',
                                              'DTS, Portuguese, stereo, 48kHz',
                                              'DTS, Spanish, stereo, 48kHz',
                                              'DTS, Russian, stereo, 48kHz'
                                            ]
                },
      '26' => {
                '00011.mpls, 00-00-26' => [
                                            '',
                                            'h264/AVC, 1080p24 /1.001 (16:9)',
                                            'AC3, English, multi-channel, 48kHz'
                                          ]
              },
      '115' => {
                 '00304.mpls, 00-01-55' => [
                                             '',
                                             'h264/AVC, 480i60 /1.001 (16:9)',
                                             'AC3, English, stereo, 48kHz'
                                           ]
               }
    };

Post processing

for my $duration ( sort { $b <=> $a } keys %recordings_by_dur_attr ) {
   for my $attr ( keys $recordings_by_dur_attr{ $duration }  ) {

       #Remove duplicate structures

        my @stream = @{ $recordings_by_dur_attr{ $duration }{ $attr } };
        my ( $mpls, $hms ) = ( $attr =~ /(\d+\.mpls), (\d+-\d+-\d+)$/ );
        for ( my $i = 1;  $i < @stream; $i++ ) {

        #extract info from each stream

        }
    }
}

After I sort and remove the duplicates I want to count and additionally process each stream (e.g. get the language) from each playlist and extract mpls number and timestamp from the $attr. — theuserid01
– theuserid01, Commented Mar 10, 2013 at 19:20
I apologize. Obviously I misunderstood your request. I posted what should be considered a duplicate structure - multiple streams. I'd like to compare those multiples streams with each playlist and remove the duplicate ones, which have equal $duration. — theuserid01
– theuserid01, Commented Mar 10, 2013 at 20:30
I want to remove those duplicate structures because they are identical playlists and in that sense they are redundant. My final goal is to automate a process for extracting the codec streams, which would be time consuming if many such streams are present. In some scenarios I'm talking about 20-30 playlists. — theuserid01
– theuserid01, Commented Mar 10, 2013 at 21:11
btw, $hashref would be a very poor name for a variable that actually contains a hash reference, but to name a hash %hashref is just plain bad. — ikegami
– ikegami, Commented Mar 10, 2013 at 21:48

ikegami · Accepted Answer · 2013-03-10 23:50:47Z

1

The expression $seen{$candidate}++ is useful for finding duplicates. When it returns true, $candidate has previously been seen. It is most often used as follows:

my @uniq = grep !$seen{$_}++, @list;

Instead of building a list of keys of elements to keep, I inverted the condition to build a list of keys of elements to delete.

sub id { pack 'N/(N/a*)', @{ $_[0] } }

for my $recordings_by_attr (values(%recordings_by_dur_attr)) {
   my %seen;
   delete @{$recordings_by_attr}{
       grep $seen{id($recordings_by_attr->{$_})}++,
        sort
         keys %$recordings_by_attr
   };
}

The sort decides which of the duplicates to remove. If you don't care which, you can remove the sort.

edited Mar 10, 2013 at 23:50

answered Mar 10, 2013 at 21:35

ikegami

391k17 gold badges291 silver badges555 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

theuserid01 Over a year ago

Many, many thanks ikegami. I changed the bad naming to match your suggestion and obviously this is the solution but I keep getting an error: Can't use string $attr as an ARRAY ref while "strict ref" in use at sub id { pack 'N/(N/a*)', @$_ }. Am I missing something

theuserid01 Over a year ago

Yessss!!! You made my day. Unlimited thanks. The missing piece in my puzzle was sub id { pack 'N/(N/a*)', @{ $_[0] } }, which is the key for the solution in my case, because I've already tried the grep !$seen{$_}++ many times. Thanks again.

ikegami Over a year ago

You need a function (in the mathmatical sense) F where F(A) eq F(B) for all duplicates A,B; and where F(A) ne F(B) for all differing A,B. It doesn't matter what the value is, as long as it meets the above definition. grep $seen{ join "\n", @$_ }, would do if the values in the arrays can't contain newlines, but went with a solution that handles every string.

Krishnachandra Sharma · Accepted Answer · 2013-03-10 19:34:59Z

0

Steps:

    1. Traverse the hash.
    2. if ref $key eq "ARRAY"
       then 
       1. my `@temp = uniq(@{$hash->{$key}})`;
       2. $var = undef;
       3. $var = \@temp;
       Else
       1. Traverse the hash.
    3. Else
       1. next;

edited Mar 10, 2013 at 19:34

answered Mar 10, 2013 at 19:29

Krishnachandra Sharma

1,3503 gold badges20 silver badges43 bronze badges

Collectives™ on Stack Overflow

How to remove duplicate values from hash of arrays with 2 references

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related