0

I have a hash of hash of arrays. The keys to the hashes are $duration and $attr. I want to sort descending $b <=> $a and remove only those duplicate values, which have equal duration. In the snippet these should be streams:

'h264/AVC, 1080p24 /1.001 (16:9)' & 'AC3, English, multi-channel, 48kHz' with duration '26' but not the duplicate values with $duration '2124' & '115'.

There are countless examples for removing duplicates and I've tried everything I could find to implement for my needs but with no success. What should be my approach for the solution. Thanks.

my ( %recordings_by_dur_attr ) = ();

push( @{ $recordings_by_dur_attr{ $duration }{ $attr } }, @stream );

print Data::Dumper->Dump( [\%recordings_by_dur_attr] );

Result:

$VAR1 = {
      '2124' => {
                  '00300.mpls, 00-35-24' => [
                                              '',
                                              'h264/AVC, 480i60 /1.001 (16:9)',
                                              'AC3, English, stereo, 48kHz'
                                            ]
                },
      '50' => {
                00021.mpls, 00-00-50' => [
                                            '',
                                            'h264/AVC, 1080p24 /1.001 (16:9)',
                                            'AC3, English, multi-channel, 48kHz'
                                          ]
              },
      '6528' => {
                  '00800.mpls, 01-48-48' => [
                                              '',
                                              'Chapters, 18 chapters',
                                              'h264/AVC, 1080p24 /1.001 (16:9)',
                                              'DTS, Japanese, stereo, 48kHz',
                                              'DTS Master Audio, English, stereo, 48kHz',
                                              'DTS, French, stereo, 48kHz',
                                              'DTS, Italian, stereo, 48kHz',
                                              'DTS, German, stereo, 48kHz',
                                              'DTS, Spanish, stereo, 48kHz',
                                              'DTS, Portuguese, stereo, 48kHz',
                                              'DTS, Spanish, stereo, 48kHz',
                                              'DTS, Russian, stereo, 48kHz'
                                            ]
                },
      '26' => {
                '01103.mpls, 00-00-26' => [
                                            '',
                                            'h264/AVC, 1080p24 /1.001 (16:9)',
                                            'AC3, English, multi-channel, 48kHz'
                                          ],
                '01102.mpls, 00-00-26' => [
                                            '',
                                            'h264/AVC, 1080p24 /1.001 (16:9)',
                                            'AC3, English, multi-channel, 48kHz'
                                          ],
                '00011.mpls, 00-00-26' => [
                                            '',
                                            'h264/AVC, 1080p24 /1.001 (16:9)',
                                            'AC3, English, multi-channel, 48kHz'
                                          ]
              },
      '115' => {
                 '00304.mpls, 00-01-55' => [
                                             '',
                                             'h264/AVC, 480i60 /1.001 (16:9)',
                                             'AC3, English, stereo, 48kHz'
                                           ]
               }
    };

Duplicate structure

 '',
'h264/AVC, 1080p24 /1.001 (16:9)',
'AC3, English, multi-channel, 48kHz'

Wanted result with removed duplicate structure:

$VAR1 = {
      '2124' => {
                  '00300.mpls, 00-35-24' => [
                                              '',
                                              'h264/AVC, 480i60 /1.001 (16:9)',
                                              'AC3, English, stereo, 48kHz'
                                            ]
                },
      '50' => {
                00021.mpls, 00-00-50' => [
                                            '',
                                            'h264/AVC, 1080p24 /1.001 (16:9)',
                                            'AC3, English, multi-channel, 48kHz'
                                          ]
              },
      '6528' => {
                  '00800.mpls, 01-48-48' => [
                                              '',
                                              'Chapters, 18 chapters',
                                              'h264/AVC, 1080p24 /1.001 (16:9)',
                                              'DTS, Japanese, stereo, 48kHz',
                                              'DTS Master Audio, English, stereo, 48kHz',
                                              'DTS, French, stereo, 48kHz',
                                              'DTS, Italian, stereo, 48kHz',
                                              'DTS, German, stereo, 48kHz',
                                              'DTS, Spanish, stereo, 48kHz',
                                              'DTS, Portuguese, stereo, 48kHz',
                                              'DTS, Spanish, stereo, 48kHz',
                                              'DTS, Russian, stereo, 48kHz'
                                            ]
                },
      '26' => {
                '00011.mpls, 00-00-26' => [
                                            '',
                                            'h264/AVC, 1080p24 /1.001 (16:9)',
                                            'AC3, English, multi-channel, 48kHz'
                                          ]
              },
      '115' => {
                 '00304.mpls, 00-01-55' => [
                                             '',
                                             'h264/AVC, 480i60 /1.001 (16:9)',
                                             'AC3, English, stereo, 48kHz'
                                           ]
               }
    };

Post processing

for my $duration ( sort { $b <=> $a } keys %recordings_by_dur_attr ) {
   for my $attr ( keys $recordings_by_dur_attr{ $duration }  ) {

       #Remove duplicate structures

        my @stream = @{ $recordings_by_dur_attr{ $duration }{ $attr } };
        my ( $mpls, $hms ) = ( $attr =~ /(\d+\.mpls), (\d+-\d+-\d+)$/ );
        for ( my $i = 1;  $i < @stream; $i++ ) {

        #extract info from each stream

        }
    }
}
4
  • After I sort and remove the duplicates I want to count and additionally process each stream (e.g. get the language) from each playlist and extract mpls number and timestamp from the $attr. Commented Mar 10, 2013 at 19:20
  • I apologize. Obviously I misunderstood your request. I posted what should be considered a duplicate structure - multiple streams. I'd like to compare those multiples streams with each playlist and remove the duplicate ones, which have equal $duration. Commented Mar 10, 2013 at 20:30
  • I want to remove those duplicate structures because they are identical playlists and in that sense they are redundant. My final goal is to automate a process for extracting the codec streams, which would be time consuming if many such streams are present. In some scenarios I'm talking about 20-30 playlists. Commented Mar 10, 2013 at 21:11
  • btw, $hashref would be a very poor name for a variable that actually contains a hash reference, but to name a hash %hashref is just plain bad. Commented Mar 10, 2013 at 21:48

2 Answers 2

1

The expression $seen{$candidate}++ is useful for finding duplicates. When it returns true, $candidate has previously been seen. It is most often used as follows:

my @uniq = grep !$seen{$_}++, @list;

Instead of building a list of keys of elements to keep, I inverted the condition to build a list of keys of elements to delete.

sub id { pack 'N/(N/a*)', @{ $_[0] } }

for my $recordings_by_attr (values(%recordings_by_dur_attr)) {
   my %seen;
   delete @{$recordings_by_attr}{
       grep $seen{id($recordings_by_attr->{$_})}++,
        sort
         keys %$recordings_by_attr
   };
}

The sort decides which of the duplicates to remove. If you don't care which, you can remove the sort.

Sign up to request clarification or add additional context in comments.

3 Comments

Many, many thanks ikegami. I changed the bad naming to match your suggestion and obviously this is the solution but I keep getting an error: Can't use string $attr as an ARRAY ref while "strict ref" in use at sub id { pack 'N/(N/a*)', @$_ }. Am I missing something
Yessss!!! You made my day. Unlimited thanks. The missing piece in my puzzle was sub id { pack 'N/(N/a*)', @{ $_[0] } }, which is the key for the solution in my case, because I've already tried the grep !$seen{$_}++ many times. Thanks again.
You need a function (in the mathmatical sense) F where F(A) eq F(B) for all duplicates A,B; and where F(A) ne F(B) for all differing A,B. It doesn't matter what the value is, as long as it meets the above definition. grep $seen{ join "\n", @$_ }, would do if the values in the arrays can't contain newlines, but went with a solution that handles every string.
0

Steps:

    1. Traverse the hash.
    2. if ref $key eq "ARRAY"
       then 
       1. my `@temp = uniq(@{$hash->{$key}})`;
       2. $var = undef;
       3. $var = \@temp;
       Else
       1. Traverse the hash.
    3. Else
       1. next;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.