3

I have a file(tab delimited) with 6 columns (here I have shown 2 columns for simplicity)

46_#1   A   
47_#1   B   
49_#1   C   
51_#1   D   
51_#1   E

I want to count duplicates in first column (only count-no removal) and store count in next column. So output should be-

46_#1   1  A    
47_#1   1  B    
49_#1   1  C    
51_#1   2  D    
51_#1   2  E

I have used linux command-

uniq -c  file

but this will take whole line (not 1st column) then I used

uniq -c -w5 file

But word count in first column can vary.

Can anyone help please?

PS- I have a very big file (around 1gb).

2
  • Are the duplicates always adjacent? Commented Jan 27, 2012 at 12:21
  • Oh!sorry I should have mention that. No they can be far away. Commented Jan 27, 2012 at 12:33

1 Answer 1

6

I don't like just providing complete solutions, but it seemed the easiest way to explain. This program reads through the file twice: first to accumulate the frequency information and then to output the modified data.

use strict;
use warnings;

@ARGV or die "No input file specified";

open my $fh, '<', $ARGV[0] or die "Unable to open input file: $!";

my %count;

while (<$fh>) {
  next unless my ($key) = split;
  $count{$key}++;
}

seek $fh, 0, 0;
while (<$fh>) {
  chomp;
  next unless my ($key, $rest) = split ' ', $_, 2;
  print "$key $count{$key} $rest\n";
}
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot. It is perfectly fine.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.