Count and sum up duplicate records in a file (UNIX)

Question

I supposed to count the total the duplicate records in a file.

I used

 sort $TEMP_FILE2 | uniq -d

to list all duplicate records without count. My problem is, i do not know what script to use to sum up or get the total of those records.

This should be my output:

Total Data Count: xxx

Duplicate Data Count: xxx (Total duplicate records in a file)

Final Data Count: xxx

You could use awk. Without some sample data, it's unlikely for anyone to be able to help. — devnull
– devnull, Commented Mar 20, 2014 at 2:37
This should work: awk '你吃饭了吗你多吃一点慢慢吃慢走我先走了' inputFile — jaypal singh
– jaypal singh, Commented Mar 20, 2014 at 3:21
@jaypal Does your awk expression really contain Chinese characters, or are my (or your) browser fonts messed up? — Digital Trauma
– Digital Trauma, Commented Mar 20, 2014 at 4:34

Digital Trauma · Accepted Answer · 2014-03-20 04:32:21Z

3

I'll take a few guesses here, since its not entirely clear what's needed. First I'll assume your file looks something like this:

apple
banana
pear
apple
pear
apple

I assume "Total Data Count" is simply the number of entries, i.e. the total number of lines in the file. wc -l is the tool for that:

$ echo "Total Data Count: $(wc -l < temp_file)"
Total Data Count: 6
$

Then "Duplicate Data Count" is one of two things:

If it is the count of all records that are duplicated (5 = "apple", "apple", "apple", "banana", "banana" in my example), uniq -dc to get counts of duplicated fields, then awk to sum them up:

$ echo "Duplicate Data Count: $(sort temp_file | uniq -dc | awk '{count+=$1} END {print count}')"
Duplicate Data Count: 5
$

If it is the number of records that contain duplicates (but not full count of all duplicates) (2 = "apple", "banana" in my example), then wc -l of uniq -d should be sufficient:

$ echo "Duplicate Data Count: $(sort temp_file | uniq -d | wc -l)"
Duplicate Data Count: 2
$

I'm assuming "Final Data Count" is the number of all records with duplicates removed (3 = "apple", "pear", "banana" in my example). Here we can just pipe plain uniq to wc -l:

$ echo "Final Data Count: $(sort temp_file | uniq | wc -l)"
Final Data Count: 3
$

answered Mar 20, 2014 at 4:32

Digital Trauma

16.1k4 gold badges55 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user3003001 Over a year ago

this works "Duplicate Data Count: $(sort temp_file | uniq -d | wc -l)" thanks :)

Collectives™ on Stack Overflow

Count and sum up duplicate records in a file (UNIX)

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related