0

I supposed to count the total the duplicate records in a file.

I used

 sort $TEMP_FILE2 | uniq -d

to list all duplicate records without count. My problem is, i do not know what script to use to sum up or get the total of those records.

This should be my output:

Total Data Count: xxx

Duplicate Data Count: xxx (Total duplicate records in a file)

Final Data Count: xxx

5
  • 2
    You could use awk. Without some sample data, it's unlikely for anyone to be able to help. Commented Mar 20, 2014 at 2:37
  • 3
    This should work: awk '你吃饭了吗你多吃一点慢慢吃慢走我先走了' inputFile Commented Mar 20, 2014 at 3:21
  • @jaypal Does your awk expression really contain Chinese characters, or are my (or your) browser fonts messed up? Commented Mar 20, 2014 at 4:34
  • @DigitalTrauma 对 which means, yes, thats correct! :P Commented Mar 20, 2014 at 4:49
  • 1
    @jaypal I think I missed that part of the awk manpage ;-) Commented Mar 20, 2014 at 4:52

1 Answer 1

3

I'll take a few guesses here, since its not entirely clear what's needed. First I'll assume your file looks something like this:

apple
banana
pear
apple
pear
apple

I assume "Total Data Count" is simply the number of entries, i.e. the total number of lines in the file. wc -l is the tool for that:

$ echo "Total Data Count: $(wc -l < temp_file)"
Total Data Count: 6
$ 

Then "Duplicate Data Count" is one of two things:

If it is the count of all records that are duplicated (5 = "apple", "apple", "apple", "banana", "banana" in my example), uniq -dc to get counts of duplicated fields, then awk to sum them up:

$ echo "Duplicate Data Count: $(sort temp_file | uniq -dc | awk '{count+=$1} END {print count}')"
Duplicate Data Count: 5
$ 

If it is the number of records that contain duplicates (but not full count of all duplicates) (2 = "apple", "banana" in my example), then wc -l of uniq -d should be sufficient:

$ echo "Duplicate Data Count: $(sort temp_file | uniq -d | wc -l)"
Duplicate Data Count: 2
$ 

I'm assuming "Final Data Count" is the number of all records with duplicates removed (3 = "apple", "pear", "banana" in my example). Here we can just pipe plain uniq to wc -l:

$ echo "Final Data Count: $(sort temp_file | uniq | wc -l)"
Final Data Count: 3
$ 
Sign up to request clarification or add additional context in comments.

1 Comment

this works "Duplicate Data Count: $(sort temp_file | uniq -d | wc -l)" thanks :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.