Finding Duplicate rows based on a column in Unix File

Question

I have a file of about 1 Million records. I need to extract the records which have different FName and LName for id.

Input File

Col1,Col2,Col3,Col4,ID,FName,Col5,LName,Col6,Col7,Col8
AP,[email protected],xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1
AP,[email protected],xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1
AP,[email protected],xyz1,abc1,345,Raman,,Kumar,phn2,fax2,url1
AP,[email protected],xyz1,abc1,345,Raman,,Kumar,phn2,fax2,url1
AP,[email protected],xyz1,abc1,567,Alex,,Smith,phn2,fax2,url1
AP,[email protected],xyz1,abc1,789,Allen,,Prack,phn2,fax2,url1

The result that I want to see

AP,[email protected],xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1
AP,[email protected],xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1

Any AWK or Sed command or script can help? Thanks

Kindly do add your efforts in form of code in your question which is highly encouraged on SO, thank you. — RavinderSingh13
– RavinderSingh13, Commented Jan 13, 2021 at 10:23
Could you please do explain why lines 567,Alex,Smith and 789,Allen,Prack are NOT present in expected output, though they have their first and last names unique. — RavinderSingh13
– RavinderSingh13, Commented Jan 13, 2021 at 11:33
Please put all information in your question, not spread out in comments where people could miss them. I thought abc1 and abc2 were your ID values. — Ed Morton
– Ed Morton, Commented Jan 13, 2021 at 15:07

anubhava · Accepted Answer · 2021-01-13 14:51:43Z

2

You may try this awk:

awk 'BEGIN {FS=OFS=","} {id = $5; name = $6 FS $8} id in map && map[id] != name {if (!done[id]++) print rec[id]; print} {map[id] = name; rec[id] = $0}' file

AP,[email protected],xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1
AP,[email protected],xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1

Or a bit more readable:

awk 'BEGIN {
   FS=OFS=","
}
{
   id = $5
   # name variable to store fname, lname
   name = $6 FS $8
}
# if this id is already stored as key in map and if it is there check
# if stored name is different from current name
id in map && map[id] != name {
   # print previous record if not already printed
   if (!done[id]++)
      print rec[id]
   # print current record
   print
}
{
   # store name by key as id in map array
   # and store full record by key as id in rec array
   map[id] = name
   rec[id] = $0
}' file

edited Jan 13, 2021 at 14:51

answered Jan 13, 2021 at 10:44

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Raghavendra Gupta Over a year ago

Thanks . Could you please explain this step - $1 in map && map[$1] != name { print $1, map[$1] ORS $0

Camusensei Over a year ago

This looks perfect to me, just add a | awk '!seen[$0]++' to remove duplicates as it may produce a lot otherwise

Raghavendra Gupta Over a year ago

@anubhava Thanks for your answer. Actually I have other fields too before and after id and Lname... WhatIf the ID is 5th field in file and FName is 6th and LName is 8th. Also there are many more fields, How can I print those and considering 5th, 6th and 8th in mind?

RavinderSingh13 Over a year ago

@RaghavendraGupta, IMHO, answers will be always provided by shown samples only, if this answer doesn't work with your actual file and works with shown samples then you need to mention samples which are near to your actual file else it will be difficult to help/guide on this one, so kindly do edit your question with proper details and let us now then, thank you.

anubhava Over a year ago

@RaghavendraGupta: Check my updated answer. Please provide all the relevant information upfront to get best answers, just a suggestion for future questions.

|

Ed Morton · Accepted Answer · 2021-01-13 14:55:37Z

1

Using GNU awk for arrays of arrays:

$ awk -F, '
    { vals[$5][$6 FS $8] = $0 }
    END {
        for ( id in vals ) {
            if ( length(vals[id]) > 1 ) {
                for (name in vals[id]) {
                    print vals[id][name]
                }
            }
        }
    }
' file
AP,[email protected],xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1
AP,[email protected],xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1

or if your input file is sorted by "id" as shown in your sample input then with any awk and without storing the input file in memory:

$ cat tst.awk
BEGIN { FS=OFS="," }
NR > 1 {
    id   = $5
    name = $6 FS $8

    if ( id == prevId ) {
        if ( name != prevName ) {
            if ( firstRec != "" ) {
                print firstRec
                firstRec = ""
            }
            print
        }
    }
    else {
        firstRec = $0
    }

    prevId   = id
    prevName = name
}

$ awk -f tst.awk file
AP,[email protected],xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1
AP,[email protected],xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1

edited Jan 13, 2021 at 14:55

answered Jan 13, 2021 at 14:33

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

2 Comments

Raghavendra Gupta Over a year ago

First Name and Last Name with different cases (lower and upper) are considering as different names. Anything we can do to ignore cases?

Ed Morton Over a year ago

Again, please don't spread out requirements in comments, put it all in your question (and include a case to test that in yoru example). Having said that, just change name = $6 FS $8 to name = tolower($6 FS $8) to make it case-insensitive

F. Knorr · Accepted Answer · 2021-01-13 14:10:24Z

1

This one-liner should do the job:

awk -F "," '!a[$5] {a[$5]=$0} a[$5]!=$0{print a[$5]; print $0; a[$5]=$0}' input_file.txt

Output:

AP,[email protected],xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1
AP,[email protected],xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1

Note that the entire lines are compared based on ID.

edited Jan 13, 2021 at 14:10

answered Jan 13, 2021 at 10:40

F. Knorr

3,07518 silver badges22 bronze badges

Comments

Raman Sailopal · Accepted Answer · 2021-01-13 10:43:47Z

0

awk -F, -v id="123" '$1 == id { map[NR]=$0 } END { for(i in map) { print map[i] } }' file

With awk, set the field separator to a comma and pass a variable in called id. When the first field is equal to the passed id, add to an array called map, indexed by the record number and with the line as the value. At the end loop through the array and print the values.

answered Jan 13, 2021 at 10:43

Raman Sailopal

13k2 gold badges15 silver badges21 bronze badges

Collectives™ on Stack Overflow

Finding Duplicate rows based on a column in Unix File

4 Answers 4

7 Comments

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

7 Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related