0

I have a file of about 1 Million records. I need to extract the records which have different FName and LName for id.

Input File

Col1,Col2,Col3,Col4,ID,FName,Col5,LName,Col6,Col7,Col8
AP,[email protected],xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1
AP,[email protected],xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1
AP,[email protected],xyz1,abc1,345,Raman,,Kumar,phn2,fax2,url1
AP,[email protected],xyz1,abc1,345,Raman,,Kumar,phn2,fax2,url1
AP,[email protected],xyz1,abc1,567,Alex,,Smith,phn2,fax2,url1
AP,[email protected],xyz1,abc1,789,Allen,,Prack,phn2,fax2,url1

The result that I want to see

AP,[email protected],xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1
AP,[email protected],xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1

Any AWK or Sed command or script can help? Thanks

6
  • 3
    Kindly do add your efforts in form of code in your question which is highly encouraged on SO, thank you. Commented Jan 13, 2021 at 10:23
  • Sure. Adding it. Commented Jan 13, 2021 at 10:24
  • Could you please do explain why lines 567,Alex,Smith and 789,Allen,Prack are NOT present in expected output, though they have their first and last names unique. Commented Jan 13, 2021 at 11:33
  • They don't have ids duplicated in file. Commented Jan 13, 2021 at 12:25
  • 2
    Please put all information in your question, not spread out in comments where people could miss them. I thought abc1 and abc2 were your ID values. Commented Jan 13, 2021 at 15:07

4 Answers 4

2

You may try this awk:

awk 'BEGIN {FS=OFS=","} {id = $5; name = $6 FS $8} id in map && map[id] != name {if (!done[id]++) print rec[id]; print} {map[id] = name; rec[id] = $0}' file

AP,[email protected],xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1
AP,[email protected],xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1

Or a bit more readable:

awk 'BEGIN {
   FS=OFS=","
}
{
   id = $5
   # name variable to store fname, lname
   name = $6 FS $8
}
# if this id is already stored as key in map and if it is there check
# if stored name is different from current name
id in map && map[id] != name {
   # print previous record if not already printed
   if (!done[id]++)
      print rec[id]
   # print current record
   print
}
{
   # store name by key as id in map array
   # and store full record by key as id in rec array
   map[id] = name
   rec[id] = $0
}' file
Sign up to request clarification or add additional context in comments.

7 Comments

Thanks . Could you please explain this step - $1 in map && map[$1] != name { print $1, map[$1] ORS $0
This looks perfect to me, just add a | awk '!seen[$0]++' to remove duplicates as it may produce a lot otherwise
@anubhava Thanks for your answer. Actually I have other fields too before and after id and Lname... WhatIf the ID is 5th field in file and FName is 6th and LName is 8th. Also there are many more fields, How can I print those and considering 5th, 6th and 8th in mind?
@RaghavendraGupta, IMHO, answers will be always provided by shown samples only, if this answer doesn't work with your actual file and works with shown samples then you need to mention samples which are near to your actual file else it will be difficult to help/guide on this one, so kindly do edit your question with proper details and let us now then, thank you.
@RaghavendraGupta: Check my updated answer. Please provide all the relevant information upfront to get best answers, just a suggestion for future questions.
|
1

Using GNU awk for arrays of arrays:

$ awk -F, '
    { vals[$5][$6 FS $8] = $0 }
    END {
        for ( id in vals ) {
            if ( length(vals[id]) > 1 ) {
                for (name in vals[id]) {
                    print vals[id][name]
                }
            }
        }
    }
' file
AP,[email protected],xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1
AP,[email protected],xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1

or if your input file is sorted by "id" as shown in your sample input then with any awk and without storing the input file in memory:

$ cat tst.awk
BEGIN { FS=OFS="," }
NR > 1 {
    id   = $5
    name = $6 FS $8

    if ( id == prevId ) {
        if ( name != prevName ) {
            if ( firstRec != "" ) {
                print firstRec
                firstRec = ""
            }
            print
        }
    }
    else {
        firstRec = $0
    }

    prevId   = id
    prevName = name
}

$ awk -f tst.awk file
AP,[email protected],xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1
AP,[email protected],xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1

2 Comments

First Name and Last Name with different cases (lower and upper) are considering as different names. Anything we can do to ignore cases?
Again, please don't spread out requirements in comments, put it all in your question (and include a case to test that in yoru example). Having said that, just change name = $6 FS $8 to name = tolower($6 FS $8) to make it case-insensitive
1

This one-liner should do the job:

awk -F "," '!a[$5] {a[$5]=$0} a[$5]!=$0{print a[$5]; print $0; a[$5]=$0}' input_file.txt

Output:

AP,[email protected],xyz1,abc1,123,Ram,,Kumar,phn1,fax1,url1
AP,[email protected],xyz2,abc2,123,Shyam,,Kumar,phn2,fax2,url1

Note that the entire lines are compared based on ID.

Comments

0
awk -F, -v id="123" '$1 == id { map[NR]=$0 } END { for(i in map) { print map[i] } }' file

With awk, set the field separator to a comma and pass a variable in called id. When the first field is equal to the passed id, add to an array called map, indexed by the record number and with the line as the value. At the end loop through the array and print the values.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.