0

I need help to improve my working code for combining multiple rows based on identical values for specified columns. Here is a sample data:

c-i1_pos-at1-v2 162a    AT1G01040.1 2   3422-3443   3433    1
c-i1_pos-at1-v2 162b    AT1G01040.1 2   3422-3443   3433    1
pare-i_226-v2-wt    162a    AT1G01040.1 2   3422-3443   3433    0
pare-i_226-v2-wt    162b    AT1G01040.1 2   3422-3443   3433    0
xrn4-pare-i_ath-227-v2-wt   827 AT1G02860.1 1   258-278 269 1
i2_lib2-v2  156a    AT1G03730.1 4   242-260 252 3
i2_lib2-v2  156b    AT1G03730.1 4   242-260 252 3
i2_lib2-v2  156c    AT1G03730.1 4   242-260 252 3
i2_lib2-v2  156d    AT1G03730.1 4   242-260 252 3
i2_lib2-v2  156e    AT1G03730.1 4   242-260 252 3

Basically if values in columns $3,$5 are identical, I want to combine rows for columns $2,$6 (or more), with the unique values of the remaining columns merged like this:

AT1G01040.1 3422-3443   3433    162a,162b
AT1G02860.1 258-278 269 827
AT1G03730.1 242-260 252 156a,156b,156c,156d,156e

Right now I am trying to do this in multiple steps, based on the answers here.

awk 'BEGIN{FS=OFS="\t"} {c=$2 FS $3 FS $5; if (c in a) a[c]=a[c]","$6; else a[c]=$6}END{for (k in a) print k,a[k]}'|awk '{p=$1 FS $2 FS $4; if (p in l) l[p]=l[p]","$3;else l[p]=$3}END{for (m in l) print m,l[m]}' <input.txt

Which gives:

AT1G01040.1 3422-3443   3433,3433   162a,162b
AT1G02860.1 258-278 269 827
AT1G03730.1 242-260 252 156a,156b,156c,156d,156e

I thought I should put the values in the remaining columns as arrays to get my desired output at one step, but I am struggling to figure out the correct context.

4
  • Ideally there should be only one 3433 in row 1, however the code I have up there prints 3433 twice..must be because I messed up something :/ Commented Oct 30, 2014 at 9:40
  • yes it works perfectly--thanks so much! Now I am going through the code to understand the correct context-- thanks again. Commented Oct 30, 2014 at 9:54
  • that will be great :) Commented Oct 30, 2014 at 9:57
  • @pasaima I have added an explanation. Hope it is clear now :) Commented Oct 30, 2014 at 10:00

1 Answer 1

2

How about something like

awk '{if ( $3 in a ) a[$3] = a[$3]","$2; else a[$3] = $3" "$5" "$6" "$2} END{for (i in a) print a[i]}' inputFile

Will produce output as

AT1G03730.1 242-260 252 156a,156b,156c,156d,156e
AT1G02860.1 258-278 269 827
AT1G01040.1 3422-3443 3433 162a,162b,162a,162b

Explanation

a[$3] = $3" "$5" "$6" "$2 creates an array a indexed by the third field $3, the else part ensures that the array is created when the row is encounterd for the first time.

if ( $3 in a ) a[$3] = a[$3]","$2 if the third field $3 is already present in the array, append field two $2 to the array

END{for (i in a) print a[i]} END block is excecuted at the end of input. Prints the entire array giving the output

EDIT

A simpler version would be

awk '{( $3 in a ) ? a[$3] = a[$3]","$2 : a[$3] = $3" "$5" "$6" "$2} END{for (i in a) print a[i]}' inputFile

Thank you Jotne for the suggestion.

Sign up to request clarification or add additional context in comments.

3 Comments

Some golfing of your code: awk '{a[$3]=($3 in a)?a[$3]","$2:$3FS$5FS$6FS$2} END {for (i in a) print a[i]}' file
You are welcome. You could also remove the parentheses, but this will make it harder to read, so best to leave it there.
@Jotne OP just started using awk, we dont want to make harder for him :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.