combining multiple column values in lines based on identical column

Question

I need help to improve my working code for combining multiple rows based on identical values for specified columns. Here is a sample data:

c-i1_pos-at1-v2 162a    AT1G01040.1 2   3422-3443   3433    1
c-i1_pos-at1-v2 162b    AT1G01040.1 2   3422-3443   3433    1
pare-i_226-v2-wt    162a    AT1G01040.1 2   3422-3443   3433    0
pare-i_226-v2-wt    162b    AT1G01040.1 2   3422-3443   3433    0
xrn4-pare-i_ath-227-v2-wt   827 AT1G02860.1 1   258-278 269 1
i2_lib2-v2  156a    AT1G03730.1 4   242-260 252 3
i2_lib2-v2  156b    AT1G03730.1 4   242-260 252 3
i2_lib2-v2  156c    AT1G03730.1 4   242-260 252 3
i2_lib2-v2  156d    AT1G03730.1 4   242-260 252 3
i2_lib2-v2  156e    AT1G03730.1 4   242-260 252 3

Basically if values in columns $3,$5 are identical, I want to combine rows for columns $2,$6 (or more), with the unique values of the remaining columns merged like this:

AT1G01040.1 3422-3443   3433    162a,162b
AT1G02860.1 258-278 269 827
AT1G03730.1 242-260 252 156a,156b,156c,156d,156e

Right now I am trying to do this in multiple steps, based on the answers here.

awk 'BEGIN{FS=OFS="\t"} {c=$2 FS $3 FS $5; if (c in a) a[c]=a[c]","$6; else a[c]=$6}END{for (k in a) print k,a[k]}'|awk '{p=$1 FS $2 FS $4; if (p in l) l[p]=l[p]","$3;else l[p]=$3}END{for (m in l) print m,l[m]}' <input.txt

Which gives:

AT1G01040.1 3422-3443   3433,3433   162a,162b
AT1G02860.1 258-278 269 827
AT1G03730.1 242-260 252 156a,156b,156c,156d,156e

I thought I should put the values in the remaining columns as arrays to get my desired output at one step, but I am struggling to figure out the correct context.

Ideally there should be only one 3433 in row 1, however the code I have up there prints 3433 twice..must be because I messed up something :/ — psaima
– psaima, Commented Oct 30, 2014 at 9:40
yes it works perfectly--thanks so much! Now I am going through the code to understand the correct context-- thanks again. — psaima
– psaima, Commented Oct 30, 2014 at 9:54
@pasaima I have added an explanation. Hope it is clear now :) — nu11p01n73R
– nu11p01n73R, Commented Oct 30, 2014 at 10:00

nu11p01n73R · Accepted Answer · 2014-10-30 11:17:58Z

2

How about something like

awk '{if ( $3 in a ) a[$3] = a[$3]","$2; else a[$3] = $3" "$5" "$6" "$2} END{for (i in a) print a[i]}' inputFile

Will produce output as

AT1G03730.1 242-260 252 156a,156b,156c,156d,156e
AT1G02860.1 258-278 269 827
AT1G01040.1 3422-3443 3433 162a,162b,162a,162b

Explanation

a[$3] = $3" "$5" "$6" "$2 creates an array a indexed by the third field $3, the else part ensures that the array is created when the row is encounterd for the first time.

if ( $3 in a ) a[$3] = a[$3]","$2 if the third field $3 is already present in the array, append field two $2 to the array

END{for (i in a) print a[i]} END block is excecuted at the end of input. Prints the entire array giving the output

EDIT

A simpler version would be

awk '{( $3 in a ) ? a[$3] = a[$3]","$2 : a[$3] = $3" "$5" "$6" "$2} END{for (i in a) print a[i]}' inputFile

Thank you Jotne for the suggestion.

edited Oct 30, 2014 at 11:17

answered Oct 30, 2014 at 9:35

nu11p01n73R

26.8k3 gold badges42 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jotne Over a year ago

Some golfing of your code: awk '{a[$3]=($3 in a)?a[$3]","$2:$3FS$5FS$6FS$2} END {for (i in a) print a[i]}' file

Jotne Over a year ago

You are welcome. You could also remove the parentheses, but this will make it harder to read, so best to leave it there.

nu11p01n73R Over a year ago

@Jotne OP just started using awk, we dont want to make harder for him :)

Collectives™ on Stack Overflow

combining multiple column values in lines based on identical column

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related