Sort array with multiple lines using another ordered array pattern in bash with awk

Question

Following my previous post on my old post and since it didn't fully answer my question. I would like to know how can I sort my array a containing multiple lines of a particular tag code from array b.

I have an array a that the following lines

rs6605071   chr1:962943 C   ENSG00000188976 ENST00000487214 stuff
rs6605071   chr1:962943 C   ENSG00000187961 ENST00000622660 stuff
rs6605071   chr1:962943 C   84069   NM_001160184.1  stuff
rs6605071   chr1:962943 C   339451  NC_006462594.2  stuff
rs6605071   chr1:962943 C   ENSG00000135234 ENST00000624144 stuff
rs6605071   chr1:962943 C   339451  XR_001737138.1  stuff
rs6605071   chr1:962943 C   334324  NC_006462632.2  stuff
rs6605071   chr1:962943 C   84333   NM_004353462.1  stuff
rs6605071   chr1:962943 C   339451  XM_006710600.3  stuff

and another ordered array b that has the following lines:

NC
NG
NM
NP
NR
XM
XP
XR
WP

I would like to order the lines in array a to match the order of array b on column 5 to obtain to desired output:

rs6605071   chr1:962943 C   334324  NC_006462632.2  stuff
rs6605071   chr1:962943 C   339451  NC_006462594.2  stuff
rs6605071   chr1:962943 C   84069   NM_001160184.1  stuff
rs6605071   chr1:962943 C   84333   NM_004353462.1  stuff
rs6605071   chr1:962943 C   339451  XM_006710600.3  stuff
rs6605071   chr1:962943 C   339451  XR_001737138.1  stuff
rs6605071   chr1:962943 C   ENSG00000188976 ENST00000487214 stuff
rs6605071   chr1:962943 C   ENSG00000187961 ENST00000622660 stuff
rs6605071   chr1:962943 C   ENSG00000135234 ENST00000624144 stuff

The following command has been proposed in my previous post:

awk -v OFS='\t' '
FNR==NR{
  split($5,a,"_")
  array[a[1]]=$0
  next
}
($1 in array) {
  print array[$0]
  b[$1]
}
END{
  for(i in b){
    delete array[i]
  }
  for(j in array){
    print array[j]
  }
}' <(printf '%s\n' "${a[@]}") <(printf '%s\n' "${b[@]}")

but it prints:

rs6605071   chr1:962943 C   334324  NC_006462632.2  stuff
rs6605071   chr1:962943 C   84069   NM_001160184.1  stuff
rs6605071   chr1:962943 C   339451  XM_006710600.3  stuff
rs6605071   chr1:962943 C   339451  XR_001737138.1  stuff
rs6605071   chr1:962943 C   ENSG00000188976 ENST00000487214 stuff
rs6605071   chr1:962943 C   ENSG00000187961 ENST00000622660 stuff
rs6605071   chr1:962943 C   ENSG00000135234 ENST00000624144 stuff

As you see, there are lines containing NM and NC missing. Could you please tell me how I can update this command to output the desired result ?

Thanks in advance.

When you say b is ordered, what do you mean? Your example appears to be sorted alphabetically, except that WP is not in the right place — jhnc
– jhnc, Commented Feb 11, 2019 at 3:47
Do you care about the order of b, or do you just want a and b to match? — jhnc
– jhnc, Commented Feb 11, 2019 at 3:48
@jhnc what I meant is that array b is sorted according to a fixed pattern and not alphabetically. It just happened to be this way but it could be the other way around, it's called associative arrays in bash. check this link. So, the order of b must not change and the lines of array a has to match the order of array b. Thanks ! — user324810
– user324810, Commented Feb 11, 2019 at 6:07

RavinderSingh13 · Accepted Answer · 2019-02-11 03:47:35Z

2

Could you please try following. I have changed solution a bit now. Why because it was not clear that you want to print ALL values of for example NC from array a so I have changed the logic now. Where it will keep concatenating values to itself for a string NC OR NV and when it checks it in array b or so then it will print all values of it(from array a).

awk -v OFS='\t' '
FNR==NR{
  split($5,a,"_")
  array[a[1]]=(array[a[1]]?array[a[1]] ORS $0:$0)
  next
}
($1 in array) {
  print array[$0]
  delete array[$0]
}
END{
  for(j in array){
   if(array[j]){ print array[j] }
  }
}' <(printf '%s\n' "${a[@]}") <(printf '%s\n' "${b[@]}")

edited Feb 11, 2019 at 3:47

answered Feb 11, 2019 at 3:39

RavinderSingh13

135k14 gold badges61 silver badges100 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user324810 Over a year ago

Thanks again for the answer ! it's a clear code again ! :)

karakfa · Accepted Answer · 2019-02-11 04:03:38Z

2

here is awk with sort solution

$ awk 'NR==FNR{a[$1]=NR; next} 
          {k=substr($5,1,2); 
           print (k in a)?a[k]:99,NR "\t" $0}' <(printf '%s\n' "${b[@]}") <(printf '%s\n' "${a[@]}") | 
  sort -n | cut -f2-

rs6605071   chr1:962943 C   339451  NC_006462594.2  stuff
rs6605071   chr1:962943 C   334324  NC_006462632.2  stuff
rs6605071   chr1:962943 C   84069   NM_001160184.1  stuff
rs6605071   chr1:962943 C   84333   NM_004353462.1  stuff
rs6605071   chr1:962943 C   339451  XM_006710600.3  stuff
rs6605071   chr1:962943 C   339451  XR_001737138.1  stuff
rs6605071   chr1:962943 C   ENSG00000188976 ENST00000487214 stuff
rs6605071   chr1:962943 C   ENSG00000187961 ENST00000622660 stuff
rs6605071   chr1:962943 C   ENSG00000135234 ENST00000624144 stuff

answered Feb 11, 2019 at 4:03

karakfa

67.8k8 gold badges45 silver badges59 bronze badges

2 Comments

user324810 Over a year ago

@karafka thank you for the answer ! Could you please explain keep what awk did ?

user324810 Over a year ago

I also request that you could keep this comment please because other users may find it useful and answer their questions because of how simple it is as one liner !

NeronLeVelu · Accepted Answer · 2019-02-11 10:56:56Z

1

You can try this awk. Wll be memory dependant (problem on huge file) because load the dictionnary but also the full file in an temporary array. Need a GNU version for the use of asort.

awk 'FNR==NR{ Dct[$1] = Idx++; next }
   {
   Ctg = $5; sub( /_.*/, "", Ctg )
   Indice = ( Ctg in Dct ) ? Dct[Ctg] : Idx
   Lines[Ln++] = Indice " " $0
   }

   END {
     asort( Lines )
     for( Idx=0; Idx<Ln; Idx++) {
        Temp = Lines[Idx]
        sub( /^[^ ]* /, "", Temp)
        print Temp
        }
     }
   ' Array.B Array.A

same principle of @karakfa but only in awk

answered Feb 11, 2019 at 10:56

NeronLeVelu

10.1k1 gold badge26 silver badges44 bronze badges

Collectives™ on Stack Overflow

Sort array with multiple lines using another ordered array pattern in bash with awk

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related