3

Following my previous post on my old post and since it didn't fully answer my question. I would like to know how can I sort my array a containing multiple lines of a particular tag code from array b.

I have an array a that the following lines

rs6605071   chr1:962943 C   ENSG00000188976 ENST00000487214 stuff
rs6605071   chr1:962943 C   ENSG00000187961 ENST00000622660 stuff
rs6605071   chr1:962943 C   84069   NM_001160184.1  stuff
rs6605071   chr1:962943 C   339451  NC_006462594.2  stuff
rs6605071   chr1:962943 C   ENSG00000135234 ENST00000624144 stuff
rs6605071   chr1:962943 C   339451  XR_001737138.1  stuff
rs6605071   chr1:962943 C   334324  NC_006462632.2  stuff
rs6605071   chr1:962943 C   84333   NM_004353462.1  stuff
rs6605071   chr1:962943 C   339451  XM_006710600.3  stuff

and another ordered array b that has the following lines:

NC
NG
NM
NP
NR
XM
XP
XR
WP

I would like to order the lines in array a to match the order of array b on column 5 to obtain to desired output:

rs6605071   chr1:962943 C   334324  NC_006462632.2  stuff
rs6605071   chr1:962943 C   339451  NC_006462594.2  stuff
rs6605071   chr1:962943 C   84069   NM_001160184.1  stuff
rs6605071   chr1:962943 C   84333   NM_004353462.1  stuff
rs6605071   chr1:962943 C   339451  XM_006710600.3  stuff
rs6605071   chr1:962943 C   339451  XR_001737138.1  stuff
rs6605071   chr1:962943 C   ENSG00000188976 ENST00000487214 stuff
rs6605071   chr1:962943 C   ENSG00000187961 ENST00000622660 stuff
rs6605071   chr1:962943 C   ENSG00000135234 ENST00000624144 stuff

The following command has been proposed in my previous post:

awk -v OFS='\t' '
FNR==NR{
  split($5,a,"_")
  array[a[1]]=$0
  next
}
($1 in array) {
  print array[$0]
  b[$1]
}
END{
  for(i in b){
    delete array[i]
  }
  for(j in array){
    print array[j]
  }
}' <(printf '%s\n' "${a[@]}") <(printf '%s\n' "${b[@]}")

but it prints:

rs6605071   chr1:962943 C   334324  NC_006462632.2  stuff
rs6605071   chr1:962943 C   84069   NM_001160184.1  stuff
rs6605071   chr1:962943 C   339451  XM_006710600.3  stuff
rs6605071   chr1:962943 C   339451  XR_001737138.1  stuff
rs6605071   chr1:962943 C   ENSG00000188976 ENST00000487214 stuff
rs6605071   chr1:962943 C   ENSG00000187961 ENST00000622660 stuff
rs6605071   chr1:962943 C   ENSG00000135234 ENST00000624144 stuff

As you see, there are lines containing NM and NC missing. Could you please tell me how I can update this command to output the desired result ?

Thanks in advance.

3
  • When you say b is ordered, what do you mean? Your example appears to be sorted alphabetically, except that WP is not in the right place Commented Feb 11, 2019 at 3:47
  • Do you care about the order of b, or do you just want a and b to match? Commented Feb 11, 2019 at 3:48
  • @jhnc what I meant is that array b is sorted according to a fixed pattern and not alphabetically. It just happened to be this way but it could be the other way around, it's called associative arrays in bash. check this link. So, the order of b must not change and the lines of array a has to match the order of array b. Thanks ! Commented Feb 11, 2019 at 6:07

3 Answers 3

2

Could you please try following. I have changed solution a bit now. Why because it was not clear that you want to print ALL values of for example NC from array a so I have changed the logic now. Where it will keep concatenating values to itself for a string NC OR NV and when it checks it in array b or so then it will print all values of it(from array a).

awk -v OFS='\t' '
FNR==NR{
  split($5,a,"_")
  array[a[1]]=(array[a[1]]?array[a[1]] ORS $0:$0)
  next
}
($1 in array) {
  print array[$0]
  delete array[$0]
}
END{
  for(j in array){
   if(array[j]){ print array[j] }
  }
}' <(printf '%s\n' "${a[@]}") <(printf '%s\n' "${b[@]}")
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks again for the answer ! it's a clear code again ! :)
2

here is awk with sort solution

$ awk 'NR==FNR{a[$1]=NR; next} 
          {k=substr($5,1,2); 
           print (k in a)?a[k]:99,NR "\t" $0}' <(printf '%s\n' "${b[@]}") <(printf '%s\n' "${a[@]}") | 
  sort -n | cut -f2-

rs6605071   chr1:962943 C   339451  NC_006462594.2  stuff
rs6605071   chr1:962943 C   334324  NC_006462632.2  stuff
rs6605071   chr1:962943 C   84069   NM_001160184.1  stuff
rs6605071   chr1:962943 C   84333   NM_004353462.1  stuff
rs6605071   chr1:962943 C   339451  XM_006710600.3  stuff
rs6605071   chr1:962943 C   339451  XR_001737138.1  stuff
rs6605071   chr1:962943 C   ENSG00000188976 ENST00000487214 stuff
rs6605071   chr1:962943 C   ENSG00000187961 ENST00000622660 stuff
rs6605071   chr1:962943 C   ENSG00000135234 ENST00000624144 stuff

2 Comments

@karafka thank you for the answer ! Could you please explain keep what awk did ?
I also request that you could keep this comment please because other users may find it useful and answer their questions because of how simple it is as one liner !
1

You can try this awk. Wll be memory dependant (problem on huge file) because load the dictionnary but also the full file in an temporary array. Need a GNU version for the use of asort.

awk 'FNR==NR{ Dct[$1] = Idx++; next }
   {
   Ctg = $5; sub( /_.*/, "", Ctg )
   Indice = ( Ctg in Dct ) ? Dct[Ctg] : Idx
   Lines[Ln++] = Indice " " $0
   }

   END {
     asort( Lines )
     for( Idx=0; Idx<Ln; Idx++) {
        Temp = Lines[Idx]
        sub( /^[^ ]* /, "", Temp)
        print Temp
        }
     }
   ' Array.B Array.A

same principle of @karakfa but only in awk

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.