0

I have an array a that the following lines

rs6605071   chr1:962943 C   ENSG00000188976 ENST00000487214
rs6605071   chr1:962943 C   ENSG00000187961 ENST00000622660
rs6605071   chr1:962943 C   84069   NM_001160184.1
rs6605071   chr1:962943 C   339451  NC_006462594.2
rs6605071   chr1:962943 C   339451  XR_001737138.1
rs6605071   chr1:962943 C   339451  XM_006710600.3

and another ordered array b that has the following lines:

NC
NG
NM
NP
NR
XM
XP
XR
WP

I would like to order the lines in array a to match the order of array b on column 5 to obtain to desired output:

rs6605071   chr1:962943 C   339451  NC_006462594.2
rs6605071   chr1:962943 C   84069   NM_001160184.1
rs6605071   chr1:962943 C   339451  XM_006710600.3
rs6605071   chr1:962943 C   339451  XR_001737138.1
rs6605071   chr1:962943 C   ENSG00000188976 ENST00000487214
rs6605071   chr1:962943 C   ENSG00000187961 ENST00000622660

I tried to do the following command by splitting on column 5 but it is printing blank lines:

awk -F '\t' -v OFS='\t' 'FNR==NR{split(a[$5],t,"_"); t[1]=$0;next}
{print a[$1]}' <(printf '%s\n' "${a[@]}") <(printf '%s\n' "${b[@]}")

Could you please tell me why my command is not working ? Would a partial match by regex work ?

EDIT 1: changing array a to include lines that can have multiple codes from array b

rs6605071   chr1:962943 C   ENSG00000188976 ENST00000487214
rs6605071   chr1:962943 C   ENSG00000187961 ENST00000622660
rs6605071   chr1:962943 C   84069   NM_001160184.1
rs6605071   chr1:962943 C   339451  NC_006462594.2
rs6605071   chr1:962943 C   ENSG00000135234 ENST00000624144
rs6605071   chr1:962943 C   339451  XR_001737138.1
rs6605071   chr1:962943 C   334324  NC_006462632.2
rs6605071   chr1:962943 C   84333   NM_004353462.1
rs6605071   chr1:962943 C   339451  XM_006710600.3

Expected output:

rs6605071   chr1:962943 C   334324  NC_006462632.2
rs6605071   chr1:962943 C   339451  NC_006462594.2
rs6605071   chr1:962943 C   84069   NM_001160184.1
rs6605071   chr1:962943 C   84333   NM_004353462.1
rs6605071   chr1:962943 C   339451  XM_006710600.3
rs6605071   chr1:962943 C   339451  XR_001737138.1
rs6605071   chr1:962943 C   ENSG00000188976 ENST00000487214
rs6605071   chr1:962943 C   ENSG00000187961 ENST00000622660
rs6605071   chr1:962943 C   ENSG00000135234 ENST00000624144

EDIT 2: Since the answer provided by RavinderSingh13 below did not fully answer my question, I will re-ask the question on how to perform such task with AWK.

Thanks in advance.

6
  • Why did you put the lines where $5 does not contain a _ at the end of the desired output? Commented Feb 7, 2019 at 18:54
  • Can there be multiple NC_ lines in array a ? Commented Feb 7, 2019 at 19:20
  • @hek2mgl I used $5 because I though it would take 5th column which contains string like XR_001737138.1 and it would split them on _ ... Commented Feb 7, 2019 at 20:34
  • @anubhava yes there can be multiple NC lines in a Commented Feb 7, 2019 at 20:34
  • @Law Read my question carefully Commented Feb 7, 2019 at 20:41

1 Answer 1

2

I am assuming here that you want to print matching fields of both arrays in order and then remaining non-matched items from array a also you want to print too, if that is the case then following may help you here.

Creating arrays here:

declare -a a=("rs6605071   chr1:962943 C   ENSG00000188976 ENST00000487214
rs6605071   chr1:962943 C   ENSG00000187961 ENST00000622660
rs6605071   chr1:962943 C   84069   NM_001160184.1
rs6605071   chr1:962943 C   339451  NC_006462594.2
rs6605071   chr1:962943 C   339451  XR_001737138.1
rs6605071   chr1:962943 C   339451  XM_006710600.3")
declare -a b=("NC
NG
NM
NP
NR
XM
XP
XR
WP")

Now running following code:

awk -v OFS='\t' '
FNR==NR{
  split($5,a,"_")
  array[a[1]]=$0
  next
}
($1 in array) {
  print array[$0]
  b[$1]
}
END{
  for(i in b){
    delete array[i]
  }
  for(j in array){
    print array[j]
  }
}' <(printf '%s\n' "${a[@]}") <(printf '%s\n' "${b[@]}")

Output will be as follows.

rs6605071   chr1:962943 C   339451  NC_006462594.2
rs6605071   chr1:962943 C   84069   NM_001160184.1
rs6605071   chr1:962943 C   339451  XM_006710600.3
rs6605071   chr1:962943 C   339451  XR_001737138.1
rs6605071   chr1:962943 C   ENSG00000188976 ENST00000487214
rs6605071   chr1:962943 C   ENSG00000187961 ENST00000622660
Sign up to request clarification or add additional context in comments.

4 Comments

@RavinderSingh13 when using this code, it outputs the correct order but only one line, for example, the line NM in array a. I only gave a small subset but there can be multiple lines having the same code of array b like so rs6605071 chr1:962943 C 84069 NM_001160184.1 rs6605071 chr1:962943 C 84062 NM_004420144.5
I am riding bike as of now. If your actual input_file is not same as shown samples then please do add correct samples. Will look into it once I am on system
@RavinderSingh13 yes I will edit my post, it's slightly different since I only wanted to keep it as simple as possible
Bump on this post to have some feedback on the newly edited code ... sorry

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.