0

Here are 2 files, and I want to replace values in fileA, from the values in fileB (if present).

The idea is to process fileA line by line and check if the "gene_id" value (column #3) is somewhere in the column #1 of fileB.

In the first line of fileA, the value is found in fileB. So we replace the value in fileA "id1.2" (column #3) by value in fileB "ND1" (column #3). In the second line of fileA, the value is not found in fileB. So it doesn't do anything.

The difficulty is also that it's not the exact same pattern between fileA and fileB, but the whole part before the ".2" has to be the same (e.g. id1 in fileB VS "id1.2" in fileA).

Original files:

> cat fileA.txt
chr1    gene_id "id1.2";
chr1    gene_id "id2.2";

> cat fileB.txt
id1 protein_coding  ND1 MT  

Wanted files (extract value in column #3 from fileB and if there's a match, put it in column #3 of fileA) :

> cat fileA.txt
chr1    gene_id "ND1";
chr1    gene_id "id2.2";

I tried something inspired from this post, but it's not working (I'm not sure I really understood the meaning of this awk line as it's the first time I'm using this syntax):

awk -F ' ' 'NR==FNR{a[$1]=$3;next}{$3=a[$3];}1' fileB.txt fileA.txt

Any help would be more than welcome.

5
  • Not clear, please be more clear how to get your expected output and add more details into you post. Commented Sep 9, 2019 at 16:32
  • I added some more explanations, is it better ? Commented Sep 9, 2019 at 16:41
  • thanks for adding more info but still not clear how are you comparing the values? They doesn't seem to be equal value from file1 and file2? Commented Sep 9, 2019 at 16:49
  • AND please edit your question to include the code that you thought should work. Then we can help correct your understanding of how these tools work. Good luck. Commented Sep 9, 2019 at 16:52
  • I simplified the question, add some code, and some more explanations. Hope it's better now Commented Sep 9, 2019 at 17:18

2 Answers 2

3

Could you please try following, based on your samples only(change column numbers accordingly as per your real Input_files).

awk -v s1="\"" '
FNR==NR{
   a[$1]=$3
   next
}
{
   val=$3
   gsub(/\"|;|\..*/,"",val)
}
(val in a){
   $3=s1 a[val] s1";"
}
1
'  fileb filea | 
   column -t
Sign up to request clarification or add additional context in comments.

Comments

0

Some months after, I figured out another option that's a bit more understandable for people who are not so used to awk. If it can help someone, I share that here:

BEGIN {
    FS="\t";
    while (getline < fileB ){
        geneTable[$1] = $3
        }
        close(fileB)
    }
{
    split($0, geneID, "gene_id \"")
    split(geneID[2], geneID, ".")

    if (geneID[1] in geneTable){
        $2 = "gene_id \"" geneTable[geneID[1]] "\";"
    }
    print $0
}

The best way is to store this command in an external file, that we call here cmd.awk. To run the script:

awk -v fileB="fileB.txt" -f cmd.awk fileA.txt | column -t
  • The BEGIN part is to read the fileB.txt and store the results in the array geneTable.
  • The split part is to get the value after the "gene_id" in fileA.txt.
  • The if part is to replace value in fileA.txt if found in the array geneTable (=> found in fileB.txt)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.