0

I have a fasta file which looks like this.

>header1  
ATGC....  
>header2  
ATGC...

My list files looks like this

organism1  
organism2

and contains a list of organism that I want to replace the header with.

I tried to use a for loop using sed command which is as follows:

for i in `cat list7b`; do sed "s/^>/$i/g" sequence.fa; done

but it didn't work please tell how I can achieve this task.

The result file should look like this

>organism1  
ATGC...  
>organism2  
ATGC....

that is >header1 replaced with >organism_1 and so on

  1. The two headers are distinguished from ATGC as header always starts with > greater than sign whereas ATGC would not. That's how they are distinguished.
  2. The header lines should be replaced by the order of appearance, i.e. first header* replaced with first-line from file, 2nd header from the second and so on.

I also request to explain the logic if possible. thanks in advance.

2
  • Please edit your question and explain how you distinguish the header1, header2 etc lines from the ATGC.... lines. I assume the two lines organism1and organism2 are your file list7b. How do you define which organism* line shall replace which header* line? By a common trailing number, e.g. header 1 -> organism 1 etc? Or by the order of appearance, i.e. first header* replaced with first line from file, 2nd header* with 2nd line etc? Commented Mar 13, 2020 at 13:46
  • @Bodo Thank you for your quick response. I have re-edited the question and I hope this would help you to understand the problem. Please feel free to ask if the edit is not sufficient, your time and efforts for helping me are highly appreciated. Commented Mar 13, 2020 at 14:56

1 Answer 1

2

With awk this is easy to do in one run.

Assuming your fasta file is named sequence.fa and your organisms list file is named list7b as in the question you can use

awk 'NR == FNR { o[n++] = $0; next } /^>/ && i < n { $0 = ">" o[i++] } 1' list7b sequence.fa > output.fa

Explanation:

NR == FNR is a condition for doing something with the first file only. (total number of records is equal to number of records in current file)

{ o[n++] = $0; next } puts the input line into array o, counts the entries and skips further processing of the input line, so o will contain all your organism lines.

The next part is executed for the remaining file(s).

/^>/ && i < n is valid for lines that start with > as long as i is less than the number of elements n that were put into array o.

{ $0 = ">" o[i++] } replaces the current line with > followed by the array element (i.e. a line from the first file) and increments the index i to the next element.

1 is an "always true" condition with the implicit default action { print } to print the current line for every input line.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much @bodo I appreciate your help and precious time it helped me a lot.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.