6

I want to replace all the headers (starting with >) with >{filename}, of all *.fasta files inside my directory AND concatenate them afterwards

content of my directory

speciesA.fasta
speciesB.fasta
speciesC.fasta

example of file, speciesA.fasta

>protein1 description
MJSUNDKFJSKFJSKFJ
>protein2 anothername
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>protein3 somewordshere
KSDAFJLASDJFKLAJFL

my desired output (only for speciesA.fasta now):

>speciesA
MJSUNDKFJSKFJSKFJ
>speciesA
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>speciesA
KSDAFJLASDJFKLAJFL

This is my code:

for file in *.fasta; do var=$(basename $file .fasta) | sed 's/>.*/>$var/' $var.fasta >>$var.outfile.fasta; done

but all I get is

>$var
MJSUNDKFJSKFJSKFJ
>$var
KEFJKSDJFKSDJFKSJFLSJDFLKSJF

[and so on ...]

Where did i make a mistake??

2 Answers 2

6

The bash loop is superfluous. Try:

awk '/^>/{print ">" substr(FILENAME,1,length(FILENAME)-6); next} 1' *.fasta

This approach is safe even if the file names contain special or regex-active characters.

How it works

  • /^>/ {print ">" substr(FILENAME, 1, length(FILENAME)-6); next}

    For any line that begins >, the commands in curly braces are executed. The first command prints > followed by all but the last 6 letters of the filename. The second command, next, skips the rest of the commands on the line and jumps to start over with the next line.

  • 1

    This is awk's cryptic shorthand for print-the-line.

Example

Let's consider a directory with two (identical) test files:

$ cat speciesA.fasta
>protein1 description
MJSUNDKFJSKFJSKFJ
>protein2 anothername
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>protein3 somewordshere
KSDAFJLASDJFKLAJFL
$ cat speciesB.fasta
>protein1 description
MJSUNDKFJSKFJSKFJ
>protein2 anothername
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>protein3 somewordshere
KSDAFJLASDJFKLAJFL

The output of our command is:

$ awk '/^>/{print ">" substr(FILENAME,1,length(FILENAME)-6); next} 1' *.fasta
>speciesA
MJSUNDKFJSKFJSKFJ
>speciesA
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>speciesA
KSDAFJLASDJFKLAJFL
>speciesB
MJSUNDKFJSKFJSKFJ
>speciesB
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>speciesB
KSDAFJLASDJFKLAJFL

The output has the substitutions and concatenates all the input files.

Sign up to request clarification or add additional context in comments.

Comments

2

In sed you need to use double quotes for variable expansion. Otherwise, they will be considered as literal text.

for file in *.fasta;
   do
       sed -i "s/>.*/${file%%.*}/" "$file" ;
done

1 Comment

for some reason I had to modify this to work in zsh and retain the ">" for file in *.fasta; do tag=">"${file%%.*} sed -i "s/>.*/$tag/" "$file" ; done

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.