bash loop to replace middle of string after a certain character

Question

I have 120 files (genomes.faa) that all have headers between each sequence

>GENOME1_00001 HYPOTHETICAL PROTEIN A
NQFTIAQSQVGLEDALLDL

>GENOME1_00002 HYPOTHETICAL PROTEIN B
NQFTIAQSQVGLEDALLDL

>GENOME1_00003 HYPOTHETICAL PROTEIN C
NQFTIAQSQVGLEDALLDL

etc.

I am trying to remove the "_0000X " after the name and replace it with a "|"

>GENOME1|HYPOTHETICAL PROTEIN A
NQFTIAQSQVGLEDALLDL

>GENOME1|HYPOTHETICAL PROTEIN B
NQFTIAQSQVGLEDALLDL

>GENOME1|HYPOTHETICAL PROTEIN C
NQFTIAQSQVGLEDALLDL

etc.

I have tried doing this:

for file in *.faa
do
sed -r 's/_.*$/|/g' $file > $file.1
done

This does not keep the 'HYPOTHETICAL PROTEIN A' afterwards, resulting in

>ERR1156171|
MMRQSVQTVLP

instead of

>ERR1156171|HYPOTHETICAL PROTEIN A
MMRQSVQTVLP

Any help is appreciated!

.* includes everything until the end, you want [0-9]* instead or [^ ]* — pLumo
– pLumo, Commented Jul 20, 2022 at 13:09

ilkkachu · Accepted Answer · 2022-07-20 15:01:00Z

11

I think you were very close to a working command. This worked for me on the few examples you gave:

sed -E 's/_[0-9]+ /|/' "$file" > "$file.1"

I changed the match expression from _.* to _[0-9]+ to limit the match to only the underscore, numeric digits, and space character.
I removed the $ because that matches at the end of the line, not the end of the first word.
I changed the end of the substitute command from /g to / because your examples have only one place in each line that needs editing, rather than multiple places.
Also, rather use -E than -r for extended regular expressions, as the former is more compatible with other versions of sed; and quote the variable expansions in case any filenames contain whitespace or special characters.

edited Jul 20, 2022 at 15:01

ilkkachu

148k16 gold badges268 silver badges441 bronze badges

answered Jul 20, 2022 at 13:14

Sotto Voce

7,3271 gold badge14 silver badges29 bronze badges

Thank you for this really helpful comment! I understand everything you did and it has worked!

Goodolgab
– Goodolgab

2022-07-21 08:36:15 +00:00
Commented Jul 21, 2022 at 8:36

Add a comment |

Timur Shtatland · Accepted Answer · 2022-07-20 16:16:07Z

Use this Perl one-liner:

perl -pe 's{^(>\S+?)(_\d+)?\s+(.*)}{$1|$3}' "$file" > "$file.1"

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.

(...) : capture group, which can be referred to later as $1, $2, etc.
\S+? : one or more non-whitespace characters, non-greedy.
(_\d+)? : optional matched group that consists of underscore followed by 1 or more digits.
\s+ : 1 or more whitespace characters.
(.*) : any character, repeated 0 or more times.

SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)
perldoc perlre: Perl regular expressions (regexes): Quantifiers; Character Classes and other Special Escapes; Assertions; Capture groups
perldoc perlrequick: Perl regular expressions quick start

terdon · Accepted Answer · 2022-07-20 18:40:37Z

Here's a simple perl one-liner that will find the first _ that occurs on lines beginning with a > and then one or more non-whitespace characters (\S), and remove all non-whitespace characters after that _ and any whitespace characters after them:

$ perl -pe 's/^(>\S+)_\S+\s*/$1|/' file
>GENOME1|HYPOTHETICAL PROTEIN A
NQFTIAQSQVGLEDALLDL

>GENOME1|HYPOTHETICAL PROTEIN B
NQFTIAQSQVGLEDALLDL

>GENOME1|HYPOTHETICAL PROTEIN C
NQFTIAQSQVGLEDALLDL

You can do the same basic thing with GNU sed as well:

$ sed -E 's/^(>\S+)_\S+\s*/\1|/' file
>GENOME1|HYPOTHETICAL PROTEIN A
NQFTIAQSQVGLEDALLDL

>GENOME1|HYPOTHETICAL PROTEIN B
NQFTIAQSQVGLEDALLDL

>GENOME1|HYPOTHETICAL PROTEIN C
NQFTIAQSQVGLEDALLDL

And with any sed:

$ sed 's/^\(>[^[:blank:]]*\)_[^[:blank:]]*[[:blank:]]*/\1\|/' file
>GENOME1|HYPOTHETICAL PROTEIN A
NQFTIAQSQVGLEDALLDL

>GENOME1|HYPOTHETICAL PROTEIN B
NQFTIAQSQVGLEDALLDL

>GENOME1|HYPOTHETICAL PROTEIN C
NQFTIAQSQVGLEDALLDL

jubilatious1 · Accepted Answer · 2022-07-21 11:59:40Z

Using Raku (formerly known as Perl_6)

raku -pe 's/^ (\>\S+) _ \S+\s* /$0|/;'

Above is very similar to the nice Perl answer posted by @terdon.

Note, you could try deleting the _00001 sequence (or similar digits) directly. Below uses Raku's <(…)>capture markers, which allow matches in the left-half of the substitution operator, but drops matching elements outside the markers before replacement such that only the _ \d+ \s+ elements are replaced by | in the right-half of the substitution operator:

raku -pe 's/^ \>\S+ <(_ \d+ \s+)> /|/;'

Sample Input:

>GENOME1_00001 HYPOTHETICAL PROTEIN A
NQFTIAQSQVGLEDALLDL

>GENOME1_00002 HYPOTHETICAL PROTEIN B
NQFTIAQSQVGLEDALLDL

>GENOME1_00003 HYPOTHETICAL PROTEIN C
NQFTIAQSQVGLEDALLDL

Sample Output:

>GENOME1|HYPOTHETICAL PROTEIN A
NQFTIAQSQVGLEDALLDL

>GENOME1|HYPOTHETICAL PROTEIN B
NQFTIAQSQVGLEDALLDL

>GENOME1|HYPOTHETICAL PROTEIN C
NQFTIAQSQVGLEDALLDL

https://raku.org

Peter Rottengatter · Accepted Answer · 2022-07-21 17:22:24Z

I don't know why everybody gives code of some other language when you have specifically asked for bash.

Use bash's inbuilt variable expansion facility for this, it is much faster than calling an external program like sed for every filename. For only few names this does not matter much, but it can add up for a large number of files.

The code

#!/bin/bash

for file in "GENOME1_00001 HYPOTHETICAL PROTEIN A" "GENOME1_00002 HYPOTHETICAL PROTEIN B" "GENOME1_00003 HYPOTHETICAL PROTEIN C"
  do
     echo -n $file
     new_name="${file%_*}|HYPOTHETICAL PROTEIN ${file##*EIN }"
     echo " -> ${new_name}"
  done

which calls no external tools, yields the output

GENOME1_00001 HYPOTHETICAL PROTEIN A -> GENOME1|HYPOTHETICAL PROTEIN A
GENOME1_00002 HYPOTHETICAL PROTEIN B -> GENOME1|HYPOTHETICAL PROTEIN B
GENOME1_00003 HYPOTHETICAL PROTEIN C -> GENOME1|HYPOTHETICAL PROTEIN C

as you asked for.

As explained in the comment, I was assuming the '>' at the beginning of the line was some kind of prompt, and only those lines are to be converted. IMHO it's fairly trivial to modify the code to accommodate Sotto Voce's objection, but then again, maybe, it's not. Here is a version that deals with all lines, as Sotto Voce requests. Note I have converted the input data to a here-document, and, like before, for efficiency no external tools are called.

#!/bin/bash

while read line
  do
     if [ "${line%%GENOME1_*}" = ">" ]; then
          line="${line%_*}|HYPOTHETICAL PROTEIN ${line##*EIN }"
       fi
     echo "${line}"
  done << etc
>GENOME1_00001 HYPOTHETICAL PROTEIN A
NQFTIAQSQVGLEDALLDL

>GENOME1_00002 HYPOTHETICAL PROTEIN B
NQFTIAQSQVGLEDALLDL

>GENOME1_00003 HYPOTHETICAL PROTEIN C
NQFTIAQSQVGLEDALLDL

etc

This is the output:

>GENOME1|HYPOTHETICAL PROTEIN A
NQFTIAQSQVGLEDALLDL

>GENOME1|HYPOTHETICAL PROTEIN B
NQFTIAQSQVGLEDALLDL

>GENOME1|HYPOTHETICAL PROTEIN C
NQFTIAQSQVGLEDALLDL

the reason I used sed is that's the tool the original poster used in the question, and only a minor adjustment was needed to the regexp — Sotto Voce
– Sotto Voce, Commented Jul 21, 2022 at 15:58
BTW, your code doesn't pass through the other lines in the question's example files - the lines without _00001 and the empty lines. — Sotto Voce
– Sotto Voce, Commented Jul 21, 2022 at 16:03
To be honest, I had not clearly understood, what these lines meant. I actually thought they represented the actual contents of the files, whose names the poster wanted to rename. — Peter Rottengatter
– Peter Rottengatter, Commented Jul 21, 2022 at 16:25
I also prefer to do most of my text matching and modifications in bash rather than call external commands. I like your solution. Cheers! — Sotto Voce
– Sotto Voce, Commented Jul 21, 2022 at 18:23
The OP posted hypothetical protein sequences in FASTA format. Here's FASTA format as accepted by the BLAST program: blast.ncbi.nlm.nih.gov/… — jubilatious1
– jubilatious1, Commented Oct 18, 2022 at 2:29

Stack Exchange Network

bash loop to replace middle of string after a certain character

5 Answers 5

You must log in to answer this question.

Hot Network Questions

bash loop to replace middle of string after a certain character

5 Answers 5

You must log in to answer this question.

Related

Hot Network Questions