2

somehow I can't wrap my head around this. I have the following string:

>sp.A9L976 PSBA_LEMMI Photosystem II protein D1 organism=Lemna minor taxid=4472 gene=psbA

I would like to use sed to remove the string between the 1th and 2nd occurrence of a space. Hence, in this case, the PSBA_LEMMI should be removed. The string between the first two spaces does not contain any special characters.

So far I tried the following:

sed 's/\s.*\s/\s/'

But this removes everything unitl the last occurring space string, resulting in:>sp.A9L976 TESTgene=psbA. I thought by leaving out the greedy expression g sed will only match the first occurrence of the string. I also tried:

sed 's/(?<=\s).*(?=\s)//'

But this did not match / remove anything. Can someone help me out here? What am I missing?

2
  • 1
    Using awk this is just awk '{$2 = ""} 1' file Commented Sep 22, 2021 at 14:47
  • 1
    That is very elegant! Thx a lot! Commented Sep 23, 2021 at 8:15

3 Answers 3

2

You can use

sed -E 's/\s+\S+\s+/ /'
sed -E 's/[[:space:]]+[^[:space:]]+[[:space:]]+/ /'

The two POSIX ERE patterns are the same, they match one or more whitespaces, one or more non-whitespaces, and one or more whitespaces, just \s and \S pattern can only be used in the GNU sed version.

Note that you cannot use \s as a whitespace char in the replacement part. \s is a regex pattern, and regex is used in the LHS (left-hand side) to search for whitespaces. So, a literal space is required to replace with a space.

Since you can also use an awk solution you may use

awk '{$2=""}1' file

Here, the lines ("records") are split into "fields" with whitespace (it is the default field separator), and the second field ($2) value is cleared with {$2 = ""} and the 1 forces awk to output the result (calling the default print command).

Sign up to request clarification or add additional context in comments.

1 Comment

Great explanation. Thank you very much. I accepted your answer as it solved my problem via sed and provided me with useful insights.
2

To edit the header of the fasta file as you specify, use this Perl one-liner:

echo '>sp.A9L976 PSBA_LEMMI Photosystem II protein D1 organism=Lemna minor taxid=4472 gene=psbA' | perl -lpe 's{^(>\S+\s+)\S+\s+}{$1}'

Prints:

>sp.A9L976 Photosystem II protein D1 organism=Lemna minor taxid=4472 gene=psbA

Note that it changes the fasta headers only, keeping the sequence intact even in the relatively rare cases when the sequence has whitespace. This is important in bioinformatics applications:

echo ">sp.A9L976 PSBA_LEMMI Photosystem II protein D1 organism=Lemna minor taxid=4472 gene=psbA\nACTG ACTG ACTG" | perl -pe 's{^(>\S+\s+)\S+\s+}{$1}'

Prints:

>sp.A9L976 Photosystem II protein D1 organism=Lemna minor taxid=4472 gene=psbA
ACTG ACTG ACTG

To edit the file in place:

perl -i.bak -lpe 's{^(>\S+\s+)\S+\s+}{$1}' in_file.fasta

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak.

Here,
^ : beginning of the line.
> : literal "greater than" character, which marks the beginning of the header in fasta format specifications.
\S+ : 1 or more non-whitespace characters.
\s+ : 1 or more whitespace characters.
$1 : 1st captured pattern. Capture occurs using parentheses: (...).

SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)

2 Comments

In such scenarios, the (.*) in the regex and $2 in the replacement pattern can be both removed. We needn't touch the text after the second "word".
@WiktorStribiżew Thank you for the suggestion to remove the unnecessary (.*) and $2. Updated the answer.
1

You can try this sed

sed 's/\(\.[^\s]*\) .[^\s]* \(.*\)/\1 \2/' input_file

This utilizes grouping to exclude the match between the first and second occurance of a space.

Output

>sp.A9L976 Photosystem II protein D1 organism=Lemna minor taxid=4472 gene=psbA

1 Comment

Thx a lot for your help!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.