Remove string between two space characters with sed

Question

somehow I can't wrap my head around this. I have the following string:

>sp.A9L976 PSBA_LEMMI Photosystem II protein D1 organism=Lemna minor taxid=4472 gene=psbA

I would like to use sed to remove the string between the 1th and 2nd occurrence of a space. Hence, in this case, the PSBA_LEMMI should be removed. The string between the first two spaces does not contain any special characters.

So far I tried the following:

sed 's/\s.*\s/\s/'

But this removes everything unitl the last occurring space string, resulting in:>sp.A9L976 TESTgene=psbA. I thought by leaving out the greedy expression g sed will only match the first occurrence of the string. I also tried:

sed 's/(?<=\s).*(?=\s)//'

But this did not match / remove anything. Can someone help me out here? What am I missing?

Using awk this is just awk '{$2 = ""} 1' file

anubhava
– anubhava

2021-09-22 14:47:12 +00:00
Commented Sep 22, 2021 at 14:47 — anubhava
– anubhava, Commented Sep 22, 2021 at 14:47
That is very elegant! Thx a lot!

han5000
– han5000

2021-09-23 08:15:50 +00:00
Commented Sep 23, 2021 at 8:15 — han5000
– han5000, Commented Sep 23, 2021 at 8:15

Wiktor Stribiżew · Accepted Answer · 2021-09-23 08:19:18Z

2

You can use

sed -E 's/\s+\S+\s+/ /'
sed -E 's/[[:space:]]+[^[:space:]]+[[:space:]]+/ /'

The two POSIX ERE patterns are the same, they match one or more whitespaces, one or more non-whitespaces, and one or more whitespaces, just \s and \S pattern can only be used in the GNU sed version.

Note that you cannot use \s as a whitespace char in the replacement part. \s is a regex pattern, and regex is used in the LHS (left-hand side) to search for whitespaces. So, a literal space is required to replace with a space.

Since you can also use an awk solution you may use

awk '{$2=""}1' file

Here, the lines ("records") are split into "fields" with whitespace (it is the default field separator), and the second field ($2) value is cleared with {$2 = ""} and the 1 forces awk to output the result (calling the default print command).

edited Sep 23, 2021 at 8:19

answered Sep 22, 2021 at 14:28

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

han5000 Over a year ago

Great explanation. Thank you very much. I accepted your answer as it solved my problem via sed and provided me with useful insights.

Timur Shtatland · Accepted Answer · 2021-09-23 15:35:08Z

2

To edit the header of the fasta file as you specify, use this Perl one-liner:

echo '>sp.A9L976 PSBA_LEMMI Photosystem II protein D1 organism=Lemna minor taxid=4472 gene=psbA' | perl -lpe 's{^(>\S+\s+)\S+\s+}{$1}'

Prints:

>sp.A9L976 Photosystem II protein D1 organism=Lemna minor taxid=4472 gene=psbA

Note that it changes the fasta headers only, keeping the sequence intact even in the relatively rare cases when the sequence has whitespace. This is important in bioinformatics applications:

echo ">sp.A9L976 PSBA_LEMMI Photosystem II protein D1 organism=Lemna minor taxid=4472 gene=psbA\nACTG ACTG ACTG" | perl -pe 's{^(>\S+\s+)\S+\s+}{$1}'

Prints:

>sp.A9L976 Photosystem II protein D1 organism=Lemna minor taxid=4472 gene=psbA
ACTG ACTG ACTG

To edit the file in place:

perl -i.bak -lpe 's{^(>\S+\s+)\S+\s+}{$1}' in_file.fasta

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak.

Here,
^ : beginning of the line.
> : literal "greater than" character, which marks the beginning of the header in fasta format specifications.
\S+ : 1 or more non-whitespace characters.
\s+ : 1 or more whitespace characters.
$1 : 1st captured pattern. Capture occurs using parentheses: (...).

SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)

edited Sep 23, 2021 at 15:35

answered Sep 22, 2021 at 14:43

Timur Shtatland

12.8k3 gold badges41 silver badges68 bronze badges

2 Comments

Wiktor Stribiżew Over a year ago

In such scenarios, the (.*) in the regex and $2 in the replacement pattern can be both removed. We needn't touch the text after the second "word".

Timur Shtatland Over a year ago

@WiktorStribiżew Thank you for the suggestion to remove the unnecessary (.*) and $2. Updated the answer.

sseLtaH · Accepted Answer · 2021-09-22 14:38:05Z

1

You can try this sed

sed 's/\(\.[^\s]*\) .[^\s]* \(.*\)/\1 \2/' input_file

This utilizes grouping to exclude the match between the first and second occurance of a space.

Output

>sp.A9L976 Photosystem II protein D1 organism=Lemna minor taxid=4472 gene=psbA

answered Sep 22, 2021 at 14:38

sseLtaH

11.3k5 gold badges17 silver badges34 bronze badges

1 Comment

han5000 Over a year ago

Thx a lot for your help!

Collectives™ on Stack Overflow

Remove string between two space characters with sed

3 Answers 3

1 Comment

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related