Remove a string character between 2 special characters in the headers of a fastq file

Question

I have a fastq file containing several sequences with headers such as :

tail SRR11149706_1.fastq 

@SRR11149706.16630586 16630586/1
CCCAACAACAACAACAGCAACCTCCTCACGCCAACGCCGATCCCGCCGCTGTTTTCCAA

@SRR11149706.16630587 16630587/1
CAAAGCACCAGGTGCAGTGCACCTTGTCCGTCGGTCTGAATATCTGCTCTCTGTTCTCCA

I would like to remove the numbers that come before the "/" as well as this last character. The number of characters is variable. The result should be :

@SRR11149706.16630586 1
CCCAACAACAACAACAGCAACCTCCTCACGCCAACGCCGATCCCGCCGCTGTTTTCCAA

@SRR11149706.16630587 1
CAAAGCACCAGGTGCAGTGCACCTTGTCCGTCGGTCTGAATATCTGCTCTCTGTTCTCCA

Nothing I have tried worked.

Edit : I thought I would remove everything which is between the space and the first / included, but I do need a space.

The title of your question says between 2 special characters, but you speak about only one being / — bruno
– bruno, Commented Jul 4 at 20:40
Sorry I originally meant between the blank space and the /, but I realized I do need the blank space... — CaroZ
– CaroZ, Commented Jul 4 at 22:27
@CaroZ A FASTQ file should have base qualities as well, 4 lines per record, and no blank lines between records. Are you sure this is the right format? — Timur Shtatland
– Timur Shtatland, Commented Jul 4 at 22:32

Mark Setchell · Accepted Answer · 2025-07-05 09:16:55Z

With sed, supposing you want to modify SRR11149706_1.fastq :

sed -E -e "s=[0-9]+/==" -i SRR11149706_1.fastq

Example of execution on my Pi 5 (Debian bookworm)

bruno@raspberrypi:/tmp $ cat SRR11149706_1.fastq
@SRR11149706.16630586 16630586/1
CCCAACAACAACAACAGCAACCTCCTCACGCCAACGCCGATCCCGCCGCTGTTTTCCAA

@SRR11149706.16630587 16630587/1
CAAAGCACCAGGTGCAGTGCACCTTGTCCGTCGGTCTGAATATCTGCTCTCTGTTCTCCA
bruno@raspberrypi:/tmp $ 
bruno@raspberrypi:/tmp $ sed -E -e "s=[0-9]+/==" -i SRR11149706_1.fastq
bruno@raspberrypi:/tmp $ 
bruno@raspberrypi:/tmp $ cat SRR11149706_1.fastq
@SRR11149706.16630586 1
CCCAACAACAACAACAGCAACCTCCTCACGCCAACGCCGATCCCGCCGCTGTTTTCCAA

@SRR11149706.16630587 1
CAAAGCACCAGGTGCAGTGCACCTTGTCCGTCGGTCTGAATATCTGCTCTCTGTTCTCCA
bruno@raspberrypi:/tmp $

If you do not want to modify SRR11149706_1.fastq, remove the option -i and maybe redirect the output into the expected result file.

Above I supposed there is only one occurrence of a number followed by / to remove per line, if you want to remove all the occurrrences per line :

sed -E -e "s=[0-9]+/==g" -i SRR11149706_1.fastq

In the title of your question you speak about two special characters but you just speak about /

If the number/ must be removed only on lines starting by @ :

sed -E -e "/^@/ s=[0-9]+/==" -i SRR11149706_1.fastq

of course replace @ by @SRR11149706 if needed etc

and add g as previously to be able to remove all occurrences of number/ per selected line rather than just the first occurrence

potong · Accepted Answer · 2025-07-05 08:35:29Z

3

This might work for you (GNU sed):

sed -E '\#^@[^/]*/.$#s#\S+/##' file

Look for a line that starts with an @ and ends with a / before the last character.

Then remove the non-space characters before the / as well as the / too.

N.B. The use of the \#...# which replaces the normal /.../ and allows the / to be included in the search regex. Of course the / could have been escaped but perhaps this is more elegant than /^@[^/]*\/.$/ as the subsequent substitution also uses the same # delimiters.

edited Jul 5 at 8:35

answered Jul 5 at 8:30

potong

59.3k6 gold badges55 silver badges92 bronze badges

1 Comment

CaroZ Jul 5 at 10:19

Thank you very much for the thorough explanation !

jsbueno · Accepted Answer · 2025-07-04 20:58:01Z

2

Sorry, I don't really know AWK, (and I got dizzy by the 5 page of info awk :-) )

But that can also be achieved with a Python 1-liner - although a bit more verbose sinde reading from stdin (except line by line) and regexps are not Python built-in, and the regexps are not special cased in the language, requiring some quotes.

After adding these,it simply works and you can type this at the shell:

 cat input.fastq| python -c 'import sys,re; print(re.sub(r"^(@[A-Z 0-9 .]+\s)(\d+)(\/.*)", r"\1\3", sys.stdin.read(), flags=re.MULTILINE))' >output.fastq

What I am doing here: I am using Python's re.sub which, in case there is no match will simply return the input line. For matching lines, it breaks your line in three sub-groups, and then replaces then by combining the first and the last, dropping the second group - which are the digits you want to drop.

answered Jul 4 at 20:58

jsbueno

113k11 gold badges159 silver badges239 bronze badges

5 Comments

CaroZ Jul 5 at 10:29

By the way why 0-9 ? Thank you !

Ed Morton Jul 5 at 18:17

If the 5 pages of awk info made you dizzy then the massive volumes of python documentation must be quite a trip :-).

jsbueno Jul 7 at 13:31

yes, since I've been following their growth for 25+ years now. :-) I am not saying awk is worse, and it is certainly a more concise tool for this job - if one knows awk. Which has been just the fourth person who stepped in to answer the problem, I also recognize perl and sed are better suited for inplace replace from the shell. But me? I more often have a Python REPL as my shell.

RARE Kpop Manifesto Sep 12 at 1:51

: sure …. if you think having to import sys everytime just to access things piped in from /dev/stdin is a time-saver for ya then be my guest (not to mention its strictness in indentation makes for very awkward shell one-liners)

RARE Kpop Manifesto Sep 12 at 1:57

and python's ternary, unlike any other on this planet and also the rest of Milky Way, that reads like - dinner I'll eat, if I'm hungry, otherwise go to bed instead of if I'm hungry I'll eat dinner otherwise I'll go to bed

Timur Shtatland · Accepted Answer · 2025-07-04 23:10:14Z

Use this Perl one-liner:

perl -pe 's{\s+\d+/}{ }' infile.fastq > outfile.fastq

or modify the file in-place:

perl -i.bak -pe 's{\s+\d+/}{ }' infile.fastq

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak. If you want to skip writing a backup file, just use -i and skip the extension.

s{PATTERN}{REPLACEMENT} : Replace regex PATTERN with REPLACEMENT.

\s+\d+/ : 1 or more whitespace characters, followed by 1 or more digits, followed by a literal /.

1 Comment

CaroZ Jul 5 at 10:29

I think they are. I have so many different solutions to try now, thank you !

Ed Morton · Accepted Answer · 2025-07-05 18:15:23Z

2

Using any sed:

$ sed 's:[0-9]*/::' SRR11149706_1.fastq
@SRR11149706.16630586 1
CCCAACAACAACAACAGCAACCTCCTCACGCCAACGCCGATCCCGCCGCTGTTTTCCAA

@SRR11149706.16630587 1
CAAAGCACCAGGTGCAGTGCACCTTGTCCGTCGGTCTGAATATCTGCTCTCTGTTCTCCA

or any awk:

$ awk '{sub("[0-9]+/","")} 1' SRR11149706_1.fastq
@SRR11149706.16630586 1
CCCAACAACAACAACAGCAACCTCCTCACGCCAACGCCGATCCCGCCGCTGTTTTCCAA

@SRR11149706.16630587 1
CAAAGCACCAGGTGCAGTGCACCTTGTCCGTCGGTCTGAATATCTGCTCTCTGTTCTCCA

answered Jul 5 at 18:15

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

Comments

Alexey Melezhik · Accepted Answer · 2025-07-08 18:55:37Z

1

You can use Raku/Sparrow for that, it's quite simple, given input data inside data.txt file:

task.bash

cat data.txt

task.check

~regexp: (\S+) \s+ (\d+) "/" (.*)
 
code: <<OK
!raku
for captures-full()<> -> $c {
  replace(
    "data.txt",
    $c<index>,
    $c<data>[0] ~ " " ~ $c<data>[2],
  );
}
OK

Test

s6 --task-run  .
21:45:31 :: [sparrowtask] - run sparrow task .
21:45:31 :: [sparrowtask] - run [.], thing: .
[task run: task.bash - .]
[task stdout]
21:45:31 :: @SRR11149706.16630586 16630586/1
21:45:31 :: CCCAACAACAACAACAGCAACCTCCTCACGCCAACGCCGATCCCGCCGCTGTTTTCCAA
21:45:31 :: 
21:45:31 :: @SRR11149706.16630587 16630587/1
21:45:31 :: CAAAGCACCAGGTGCAGTGCACCTTGTCCGTCGGTCTGAATATCTGCTCTCTGTTCTCCA
[task check]
stdout match <(\S+) \s+ (\d+) "/" (.*)> True

edited Jul 8 at 18:55

answered Jul 8 at 18:49

Alexey Melezhik

1,03110 silver badges30 bronze badges

Comments

RARE Kpop Manifesto · Accepted Answer · 2025-08-20 07:46:26Z

1

awk half-liner - using regex outcome as powering exponent :

echo '
@SRR11149706.16630586 16630586/1
CCCAACAACAACAACAGCAACCTCCTCACGCCAACGCCGATCCCGCCGCTGTTTTCCAA

@SRR11149706.16630587 16630587/1
CAAAGCACCAGGTGCAGTGCACCTTGTCCGTCGGTCTGAATATCTGCTCTCTGTTCTCCA' |

awk '(NF != 2)^(/^@/) || NF = NF' FS=' [0-9]+[/]'

@SRR11149706.16630586 1
CCCAACAACAACAACAGCAACCTCCTCACGCCAACGCCGATCCCGCCGCTGTTTTCCAA

@SRR11149706.16630587 1
CAAAGCACCAGGTGCAGTGCACCTTGTCCGTCGGTCTGAATATCTGCTCTCTGTTCTCCA

answered Aug 20 at 7:46

RARE Kpop Manifesto

3,0256 silver badges15 bronze badges

Collectives™ on Stack Overflow

Remove a string character between 2 special characters in the headers of a fastq file

8 Answers 8

Comments

1 Comment

5 Comments

See also:

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

Comments

1 Comment

5 Comments

See also:

Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related