2

I have a fastq file containing several sequences with headers such as :

tail SRR11149706_1.fastq 

@SRR11149706.16630586 16630586/1
CCCAACAACAACAACAGCAACCTCCTCACGCCAACGCCGATCCCGCCGCTGTTTTCCAA

@SRR11149706.16630587 16630587/1
CAAAGCACCAGGTGCAGTGCACCTTGTCCGTCGGTCTGAATATCTGCTCTCTGTTCTCCA

I would like to remove the numbers that come before the "/" as well as this last character. The number of characters is variable. The result should be :

@SRR11149706.16630586 1
CCCAACAACAACAACAGCAACCTCCTCACGCCAACGCCGATCCCGCCGCTGTTTTCCAA

@SRR11149706.16630587 1
CAAAGCACCAGGTGCAGTGCACCTTGTCCGTCGGTCTGAATATCTGCTCTCTGTTCTCCA

Nothing I have tried worked.

Edit : I thought I would remove everything which is between the space and the first / included, but I do need a space.

6
  • The title of your question says between 2 special characters, but you speak about only one being / Commented Jul 4 at 20:40
  • 1
    Sorry I originally meant between the blank space and the /, but I realized I do need the blank space... Commented Jul 4 at 22:27
  • 1
    @CaroZ A FASTQ file should have base qualities as well, 4 lines per record, and no blank lines between records. Are you sure this is the right format? Commented Jul 4 at 22:32
  • 1
    awk '/^@/{sub(" [0-9]+/"," ")} 1' Commented Jul 5 at 1:46
  • 3
    Please show what you tried and explain why it was wrong. Commented Jul 5 at 5:15

8 Answers 8

4

With sed, supposing you want to modify SRR11149706_1.fastq :

sed -E -e "s=[0-9]+/==" -i SRR11149706_1.fastq

Example of execution on my Pi 5 (Debian bookworm)

bruno@raspberrypi:/tmp $ cat SRR11149706_1.fastq
@SRR11149706.16630586 16630586/1
CCCAACAACAACAACAGCAACCTCCTCACGCCAACGCCGATCCCGCCGCTGTTTTCCAA

@SRR11149706.16630587 16630587/1
CAAAGCACCAGGTGCAGTGCACCTTGTCCGTCGGTCTGAATATCTGCTCTCTGTTCTCCA
bruno@raspberrypi:/tmp $ 
bruno@raspberrypi:/tmp $ sed -E -e "s=[0-9]+/==" -i SRR11149706_1.fastq
bruno@raspberrypi:/tmp $ 
bruno@raspberrypi:/tmp $ cat SRR11149706_1.fastq
@SRR11149706.16630586 1
CCCAACAACAACAACAGCAACCTCCTCACGCCAACGCCGATCCCGCCGCTGTTTTCCAA

@SRR11149706.16630587 1
CAAAGCACCAGGTGCAGTGCACCTTGTCCGTCGGTCTGAATATCTGCTCTCTGTTCTCCA
bruno@raspberrypi:/tmp $ 

If you do not want to modify SRR11149706_1.fastq, remove the option -i and maybe redirect the output into the expected result file.


Above I supposed there is only one occurrence of a number followed by / to remove per line, if you want to remove all the occurrrences per line :

sed -E -e "s=[0-9]+/==g" -i SRR11149706_1.fastq

In the title of your question you speak about two special characters but you just speak about /

If the number/ must be removed only on lines starting by @ :

sed -E -e "/^@/ s=[0-9]+/==" -i SRR11149706_1.fastq

of course replace @ by @SRR11149706 if needed etc

and add g as previously to be able to remove all occurrences of number/ per selected line rather than just the first occurrence

Sign up to request clarification or add additional context in comments.

Comments

3

This might work for you (GNU sed):

sed -E '\#^@[^/]*/.$#s#\S+/##' file

Look for a line that starts with an @ and ends with a / before the last character.

Then remove the non-space characters before the / as well as the / too.

N.B. The use of the \#...# which replaces the normal /.../ and allows the / to be included in the search regex. Of course the / could have been escaped but perhaps this is more elegant than /^@[^/]*\/.$/ as the subsequent substitution also uses the same # delimiters.

1 Comment

Thank you very much for the thorough explanation !
2

Sorry, I don't really know AWK, (and I got dizzy by the 5 page of info awk :-) )

But that can also be achieved with a Python 1-liner - although a bit more verbose sinde reading from stdin (except line by line) and regexps are not Python built-in, and the regexps are not special cased in the language, requiring some quotes.

After adding these,it simply works and you can type this at the shell:

 cat input.fastq| python -c 'import sys,re; print(re.sub(r"^(@[A-Z 0-9 .]+\s)(\d+)(\/.*)", r"\1\3", sys.stdin.read(), flags=re.MULTILINE))' >output.fastq

What I am doing here: I am using Python's re.sub which, in case there is no match will simply return the input line. For matching lines, it breaks your line in three sub-groups, and then replaces then by combining the first and the last, dropping the second group - which are the digits you want to drop.

5 Comments

By the way why 0-9 ? Thank you !
If the 5 pages of awk info made you dizzy then the massive volumes of python documentation must be quite a trip :-).
yes, since I've been following their growth for 25+ years now. :-) I am not saying awk is worse, and it is certainly a more concise tool for this job - if one knows awk. Which has been just the fourth person who stepped in to answer the problem, I also recognize perl and sed are better suited for inplace replace from the shell. But me? I more often have a Python REPL as my shell.
: sure …. if you think having to import sys everytime just to access things piped in from /dev/stdin is a time-saver for ya then be my guest (not to mention its strictness in indentation makes for very awkward shell one-liners)
and python's ternary, unlike any other on this planet and also the rest of Milky Way, that reads like - dinner I'll eat, if I'm hungry, otherwise go to bed instead of if I'm hungry I'll eat dinner otherwise I'll go to bed
2

Use this Perl one-liner:

perl -pe 's{\s+\d+/}{ }' infile.fastq > outfile.fastq

or modify the file in-place:

perl -i.bak -pe 's{\s+\d+/}{ }' infile.fastq

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak. If you want to skip writing a backup file, just use -i and skip the extension.

s{PATTERN}{REPLACEMENT} : Replace regex PATTERN with REPLACEMENT.

\s+\d+/ : 1 or more whitespace characters, followed by 1 or more digits, followed by a literal /.

See also:

Comments

2

I would harness GNU AWK for this task following way, let file.txt content be

@SRR11149706.16630586 16630586/1
CCCAACAACAACAACAGCAACCTCCTCACGCCAACGCCGATCCCGCCGCTGTTTTCCAA

@SRR11149706.16630587 16630587/1
CAAAGCACCAGGTGCAGTGCACCTTGTCCGTCGGTCTGAATATCTGCTCTCTGTTCTCCA

then

awk '/\//{sub(/^[0-9]+\//,"",$NF)}{print}' file.txt

gives output

@SRR11149706.16630586 1
CCCAACAACAACAACAGCAACCTCCTCACGCCAACGCCGATCCCGCCGCTGTTTTCCAA

@SRR11149706.16630587 1
CAAAGCACCAGGTGCAGTGCACCTTGTCCGTCGGTCTGAATATCTGCTCTCTGTTCTCCA

Explanation: for line containing slash (note that we need to escape it, as otherwise it would be mistaken for regular expression terminator) I replace one-or-more leading digits followed by slash with empty string in last field. I print every line. Disclaimer: this solution assumes your fields are sheared by exactly one SPACE character.

(tested in GNU Awk 5.3.1)

1 Comment

I think they are. I have so many different solutions to try now, thank you !
2

Using any sed:

$ sed 's:[0-9]*/::' SRR11149706_1.fastq
@SRR11149706.16630586 1
CCCAACAACAACAACAGCAACCTCCTCACGCCAACGCCGATCCCGCCGCTGTTTTCCAA

@SRR11149706.16630587 1
CAAAGCACCAGGTGCAGTGCACCTTGTCCGTCGGTCTGAATATCTGCTCTCTGTTCTCCA

or any awk:

$ awk '{sub("[0-9]+/","")} 1' SRR11149706_1.fastq
@SRR11149706.16630586 1
CCCAACAACAACAACAGCAACCTCCTCACGCCAACGCCGATCCCGCCGCTGTTTTCCAA

@SRR11149706.16630587 1
CAAAGCACCAGGTGCAGTGCACCTTGTCCGTCGGTCTGAATATCTGCTCTCTGTTCTCCA

Comments

1

You can use Raku/Sparrow for that, it's quite simple, given input data inside data.txt file:

task.bash

cat data.txt

task.check

~regexp: (\S+) \s+ (\d+) "/" (.*)
 
code: <<OK
!raku
for captures-full()<> -> $c {
  replace(
    "data.txt",
    $c<index>,
    $c<data>[0] ~ " " ~ $c<data>[2],
  );
}
OK

Test

s6 --task-run  .
21:45:31 :: [sparrowtask] - run sparrow task .
21:45:31 :: [sparrowtask] - run [.], thing: .
[task run: task.bash - .]
[task stdout]
21:45:31 :: @SRR11149706.16630586 16630586/1
21:45:31 :: CCCAACAACAACAACAGCAACCTCCTCACGCCAACGCCGATCCCGCCGCTGTTTTCCAA
21:45:31 :: 
21:45:31 :: @SRR11149706.16630587 16630587/1
21:45:31 :: CAAAGCACCAGGTGCAGTGCACCTTGTCCGTCGGTCTGAATATCTGCTCTCTGTTCTCCA
[task check]
stdout match <(\S+) \s+ (\d+) "/" (.*)> True

Comments

1

awk half-liner - using regex outcome as powering exponent :

echo '
@SRR11149706.16630586 16630586/1
CCCAACAACAACAACAGCAACCTCCTCACGCCAACGCCGATCCCGCCGCTGTTTTCCAA

@SRR11149706.16630587 16630587/1
CAAAGCACCAGGTGCAGTGCACCTTGTCCGTCGGTCTGAATATCTGCTCTCTGTTCTCCA' |
awk '(NF != 2)^(/^@/) || NF = NF' FS=' [0-9]+[/]'

@SRR11149706.16630586 1
CCCAACAACAACAACAGCAACCTCCTCACGCCAACGCCGATCCCGCCGCTGTTTTCCAA

@SRR11149706.16630587 1
CAAAGCACCAGGTGCAGTGCACCTTGTCCGTCGGTCTGAATATCTGCTCTCTGTTCTCCA

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.