1

I have a SAM file with an RX: field containing 12 bases separated in the middle by a - i.e. RX:Z:CTGTGC-TCGTAA

I want to remove the hyphen from this field, but I can't simply remove all hyphens from the whole file as the read names contain them, like 1713704_EP0004-T

Have mostly been trying tr, but this is just removing all hyphens from the file.:

tr -d '"-' < sample.fq.unaln.umi.sam > sample.fq.unaln.umi.re.sam

input is a large SAM file of >10,000,000 lines like this:

1902336-103-016_C1D1_1E-T:34    99  chr1    131341  36  146M    =   131376  182 GGACAGGGAGTGTTGACCCTGGGCGGCCCCCTGGAGCCACCTGCCCTGAAAGCCCAGGGCCCGCAACCCCACACACTTTGGGGCTGGTGGAACCTGGTAAAAGCTCACCTCCCACCATGGAGGAGGAGCCCTGGGCCCCTCAGGGG  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN  MC:Z:147M   MD:Z:83T62cD:i:4    cE:f:0  PG:Z:bwa    RG:Z:A  MI:Z:34 NM:i:1  cM:i:3  MQ:i:36 UQ:i:45 AS:i:141    XS:i:136    RX:Z:CTGTGC-TCGTAA

Desired output (i.e. last field)

1902336-103-016_C1D1_1E-T:34    99  chr1    131341  36  146M    =   131376  182 GGACAGGGAGTGTTGACCCTGGGCGGCCCCCTGGAGCCACCTGCCCTGAAAGCCCAGGGCCCGCAACCCCACACACTTTGGGGCTGGTGGAACCTGGTAAAAGCTCACCTCCCACCATGGAGGAGGAGCCCTGGGCCCCTCAGGGG  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN  MC:Z:147M   MD:Z:83T62cD:i:4    cE:f:0  PG:Z:bwa    RG:Z:A  MI:Z:34 NM:i:1  cM:i:3  MQ:i:36 UQ:i:45 AS:i:141    XS:i:136    RX:Z:CTGTGCTCGTAA

How do I solve this problem?

16
  • 1
    Posting some sample lines would be useful. Commented May 1, 2019 at 14:48
  • 1
    Specifically, it's not clear what a "read name" is and where it occurs in an input line. Commented May 1, 2019 at 14:54
  • 1
    Please post a few lines from the file and post a sample output Commented May 1, 2019 at 14:54
  • 1
    @sjsam Done, hope that's clear Commented May 1, 2019 at 15:00
  • 1
    @sjsam The last field is what I want changed RX:Z:CTGTGC-TCGTAA. My issue is i can't just delete all '-' from the file because there are hyphens in the read name 1902336-103-016_C1D1_1E-T:34 i.e. the first field Commented May 1, 2019 at 15:04

4 Answers 4

4

awk

awk '{sub(/-/,"",$NF)}1' file

is what you need.

Explanation

  • From this it is clear that you're concerned only about the last field.
  • NF is the total number of fields that a record contains, hence $NF is the last field.
  • sub(/-/,"",$NF) replaces the - in the last field with an empty string, making the change persistent.

GNU sed

For this same reason,

sed -Ei 's/^(.*)-/\1/' file

will work. It has an added advantage that it can perform an inplace edit.

Explanation

  • The -E option enables the extended regular expression engine.
  • The (.*) is a greedy search that will match any character(.) any number of times(*). For the fact that is greedy it will match anything up to the last hyphen.
  • The () makes sed remember what was matched.
  • In the substitution part, we put just the matched part \1 (1 because we having only one pair of parenthesis, note that you can have as many as you like) without the hyphen, thus effectively removing it from the last field where it should occur.

Note : The GNU awk support -i inplace, but I'm not sure from which version on.

Sign up to request clarification or add additional context in comments.

17 Comments

@PaulHodges : No, a 1 simply prints a record. Being idiomatic :-)
Just surprised me to see it at the end of the code block, but then I guess technically it's at the beginning of the implied {print}, eh?
You don't want to do this on all lines of a 10M file
wrt [sed] has an added advantage that it can perform an inplace edit. - you're using GNU sed to get -i and you can use GNU awk to get the equivalent -i inplace. That particular awk command would change the white space in the rest of the output which may not be desirable though.
@EdMorton : Thank you Updating the answer. I have gawk 4.02 which doesn't have the -i option in the manual. Not sure when it was introduced.
|
2

I've solved this problem using pysam which is faster, safer and requires less disk space as a sam file is not required. It's not perfect, I'm still learning python and have used pysam for half a day.

import pysam
import sys
from re import sub

# Provide a bam file
if len(sys.argv) == 2:
    assert sys.argv[1].endswith('.bam')

# Makes output filehandle
inbamfn = sys.argv[1]
outbamfn = sub('.bam$', '.fixRX.bam', inbamfn)

inbam = pysam.Samfile(inbamfn, 'rb')
outbam = pysam.Samfile(outbamfn, 'wb', template=inbam)

# Counters for reads processed and written
n = 0
w = 0

# .get_tag() retrieves RX tag from each read
for read in inbam.fetch(until_eof=True):
    n += 1
    umi = read.get_tag('RX')
    assert umi is not None
    umifix = umi[:6] + umi[7:]
    read.set_tag('RX', umifix, value_type='Z')
    if '-' in umifix:
        print('Hyphen found in UMI:', umifix, read)
        break
    else:
        w += 1
        outbam.write(read)

inbam.close()
outbam.close()

print ('Processed', n, 'reads:\n',
       w, 'UMIs written.\n',
       str(int((w / n) * 100)) + '% of UMIs fixed')

1 Comment

A tool for the job is always a charm to work with !
1

The best solution is to work with BAM rather than SAM files, and to use a proper BAM parser/writer library, such as htslib.

Lacking that, you can cobble something together by searching for the regular expression ^RX:Z: in the optional tags (columns 12 and up).

Working with columns, while possible, is hard with sed. Instead, here’s how to do this in awk:

awk -F '[[:space:]]*' '{
    for (i = 12; i <= NF; i++) {
        if ($i ~ /^RX:Z:/) gsub("-", "", $i)
    }
}
1' file.sam

And here’s a roughly equivalent solution as a Perl “one-liner”:

perl -ape '
    for (@F[11..(scalar @F)]) {
        s/-//g if /^RX:Z:/;
    }
    $_ = join("\t", @F);
' file.sam

To perform the replacement in the original file, you can pass the option -i.bak to perl (this will create a backup file.sam.bak; if you don’t want the backup, omit the extension).

Comments

0

This pattern is on many records that you want to edit, and is always at the end of the line? If so -

sed -E 's/^(.*)(\s..:.:......)-(......\s*)$/\1\2\3/' < sample.fq.unaln.umi.sam > sample.fq.unaln.umi.re.sam

1 Comment

It won’t always be at the end of the line, nor will the number of letters before/after the dash be fixed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.