Remove character from the middle of a string

Question

I have a SAM file with an RX: field containing 12 bases separated in the middle by a - i.e. RX:Z:CTGTGC-TCGTAA

I want to remove the hyphen from this field, but I can't simply remove all hyphens from the whole file as the read names contain them, like 1713704_EP0004-T

Have mostly been trying tr, but this is just removing all hyphens from the file.:

tr -d '"-' < sample.fq.unaln.umi.sam > sample.fq.unaln.umi.re.sam

input is a large SAM file of >10,000,000 lines like this:

1902336-103-016_C1D1_1E-T:34    99  chr1    131341  36  146M    =   131376  182 GGACAGGGAGTGTTGACCCTGGGCGGCCCCCTGGAGCCACCTGCCCTGAAAGCCCAGGGCCCGCAACCCCACACACTTTGGGGCTGGTGGAACCTGGTAAAAGCTCACCTCCCACCATGGAGGAGGAGCCCTGGGCCCCTCAGGGG  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN  MC:Z:147M   MD:Z:83T62cD:i:4    cE:f:0  PG:Z:bwa    RG:Z:A  MI:Z:34 NM:i:1  cM:i:3  MQ:i:36 UQ:i:45 AS:i:141    XS:i:136    RX:Z:CTGTGC-TCGTAA

Desired output (i.e. last field)

1902336-103-016_C1D1_1E-T:34    99  chr1    131341  36  146M    =   131376  182 GGACAGGGAGTGTTGACCCTGGGCGGCCCCCTGGAGCCACCTGCCCTGAAAGCCCAGGGCCCGCAACCCCACACACTTTGGGGCTGGTGGAACCTGGTAAAAGCTCACCTCCCACCATGGAGGAGGAGCCCTGGGCCCCTCAGGGG  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN  MC:Z:147M   MD:Z:83T62cD:i:4    cE:f:0  PG:Z:bwa    RG:Z:A  MI:Z:34 NM:i:1  cM:i:3  MQ:i:36 UQ:i:45 AS:i:141    XS:i:136    RX:Z:CTGTGCTCGTAA

How do I solve this problem?

Specifically, it's not clear what a "read name" is and where it occurs in an input line. — Benjamin W.
– Benjamin W., Commented May 1, 2019 at 14:54
Please post a few lines from the file and post a sample output — sjsam
– sjsam, Commented May 1, 2019 at 14:54
@sjsam The last field is what I want changed RX:Z:CTGTGC-TCGTAA. My issue is i can't just delete all '-' from the file because there are hyphens in the read name 1902336-103-016_C1D1_1E-T:34 i.e. the first field — lgallagher
– lgallagher, Commented May 1, 2019 at 15:04

sjsam · Accepted Answer · 2019-05-01 19:25:52Z

4

awk

awk '{sub(/-/,"",$NF)}1' file

is what you need.

Explanation

From this it is clear that you're concerned only about the last field.
NF is the total number of fields that a record contains, hence $NF is the last field.
sub(/-/,"",$NF) replaces the - in the last field with an empty string, making the change persistent.

GNU sed

For this same reason,

sed -Ei 's/^(.*)-/\1/' file

will work. It has an added advantage that it can perform an inplace edit.

Explanation

The -E option enables the extended regular expression engine.
The (.*) is a greedy search that will match any character(.) any number of times(*). For the fact that is greedy it will match anything up to the last hyphen.
The () makes sed remember what was matched.
In the substitution part, we put just the matched part \1 (1 because we having only one pair of parenthesis, note that you can have as many as you like) without the hyphen, thus effectively removing it from the last field where it should occur.

_{Note : The GNU awk support -i inplace, but I'm not sure from which version on.}

edited May 1, 2019 at 19:25

answered May 1, 2019 at 15:06

sjsam

22k6 gold badges63 silver badges114 bronze badges

Sign up to request clarification or add additional context in comments.

17 Comments

sjsam Over a year ago

@PaulHodges : No, a 1 simply prints a record. Being idiomatic :-)

Paul Hodges Over a year ago

Just surprised me to see it at the end of the code block, but then I guess technically it's at the beginning of the implied {print}, eh?

Inian Over a year ago

You don't want to do this on all lines of a 10M file

Ed Morton Over a year ago

wrt [sed] has an added advantage that it can perform an inplace edit. - you're using GNU sed to get -i and you can use GNU awk to get the equivalent -i inplace. That particular awk command would change the white space in the rest of the output which may not be desirable though.

sjsam Over a year ago

@EdMorton : Thank you Updating the answer. I have gawk 4.02 which doesn't have the -i option in the manual. Not sure when it was introduced.

|

lgallagher · Accepted Answer · 2019-05-10 14:51:15Z

2

I've solved this problem using pysam which is faster, safer and requires less disk space as a sam file is not required. It's not perfect, I'm still learning python and have used pysam for half a day.

import pysam
import sys
from re import sub

# Provide a bam file
if len(sys.argv) == 2:
    assert sys.argv[1].endswith('.bam')

# Makes output filehandle
inbamfn = sys.argv[1]
outbamfn = sub('.bam$', '.fixRX.bam', inbamfn)

inbam = pysam.Samfile(inbamfn, 'rb')
outbam = pysam.Samfile(outbamfn, 'wb', template=inbam)

# Counters for reads processed and written
n = 0
w = 0

# .get_tag() retrieves RX tag from each read
for read in inbam.fetch(until_eof=True):
    n += 1
    umi = read.get_tag('RX')
    assert umi is not None
    umifix = umi[:6] + umi[7:]
    read.set_tag('RX', umifix, value_type='Z')
    if '-' in umifix:
        print('Hyphen found in UMI:', umifix, read)
        break
    else:
        w += 1
        outbam.write(read)

inbam.close()
outbam.close()

print ('Processed', n, 'reads:\n',
       w, 'UMIs written.\n',
       str(int((w / n) * 100)) + '% of UMIs fixed')

edited May 10, 2019 at 14:51

answered May 10, 2019 at 12:55

lgallagher

516 bronze badges

1 Comment

sjsam Over a year ago

A tool for the job is always a charm to work with !

Konrad Rudolph · Accepted Answer · 2019-05-09 13:09:00Z

The best solution is to work with BAM rather than SAM files, and to use a proper BAM parser/writer library, such as htslib.

Lacking that, you can cobble something together by searching for the regular expression ^RX:Z: in the optional tags (columns 12 and up).

Working with columns, while possible, is hard with sed. Instead, here’s how to do this in awk:

awk -F '[[:space:]]*' '{
    for (i = 12; i <= NF; i++) {
        if ($i ~ /^RX:Z:/) gsub("-", "", $i)
    }
}
1' file.sam

And here’s a roughly equivalent solution as a Perl “one-liner”:

perl -ape '
    for (@F[11..(scalar @F)]) {
        s/-//g if /^RX:Z:/;
    }
    $_ = join("\t", @F);
' file.sam

To perform the replacement in the original file, you can pass the option -i.bak to perl (this will create a backup file.sam.bak; if you don’t want the backup, omit the extension).

Paul Hodges · Accepted Answer · 2019-05-01 15:07:50Z

0

This pattern is on many records that you want to edit, and is always at the end of the line? If so -

sed -E 's/^(.*)(\s..:.:......)-(......\s*)$/\1\2\3/' < sample.fq.unaln.umi.sam > sample.fq.unaln.umi.re.sam

answered May 1, 2019 at 15:07

Paul Hodges

16k1 gold badge24 silver badges43 bronze badges

1 Comment

Konrad Rudolph Over a year ago

It won’t always be at the end of the line, nor will the number of letters before/after the dash be fixed.

Collectives™ on Stack Overflow

Remove character from the middle of a string

4 Answers 4

17 Comments

1 Comment

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

17 Comments

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related