0

I have an input data with three columns (tab separated) like this:

  a  mrna_185598_SGL 463
  b  mrna_9210_DLT   463
  c  mrna_9210_IND   463
  d  mrna_9210_INS   463
  e  mrna_9210_SGL   463

How can I use sed/awk to modify it into four columns data that looks like this:

a  mrna_185598 SGL   463
b  mrna_9210   DLT   463
c  mrna_9210   IND   463
d  mrna_9210   INS   463
e  mrna_9210   SGL   463

In principle I want to split the original "mrna" string into 2 parts.

7 Answers 7

2

something like this

awk 'BEGIN{FS=OFS="\t"}{split($2,a,"_"); $2=a[1]"_"a[2]"\t"a[3] }1'  file

output

# ./shell.sh
a       mrna_185598     SGL     463
b       mrna_9210       DLT     463
c       mrna_9210       IND     463
d       mrna_9210       INS     463
e       mrna_9210       SGL     463

use nawk on Solaris

and if you have bash

while IFS=$'\t' read -r a b c
do
    front=${b%_*}
    back=${b##*_}
    printf "$a\t$front\t$back\t$c\n"
done <"file"
Sign up to request clarification or add additional context in comments.

Comments

2

gawk:

{
  print $1 "\t" gensub(/_/, "\t", 2, $2) "\t" $3
}

Comments

1

you dont need to use sed. instead use tr

cat *FILENAME* | tr '_[:upper:]{3}\t' '\t[:lower:]{3}\t' >> *FILEOUT*

cat FILENAME will print out the files witch will then be piped ('|') to tr (translate). tr will replace anything that has an underscore followed by 3 uppercase characters and then a tab with a tab instead of the underscore. Then it will append it to FILEOUT.

1 Comment

useless use of cat. pass the file to tr instead. --. tr 'blah 'blah' < file >> fileout. and did you test your command properly?
1
$ cat test.txt
  a  mrna_185598_SGL 463
  b  mrna_9210_DLT   463
  c  mrna_9210_IND   463
  d  mrna_9210_INS   463
  e  mrna_9210_SGL   463

$ cat test.txt | sed -E 's/(\S+)_(\S+)\s+(\S+)$/\1\t\2\t\3/'
  a  mrna_185598    SGL 463
  b  mrna_9210  DLT 463
  c  mrna_9210  IND 463
  d  mrna_9210  INS 463
  e  mrna_9210  SGL 463

1 Comment

useless use of cat. pass the file name to sed instead. -- sed 'options' filename
1

Provided they don't look too much different from what you've posted:

sed -E 's/mrna_([0-9]+)_/mrna_\1\t/'

Comments

1
gawk '{$1=$1; $0=gensub(/_/,"\t",2);print}' file

a mrna_185598   SGL 463
b mrna_9210 DLT 463
c mrna_9210 IND 463
d mrna_9210 INS 463
e mrna_9210 SGL 463

Comments

0

This might work for you (GNU sed):

sed 's/_/\t/2' file

Replace the second occurrence of a _ by a tab.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.