0

Well maybe this is not the best title; but it's hard to convey my intention only in short title.

I've a line here:

2   118610455   P2_PM_2_5034    T   <DUP:TANDEM>    40  .   END=118610566;SVLEN=110;SVTYPE=TDUP;CIPOS=-100,55;CIEND=-56,100;IMPRECISE;DBVARID=esv7540;VALIDATED;VALMETHOD=CGH;SVMETHOD=RP

Basically I would like to convert it into:

2 118610455 118610566

So major problem is to grep this 118610566 from the 8th column.

I know how to grep this number:

$c=`cat line|awk '{print $8}'|sed 's/;/\t/g'|awk '{print $1}'|sed 's/\END=//g'`

but my question is then how I can incorporate this variable into another bash line:

what_i_want=`cat line|awk '{print $1"\t"$2"\t"$c}'`

thx

3 Answers 3

3

May be this can help -

[jaypal:~/Temp] cat tmp
2   118610455   P2_PM_2_5034    T   <DUP:TANDEM>    40  .   END=118610566;SVLEN=110;SVTYPE=TDUP;CIPOS=-100,55;CIEND=-56,100;IMPRECISE;DBVARID=esv7540;VALIDATED;VALMETHOD=CGH;SVMETHOD=RP

[jaypal:~/Temp] var=$(awk -v FS="[ ;=]" '{print $1,$4,$24}' tmp)

[jaypal:~/Temp] echo $var
2 118610455 118610566

FS is awk's built-in variable. It is defaulted to a space or a tab. Since your line as more than one delimiter setting the FS to a character class helps in splitting the line for each de-limiter. The character class we have defined here is either a space, semi-colon or equal.

Might feel a little odd but I use this as a my debugging tool for identifying columns when I happen to parse a line with more than 1 delimiters. This is what I had got from your line -

[jaypal:~/Temp] awk -v FS="[ ;=]" '{for(i=1;i<=NF;i++) print "$"i" is "$i}' tmp
$1 is 2
$2 is 
$3 is 
$4 is 118610455
$5 is 
$6 is 
$7 is P2_PM_2_5034
$8 is 
$9 is 
$10 is 
$11 is T
$12 is 
$13 is 
$14 is <DUP:TANDEM>
$15 is 
$16 is 
$17 is 
$18 is 40
$19 is 
$20 is .
$21 is 
$22 is 
$23 is END
$24 is 118610566
$25 is SVLEN
$26 is 110
$27 is SVTYPE
$28 is TDUP
$29 is CIPOS
$30 is -100,55
$31 is CIEND
$32 is -56,100
$33 is IMPRECISE
$34 is DBVARID
$35 is esv7540
$36 is VALIDATED
$37 is VALMETHOD
$38 is CGH
$39 is SVMETHOD
$40 is RP

You can also use a simple substr built-in function of awk in the following manner -

[jaypal:~/Temp] awk '{print $1,$2,$8=substr($8,5,9)}' tmp
2 118610455 118610566
Sign up to request clarification or add additional context in comments.

1 Comment

thx but can you explain about FS="[;=]" a bit? I don't know why 11861045 becomes the 4th column.
1

With a little string manipulation you can get it in one go.

what_i_want=$(awk '{sub(/^END=/,"",$8); sub(/;.*$/,"",$8); print $1,$2,$8}' line)

Some explanation:

sub(a,b,c) searches for pattern a in variable c and replaces it with b, storing the modified string back into c. Patterns are written within //.

^ is the beginning of the string, $ is the end, . is anything, and * means zero or more of the preceding pattern. So in our case:

sub(/^END=/,"",$8); matches END= at the beginning (^) of the string and replaces it with "", nothing, essentially deleting it.

sub(/;.*$/,"",$8); takes everything (.*) from ; to the end ($) and deletes it. Note that in awk, as with most regex engines, * is greedy, which means it takes the longest match it can get, so we know this will get the first ;.

And all we are left with is the number you want.

1 Comment

thx works well...but can you explain a bit about sub(/;.*/,"",$8) ? I know here to truncate the part after ; right? but I don't understand what . and * means here.
0

If your "columns" are always separated by whitespace, then you don't need to use subshells and awk, you can do this directly in shell:

[ghoti@pc ~]$ read one two three four five junk <<< "2   118610455   P2_PM_2_5034    T   <DUP:TANDEM>    40  .   END=118610566;SVLEN=110;SVTYPE=TDUP;CIPOS=-100,55;CIEND=-56,100;IMPRECISE;DBVARID=esv7540;VALIDATED;VALMETHOD=CGH;SVMETHOD=RP"
[ghoti@pc ~]$ echo "$five"
<DUP:TANDEM>
[ghoti@pc ~]$ echo "$junk"
40 . END=118610566;SVLEN=110;SVTYPE=TDUP;CIPOS=-100,55;CIEND=-56,100;IMPRECISE;DBVARID=esv7540;VALIDATED;VALMETHOD=CGH;SVMETHOD=RP

The last variable specified on your read line gets "everything else".

Also. if you're handling multiple lines like this, you can run it in a loop:

cat /path/to/inputfile | while read one two three four five junk; do
  echo "$one - $two - $five"
done

Salt to taste.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.