0

I have a text file with lines like this:

Sequences (1:4) Aligned. Score:  4
Sequences (100:3011) Aligned. Score: 77
Sequences (12:345) Aligned. Score: 100
...

I want to be able to extract the values into a new tab delimited text file:

1 4 4
100 3011 77
12 345 100

(like this but with tabs instead of spaces)

Can anyone suggest anything? Some combination of sed or cut maybe?

3 Answers 3

3

You can use Perl:

cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/'

Or, to save to file:

cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/' > data2.txt

Little explanation:

Regex here is in the form:

s/RULES_HOW_TO_MATCH/HOW_TO_REPLACE/

How to match = .*?(\d+):(\d+).*?(\d+)

How to replace = $1\t$2\t$3

In our case, we used the following tokens to declare how we want to match the string:

  • .*? - match any character ('.') as many times as possible ('*') as long as this character is not matching the next token in regex (which is \d in our case).

  • \d+:\d+ - match at least one digit followed by colon and another number

  • .*? - same as above

  • \d+ - match at least one digit

Additionally, if some token in regex is in parentheses, it means "save it so I can reference it later". First parenthese will be known as '$1', second as '$2' etc. In our case:

.*?(\d+):(\d+).*?(\d+)
     $1    $2      $3

Finally, we're taking $1, $2, $3 and printing them out separated by tab (\t):

$1\t$2\t$3
Sign up to request clarification or add additional context in comments.

5 Comments

Brilliant! Thank you - if you have a minute would you mind breaking it down so I can see how it works? Thanks again!
@squiguy, true, but I like it this way - it keeps my cursor closer to the regex which I tend to correct often, so it helps me save a second a day ;)
@AmyEllison, sure, I've added explanation. Hope it helps.
@kamituel That is really helpful. But I notice the output is not tab separated. It would be no big deal to do in text editor/spreadsheet program but my actual file is over 100 million lines long so can't open in excel/libre office etc. Can you adjust your perl code? thanks!
@AmyEllison - just use \t instead of space. \t is a symbol for tab. I'll update the code in a sec.
2

You could use sed:

sed 's/[^0-9]*\([0-9]*\)/\1\t/g' infile

Here's a BSD sed compatible version:

sed 's/[^0-9]*\([0-9]*\)/\1'$'\t''/g' infile

The above solutions leave a trailing tab in the output, append s/\t$// or s/'$'\t''$// respectively to remove it.

If you know there will always be 3 numbers per line, you could go with grep:

<infile grep -o '[0-9]\+' | paste - - -

Output in all cases:

1       4       4       
100     3011    77      
12      345     100     

Comments

1

My solution using sed:

sed 's/\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]\)*/\1     \2      \3/g' file.txt

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.