Extract certain text from each line of text file using UNIX or perl

Question

I have a text file with lines like this:

Sequences (1:4) Aligned. Score:  4
Sequences (100:3011) Aligned. Score: 77
Sequences (12:345) Aligned. Score: 100
...

I want to be able to extract the values into a new tab delimited text file:

1 4 4
100 3011 77
12 345 100

(like this but with tabs instead of spaces)

Can anyone suggest anything? Some combination of sed or cut maybe?

kamituel · Accepted Answer · 2013-03-07 21:23:17Z

3

You can use Perl:

cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/'

Or, to save to file:

cat data.txt | perl -pe 's/.*?(\d+):(\d+).*?(\d+)/$1\t$2\t$3/' > data2.txt

Little explanation:

Regex here is in the form:

s/RULES_HOW_TO_MATCH/HOW_TO_REPLACE/

How to match = .*?(\d+):(\d+).*?(\d+)

How to replace = $1\t$2\t$3

In our case, we used the following tokens to declare how we want to match the string:

.*? - match any character ('.') as many times as possible ('*') as long as this character is not matching the next token in regex (which is \d in our case).
\d+:\d+ - match at least one digit followed by colon and another number
.*? - same as above
\d+ - match at least one digit

Additionally, if some token in regex is in parentheses, it means "save it so I can reference it later". First parenthese will be known as '$1', second as '$2' etc. In our case:

.*?(\d+):(\d+).*?(\d+)
     $1    $2      $3

Finally, we're taking $1, $2, $3 and printing them out separated by tab (\t):

$1\t$2\t$3

edited Mar 7, 2013 at 21:23

answered Mar 7, 2013 at 21:00

kamituel

36.2k6 gold badges86 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Amy Ellison Over a year ago

Brilliant! Thank you - if you have a minute would you mind breaking it down so I can see how it works? Thanks again!

kamituel Over a year ago

@squiguy, true, but I like it this way - it keeps my cursor closer to the regex which I tend to correct often, so it helps me save a second a day ;)

kamituel Over a year ago

@AmyEllison, sure, I've added explanation. Hope it helps.

Amy Ellison Over a year ago

@kamituel That is really helpful. But I notice the output is not tab separated. It would be no big deal to do in text editor/spreadsheet program but my actual file is over 100 million lines long so can't open in excel/libre office etc. Can you adjust your perl code? thanks!

kamituel Over a year ago

@AmyEllison - just use \t instead of space. \t is a symbol for tab. I'll update the code in a sec.

Thor · Accepted Answer · 2013-03-07 21:29:30Z

2

You could use sed:

sed 's/[^0-9]*\([0-9]*\)/\1\t/g' infile

Here's a BSD sed compatible version:

sed 's/[^0-9]*\([0-9]*\)/\1'$'\t''/g' infile

The above solutions leave a trailing tab in the output, append s/\t$// or s/'$'\t''$// respectively to remove it.

If you know there will always be 3 numbers per line, you could go with grep:

<infile grep -o '[0-9]\+' | paste - - -

Output in all cases:

1       4       4       
100     3011    77      
12      345     100

edited Mar 7, 2013 at 21:29

answered Mar 7, 2013 at 21:02

Thor

47.7k12 gold badges125 silver badges140 bronze badges

Comments

user529758 · Accepted Answer · 2013-03-07 21:04:03Z

1

My solution using sed:

sed 's/\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]\)*/\1     \2      \3/g' file.txt

answered Mar 7, 2013 at 21:04

user529758

Collectives™ on Stack Overflow

Extract certain text from each line of text file using UNIX or perl

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related