2

I'm trying to match the first and second occurrence of a string in perl. The first few lines of input (contained in @intersect) are:

          'gi|112807938|emb|CU075707.1|_Xenopus_tropicalis_finished_cDNA,_clone_TNeu129d01  C1:TCONS_00039972(XLOC_025068),_12.9045:32.0354,_Change:1.3118,_p:0.00025,_q:0.50752  C2:TCONS_00045925(XLOC_029835),_10.3694:43.8379,_Change:2.07985,_p:0.0004,_q:0.333824',
          'gi|115528274|gb|BC124894.1|_Xenopus_laevis_islet-1,_mRNA_(cDNA_clone_MGC:154537_IMAGE:8320777),_complete_cds C1:TCONS_00080221(XLOC_049570),_17.9027:40.8136,_Change:1.18887,_p:0.00535,_q:0.998852  C2:TCONS_00092192(XLOC_059015),_17.8995:35.5534,_Change:0.990066,_p:0.0355,_q:0.998513',
          'gi|118404233|ref|NM_001078963.1|_Xenopus_(Silurana)_tropicalis_pancreatic_lipase-related_protein_2_(pnliprp2),_mRNA  C1:TCONS_00031955(XLOC_019851),_0.944706:5.88717,_Change:2.63964,_p:0.01915,_q:0.998852 C2:TCONS_00036655(XLOC_023660),_2.31819:11.556,_Change:2.31757,_p:0.0358,_q:0.998513',

The information I'm trying to extract is the 'Change:[value]' for both C1 and C2 (which are separated by tabs), using the following:

#!/usr/bin/perl -w
use strict; 
use File::Slurp;
use Data::Dumper;
$Data::Dumper::Sortkeys = 1;

my @log_change;
foreach (@intersect) {
    chomp;
    my @condition1_match = ($_ =~ /(C1:).*Change:(-?\d+\.\d+)/g);
    my @condition2_match = ($_ =~ /(C2:).*Change:(-?\d+\.\d+)/g);
    push @log_change, "@condition1_match\t@condition2_match";
  }

print Dumper (\@log_change);

Prints:

      'C1: 2.07985    C2: 2.07985',
      'C1: 0.990066    C2: 0.990066',
      'C1: 2.31757    C2: 2.31757',

i.e. the same value for C1 and C2. It's clear that my loop stores the value for C2 in both @condition1_match and @condition2_match.

My question is: How can I specify that I want the first iteration of 'Change:[value]' to be pushed onto @condition1_match and the second onto @condition2_match ?

1 Answer 1

4

What is happening is that your regexes are matching as much as possible where you have the .*. What you need to do is make the quantifier lazy (non-greedy) and this is done by adding a question mark ? it.

my @condition1_match = ($_ =~ /(C1:).*?Change:(-?\d+\.\d+)/g);
                                  #   ^
my @condition2_match = ($_ =~ /(C2:).*?Change:(-?\d+\.\d+)/g);
                                  #   ^

That way, the regex will match the least possible characters until it 'sees' Change:(-?\d+\.\d+)/g).

You can check on some online regex sites what you are exactly matching, for example this site.

Sign up to request clarification or add additional context in comments.

1 Comment

@Nick You're welcome! I added some more content and a site to help you whenever you have regex. ^^

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.