Using REGEX to match a string without the first token repeated in the string. (PART 2)

Question

Thanks to @cool_me5000 for helping me out with an overly simplified version of this question here: PERL: Using REGEX to match a string without the first token repeated in the string. (ABC, not AAA ABC)

Here is the adjusted question:

I am trying to use a regular expression to match the FIRST instance where ATE is followed by CAT without another ATE in between ATE and CAT. I want to match to the "ATE BAT CAT." Note that in this text string there are other instances following the first ATE/CAT combination that could also fit the ATE/CAT, (specifically note the "ATE DOG CAT near the end of the string) Here is the text:

$TEXT = "ATE ATE ATE ATE BAT CAT ATE DOG EGG ATE FOR GIN ATE DOG CAT ATE";

I first tried:

@finds1=$TEXT=~m/((ATE).*?(CAT))/;
$result = $finds1[0];
print "result = $result\n";

This prints the following:

result = ATE ATE ATE ATE BAT CAT

When what I want is just:

result = ATE BAT CAT

Note that I am trying to create a regular expression that could be used where B could be any string of characters. For example ATE DOG CAT, ATE FAT GET HAT JOT KIN CAT, ATE YAK ULE INN OLD KOC JOG HUG GOT TAL CAT.

I next tried to use a look-forward combined with an if then else statement. Here is the code:

@finds1=$TEXT=~m/(ATE(?(?!.*?ATE.*?CAT).*?CAT|Z{100}))/;
$result = $finds1[0];
print "result = $result\n";

The first part of the REGEX, (ATE, tells perl to find an occurrence of ATE. Once found, perl then processes the if then else statement where the conditional statement is that there are no instances of .?ATE.?CAT following the ATE, if none are found then perl looks for .*?CAT, if at least one is found, then it searches for 100 instances of Z. (my way of getting Perl to move on since neither in this text nor in the text I'm trying to parse are there 100 Zs.)

This returns:

result = ATE DOG CAT

I have considered using a positive look-behind after identifying CAT for the first time. However, like I mentioned above, The number of characters between the first ATE.CAT combination without an A in between them is variable. As far as I know PERL can't do variable-length look-behinds.

Any help or direction you could provide would be GREATLY appreciated!!

Thanks in advance!

ikegami · Accepted Answer · 2012-07-04 01:55:19Z

3

For the earlier question, the solution was:

my ($first) = $text =~ /(A[^AC]*C)/;

We used the negation of A|C then, so that means we need to use the negation of ATE|CAT here.

Something everyone should know is that (?:(?!STRING).) is to (?:STRING) as [^CHAR] is to CHAR. (?:(?!PAT).) also works with some more complex patterns, including the one above.

So we get:

my ($first) = $text =~ /(ATE (?:(?!ATE|CAT).)* CAT)/sx;

Explanation:

You don't want "CAT" or "ATE" between "ATE" and "CAT", so

   +---------------- You don't want CAT or ATE starting here.
   |+--------------- You don't want CAT or ATE starting here.
   ||--+------------ You don't want CAT or ATE starting here.
   ||   +----------- You don't want CAT or ATE starting here.
   ||   |+---------- You don't want CAT or ATE starting here.
   ||   ||
   vv   vv
ATE??...??CAT

So that would be

/
   ATE
   (?! CAT|ATE ) .
   (?! CAT|ATE ) .
   ...
   (?! CAT|ATE ) .
   (?! CAT|ATE ) .
   CAT
/x

The repetition is handled using *.

edited Jul 4, 2012 at 1:55

answered Jul 4, 2012 at 1:24

ikegami

391k17 gold badges291 silver badges555 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Stone Mason Over a year ago

perldoc.perl.org/perlre.html#Extended-Patterns scroll to "Look-Around Assertions" if you would like to know how this works :)

user1500158 Over a year ago

This is great! Thanks for introducing me to Non-Capturing Groups!

Collectives™ on Stack Overflow

Using REGEX to match a string without the first token repeated in the string. (PART 2)

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related