2

Thanks to @cool_me5000 for helping me out with an overly simplified version of this question here: PERL: Using REGEX to match a string without the first token repeated in the string. (ABC, not AAA ABC)

Here is the adjusted question:

I am trying to use a regular expression to match the FIRST instance where ATE is followed by CAT without another ATE in between ATE and CAT. I want to match to the "ATE BAT CAT." Note that in this text string there are other instances following the first ATE/CAT combination that could also fit the ATE/CAT, (specifically note the "ATE DOG CAT near the end of the string) Here is the text:

$TEXT = "ATE ATE ATE ATE BAT CAT ATE DOG EGG ATE FOR GIN ATE DOG CAT ATE";

I first tried:

@finds1=$TEXT=~m/((ATE).*?(CAT))/;
$result = $finds1[0];
print "result = $result\n";

This prints the following:

result = ATE ATE ATE ATE BAT CAT

When what I want is just:

result = ATE BAT CAT

Note that I am trying to create a regular expression that could be used where B could be any string of characters. For example ATE DOG CAT, ATE FAT GET HAT JOT KIN CAT, ATE YAK ULE INN OLD KOC JOG HUG GOT TAL CAT.

I next tried to use a look-forward combined with an if then else statement. Here is the code:

@finds1=$TEXT=~m/(ATE(?(?!.*?ATE.*?CAT).*?CAT|Z{100}))/;
$result = $finds1[0];
print "result = $result\n";

The first part of the REGEX, (ATE, tells perl to find an occurrence of ATE. Once found, perl then processes the if then else statement where the conditional statement is that there are no instances of .?ATE.?CAT following the ATE, if none are found then perl looks for .*?CAT, if at least one is found, then it searches for 100 instances of Z. (my way of getting Perl to move on since neither in this text nor in the text I'm trying to parse are there 100 Zs.)

This returns:

result = ATE DOG CAT    

I have considered using a positive look-behind after identifying CAT for the first time. However, like I mentioned above, The number of characters between the first ATE.CAT combination without an A in between them is variable. As far as I know PERL can't do variable-length look-behinds.

Any help or direction you could provide would be GREATLY appreciated!!

Thanks in advance!

1 Answer 1

3

For the earlier question, the solution was:

my ($first) = $text =~ /(A[^AC]*C)/;

We used the negation of A|C then, so that means we need to use the negation of ATE|CAT here.

Something everyone should know is that (?:(?!STRING).) is to (?:STRING) as [^CHAR] is to CHAR. (?:(?!PAT).) also works with some more complex patterns, including the one above.

So we get:

my ($first) = $text =~ /(ATE (?:(?!ATE|CAT).)* CAT)/sx;

Explanation:

You don't want "CAT" or "ATE" between "ATE" and "CAT", so

   +---------------- You don't want CAT or ATE starting here.
   |+--------------- You don't want CAT or ATE starting here.
   ||--+------------ You don't want CAT or ATE starting here.
   ||   +----------- You don't want CAT or ATE starting here.
   ||   |+---------- You don't want CAT or ATE starting here.
   ||   ||
   vv   vv
ATE??...??CAT

So that would be

/
   ATE
   (?! CAT|ATE ) .
   (?! CAT|ATE ) .
   ...
   (?! CAT|ATE ) .
   (?! CAT|ATE ) .
   CAT
/x

The repetition is handled using *.

Sign up to request clarification or add additional context in comments.

2 Comments

perldoc.perl.org/perlre.html#Extended-Patterns scroll to "Look-Around Assertions" if you would like to know how this works :)
This is great! Thanks for introducing me to Non-Capturing Groups!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.