0

I am learning Perl. I want to find all occurences of 3 keywords in this order: keyword1, keyword2 and keyword3 in a text. keyword1 and keyword3 are optional. It can have up to 6 words between keywords. This is the code in Perl:

#!/usr/bin/perl
$reg="(keyword1)*\W*(?:\w+\W+){0,6}?(keyword2)\W*(?:\w+\W+){0,6}?(keyword3)*";
$content="some words before keyword1 optional word here then keyword2 again optional words then keyword3 others words after.";
while ($content=~m/$reg/g) {
    print "$&\n";
} 

I want to extract only the substring keyword1 optional word here then keyword2 again optional words then keyword3 but I got keyword2. Thank you.

6
  • Your spec is far from clear. Do you have an actual use case or is this just a random exercise? Commented Dec 7, 2016 at 18:48
  • 5
    Always use use strict; use warnings qw( all );! It should catch a problem: \w and \W in a double-quoted string literal. Escape the slash or switch to using qr//. Commented Dec 7, 2016 at 19:13
  • 2
    qr/\b (?: k1 \W+ (?: \w+ \W+ ){0,6}? )? k2 (?: \W+ (?: \w+ \W+ ){0,6}? k3 )? \b/x Commented Dec 7, 2016 at 19:21
  • @ikegami This works perfectly!!!. Thank you very very very much. Commented Dec 7, 2016 at 21:34
  • @melpomene, this is an exercise. Thank for the comment. Commented Dec 7, 2016 at 21:43

1 Answer 1

2

First of all, "\w" produces the string w, and "\W" produces the string W.

$ perl -wE'say "\w\W"'
Unrecognized escape \w passed through at -e line 1.
Unrecognized escape \W passed through at -e line 1.
wW

You need to escape the backslash ("\\W") or use qr// (qr/\W/).


I'm pretty sure there are other problems with the pattern. I'm going to start from scratch.

Assumes k1 and k3 are both independently optional, you want:

qr/
    (?: \b k1 \W+
        (?: \w+ \W+ ){0,6}?
    )?

    \b k2 \b

    (?: 
        (?: \W+ \w+ ){0,6}?
        \W+ k3 \b
    )?
/x

The word boundaries (\b) are there to ensure that we don't match abck2def or abck1 k2 k3def.


The above is inefficient.

Take for example the following regex pattern:

(?: x y )? x z

It can match the following strings:

xyxz
xz

Notice how both start with x? That means a better pattern (i.e. one that performs less backtracking) would be

x (?: y x )? z    

There are a couple of instances of this anti-pattern in the above answer. So let's refactor.

qr/
    \b
    (?: k1 \W+ (?: \w+ \W+ ){0,6}? \b )?
    k2 \b
    (?: \W+ (?: \w+ \W+ ){0,6}? k3 \b  )?
/x

Now we have something efficient.


In the above pattern, notice that the second \b is redundant. So let's get rid of it.

If we add a \b to the very end, the third and fourth \b become redundant.

After applying those simplifications, we get:

qr/
    \b
    (?: k1 \W+ (?: \w+ \W+ ){0,6}? )?
    k2
    (?: \W+ (?: \w+ \W+ ){0,6}? k3 )?
    \b
/x

Personally, I strongly dislike the non-greediness modifier as anything but a optimization. Furthermore, the use of two of them is normally a giant red flag that there is a bug in the pattern. For example, the pattern can match k1 k1 k2, but that may not be desirable.

To eliminate them, we need to ensure the first \w+ doesn't match k1 or k2. This can be achieved by replacing

\b \w+ \b

with

(?! \b k1 \b ) (?! \b k2 \b ) \b \w+ \b

Again, we factor out common prefixes to get:

\b (?! (?: k2 | k3 ) \b ) \w+ \b

Similarly, we need to ensure that the second \w+ doesn't match k2 or k3.

With these changes, we get:

qr/
    \b
    (?: k1 \W+ (?: (?! (?: k1 | k2 ) \b ) \w+ \W+ ){0,6} )?
    k2
    (?: \W+ (?: (?! (?: k2 | k3 ) \b ) \w+ \W+ ){0,6} k3 )?
    \b
/x

Complicated? yes. A better solution might start by breaking down the stream into word and non-word tokens. The advantage of this is that we don't have to worry about boundaries anymore.

my @tokens = split(/(\W+)/, $content, -1);

Then, the array is checked for the pattern. Since the regex engine is particular adept at doing this, we can leverage it as follows:

my $tokens =
   join '',
      map {
         ($_ % 2) ? "W"
         : $words[$_] eq "k1" ? 1
         : $words[$_] eq "k2" ? 2
         : $words[$_] eq "k3" ? 3
         : "w"                      # Non-key word
      }
         0..$#tokens;

while ($tokens =~ /(?: 1 W (?: w W ){0,6} )? 2 (?: W (?: w W ){0,6} 3 )?/xg) {
   say join('', @tokens[ $-[0] .. $+[0] - 1 ]);
}

Given the that @tokens will always be of the form word, non-word, word, non-word, etc, we can also use the following:

my $words =
   join '',
      map {
         ($_ % 2) ? ""              # We just want to look at the words
         : $words[$_] eq "k1" ? 1
         : $words[$_] eq "k2" ? 2
         : $words[$_] eq "k3" ? 3
         : "w"                      # Non-key word
      }
         0..$#tokens;

while ($words =~ /(?: 1 w{0,6} )? 2 (?: w{0,6} 3 )?/xg) {
   say join('', @tokens[ $-[0] * 2 .. ( $+[0] - 1 ) * 2 ]);
}
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you again for the detailed answer!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.