First of all, "\w" produces the string w, and "\W" produces the string W.
$ perl -wE'say "\w\W"'
Unrecognized escape \w passed through at -e line 1.
Unrecognized escape \W passed through at -e line 1.
wW
You need to escape the backslash ("\\W") or use qr// (qr/\W/).
I'm pretty sure there are other problems with the pattern. I'm going to start from scratch.
Assumes k1 and k3 are both independently optional, you want:
qr/
(?: \b k1 \W+
(?: \w+ \W+ ){0,6}?
)?
\b k2 \b
(?:
(?: \W+ \w+ ){0,6}?
\W+ k3 \b
)?
/x
The word boundaries (\b) are there to ensure that we don't match abck2def or abck1 k2 k3def.
The above is inefficient.
Take for example the following regex pattern:
(?: x y )? x z
It can match the following strings:
xyxz
xz
Notice how both start with x? That means a better pattern (i.e. one that performs less backtracking) would be
x (?: y x )? z
There are a couple of instances of this anti-pattern in the above answer. So let's refactor.
qr/
\b
(?: k1 \W+ (?: \w+ \W+ ){0,6}? \b )?
k2 \b
(?: \W+ (?: \w+ \W+ ){0,6}? k3 \b )?
/x
Now we have something efficient.
In the above pattern, notice that the second \b is redundant. So let's get rid of it.
If we add a \b to the very end, the third and fourth \b become redundant.
After applying those simplifications, we get:
qr/
\b
(?: k1 \W+ (?: \w+ \W+ ){0,6}? )?
k2
(?: \W+ (?: \w+ \W+ ){0,6}? k3 )?
\b
/x
Personally, I strongly dislike the non-greediness modifier as anything but a optimization. Furthermore, the use of two of them is normally a giant red flag that there is a bug in the pattern. For example, the pattern can match k1 k1 k2, but that may not be desirable.
To eliminate them, we need to ensure the first \w+ doesn't match k1 or k2. This can be achieved by replacing
\b \w+ \b
with
(?! \b k1 \b ) (?! \b k2 \b ) \b \w+ \b
Again, we factor out common prefixes to get:
\b (?! (?: k2 | k3 ) \b ) \w+ \b
Similarly, we need to ensure that the second \w+ doesn't match k2 or k3.
With these changes, we get:
qr/
\b
(?: k1 \W+ (?: (?! (?: k1 | k2 ) \b ) \w+ \W+ ){0,6} )?
k2
(?: \W+ (?: (?! (?: k2 | k3 ) \b ) \w+ \W+ ){0,6} k3 )?
\b
/x
Complicated? yes. A better solution might start by breaking down the stream into word and non-word tokens. The advantage of this is that we don't have to worry about boundaries anymore.
my @tokens = split(/(\W+)/, $content, -1);
Then, the array is checked for the pattern. Since the regex engine is particular adept at doing this, we can leverage it as follows:
my $tokens =
join '',
map {
($_ % 2) ? "W"
: $words[$_] eq "k1" ? 1
: $words[$_] eq "k2" ? 2
: $words[$_] eq "k3" ? 3
: "w" # Non-key word
}
0..$#tokens;
while ($tokens =~ /(?: 1 W (?: w W ){0,6} )? 2 (?: W (?: w W ){0,6} 3 )?/xg) {
say join('', @tokens[ $-[0] .. $+[0] - 1 ]);
}
Given the that @tokens will always be of the form word, non-word, word, non-word, etc, we can also use the following:
my $words =
join '',
map {
($_ % 2) ? "" # We just want to look at the words
: $words[$_] eq "k1" ? 1
: $words[$_] eq "k2" ? 2
: $words[$_] eq "k3" ? 3
: "w" # Non-key word
}
0..$#tokens;
while ($words =~ /(?: 1 w{0,6} )? 2 (?: w{0,6} 3 )?/xg) {
say join('', @tokens[ $-[0] * 2 .. ( $+[0] - 1 ) * 2 ]);
}
use strict; use warnings qw( all );! It should catch a problem:\wand\Win a double-quoted string literal. Escape the slash or switch to usingqr//.qr/\b (?: k1 \W+ (?: \w+ \W+ ){0,6}? )? k2 (?: \W+ (?: \w+ \W+ ){0,6}? k3 )? \b/x