2

This perl script doesn't substitute the input text_str as expected:

my $text_str ='public class ZipUtilTest extends TestCase {}';
my $find = '^(public class \\w+) extends TestCase \\{';
my $replace = '@RunWith(JUnit4.class)\\n\\1 {';
eval '$text_str =~ ' . "s#$find#$replace#mg";
say "$text_str";

Output (wrong):

(JUnit4.class)
public class ZipUtilTest {}

This revised perl script (with '@' in 'replace' escaped) substitutes as expected:

my $text_str ='public class ZipUtilTest extends TestCase {}';
my $find = '^(public class \\w+) extends TestCase \\{';
my $replace = '@RunWith(JUnit4.class)\\n\\1 {';
$replace =~ s/@/\\@/g;  # Escape '@' to avoid Perl @var interpolation
eval '$text_str =~ ' . "s#$find#$replace#mg";
say "$text_str";

Output (correct):

@RunWith(JUnit4.class)
public class ZipUtilTest {}

It looks like '@RunWith' in the 'replace' pattern is treated as a Perl @variable and interpolated to an empty string.

Is there a better way to handle this than escaping the '@' character in patterns? If we have to do this, any other '@'-like characters need to be escaped?

3
  • 1
    TL;DR \@RunWith(JUnit4.class)\\n\\1 Commented Feb 14, 2023 at 18:18
  • 1
    What is your ultimate goal here? Are you trying to support replacement strings that can contain not just special characters such as newlines but also escape sequences such as \n, but not variables? Are those replacement strings trusted? Commented Feb 14, 2023 at 18:43
  • @amon. That's right. The replacement strings are valid regex patterns and '@' is just a regular character and shouldn't be considered a Perl variable. Commented Feb 14, 2023 at 19:21

3 Answers 3

3

You can use a positive-lookahead to match the { without capturing it in $1. Then the replacement string does not need to contain the $1.

When building a regex, it's better to use the regex quoting operator qr{} than strings; it will quote like a regex, not a string. This can avoid subtle bugs.

use v5.10;

my $text_str = 'public class ZipUtilTest extends TestCase {}';

# Use a positive-look ahead to match, but not capture, the {
# Quote as regex to avoid subtle quoting issues.
my $find = qr'^(public class \w+) extends TestCase(?>\s*\{)';

# Use double-quotes to interpolate the \n, but escape the \@.
my $replace = "\@RunWith(JUnit4.class)\n";

# Add the $1 to the end of the replacement.
$text_str =~ s{$find}{$replace$1};

say $text_str;

Demonstration.

Sign up to request clarification or add additional context in comments.

Comments

2

It seems you want to load language-agnostic search and replace patterns from a configuration file, and then apply them via a Perl script.

If that is your goal, then using eval is not appropriate since Perl has syntax that you do not want to support, as you found out.

It is not reasonable to try to work around those Perl-specific parts by trying to escape them, since that can get rather complex. For example, you considered escaping occurrences of @ as they can introduce an array name, but what if that character already is backslash-escaped? Handling this properly would require an almost complete re-implementation of Perl's string literal syntax, which doesn't sound like fun.

What I would do is to define a replacement string syntax of our own, so that we're completely independent from Perl's syntax.

For example, we might define our replacement string syntax to be entirely verbatim, except that we support certain backslash-escapes. Let's say that the syntax '\' DIGIT such as \1 replaces a capture, and that the usual backlash escapes are supported (\b \t \n \v \f \r \" \' \\ \x0A), which is the common subset of JavaScript string literals, Python 3 string literals, and Perl escapes, minus octal escapes. Note that these languages do not agree on a syntax for Unicode characters.

We can implement an interpreter for this string replacement language as follows: we parse the replacement string into an array of opcodes, alternating a literal string with the number of a capture. For example, the replacement pattern abc\1def would be parsed into ['abc', 1, 'def']:

sub parse_replacement_pattern {
  my ($pattern) = @_;
  my @ops = ('');  # init with empty string

  # use m//gc style parsing which lets us anchor patterns at the current "pos"
  pos($pattern) = 0;
  while (pos $pattern < length $pattern) {
    if ($pattern =~ /\G([^\\]+)/gc) {
      $ops[-1] .= $1;
    }
    elsif ($pattern =~ /\G\\n/gc) {
      $ops[-1] .= "\n";
    }
    ...  # and so on for the basic escapes
    elsif ($pattern =~ /\G\\x([0-9a-fA-F]{2})/gc) {
      $ops[-1] .= chr $1;
    }
    elsif ($pattern =~ /\G\\([1-9])/gc) {
      push @ops, $1, '';  # add replacement opcode + empty string
    }
    else {
      die "invalid syntax";
    }
  }

  return \@ops;
}

We can apply such a replacement pattern by looping through the operations, appending the literal string or the capture contents as appropriate.

sub apply_replacement_pattern {
  my ($ops) = @_;
  my $output = '';
  my $is_capture = 0;

  for my $op (@$ops) {
    if ($is_capture) {
      # we know that $op must be the number of a capture buffer
      $output .= ${^CAPTURE}[$op - 1];  # like eval "\$$op"
    }
    else {
      # we know that $op must be a literal string
      $output .= $op;
    }
    $is_capture = !$is_capture;
  }

  return $output;
}

We can now use these functions in your test case:

my $text_str ='public class ZipUtilTest extends TestCase {}';
my $find = '^(public class \\w+) extends TestCase \\{';
my $replace = '@RunWith(JUnit4.class)\\n\\1 {';

my $replace_ops = parse_replacement_pattern($replace);
$text_str =~ s{$find}{apply_replacement_pattern($replace_ops)}mge;
say $text_str;

This produces the expected output

@RunWith(JUnit4.class)
public class ZipUtilTest {}

Comments

2

Here is a version that works: use $1 directly in the replacement side, not in a pre-made variable for it. It saves us some hassle.

use warnings;
use strict;
use feature 'say';

my $text_str = 'public class ZipUtilTest extends TestCase {}';
#say $text_str;

my $re = '^(public class \w+) extends TestCase \{';
#say $re;

my $replace =  "\@RunWith(JUnit4.class)\n";
#say $replace;

$text_str =~ s/$re/${replace}$1 {/;

say $text_str;

Update with comments

The variables for the pattern and the replacement string are read from a configuration file. Then the "hassle" I mention becomes more serious.

If $1 is to be prepared in the replacement-string variable, it must be a mere string (of characters $ and 1) there while it need become a variable, and be evaluated, in the regex.

That means the variable must be eval-ed (or regex run with /ee), and that is the problem with the string form of eval -- input from outside: eval will evaluate (run) anything, any code. We don't need malicious action regarding text-to-become-code in config files, just consider typos.

As for nicely escaping (only) what need be escaped, one can prepare for that, a hash for example:

my %esc_char = ( at => '\@' );  # etc

and use this when composing the variable with the replacement string.

If both the pattern and replacement must come from config files and must be non-specific to Perl, as a comment says, then I am not sure how to improve the code offered in the question. Except that it should be heavily protected against running (accidentally, say) bad code.

12 Comments

@JoeSmith It's not about @, that's relatively easy to handle nicely -- it's about that $1! Because that we want to be a variable in the replacement side, with what was captured. But if you are putting it in beforehand (in $replace), then at that time it can't be a variable. Then you have to go to eval (or /ee) and that's just way worse right off the bat. So ... may $1 be entered in the regex itself? Or must that, too, be in a replacement string loaded from config file?
@JoeSmith As for handling @, one can prepare a list of escaped characters (like my %esc_repl = ( at => '\@' ); etc), and use that when composing replacement strings. And all else remains unescaped.
"I know perl prefers $1" -- right, and it's been years that it prefers it strongly. I'd suggest to always use that. The old \1 now looks too close to other things
@JoeSmith Oh. If that config need be used by languages other than Perl as well then that's different. You can still process the replacement string once it's read into Perl, to escape @ (and whatever else), like you do. But if you must stay with \1 form then the problem is that an outdated form is being used in Perl. If this must be usable by other languages (is that so?) then I don't see how to improve it. What are the languages? Are \\n and \\1 really OK? That whole design seems potentially shaky, that arbitrary languages need be able to use the same, non-trivialm regex?
@JoeSmith "how to update the code in question" -- If these (pattern + replacement) must be completely set in config files, which must not be Perl specific, then that should be stated in the question. I'd definitely want to know what other languages may get involved with them.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.