2

I'm cleaning some text directly in my query, and rather than using nested replace functions, I found this bit of code that uses perl to perform multiple replacements at once: multi-replace with perl

CREATE FUNCTION pg_temp.multi_replace(string text, orig text[], repl text[]) 
RETURNS text 
AS $BODY$ 
  my ($string, $orig, $repl) = @_;
  my %subs;

  if (@$orig != @$repl) {
     elog(ERROR, "array sizes mismatch");
  } 
  if (ref @$orig[0] eq 'ARRAY' || ref @$repl[0] eq 'ARRAY') {
     elog(ERROR, "array dimensions mismatch");
  } 

  @subs{@$orig} = @$repl;
  
  my $re = join "|",
     sort { (length($b) <=> length($a)) } keys %subs;
  $re = qr/($re)/;

  $string =~ s/$re/$subs{$1}/g;

  return $string;
$BODY$ language plperl strict immutable;

Example query:

select

name as original_name, 
multi_replace(name, '{-,&,LLC$}', '{_,and,business}') as cleaned_name

from some_table

The function finds the pattern LLC at the end of the name string but removes it instead of replacing it with "business."

How can I make this work as intended?

3 Answers 3

2

When the strings in @$orig are to be matched literally, I'd actually use this:

my ($string, $orig, $repl) = @_;

# Argument checks here.

my %subs; @subs{ @$orig } = @$repl;

my $pat =
   join "|",
      map quotemeta,
         sort { length($b) <=> length($a) }
            @$orig;

return $string =~ s/$re/$subs{$&}/gr;

In particular, map quotemeta, was missing.

(By the way, the sort line isn't needed if you ensure that xy comes before x in @$orig when you want to replace both x(?!y) and xy.)


But you want the strings in @$orig to be treated as regex patterns. For that, you can use the following:

# IMPORTANT! Only provide strings from trusted sources in
# `@$orig` as it allows execution of arbitrary Perl code.

my ($string, $orig, $repl) = @_;

# Argument checks here.

my $re =
   join "|",
      map "(?:$orig->[$_])(?{ $_ })",
         0..$#$orig;

{
   use re qw( eval );
   $re = qr/$re/;
}

return $string =~ s/$re/$repl->[$^R]/gr;

However, in your environment, I have doubts about the availability of use re qw( eval ); and (?{ }), so the above may be an unviable solution for you.

my ($string, $orig, $repl) = @_;

# Argument checks here.

my $re =
   join "|",
      map "(?<_$_>$orig->[$_])",
         0..$#$orig;

$re = qr/$re/;

return
   $string =~ s{$re}{
      my ( $n ) =
         map substr( $_, 1 ),
            grep { $-{$_} && defined( $-{$_}[0] ) }
               grep { /^_\d+\z/aa }
                  keys( %- );

      $repl->[$n]
   }egr;
Sign up to request clarification or add additional context in comments.

1 Comment

Added to my answer.
1

While the regexp tests for LLC$ with the special meaning of the $, what gets captured into $1 is just the string LLC and so it doesn't find the look-up value to replace.

If the only thing you care about is $, then you could fix it by changing the map-building lines to:

 @subs{map {my $t=$_; $t=~s/\$$//; $t} @$orig} = @$repl;

 my $re = join "|",
    sort { (length($b) <=> length($a)) } @$orig;

But it will be very hard to make it work more generally for every possible feature of regex.

Comments

1

The purpose of this plperl function in the linked blog post is to find/replace strings, not regular expressions. LLC being found with LLC$ as a search term does not happen in the original code, as the search terms go through quotemeta before being included into $re (as also sugggested in ikegami's answer)

The effect of removing the quotemeta transformation is that LLC at the end of a string is matched, but since as a key it's not found in $subs (because the key there isLLC$), then it's getting replaced by an empty string.

So how to make this work with regular expressions in the orig parameter?

The solution proposed by @ikegami does not seem usable from plperl, as it complains with this error: Unable to load re.pm into plperl.

I thought of an alternative implementation without the (?{ code }) feature: each match from the main alternation regexp can be rechecked against each regexp in orig, in a code block run with /ge. On the first match, the corresponding string in repl is selected as the replacement. Code:

CREATE or replace FUNCTION pg_temp.multi_replace(string text, orig text[], repl text[]) 
RETURNS text AS
$BODY$
  my ($string, $orig, $repl) = @_;
  my %subs;

  if (@$orig != @$repl) {
     elog(ERROR, "array sizes mismatch");
  } 
  if (ref @$orig[0] eq 'ARRAY' || ref @$repl[0] eq 'ARRAY') {
     elog(ERROR, "array dimensions mismatch");
  } 

  @subs{@$orig} = @$repl;
  
  my $re = join "|", keys %subs;
  $re = qr/($re)/;

  # on each match, recheck the match individually against each regexp
  # to find the corresponding replacement string
  $string =~ s/$re/{ my $r; foreach (@$orig) { if ($1 =~ $_) {$r=$subs{$_}; last;} } $r;}/ge;

  return $string;
$BODY$ language plperl strict immutable;

Test

=> select pg_temp.multi_replace(
    'bar foo - bar & LLC',
    '{^bar,-,&,LLC$}',
    '{baz,_,and,business}'
   );

       multi_replace        
----------------------------
 baz foo _ bar and business

1 Comment

You should precompile the patterns in @$orig: @$orig = map qr/$_/, @$orig;. Still, this could perform quite poorly for long lists. I've added my answer to add a solution that doesn't suffer from this problem in a sec.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.