1

I need to implement a preg_replace to fix some warnings that I have on an huge amount of scripts.

My goal is to replace statements like...

$variable[key] = "WhatElse";
$result = $wso->RSLA("7050", $vegalot, "600", "WFID_OK_WEB","1300", $_POST[username]);
if ($result[ECD] != 0) {
if ($line=="AAAA" && in_array(substr($wso->lot,0,7),$lot_aaaa_list) && $lot[wafer][25]) {

... with same statements having CONSTANTS replaced by ARRAY KEYS ...

$variable['key'] = "WhatElse";
$result = $wso->RSLA("7050", $vegalot, "600", "WFID_OK_WEB","1300", $_POST['username']);
if ($result['ECD'] != 0) {
if ($line=="AAAA" && in_array(substr($wso->lot,0,7),$lot_aaaa_list) && $lot[wafer][25]) {

but excluding cases when the array variable is declared within a string, ie...

$output = "<input name='variable[key]' has to be preserved as it is.";
$output = 'Even this string variable[key] has to be preserved as it is.';

...because they would be replaced (but this not not what I want) into:

$output = "<input name='variable['key']' has to be preserved as it is.";
$output = 'Even this string variable['key'] has to be preserved as it is.';

Every statements is identified by a ''preg_match_all'' statement and then replaced with a ''str_replace'':

preg_match_all('/(\[(\w*)\])/', $str, $matches, PREG_SET_ORDER, 0);
$replace_str = $str;
$local_changeflag = false;
foreach($matches as $m) {
    if (!$m[2]) continue;
    if (is_numeric($m[2])) continue;
    $replace_str = str_replace($m[1], "['" . $m[2] . "']", $replace_str);
    $local_changeflag = true;
}

Do you have any suggestion to better solve such issue that I have?

3
  • Try like this demo to skip quoted parts (not sure if the idea is good at all). Commented Dec 22, 2021 at 11:01
  • Or, this one, if you only want to match valid identifiers inside square brackets ('/(["\'])(?:(?=(\\\\?))\\2.)*?\\1(*SKIP)(*F)|(\[(?:[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*)])/'). Commented Dec 22, 2021 at 12:03
  • Shouldn't wafer in $lot[wafer] become quoted as well? Commented Dec 22, 2021 at 12:55

3 Answers 3

2

If you want to wrap any valid identifiers inside square brackets, you can use preg_replace directly:

$regex = '/(["\'])(?:(?=(\\\\?))\2.)*?\1(*SKIP)(*F)|\[([a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*)]/s';
$ouptut = preg_replace($regex, '$3', $text);

See the regex demo. Details:

  • (["'])(?:(?=(\\?))\2.)*?\1 - matches a string between single or double quotation marks (contains two capturing groups)
  • (*SKIP)(*F) - discards the matched text and fails the match starting a new search from the failure location
  • | - or
  • \[ - [ char
  • ([a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*) - Group 3: a letter, underscore, or any char from the \x7f-\xff range and then any alphanumeric, underscore or any char from the \x7f-\xff range
  • ] - a ] char.

See the PHP demo:

$regex = '/(["\'])(?:(?=(\\\\?))\2.)*?\1(*SKIP)(*F)|\[([a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*)]/s';
$str = '$output = "<input name=\'variable[key]\' has to be preserved as it is.";
$output = \'Even this string variable[key] has to be preserved as it is.\';

$variable[key] = "WhatElse";
$result = $wso->RSLA("7050", $vegalot, "600", "WFID_OK_WEB","1300", $_POST[username]);
if ($result[ECD] != 0) {
if ($line=="AAAA" && in_array(substr($wso->lot,0,7),$lot_aaaa_list) && $lot[wafer][25]) {';
echo preg_replace($regex, "['\$3']", $str);

Output:

$output = "<input name='variable[key]' has to be preserved as it is.";
$output = 'Even this string variable[key] has to be preserved as it is.';

$variable['key'] = "WhatElse";
$result = $wso->RSLA("7050", $vegalot, "600", "WFID_OK_WEB","1300", $_POST['username']);
if ($result['ECD'] != 0) {
if ($line=="AAAA" && in_array(substr($wso->lot,0,7),$lot_aaaa_list) && $lot['wafer'][25]) {
Sign up to request clarification or add additional context in comments.

Comments

2

[I know this isn't regexp, but since you asked for 'suggestion to better solve such issue' I give you my 2 cents]

How about simply parsing the code ;):

$source = file_get_contents('/tmp/test.php'); // Change this
$tokens = token_get_all($source);

$history = [];
foreach ($tokens as $token) {
    if (is_string($token)) { // simple 1-character token       
        array_push($history, str_replace(["\r", "\n"], '', $token));
        $history = array_slice($history, -2);

        echo $token;
    } else {
        list($id, $text) = $token;

        switch ($id) {
            case T_STRING:
                if ($history == [T_VARIABLE, '[']) {
                    // Token sequence is [T_VARIABLE, '[', T_STRING]
                    echo "'$text'";
                }
                else {
                    echo $text;
                }
                break;

            default:
                // anything else -> output "as is"
                echo $text;
                break;
        }

        array_push($history, $id);
        $history = array_slice($history, -2);
    }
}

Of course, the $source needs to be changed to whatever suits you. token_get_all() then loads the PHP code and parses it into a list of tokens. That list is then processed item by item and possibly changed before being output again, according to our needs.

1-char tokens like [ ("[" and "]" in f.ex $myVariable[1] both get to be tokens) are a special case which has to be handled in the loop. Otherwise $token is an array with an ID for the type of token and the token itself.

"Unfortunately" T_STRING is kind of a general case, so to pinpoint only the strings being used as constants in array indexing we store the 2 items preceding the current in $history. ("$myVariable" and "[")

..and..that's it, really. The code is read from a file, processed and output to stdout. Everything but the "constants as array index" case should be output as is.

If you like I can rewrite it as a function or something. The above should be kind of the general solution, though.

Edit Version 2, support for $myObject->myProp[key]:

<?php
$source = file_get_contents('/tmp/test.php'); // Change this
$tokens = token_get_all($source);

//print_r($tokens); exit();

$history = [];
foreach ($tokens as $token) {
    if (is_string($token)) { // simple 1-character token       
        array_push($history, str_replace(["\r", "\n"], '', $token));

        echo $token;
    } else {
        list($id, $text) = $token;

        switch ($id) {
            case T_STRING:
                if (array_slice($history, -2) == [T_VARIABLE, '[']) {
                    // Token sequence is [T_VARIABLE, '[', T_STRING]
                    echo "'$text'";
                }
                else if (array_slice($history, -4) == [T_VARIABLE, T_OBJECT_OPERATOR, T_STRING, '[']) {
                    echo "'$text'";
                }
                else {
                    echo $text;
                }
                break;

            default:
                // anything else -> output "as is"
                echo $text;
                break;
        }

        array_push($history, $id);
    }

    // This has to be at least as large as the largest chunk
    // checked anywhere above
    $history = array_slice($history, -5); 
}

As can be seen, the tough part about introducing more cases is that $history won't be as uniform anymore. At first I thought about fetching things directly from $tokens, but they aren't sanitized so I stuck to $history. It's possible that the "pruning line" at the bottom isn't needed, it's just there for memory usage. Maybe it's cleaner to skip $history, sanitize all $tokens items before the foreach() and then fetch things directly from it(adding the index to the foreach(), of course). I think I feel version 3 coming up ;-j..

Edit Version 3: This should be as simple as it gets. Simply look for brackets with unquoted strings inside.

$source = file_get_contents('/tmp/test.php'); // Change this
$tokens = token_get_all($source);

$history = [];
foreach ($tokens as $token) {
    if (is_string($token)) { // simple 1-character token       
        array_push($history, str_replace(["\r", "\n"], '', $token));

        echo $token;
    } else {
        list($id, $text) = $token;

        switch ($id) {
            case T_STRING:
                if (array_slice($history, -1) == ['[']) {
                    echo "'$text'";
                }
                else {
                    echo $text;
                }
                break;

            default:
                // anything else -> output "as is"
                echo $text;
                break;
        }

        array_push($history, $id);
    }
}

Test input(/tmp/test.php):

<?php
$variable[key] = "WhatElse";
$result = $wso->RSLA("7050", $vegalot, "600", "WFID_OK_WEB","1300", $_POST[username]);
if ($result[ECD] != 0) {
if ($line=="AAAA" && in_array(substr($wso->lot,0,7),$lot_aaaa_list) && $lot[wafer][25]) {
    $object->method[key];

    $variable[test] = 'one';
    $variable[one][two] = 'three';
    $variable->property[three]['four'] = 5;

3 Comments

Great solution, but it does not work if 'test.php' would contain '$object->method[key]' as "key" is not automatically enclosed by quotes.
You're right, object/class properties aren't supported. I made a version 2 with that added. The thing I'm interested in(well, curious about) here is how this code would look if it grows, how is maintainability affected. I thought the advantage here vs regexp would be readability. But perhaps that's not the case, I'm not sure..
Oh, heck.. Version 3 added. It was interesting to consider the cases you wrote about. Version 3 simply looks for brackets with unquoted strings within. I'm not sure there's no case where that criterion is too simple.
0

I solved the issue with a double loop:

<?
$script = file("/tmp/test.php");
/*
 * Search any row containing a php variable $(...)
 */
$scanstr = preg_grep("/\\$([\w\-\>]+)(\[.*\])/", $script);
foreach($scanstr as $k => $str) {
   unset($matchvar, $match);
   /* Get php variable */
   preg_match_all('/\$(\w|-|>|\[|\]|\'|")+/', $str, $matchvar, PREG_SET_ORDER, 0);
   /* Get array key name */
   preg_match_all('/(\[(\w+)\])/', $matchvar[0][0], $match, PREG_SET_ORDER, 0);
   $replace_str = $str;
   foreach($match as $m) {
      /*
       * if key is not defined or a number, then skip conversion
       */
      if (!$m[2]) continue;
      if (is_numeric($m[2])) continue;
      $r = str_replace($m[1], "['" . $m[2] . "']", $matchvar[0][0]);
      $replace_str = str_replace($matchvar[0][0], $r, $replace_str);
      $matchvar[0][0] = $r;
      $local_changeflag = true;
    }
  ?>

It works for any of the following cases:

 $variable[test] = 'one';
 $variable[one][two] = 'three';
 $variable->property[three]['four'] = 5;

I know, it's not very clean ;)

1 Comment

Too many preg_ calls. See Wiktor's approach.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.