5

I want to tokenize formatting strings (very roughly like printf) and I think I am only missing a small bit:

  • %[number][one letter ctYymd] shall become a token²
  • $1...$10 shall become a token
  • all else (normal text) becomes a token.

I got quite far in the regExp simulator. This looks like it should do:

²update: now using # instead of %. (Less troubles with windows command line parameters)

enter image description here

It's not scary, if you focus on the three parts, connected by pipes (as either-or), so basically it's just three matches. Since I want to match from start to end, I wrapped things in /^...%/ and surrounded by a non-matching group (?:... that may repeat 1 or more times:

$exp = '/^(?:(%\\d*[ctYymd]+)|([^$%]+)|(\\$\\d))+$/'; 

Still my source doesn't deliver:

$exp = '/^(?:(%\\d*[ctYymd]+)|([^$%]+)|(\\$\\d))+$/';
echo "expression: $exp \n";

$tests = [
        '###%04d_Ball0n%02d$1',
        '%03d_Ball0n%02x$1%03d_Ball0n%02d$1',
        '%3d_Ball0n%02d',
    ];

foreach ( $tests as $test )
{
    echo "teststring: $test\n";
    if( preg_match( $exp, $test, $tokens) )
    {
        array_shift($tokens);
        foreach ( $tokens as $token )
            echo "\t\t'$token'\n";
    }
    else
        echo "not valid.";
} // foreach

I get results but: Matches are out of order. The first %[number][letter] never matches, therefore others match double:

expression: /^((%\d*[ctYymd]+)|([^$%]+)|(\$\d))+$/ 
teststring: ###%04d_Ball0n%02d$1
        '$1'
        '%02d'
        '_Ball0n'
        '$1'
teststring: %03d_Ball0n%02x$1%03d_Ball0n%02d$1
not valid.teststring: %3d_Ball0n%02d
        '%02d'
        '%02d'
        '_Ball0n'
teststring: %d_foobardoo
        '_foobardoo'
        '%d'
        '_foobardoo'
teststring: Ball0n%02dHamburg%d
        '%d'
        '%d'
        'Hamburg'

1 Answer 1

2

Solution (edited by OP): I use a two slight variations (only regarding ‘wrapping’): first for validation, then for tokenizing, of:

#\d*[ctYymd]+|\$\d+|[^#\$]+

RegEx Demo

Code:

$core = '#\d*[ctYymd]+|\$\d+|[^#\$]+';
$expValidate = '/^('.$core.')+$/m';
$expTokenize = '/('.$core.')/m';

$tests = [
        '#3d-',
        '#3d-ABC',
        '***#04d_Ball0n#02d$1',
        '#03d_Ball0n#02x$AwrongDollar',
        '#3d_Ball0n#02d',
        'Badstring#02xWrongLetterX'
    ];

foreach ( $tests as $test )
{
    echo "teststring: [$test]\n";

    if( ! preg_match_all( $expValidate, $test) )
    {
        echo "not valid.\n";
        continue;
    }
    if( preg_match_all( $expTokenize, $test, $tokens) ) {
        foreach ( $tokens[0] as $token )
            echo "\t\t'$token'\n";
    }

} // foreach

Output:

teststring: [#3d-]
        '#3d'
        '-'
teststring: [#3d-ABC]
        '#3d'
        '-ABC'
teststring: [***#04d_Ball0n#02d$1]
        '***'
        '#04d'
        '_Ball0n'
        '#02d'
        '$1'
teststring: [#03d_Ball0n#02x$AwrongDollar]
not valid.
teststring: [#3d_Ball0n#02d]
        '#3d'
        '_Ball0n'
        '#02d'
teststring: [Badstring#02xWrongLetterX]
not valid.
Sign up to request clarification or add additional context in comments.

4 Comments

Won-der-ful! Thank you so much! To analyse my mistakes: preg_match_all rather than preg_match. Because: simply wrong. And saves me from the non-matching outer bracket.
Only thing I am missing: Doesn't yell at me on errorneous strings, just skips, i.e. Badstring%02xWrongLetterX.
Your lookahead (? did not solve it for me. Wrong expression led to a single matching block (which is no sure indication of a wrong format string). Came up with two minor variation, edited your post. (Hope that's ok). (also replaced % for # for unrelated reasons.)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.