2

I am looking to parse a token file that looks something like the one below to grab the token name/value pair. The token/value/nesting relationships are already defined, so i cant change the way the token files are made. It would seem that a context free grammar might be the best way to go, but i've no experience writing or implementing one. Is it possible to do it with regex? I've not had any luck with the nested multiline tokens (like Master1, Servant2).

;token1 = I am a top level single line token  
;token2 {  
    I am a top level  
    multiline line token  
}  

master1 {  
;servant1 = I am Master1, Servant1 single line token  
;servant2 {  
    I am Master1, Servant2.   
    A mulit line token.  
}  
;servant3 = I am Master1, Servant3  
}  
master2 {  
;servant1 = I am Master2, Servant1  
;servant2 {  
    I am Master2, Servant2  
A mulit line token.  
}  
;servant3 = I am Master2, Servant3  
}

2 Answers 2

3

PHP has a function to tokenize strings:

strtok splits a string (str) into smaller strings (tokens), with each token being delimited by any character from token. That is, if you have a string like "This is an example string" you could tokenize this string into its individual words by using the space character as the token.

Sign up to request clarification or add additional context in comments.

1 Comment

+1, Most people recommend split but as that function is based on RegularExpressions its much slower, and there's a few things that you can do in strtok, that oyu cant in explode etc.
2

Here's a reasonably simple line-walking parser (I originally tried to write a regex for it, but the lack of a leading ; on the start of the multi-line-master really made it much harder (without that ; being missing, it's reasonably easy to write). I gave up and wrote this):

function getTokens($string) {
    $string = trim($string);;
    $lines = explode("\n", $string);
    $data = array();
    $key = '';
    $open = 0;
    $buffer = '';
    foreach ($lines as $line) {
        $line = trim($line);
        if (empty($line)) {
            continue;
        } elseif (strpos($line, '}') === 0) {
            $open--;
            if ($open == 0) {
                $data[$key] = getTokens($buffer);
                $buffer = '';
            } elseif ($open < 0) {
                throw new Exception('Unmatched }');
            } else {
                $buffer .= "\n" . $line;
            }
        } elseif ($open > 0) {
            if (strpos($line, '{') !== false) {
                $open++;
            }
            $buffer .= "\n" . $line;
        } elseif ($line[0] == ';') {
            if (strpos($line, "=") !== false) {
                list ($key, $value) = explode("=", $line, 2);
                $key = trim(substr($key, 1));
                $value = trim($value);
                $data[$key] = $value;
            } elseif (strpos($line, "{") !== false) {
                $open++;
                list ($key, $value) = explode("{", $line, 2);
                $key = trim(substr($key, 1));
            } else {
                throw new Exception('Unmatched token ;');
            }
        } elseif (strpos($line, '{') !== false) {
            $open++;
            list ($key, $value) = explode("{", $line, 2);
            $key = trim($key);
        } else {
            $buffer .= "\n" . $line;
        }
    }
    if ($open > 0) {
        throw new Exception('Unmatched {');
    } elseif (empty($data) && !empty($buffer)) {
        return trim($buffer);
    }
    return $data;
}

When I feed it your string as input, I get:

Array(
    "token1" => "I am a top level single line token",
    "token2" => "I am a top level
                    multiline line token",
    "master1" => Array(
        "servant1" => "I am Master1, Servant1 single line token",
        "servant2" => "I am Master1, Servant2.
                            A mulit line token.",
        "servant3" => "I am Master1, Servant3",
    ),
    "master2" => Array(
        "servant1" => "I am Master2, Servant1",
        "servant2" => "I am Master2, Servant2
                            A mulit line token.",
        "servant3" => "I am Master2, Servant3",
    ),
)

1 Comment

Thanks! That is perfect. I was starting to go down that same road. The lack of leading semicolon is what was killing me too!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.