Return to Answer

Closer to "perfection"; closer to accounting for every tree in a forest.

Source Link

edited Sep 3 at 22:42

user272752

Single Operator Evaluator
Very simple and allows simple control flow.

The approach is good for menus or terminal commands, like with cmd.
Stream Evaluator
Good at implementing postfix evaluators. However, most human languages aren't postfix only;
- binary operators typically are infix. And,
- functions are typically prefix.
The approach is good for postfix languages, or languages easily converted to postfix.
Tree Evaluator
Trees are typically easier to understand thenthan a postfix stream. As such languages typically build and evaluate an Abstract Syntax Tree (AST). An AST is also easier to mutate as we don't need to partially evaluate the stream to find the children.

The approach is good for most languages.

Tokenizers can be pretty complicated to build correctly as a beginner. In the past I've gotten caught up trying to implement something, but I forget I need to make the tokenizer do everything. So my code became complicated as I missed the forrestforest for the trees. Lets use TDD to build the tokenizer:

BNF, the notoriously ugly and verbose flavour, is far easier to read thenthan the Multiple Tokenizers code we wrote.

You've made the same mistake I have in the past. By making a Simple Tokenizer you seem to have missed the forrestforest for the trees. We can see you've gone "how can I get x to work" but you've not looked back over the entire code and gone; "Oh, the code is the same!"

Single Operator Evaluator
Very simple and allows simple control flow.

The approach is good for menus or terminal commands, like with cmd.
Stream Evaluator
Good at implementing postfix evaluators. However, most human languages aren't postfix only;
- binary operators typically are infix. And,
- functions are typically prefix.
The approach is good for postfix languages, or languages easily converted to postfix.
Tree Evaluator
Trees are typically easier to understand then a postfix stream. As such languages typically build and evaluate an Abstract Syntax Tree (AST). An AST is also easier to mutate as we don't need to partially evaluate the stream to find the children.

The approach is good for most languages.

Tokenizers can be pretty complicated to build correctly as a beginner. In the past I've gotten caught up trying to implement something, but I forget I need to make the tokenizer do everything. So my code became complicated as I missed the forrest for the trees. Lets use TDD to build the tokenizer:

BNF, the notoriously ugly and verbose flavour, is far easier to read then the Multiple Tokenizers code we wrote.

You've made the same mistake I have in the past. By making a Simple Tokenizer you seem to have missed the forrest for the trees. We can see you've gone "how can I get x to work" but you've not looked back over the entire code and gone; "Oh, the code is the same!"

Single Operator Evaluator
Very simple and allows simple control flow.

The approach is good for menus or terminal commands, like with cmd.
Stream Evaluator
Good at implementing postfix evaluators. However, most human languages aren't postfix only;
- binary operators typically are infix. And,
- functions are typically prefix.
The approach is good for postfix languages, or languages easily converted to postfix.
Tree Evaluator
Trees are typically easier to understand than a postfix stream. As such languages typically build and evaluate an Abstract Syntax Tree (AST). An AST is also easier to mutate as we don't need to partially evaluate the stream to find the children.

The approach is good for most languages.

Tokenizers can be pretty complicated to build correctly as a beginner. In the past I've gotten caught up trying to implement something, but I forget I need to make the tokenizer do everything. So my code became complicated as I missed the forest for the trees. Lets use TDD to build the tokenizer:

BNF, the notoriously ugly and verbose flavour, is far easier to read than the Multiple Tokenizers code we wrote.

You've made the same mistake I have in the past. By making a Simple Tokenizer you seem to have missed the forest for the trees. We can see you've gone "how can I get x to work" but you've not looked back over the entire code and gone; "Oh, the code is the same!"

Source Link

answered Dec 16, 2021 at 0:14

Peilonrayz ♦

44.6k
7
80
158

To build a language with good quality code requires understanding the various approaches to building operators, evaluators and tokenizers. Lets go through the concepts from simplest to complicated, to build a calculator.

Operators

Operators is concerned with evaluating each operator. Normally an operator function can be passed arguments and return the result, like add(1, 2).

Inline Operators

The simplest approach is to define all the operators inline in a 'switch'. Since Python doesn't have a dedicated switch statement (until 3.10), we can instead use if elif else statements.

if op == "+":
    result = lhs + rhs
elif op == "-":
    result = lhs - rhs
elif op == "*":
    result = lhs * rhs
elif op == "/":
    result = lhs / rhs

Dictionary Operators

We can store the operators in an object. Giving us the ability to pass Operators to the Evaluator and/or the Tokenizer. Since the operators are strings we can build a dictionary mapping operator strings to the corresponding operator function.

When building the dictionary we can use lambda functions as we can define the operator inside the dictionary. However, lambda functions can only contain one expression in Python. If we need more than one expression then we will have to use 'full-fat' functions.
I have made addition a 'full-fat' function, so we have an example of both approaches.

def add(lhs, rhs):
    return lhs + rhs


operators = {
    "+": add,
    "-": lambda lhs, rhs: lhs - rhs,
    "*": lambda lhs, rhs: lhs * rhs,
    "/": lambda lhs, rhs: lhs / rhs,
}

Class Operators

Rather than defining the operators in a dictionary we can define the operators on a class. We can then build the operators dictionary somewhat automatically from the class. We do however have to use valid Python names for the operators.

class CalculatorOperators:
    @staticmethod
    def add(lhs, rhs):
        return lhs + rhs

    @staticmethod
    def sub(lhs, rhs):
        return lhs - rhs

    @staticmethod
    def mul(lhs, rhs):
        return lhs * rhs

    @staticmethod
    def div(lhs, rhs):
        return lhs / rhs

We can then iterate through the class by iterating through __dict__. To only get our functions we can filter out any dunder values.

def get_operators(obj):
    return {
        name: getattr(obj, name)
        for name in obj.__dict__
        if not (name.startswith("__") and name.endswith("__"))
    }


print(get_operators(CalculatorOperators))

{'add': <function CalculatorOperators.add at 0x7ff171f61f30>, ...}

Metaclass Operators

We can automatically call get_operators by having a base class with an __init_subclass__ method or a custom metaclass. Either can run when the class is defined and transparently build operators.

class Operators:
    def __init_subclass__(cls, /, **kwargs):
        cls.OPERATORS = get_operators(cls)


class CalculatorOperators(Operators):
    @staticmethod
    def add(lhs, rhs):
        return lhs + rhs

    ...


print(CalculatorOperators.OPERATORS)

{'add': <function CalculatorOperators.add at 0x7ff171f623b0>, ...}

Decorated Class Operators

We can customize the key to use by defining a decorator function (somewhat comprehensive guide by me). The decorator function can assign the desired name to an attribute on the function. We'd need to make a little change to how we build operators. We'd need to use the function's attribute value if available.

def with_name(name=None):
    def inner(fn):
        fn.name = name
        return fn
    return inner


class CalculatorEvaluator:
    @with_name("+")
    @staticmethod
    def add(lhs, rhs):
        return lhs + rhs

    ...


operators = {
    getattr(fn, "name", name): fn
    for name, fn in CalculatorEvaluator.__dict__.items()
    if not (name.startswith("__") and name.endswith("__"))
}
print(operators)

{'+': <staticmethod(<function CalculatorEvaluator.add at 0x7ff171f62560>)>, ...}

Comparison

Inline Operators
The simplest approach. However with larger operator sets or complicated logic for the operators the code quickly becomes unmaintainable.

The approach is good when building a very simple evaluator for BrainFuck.
Dictionary Operators
A very common approach. Googling "Python switch" is bound to result in results saying to use a dictionary. The approach has a complexity improvement over Inline Operators, going from \$O(n)\$ to \$O(1)\$ operator lookup. I, however, don't find the approach to be the much more maintainable to Inline Operators. As the approach is prone to code fragmentation and the function in operators getting out of sync.

The approach is good when;
- all the operators can be defined as lambdas and documenting the operators isn't needed. Or,
- the operators must be fragmented and can't be contained within a single class/namespace. For example a menu screen, as the user can enter the same view from two different ways.
Class Operators
The most advanced approach of the three, requiring an understanding of __dict__ (and potentially __slots__). After learning the little bit of metaprogramming required to understand the approach I find the approach to be the simplest. We can throw all our operators in one class, document the class like any other class and have the approach best for our use case.

Whilst we currently need to use Decorated Class Operators, we can normally use the simpler Class Operators approach when using the best approach to defining our own language.

The approach seems to be the standard approach ranging from single operator evaluators like cmd to parser libraries like TatSu.

For the rest of the answer operators is the following:

operators = {
    "+": lambda lhs, rhs: lhs + rhs,
    "-": lambda lhs, rhs: lhs - rhs,
    "*": lambda lhs, rhs: lhs * rhs,
    "/": lambda lhs, rhs: lhs / rhs,
}

Evaluator

Now we've decided gone through the operators we can move onto the evaluators. The evaluators are generally pretty simple, like the operators.

Single Operator Evaluator

We need to evaluate an operator and some values. To do so we just get the operator from operators and just pass the values.

print(operators["+"](1, 2))

Stream Evaluator

Stream evaluators typically use a stack, and so work best on postfix input. In Python the stack can just be a list. Then, whenever we have a non-operator token; we add the token to the stack. Whenever we have a operator token; we pop the operands from the stack, and pass the operands to the function. We then add the result of the operator to the stack. After processing all tokens we pop the top value of the stack.

def evaluate__stream(tokens, operators):
    stack = []
    for token in tokens:
        if token in operators:
            b = stack.pop()
            a = stack.pop()
            token = operators[token](a, b)
        stack.append(token)
    return stack.pop()


evaluate__stream([1, 2, 3, "*", "+"], operators)

Tree Evaluator

Tree evaluators interact with a tree. To have a tree the input must be nodes of some sort. We can build a simple Node class to store the operator and children.

class Node:
    def __init__(self, operator, *children):
        self.operator = operator
        self.children = children

    def __repr__(self):
        values = [self.operator, *self.children]
        return f"Node({', '.join(map(repr, values))})"

    def __str__(self):  # Only supports binary operators
        op = self.operator
        lhs = self.children[0]
        rhs = self.children[1]
        return f"({lhs} {op} {rhs})"


ast = Node("+", 1, Node("*", 2, 3))
print(repr(ast))
print(ast)

Node('+', 1, Node('*', 2, 3))
(1 + (2 * 3))

We then perform a depth first walk of the tree. Otherwise we will perform the following operation 1 + Node("*", 2, 3), resulting in an error. We can build a simple tree evaluator using recursion.

def evaluate__tree(node, operators):
    if not isinstance(node, Node):
        return node
    children = [
        evaluate__tree(child, operators)
        for child in node.children
    ]
    return operators[node.operator](*children)


print(evaluate__tree(ast, operators))

Comparison

Single Operator Evaluator
Very simple and allows simple control flow.

The approach is good for menus or terminal commands, like with cmd.
Stream Evaluator
Good at implementing postfix evaluators. However, most human languages aren't postfix only;
- binary operators typically are infix. And,
- functions are typically prefix.
The approach is good for postfix languages, or languages easily converted to postfix.
Tree Evaluator
Trees are typically easier to understand then a postfix stream. As such languages typically build and evaluate an Abstract Syntax Tree (AST). An AST is also easier to mutate as we don't need to partially evaluate the stream to find the children.

The approach is good for most languages.

The Evaluator to pick is largely dictated by how the tokens are made. Using a Tree Evaluator with the Single Operator Tokenizer doesn't make sense. As the Evaluator is completely overkill for the Tokenizer. Additionally the Single Operator Evaluator can only work with the Single Operator Tokenizer.

Tokenizer

Now we've got the easy stuff out of the way. Lets focus on the hard part, converting a string into a bunch of tokens.

Single Operator Tokenizer

The simplest approach is to get the user to provide the individual tokens. We can do so by just calling input a bunch, effectively sidestepping building a tokenizer.

lhs = int(input("lhs> "))
op = input("op > ")
rhs = int(input("rhs> "))

print(operators[op](lhs, rhs))

lhs> 1
op > +
rhs> 2
3

Simple Tokenizer (postfix)

_{The section is based of another one of my answers.}

Tokenizers can be pretty complicated to build correctly as a beginner. In the past I've gotten caught up trying to implement something, but I forget I need to make the tokenizer do everything. So my code became complicated as I missed the forrest for the trees. Lets use TDD to build the tokenizer:

We'll handle only numbers.
However only one character, and we'll convert to an int.

def tokenize__simple(text):
    for char in text:
        if char in "0123456789":
            yield int(char)

>>> list(tokenize__simple("123"))
[1, 2, 3]

We'll handle one number of any size.
We'll build up a 'token' stack. We'll use a list to ensure \$O(1)\$ time when adding new characters. As Python's strings guarantee a minimum \$O(n)\$ time when adding.

def tokenize__simple(text):
    token = []
    for char in text:
        if char in "0123456789":
            token.append(char)
    if token:
        yield int("".join(token))

>>> list(tokenize__simple("1 2 3"))
[123]
>>> list(tokenize__simple(""))
[]

We'll handle spaces.
We just need to copy the if token code. And reset token after yielding one.

def tokenize__simple(text):
    token = []
    for char in text:
        if char in "0123456789":
            token.append(char)
        elif char in " ":
            if token:
                yield int("".join(token))
                token = []
    if token:
        yield int("".join(token))

>>> list(tokenize__simple("1 2 3"))
[1, 2, 3]

We'll handle operators.
Basically the same as spaces but we also yield the operator.

def tokenize__simple(text):
    token = []
    for char in text:
        if char in "0123456789":
            token.append(char)
        elif char in " +-*/":
            if token:
                yield int("".join(token))
                token = []
            if char != " ":
                yield char
    if token:
        yield int("".join(token))

>>> list(tokenize__simple("1+ 2*3"))
[1, "+", 2, "*", 3]

From here we can tokenize numbers in postfix notation and pass the values to the Stream Evaluator we wrote earlier.

evaluate__stream(tokenize__simple("1 2 3 * +"), operators)

Simple Tokenizer + Transformation (Infix Stream)

Performing transformations on the tokens afterwards is fairly common. For example, we can use Dijkstra's Shunting-Yard algorithm to convert from infix to postfix. We can then pass the tokens to the Stream Evaluator for evaluation.

def to_postfix(tokens, precedences):
    operators = list()
    for token in tokens:
        if isinstance(token, int):
            yield token
        else:
            try:
                precedence = precedences[token]
                while precedence <= precedences[operators[-1]]:
                    yield operators.pop()
            except LookupError:
                pass
            operators.append(token)
    yield from operators[::-1]


precedences = {"+": 0, "-": 0, "*": 1, "/": 1}
print(evaluate__stream(to_postfix(tokenize__simple("1 + 2 * 3"), precedences), operators))

Multiple Tokenizers (Infix Graph)

To build an infix graph we can build specialized parsers for each part of the grammar. The approach allows us to build more complex grammars. Whilst completely overkill for a calculator with 4 operators the idea is transferable to more complex systems.

First we'll get whitespace working. We don't actually have a need for the whitespace, so we'll throw any runs away. The for loop will advance stream leading to the consecutive whitespace being removed.

WHITESPACE = " "


def whitespace(char, stream):
    if char in WHITESPACE:
        for char in stream:
            if char not in WHITESPACE:
                break
    return char, None

>>> it = iter("  12")
>>> whitespace(next(it), it)
('1', None)
>>> next(it)
'2'

Next we want to get the numbers as integers. The solution is a lot like whitespace. However we'll store the integer characters in chars. And we'll return chars as an integer.

NUMBERS = "0123456789"


def number(char, stream):
    chars = []
    if char in NUMBERS:
        chars.append(char)
        for char in stream:
            if char not in NUMBERS:
                break
            chars.append(char)
    return char, int("".join(chars))

Next we'll say numbers can be surrounded by whitespace. We'll discard whatever the whitespace is. Since a number must be either side of an operator, we'll be able to put whitespace anywhere.

def spaced_number(char, stream):
    char, _ = whitespace(char, stream)
    char, num = number(char, stream)
    char, _ = whitespace(char, stream)
    return char, num

Next we'll handle multiplication and division. By handling multiplication and division 'closer' to number allows us to implement BODMAS. To get the code to work we get a spaced_number then we build a tree (more like a linked list) whilst we encounter multiplication or division.

def mul_div(char, stream):
    char, number = spaced_number(char, stream)
    node = number
    while char in "*/":
        op = char
        char, number = spaced_number(next(stream, None), stream)
        node = Node(op, node, number)
    return char, node

Then we do the same we did for mul and div to add and sub. However we call mul_div rather than spaced_number.

def add_sub(char, stream):
    char, number = mul_div(char, stream)
    node = number
    while char in "+-":
        op = char
        char, number = mul_div(next(stream, None), stream)
        node = Node(op, node, number)
    return char, node

Finally we hide the char and stream implementation details away by making the tokenize__tree function.

def tokenize__tree(text):
    it = iter(text)
    _, value = add_sub(next(it, None), it)
    return value


print(evaluate__tree(tokenize__tree("1 + 2 * 3"), operators))

Comparison

Single Operator Tokenizer
Very simple and allows simple control flow without needing a complicated tokenizer.

The approach is good for menus or terminal commands, like with cmd.
Simple Tokenizer
Good at implementing postfix evaluators and simple grammars. The approach can also work with infix or prefix grammars. If you can think up an algorithm like Shunting-Yard for your use-case.

The approach is good for simple grammars where using *BNF is not an option.
Multiple Tokenizers
The approach is very powerful, but also very verbose.

The approach is good for complex grammars where using *BNF is not an option. Like writing a library to parse *BNF.

If a tokenizer isn't needed then use the Single Operator Tokenizer. Otherwise using *BNF is the best approach. If *BNF isn't an option then use Multiple Tokenizers. Most grammars seem simple as a beginner, so picking the option for complex grammars will allow the code room to grow if the grammar is more complex then anticipated.

*BNF

BNF

BNF is a very simple language for parsing strings. We can write the complex Multiple Tokenizers in a couple lines of BNF. Whilst BNF is a standard structure, many libraries use a bespoke flavour of the language.

First lets define the two constants. To define NUMBERS we need to allow any of the number characters, so we'll use | to say either the character on the left hand side or right hand side.
```
<WHITESPACE>    ::= " "
<NUMBER>        ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
```
Next lets define whitespace. We'll need to use the and operator; "a" "b" matches ab. Then we'll need to use recursion to allow any amount of whitespace.
```
<whitespace>    ::= <WHITESPACE> <whitespace> | ""
```
number is defined almost exactly the same way as whitespace. The difference is the minimum match (the right hand side) must be <NUMBER> not "" to match at least one number.
```
<number>        ::= <NUMBER> <number> | <NUMBER>
```

From here the rest is effectively the same.

<spaced_number> ::= <whitespace> <number> <whitespace>
<mul_div>       ::= <spaced_number> ("*" | "/") <mul_div> | <spaced_number>
<add_sub>       ::= <mul_div> ("+" | "-") <add_sub> | <mul_div>

BNF, the notoriously ugly and verbose flavour, is far easier to read then the Multiple Tokenizers code we wrote.

Use a Parser Library

I've used TatSu once or twice before and found the library to be easy to use. We can write the equivalent of the entire Tokenizer, Evaluator and Operators in not much code.

Lets convert the BNF to TatSu's flavour.

TatSu supports regex, so we can define spaced_number in one line.
```
spaced_number   = / *\d+ */;
```
TatSu also seems to ignore whitespace so both / */ are not needed.
```
number          = /\d+/;
```

From here the rest can be effectively the same.

mul_div         = number  ("*" | "/") mul_div | number;
add_sub         = mul_div ("+" | "-") add_sub | mul_div;

TatSu also provides a regex like {}* syntax, like regex's *, so we can avoid the recursive definition.

mul_div         = number  {("*" | "/") number}*;
add_sub         = mul_div {("+" | "-") mul_div}*;

However, rather than focusing on brevity we should focus on defining clear tokens we can interact with, with the Operators class we'll provide to TatSu. So we need to make addition, subtraction, multiplication and division stand alone rules.

mul             = number  "*" mul_div;
div             = number  "/" mul_div;
add             = mul_div "+" add_sub;
sub             = mul_div "-" add_sub;

We can then bind the values to names by using name:exp syntax.

mul             = lhs:number  "*" rhs:mul_div;
div             = lhs:number  "/" rhs:mul_div;
add             = lhs:mul_div "+" rhs:add_sub;
sub             = lhs:mul_div "-" rhs:add_sub;

Now we can write the entire calculator in a very small amount of code.

import tatsu

GRAMMAR = """
root            = add_sub $;

number          = /\d+/;

mul_div         = mul | div | number;
mul             = lhs:number  "*" rhs:mul_div;
div             = lhs:number  "/" rhs:mul_div;

add_sub         = add | sub | mul_div;
add             = lhs:mul_div "+" rhs:add_sub;
sub             = lhs:mul_div "-" rhs:add_sub;
"""


class CalculatorOperators:
    def number(self, ast):
        return int(ast)

    def mul(self, ast):
        return ast.lhs * ast.rhs

    def div(self, ast):
        return ast.lhs / ast.rhs

    def add(self, ast):
        return ast.lhs + ast.rhs

    def sub(self, ast):
        return ast.lhs - ast.rhs


def evaluate(text):
    parser = tatsu.compile(GRAMMAR)
    return parser.parse(text, semantics=CalculatorOperators())


print(evaluate("1 + 2 * 3"))

The code is now ridiculously maintainable. We've got the power of Multiple Tokenizers by writing a Class Operators and a very simple grammar.

We saw how much code and effort went into writing even the Simple Tokenizer. With fairly complex code and algorithms to get infix working. Vanilla Python isn't the right tool for writing a language. A library with a *BNF parser will normally be far better.

We can also easily change the code to support brackets/parenthesis. The change would be to add a stage between number and mul_div. We wouldn't even need to add a new operator.

Op's Code

Design

Look here:

        elif join:
            if isset:
                try:
                    dct[variable] = dct[variable2] + dct[i]
                except:
                    raise Exception("ERROR IN FILE", sys.argv[1], "VARIABLE NOT FOUND IN ITERATION", iteration)
                    break
                join = isset = isready = isgo = False
                # check if the variable is "@" change the data
                if variable == "@":
                    RESET = True
                    break
            elif isready:
                variable2 = i
                isset = True
            else:
                variable = i
                isready = True
        elif Reverse:
            # this is equivalent to pseudocode var = var2.reverse()
            if isready:
                try:
                    dct[variable] = dct[i][::-1]
                except:
                    raise Exception("ERROR IN FILE", sys.argv[1], "VARIABLE NOT FOUND IN ITERATION", iteration)
                    break
                isready = False
                Revert = False
                if variable == "@":
                    RESET = True
                    break
            else:
                variable = i
                isready = True
# ...
        elif i == "+":
            join = True
        elif i == "/":
            Reverse = True

You've written effectively the same code twice. If we split the Tokenizer away from the Evaluator and Operators then we could drastically simplify the code.

Whilst we can probably get away with a Stream Evaluator we'd be duplicating code in the Tokenizer and the Evaluator. Since your language is a prefix language to tokenize correctly we'd have to know the amount of operands to consume from the input. Then to evaluate correctly we'd need to know the correct amount of operands to consume from the input. Where instead using the Tree Evaluator would remove the duplication by storing the operands in the children variable.

I'm using an iterator, like in Multiple Tokenizers, to make the code to get the variables simpler. By advancing the iterator _text we don't need any complicate join, Reverse, isset or isready variables.

import itertools

ARGS = {
    "+": 3,
    "/": 2,
}


def tokenize(text):
    _text = iter(text)
    nodes = []
    for char in _text:
        if char in ARGS:
            nodes.append(Node(char, list(itertools.islice(_text, ARGS[char]))))
        
        # handle other tokens
    return Node("root", nodes)

However I'd still strongly recommend using *BNF if at all possible. Even if you can't, using TatSu to make a mostly working version of your language would probably help you figure out how to build Multiple Tokenizers.

I really think you need to start over with TatSu. Post another question and then see where to go from there.

Correctness

Can you find bugs?

Honestly your code needs a rewrite. Looking for bugs at the moment would be a waste of time.

Additionally languages are incredibly fragile code bases. Asking us to hunt bugs when you have a very lackluster coverage is also a waste. You may fix the bug, but without testing infrastructure in place when you fix another bug you may just reintroduce the bug. You need tests to prevent regressions.

I'd recommend installing PyTest and using TDD to rebuild your code.