0

In my grammar I validate boolean expressions that look something like this:

((foo == true) && (bar != false) || (qux == norf))

I obtain the string from ANTLR4's context object by calling getText():

def enterBody(self, ctx):
    expression = ctx.condition.getText() # condition here being shorthand for a grammar rule (`condition=expr`)

However, the string is returned whole (i.e. no spaces between each individual token) and I have no way of knowing what each token is:

((foo==true)&&(bar!=false)||(qux==norf))

Ideally, I would like it stored in a list in the following format:

['(', '(', 'foo', '==', 'true', ')', '&&', '(', 'bar', '!=', 'false', ')', '||', '(', 'qux', '==', 'norf', ')', ')']

The ANTLR4 Python documentation is rather sparse and I'm not sure if there's a method that accomplishes this.

0

1 Answer 1

2

Python runtime is really similar to the Java runtime, so you can look at the Java documentation and most likely the same method exists in Python. Or browse source code, it is pretty easy to read.

You're asking for getting a flat list of string. But the whole idea of parser is to avoid this. So I think it is most likely not the thing you need. Make sure to be aware about parse tree and how listeners work. Basically you should work with tree and not with flat list. What you probably are looking for is ParserRuleContext.getChildren(). You can use it to access all child nodes:

def enterBody(self, ctx):
    print(list(ctx.getChildren()))

Which is even more likely, you want to access specific type of a child node for some action. Take a look at the parser generated by ANTLR for you. You will find bunch of *Context classes, which contain methods to access every type of subnode. For example ctx parameter of the enterBody() method is instance of the BodyContext and you can use all it's methods to access its child nodes of specific type.

UPD If your grammar only defines a boolean expression and you use it only for validation and tokenization, you won't need parser at all. Just use lexer to get list of all tokens:

input_stream = antlr4.FileStream('input.txt')

# Instantiate an run generated lexer
lexer = BooleanLexer(input_stream)
tokens = antlr4.CommonTokenStream(lexer)

# Parse all tokens until EOF
tokens.fill()

# Print tokens as text (EOF is stripped from the end)
print([token.text for token in tokens.tokens][:-1])
Sign up to request clarification or add additional context in comments.

3 Comments

I wrote a parser for boolean expressions using the shunting-yard algorithm, so I simply want to tokenise boolean expressions so I can then feed them into said algorithm. From that I generate an expression tree in the form of a dictionary. Should I just do this with getChildren and then recursively call it on every sub-element until I get the expected result?
In case you have grammar only for boolean expression, see updated answer. If boolean expression grammar is a part of a larger grammar, you need to recursively call getChildren() to collect only relevant tokens.
Thanks! I guess I use a mixed approach, I use ANTLR for everything up to that point and then I parse it manually (just because I find it easier that way). I was hoping to get away from using getChildren as I feel like I'll have to keep calling it on every element's sub-elements etc. And I've realised that I want to extract not just tokens but some parser rules too, so I guess the token stream will not work.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.