In Python2.7 ANTLR4, extract tokens from a parser rule and store them in a list

Question

In my grammar I validate boolean expressions that look something like this:

((foo == true) && (bar != false) || (qux == norf))

I obtain the string from ANTLR4's context object by calling getText():

def enterBody(self, ctx):
    expression = ctx.condition.getText() # condition here being shorthand for a grammar rule (`condition=expr`)

However, the string is returned whole (i.e. no spaces between each individual token) and I have no way of knowing what each token is:

((foo==true)&&(bar!=false)||(qux==norf))

Ideally, I would like it stored in a list in the following format:

['(', '(', 'foo', '==', 'true', ')', '&&', '(', 'bar', '!=', 'false', ')', '||', '(', 'qux', '==', 'norf', ')', ')']

The ANTLR4 Python documentation is rather sparse and I'm not sure if there's a method that accomplishes this.

Yaroslav Admin · Accepted Answer · 2015-08-28 14:47:18Z

2

Python runtime is really similar to the Java runtime, so you can look at the Java documentation and most likely the same method exists in Python. Or browse source code, it is pretty easy to read.

You're asking for getting a flat list of string. But the whole idea of parser is to avoid this. So I think it is most likely not the thing you need. Make sure to be aware about parse tree and how listeners work. Basically you should work with tree and not with flat list. What you probably are looking for is ParserRuleContext.getChildren(). You can use it to access all child nodes:

def enterBody(self, ctx):
    print(list(ctx.getChildren()))

Which is even more likely, you want to access specific type of a child node for some action. Take a look at the parser generated by ANTLR for you. You will find bunch of *Context classes, which contain methods to access every type of subnode. For example ctx parameter of the enterBody() method is instance of the BodyContext and you can use all it's methods to access its child nodes of specific type.

UPD If your grammar only defines a boolean expression and you use it only for validation and tokenization, you won't need parser at all. Just use lexer to get list of all tokens:

input_stream = antlr4.FileStream('input.txt')

# Instantiate an run generated lexer
lexer = BooleanLexer(input_stream)
tokens = antlr4.CommonTokenStream(lexer)

# Parse all tokens until EOF
tokens.fill()

# Print tokens as text (EOF is stripped from the end)
print([token.text for token in tokens.tokens][:-1])

edited Aug 28, 2015 at 14:47

answered Aug 28, 2015 at 12:13

Yaroslav Admin

14.7k6 gold badges65 silver badges85 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Nobilis Over a year ago

I wrote a parser for boolean expressions using the shunting-yard algorithm, so I simply want to tokenise boolean expressions so I can then feed them into said algorithm. From that I generate an expression tree in the form of a dictionary. Should I just do this with getChildren and then recursively call it on every sub-element until I get the expected result?

Yaroslav Admin Over a year ago

In case you have grammar only for boolean expression, see updated answer. If boolean expression grammar is a part of a larger grammar, you need to recursively call getChildren() to collect only relevant tokens.

Nobilis Over a year ago

Thanks! I guess I use a mixed approach, I use ANTLR for everything up to that point and then I parse it manually (just because I find it easier that way). I was hoping to get away from using getChildren as I feel like I'll have to keep calling it on every element's sub-elements etc. And I've realised that I want to extract not just tokens but some parser rules too, so I guess the token stream will not work.

Collectives™ on Stack Overflow

In Python2.7 ANTLR4, extract tokens from a parser rule and store them in a list

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related