Antlr to parse python setup file

Question

I have a java program that has to parse a python setup.py file to extract info from it. I sorta have something working, but I hit a wall. I am starting with a simple raw file first, once i get that running, then i will worry about stripping out the noise that i don't want to make it reflect an actual file.

So here's my grammer

grammar SetupPy ;

file_input: (NEWLINE | setupDeclaration)* EOF;

setupDeclaration : 'setup' '(' method ')';
method : setupRequires testRequires;
setupRequires : 'setup_requires' '=' '[' LISTVAL* ']' COMMA;
testRequires : 'tests_require' '=' '[' LISTVAL* ']' COMMA;

WS: [ \t\n\r]+ -> skip ;
COMMA : ',' -> skip ;
LISTVAL : SHORT_STRING ;

UNKNOWN_CHAR
 : .
 ;

fragment SHORT_STRING
 : '\'' ( STRING_ESCAPE_SEQ | ~[\\\r\n\f'] )* '\''
 | '"' ( STRING_ESCAPE_SEQ | ~[\\\r\n\f"] )* '"'
 ;

/// stringescapeseq ::=  "\" <any source character>
fragment STRING_ESCAPE_SEQ
: '\\' .
| '\\' NEWLINE
;

fragment SPACES
 : [ \t]+
 ;

NEWLINE
 : ( {atStartOfInput()}?   SPACES
   | ( '\r'? '\n' | '\r' | '\f' ) SPACES?
   )
   {
     String newLine = getText().replaceAll("[^\r\n\f]+", "");
     String spaces = getText().replaceAll("[\r\n\f]+", "");
     int next = _input.LA(1);
     if (opened > 0 || next == '\r' || next == '\n' || next == '\f' || next == '#') {
       // If we're inside a list or on a blank line, ignore all indents,
       // dedents and line breaks.
       skip();
     }
     else {
       emit(commonToken(NEWLINE, newLine));
       int indent = getIndentationCount(spaces);
       int previous = indents.isEmpty() ? 0 : indents.peek();
       if (indent == previous) {
         // skip indents of the same size as the present indent-size
         skip();
       }
       else if (indent > previous) {
         indents.push(indent);
         emit(commonToken(Python3Parser.INDENT, spaces));
       }
       else {
         // Possibly emit more than 1 DEDENT token.
         while(!indents.isEmpty() && indents.peek() > indent) {
           this.emit(createDedent());
           indents.pop();
         }
       }
     }
   }
 ;

and my current test file (like i said, stripping the noise from a normal file is next step)

setup(
    setup_requires=['pytest-runner'],
    tests_require=['pytest', 'unittest2'],
)

Where i am stuck is how to tell antlr that setup_requires and tests_requires contain arrays. I want the values of those arrays, no matter if someone used single quotes, double quotes, each value on a different line, and combinations of all the above. I don't have a clue how to pull that off. Can i get some help please? maybe an example or two?

Things to note,

no i can't use jython and just run the file.
Regex isn't an option due to the huge variations in developer styles for this file

And of course after this issue, I still need to figure out how to strip the noise from a normal file. I tried using the Python3 grammar to do this, but me being a novice at antlr, it blew me away. i couldn't figure out how to write the rules to pull the values, so I decided to try a far simpler grammar. And quickly hit another wall.

edit here is an actual setup.py file that it will eventually have to parse. keeping in mind the setup_requires and test_requires may or may not be there and may or may not be in that order.

# -*- coding: utf-8 -*-
from __future__ import with_statement

from setuptools import setup


def get_version(fname='mccabe.py'):
    with open(fname) as f:
        for line in f:
            if line.startswith('__version__'):
                return eval(line.split('=')[-1])


def get_long_description():
    descr = []
    for fname in ('README.rst',):
        with open(fname) as f:
            descr.append(f.read())
    return '\n\n'.join(descr)


setup(
    name='mccabe',
    version=get_version(),
    description="McCabe checker, plugin for flake8",
    long_description=get_long_description(),
    keywords='flake8 mccabe',
    author='Tarek Ziade',
    author_email='[email protected]',
    maintainer='Ian Cordasco',
    maintainer_email='[email protected]',
    url='https://github.com/pycqa/mccabe',
    license='Expat license',
    py_modules=['mccabe'],
    zip_safe=False,
    setup_requires=['pytest-runner'],
    tests_require=['pytest'],
    entry_points={
        'flake8.extension': [
            'C90 = mccabe:McCabeChecker',
        ],
    },
    classifiers=[
        'Development Status :: 5 - Production/Stable',
        'Environment :: Console',
        'Intended Audience :: Developers',
        'License :: OSI Approved :: MIT License',
        'Operating System :: OS Independent',
        'Programming Language :: Python',
        'Programming Language :: Python :: 2',
        'Programming Language :: Python :: 2.7',
        'Programming Language :: Python :: 3',
        'Programming Language :: Python :: 3.3',
        'Programming Language :: Python :: 3.4',
        'Programming Language :: Python :: 3.5',
        'Programming Language :: Python :: 3.6',
        'Topic :: Software Development :: Libraries :: Python Modules',
        'Topic :: Software Development :: Quality Assurance',
    ],
)

Trying to debug and simplify and realized i don't need to find the method, just the values. so I'm playing with this grammer

grammar SetupPy ;

file_input: (ignore setupRequires ignore | ignore testRequires ignore )* EOF;

setupRequires : 'setup_requires' '=' '[' dependencyValue* (',' dependencyValue)* ']';
testRequires : 'tests_require' '=' '[' dependencyValue* (',' dependencyValue)* ']';

dependencyValue: LISTVAL;

ignore : UNKNOWN_CHAR? ;

LISTVAL: SHORT_STRING;
UNKNOWN_CHAR: . -> channel(HIDDEN);

fragment SHORT_STRING: '\'' ( STRING_ESCAPE_SEQ | ~[\\\r\n\f'] )* '\''
| '"' ( STRING_ESCAPE_SEQ | ~[\\\r\n\f"] )* '"';

fragment STRING_ESCAPE_SEQ
: '\\' .
| '\\'
;

Works great for the simple one, even handles the out of order issue. but doesnt' work on the full file, it gets hung up on the

def get_version(fname='mccabe.py'):

equals sign in that line.

I finally got around to this. unfortunately it breaks down with an actual file. It picks up the import statement and goes all wacky. i did post an example of an actual file it would have to parse. Im going to play with this a bit longer before i give up and go with a less elegant way of pulling this off. I'm running out of time on it. — scphantm
– scphantm, Commented Jul 17, 2017 at 1:32
Yes that's quite a bit more to parse, but your UNKNOWN_CHAR symbol is problematic. Pretty much everything not an implied lexer token binds strongly to that rule. — TomServo
– TomServo, Commented Jul 17, 2017 at 2:39
yea, i pulled that out and pulled out the ignores, and same thing. It looks like its not trying to parse the group "'setup_requires' '=' '['", it looks like its trying to grab the tokens individually and when it hits the = says hey, we can't find the rest. — scphantm
– scphantm, Commented Jul 17, 2017 at 15:05
ive been experimenting and im right. it is finding the individual tokens, not the pattern of tokens. how do i get it to match a sequence of tokens, not the tokens individually? — scphantm
– scphantm, Commented Jul 17, 2017 at 16:44

TomServo · Accepted Answer · 2017-07-16 21:45:01Z

I've examined your grammar and simplified it quite a bit. I took out all the python-esqe whitespace handling and just treated whitespace as whitespace. This grammar also parses this input, which as you said in the question, handles one item per line, single and double quotes, etc...

setup(
    setup_requires=['pytest-runner'],
    tests_require=['pytest', 
    'unittest2', 
    "test_3" ],
)

And here's the much simplified grammar:

grammar SetupPy ;
setupDeclaration : 'setup' '(' method ')' EOF;
method : setupRequires testRequires  ;
setupRequires : 'setup_requires' '=' '[' LISTVAL* (',' LISTVAL)* ']' ',' ;
testRequires : 'tests_require' '=' '[' LISTVAL* (',' LISTVAL)* ']' ',' ;
WS: [ \t\n\r]+ -> skip ;
LISTVAL : SHORT_STRING ;
fragment SHORT_STRING
 : '\'' ( STRING_ESCAPE_SEQ | ~[\\\r\n\f'] )* '\''
 | '"' ( STRING_ESCAPE_SEQ | ~[\\\r\n\f"] )* '"'
 ;
fragment STRING_ESCAPE_SEQ
: '\\' .
| '\\' 
;

Oh and here's the parser-lexer output showing the correct assignment of tokens:

[@0,0:4='setup',<'setup'>,1:0]
[@1,5:5='(',<'('>,1:5]
[@2,12:25='setup_requires',<'setup_requires'>,2:4]
[@3,26:26='=',<'='>,2:18]
[@4,27:27='[',<'['>,2:19]
[@5,28:42=''pytest-runner'',<LISTVAL>,2:20]
[@6,43:43=']',<']'>,2:35]
[@7,44:44=',',<','>,2:36]
[@8,51:63='tests_require',<'tests_require'>,3:4]
[@9,64:64='=',<'='>,3:17]
[@10,65:65='[',<'['>,3:18]
[@11,66:73=''pytest'',<LISTVAL>,3:19]
[@12,74:74=',',<','>,3:27]
[@13,79:89=''unittest2'',<LISTVAL>,4:1]
[@14,90:90=',',<','>,4:12]
[@15,95:102='"test_3"',<LISTVAL>,5:1]
[@16,104:104=']',<']'>,5:10]
[@17,105:105=',',<','>,5:11]
[@18,108:108=')',<')'>,6:0]
[@19,109:108='<EOF>',<EOF>,6:1]

Now you should be able to follow a simple ANTLR Visitor or Listener pattern to grab up your LISTVAL tokens and do your thing with them. I hope this meets your needs. It certainly parses your test input well, and more.

And perhaps an upvote on this one as well? Thanks, we both know how hard rep is to come by in these slow tags. :)

Collectives™ on Stack Overflow

Antlr to parse python setup file

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related