Parse python code, for specific pep8 issues

Question

I am aware, that libraries exist for parsing python code, however, for the sake of learning how they parse errors, I'm creating a script that checks a file for only 6 Pep8 errors just for reference.

This is how my current 6 Pep8 functions look (they append an issue to the issues list if the issue was found)

"""
[S001] Line is longer than 79 characters
[S002] Indentation is not a multiple of four
[S003] Unnecessary semicolon after a statement (note, semicolons are admissible in comments)
[S004] At least two spaces before inline comments required
[S005] TODO found (only in comments; the case does not matter)
[S006] More than two blank lines used before this line (must be output for the first non-empty line)
"""
def S001(self, ln_num: int, line: str):
    if len(line) > 79:
        self.issues.append(f"Line {ln_num}: S001 Too Long")


def S002(self, ln_num: int, line: str):
    indentation_length = len(line) - len(line.lstrip())
    if indentation_length % 4 != 0:
        self.issues.append(f"Line {ln_num}: S002 Indentation is not a multiple of four")


def S003(self, ln_num: int, line: str):
    regex1 = re.compile("(.*)((;(\s)*#)|(;$))")
    regex2 = re.compile("#.*;")
    if regex1.search(line) and not regex2.search(line):
        self.issues.append(f"Line {ln_num}: S003 Unnecessary semicolon")


def S004(self, ln_num: int, line: str):
    regex = re.compile("(([^ ]{2})|(\s[^ ])|([^ ]\s))#")
    if regex.search(line):
        self.issues.append(f"Line {ln_num}: S004 At least two spaces before inline comments required")


def S005(self, ln_num: int, line: str):
    regex =  re.compile("#(.*)todo", flags=re.IGNORECASE)
    if regex.search(line):
        self.issues.append(f"Line {ln_num}: S005 TODO found")


def S006(self, ln_num: int, line: str):
    if self.code[ln_num-4:ln_num-1] == ['', '', ''] and line != "":
        self.issues.append(f"Line {ln_num}: S006 More than two blank lines used before this line")

Testcases:

""" Test case 1 """
print('What\'s your name?') # reading an input
name = input();
print(f'Hello, {name}');  # here is an obvious comment: this prints greeting with a name


very_big_number = 11_000_000_000_000_000_000_000_000_000_000_000_000_000_000_000
print(very_big_number)



def some_fun():
    print('NO TODO HERE;;')
    pass; # Todo something
""" END """

""" Test Case 2 """
print('hello')
print('hello');
print('hello');;;
print('hello');  # hello
# hello hello hello;
greeting = 'hello;'
print('hello')  # ;
""" END """

""" Test Case 3 """
    print('hello')
    print('hello')  # TODO
    print('hello')  # TODO # TODO
    # todo
    # TODO just do it
    print('todo')
    print('TODO TODO')
    todo()
    todo = 'todo'
""" END """

""" Test Case 4 """
print("hello")


print("bye")



print("check")
""" END """

""" Test Case 5 """
print('hello!')
# just a comment
print('hello!')  #
print('hello!')  # hello

print('hello!') # hello
print('hello!')# hello
""" END """

Testcase 1 Expected Output:

Line 1: S004 At least two spaces before inline comment required
Line 2: S003 Unnecessary semicolon
Line 3: S001 Too long
Line 3: S003 Unnecessary semicolon
Line 6: S001 Too long
Line 11: S006 More than two blank lines used before this line
Line 13: S003 Unnecessary semicolon
Line 13: S004 At least two spaces before inline comment required
Line 13: S005 TODO found

I am aware my code, is not optimal and doesn't satisfy every edge case, but I want an idea, on how they parse the errors properly. I would like improvements or better ideas on how to parse for errors since I personally don't like my answers.

You could start by running your code through a style checker like pep8online.com , it suggests 8-10 things you should fix. — user985366
– user985366, Commented Sep 19, 2020 at 1:36

Jörg W Mittag · Accepted Answer · 2020-09-19 11:48:37Z

When I copy&paste your code into my editor, it immediately greets me with 50 errors and warnings. To be fair, some of these are duplicates, because I have multiple linters configured. However, well over 20 of those are real. And ironically, at least 6 of those would have been reported by your own code!

PEP8 violations

Maximum line length
Blank lines (after the module docstring before the first method)
Single space around operators
Function names should be lower_snake_case

Missing import

You are missing the import for re.

Missing definitions

You are missing the definitions for issues and code.

Invalid escape sequence

\ is the escape character in string literals, therefore \s is interpreted as an escape sequence, but it is not a valid escape sequence. If you want to have a literal backslash in your string, you need to escape it with another backslash: \\s.

Missing docstrings

All of your functions are missing docstrings.

Naming

regex, regex1, and regex2 are not very descriptive names.

Bugs

S001 should use 72 as the maximum line length for docstrings and comments, and only use 79 for code. Also, there is a problem with PEP8 itself: it limits the line length to 79/72 characters, but it should instead limit the line length to 79/72 columns. There are characters that take up 2 columns, and there are characters that take up 0 columns. Your code uses characters, as specified by PEP8, so it is correct as far as PEP8 is concerned, but columns would make more sense.

S003 will incorrectly flag this:

"""
;
"""

S004 will incorrectly flag this:

" #"

S005 will incorrectly flag this:

"# TODO"

FMc · Accepted Answer · 2020-09-20 15:16:07Z

You have a useful review on several details already, so I'll focus on overall design. In particular, what good is the larger class doing you? All of your PEP8 checkers have access to self and they mutate it when errors occur. It feels like the detailed, algorithmically grubby code of the checkers knows far too much. What can be done to simplify their role? One approach is to turn them into pure functions with a more generic behavior.

As a basic demonstration, let's use your original code as the dog food in a revised checking strategy:

import re
from textwrap import dedent
from collections import namedtuple

# We'll use your own code in the demo.
YOUR_CODE = dedent('''
    def S001(self, ln_num: int, line: str):
        if len(line) > 79:
            self.issues.append(f"Line {ln_num}: S001 Too Long")


    def S002(self, ln_num: int, line: str):
        indentation_length = len(line) - len(line.lstrip())
        if indentation_length % 4 != 0:
            self.issues.append(f"Line {ln_num}: S002 Indentation is not a multiple of four")


    def S003(self, ln_num: int, line: str):
        regex1 = re.compile("(.*)((;(\s)*#)|(;$))")
        regex2 = re.compile("#.*;")
        if regex1.search(line) and not regex2.search(line):
            self.issues.append(f"Line {ln_num}: S003 Unnecessary semicolon")


    def S004(self, ln_num: int, line: str):
        regex = re.compile("(([^ ]{2})|(\s[^ ])|([^ ]\s))#")
        if regex.search(line):
            self.issues.append(f"Line {ln_num}: S004 At least two spaces before inline comments required")


    def S005(self, ln_num: int, line: str):
        regex =  re.compile("#(.*)todo", flags=re.IGNORECASE)
        if regex.search(line):
            self.issues.append(f"Line {ln_num}: S005 TODO found")


    def S006(self, ln_num: int, line: str):
        if self.code[ln_num-4:ln_num-1] == ['', '', ''] and line != "":
            self.issues.append(f"Line {ln_num}: S006 More than two blank lines used before this line")
''')

What do all checkers need to do their work? Based on your current checks, they need a line number, a line, and sometimes all of the lines (for contextual checks). So let's define a simple data object to hold that context. All checkers will receive one of those objects (a CodeLine) and return True if there is an error. Here's what the checkers would look like with nothing more than those changes.

CodeLine = namedtuple('CodeLine', 'ln_num line lines')

def S001(c: CodeLine):
    return len(c.line) > 79

def S002(c: CodeLine):
    indentation_length = len(c.line) - len(c.line.lstrip())
    return indentation_length % 4 != 0

def S003(c: CodeLine):
    regex1 = re.compile("(.*)((;(\s)*#)|(;$))")
    regex2 = re.compile("#.*;")
    return regex1.search(c.line) and not regex2.search(c.line)

def S004(c: CodeLine):
    regex = re.compile("(([^ ]{2})|(\s[^ ])|([^ ]\s))#")
    return regex.search(c.line)

def S005(c: CodeLine):
    regex =  re.compile("#(.*)todo", flags=re.IGNORECASE)
    return regex.search(c.line)

def S006(c: CodeLine):
    return c.lines[c.ln_num-4:c.ln_num-1] == ['', '', ''] and c.line != ""

Is that change better? It certainly narrows the role and simplifies the code of the checkers. Since they are likely to contain most of the program's complexity, this feels like a good move. Also, these functions are much easier to test and experiment with in debugging situations. They don't need as much bootstrapping to get going: just feed them a lightweight CodeLine. One drawback is that we no longer see the error descriptions in the checkers -- and they do provide useful cues for developers working on the code. One way to remedy that would be to have the checkers return a (FAILED, DESCRIPTION) tuple. Another is just to duplicate the descriptions in docstrings or code comments (PEP8 messages don't change very frequently, so managing that duplication is unlikely to be a significant problem). In any case, you can adjust as desired if you are interested in this general strategy. For now, we'll just define a data structure for the error descriptions.

ERR_DESCS = {
    S001: 'Too Long',
    S002: 'Indentation is not a multiple of four',
    S003: 'Unnecessary semicolon',
    S004: 'At least two spaces before inline comments required',
    S005: 'TODO found',
    S006: 'More than two blank lines used before this line',
}

Finally, some code to run the demo.

def main():
    # Load up the code.
    lines = YOUR_CODE.split('\n')
    code_lines = [
        CodeLine(i, line, lines)
        for i, line in enumerate(lines)
    ]

    # Run all checks.
    errors = [
        error_message(c, f.__name__, desc, f(c))
        for c in code_lines
        for f, desc in ERR_DESCS.items()
    ]

    # Throw out the non-errors and print.
    errors = list(filter(None, errors))
    for e in errors:
        print(e)

# Helper to convert a CodeLine to a contextual error message.
def error_message(c: CodeLine, name: str, desc: str, failed: bool):
    if failed:
        return f'Line {c.ln_num}: {name} {desc}'
    else:
        return None

main()

Stack Exchange Network

Parse python code, for specific pep8 issues

2 Answers 2

PEP8 violations

Missing import

Missing definitions

Invalid escape sequence

Missing docstrings

Naming

Bugs

You must log in to answer this question.

Hot Network Questions

Parse python code, for specific pep8 issues

2 Answers 2

PEP8 violations

Missing import

Missing definitions

Invalid escape sequence

Missing docstrings

Naming

Bugs

You must log in to answer this question.

Related

Hot Network Questions