-2

I need to quickly write (or borrow) something, in any language, that automatically filters tons of python source code in order to remove comments. The goal is to make the code on the target platform more compact (and as an aside reverse engineering even so slightly more difficult). I must positively not modify the code's behavior, and can live with a few leftover comments. My input and output should be a .py text file, assumed to be valid python 2.x (assume: restricted to ASCII, I'll take care of UTF8).

Strictly speaking, I do not need to remove comments of the kind defined by

A comment starts with a hash character (#) that is not part of a string literal, and ends at the end of the physical line.

because the python tokenizer already does that for me, and in the end the code is distributed as .pyc. Too bad, because I clearly see how to do that cleanly (the only slightly tricky part is the convoluted syntax of string literal in python).

My problem is, a cursory look at the python source code I have to filter shows that it includes loads of comments that are not introduced by #, but simply are string literals that perform no useful task. These are definetly kept in the .pyc tokenized file. They are all over the place, I'm told to facilitate automatic generation of documentation, and editing. Many of these string literals that really are comments are embedded in function definitions, like:

def OnForceStatusChoice(self,event):
    """Action when a status is selected"""
    self.ExecutionPanel.SetFocus()

On the other hand there are loads of string literals that are useful text, including English text to be displayed to the user, and initialization of tables. That makes it hopeless to automatically and safely recognize those string literals that really are comments from the value of the string literal.

From my sampling, most string literals that really are comments seem to be introduced by """ (with few enough exceptions that I perhaps could live with that), but I understand enough python to know that I can't safely remove all these string literals.

Can I safely (or with some stated and reasonable assumption on coding style) assume that

  1. If the first thing in a .py file, ignoring # comments, is a string literal, it can be removed, recursively? If yes, can this rule be made more powerful by ignoring (and keeping) other things beside # comments?
  2. Any string literal starting on the leftmost column of any line can be removed?
  3. Any string literal starting after something syntactically matching a function definition (like the above def) can be removed? If yes, how do I precisely define syntactically matching a function definition?

Please answer like I can't tell python from a random collection of bytes, which is not far from reality.

3
  • 1
    Your premise is flawed: this isn't really going to contribute to compactness or obfuscation. Also, code obfuscators are totally a thing - you must have performed zero research. Commented May 16, 2014 at 17:27
  • @Marcin: The question explains that I know removing # comments from .py files does nothing, and I want to remove other comments, wich I'm told are (or at least include) docstrings. Removing (or emptying) docstrings definitely contributes to compactness (and, slightly, to obfuscation), and I aim at doing that. Commented May 16, 2014 at 18:00
  • I'm sorry, but you're still achieving next to nothing, and did no research. Commented May 16, 2014 at 18:08

1 Answer 1

6

What you are calling comments are really docstrings:

A string literal appearing as the first statement in the function body is transformed into the function’s __doc__ attribute and therefore the function’s docstring.

and from the glossary:

A string literal which appears as the first expression in a class, function or module. While ignored when the suite is executed, it is recognized by the compiler and put into the __doc__ attribute of the enclosing class, function or module. Since it is available via introspection, it is the canonical place for documentation of the object.

Compile the project to .pyo files by using the -OO command line switch:

-O
Turn on basic optimizations. This changes the filename extension for compiled (bytecode) files from .pyc to .pyo. See also PYTHONOPTIMIZE.

-OO
Discard docstrings in addition to the -O optimizations.

You can compile all files in your project using the compileall module as a command line utility:

python -OO -m compileall path/to/project/

However, Python bytecode is trivial to decompile. Removing the docstrings is not going to buy you much.

If you need to something more specialized still, you'll have to learn how to use the ast module to parse Python code into a parse tree, manipulate that tree (e.g. remove all docstrings), then write out transformed Python code. See Parse a .py file, read the AST, modify it, then write back the modified source code for some pointers in that direction.

Sign up to request clarification or add additional context in comments.

8 Comments

@fgrieu: .pyo files are really just .pyc files but use a new extension to reflect that they no longer hold assert statements and have __debug__ set to False; you can rename them if you so wish.
@fgrieu: the strings at the top of .py files are docstrings. A .py file is a module.
@fgrieu: Your test suite should be preventing the disasters. Asserts are nice for developers, but test suites are better.
@fgrieu: I gave you the other option that you have; parse, modify parse tree, write out modified code.
@fgrieu: if some_constraint_not_met: raise AssertionError('Error message'). Use that instead of assert statements and it'll never get stripped.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.