continued review

Source Link

edited Jan 19, 2022 at 23:59

35.3k
5
41
103

Python 3.10

(To be continued Assuming you're up to the latest version of Python, again)there is a new feature: Structured Pattern Matching. With it, you can match compound structures.

For example, parsing of the top-level blocks could look like:

        # Top-level block (non-indented line)

        words = line.split(maxsplits=3)
        match words:

            case ['table', name]:
                block_data = {"name": name}
                json_parsed['table'].append(block_data)
                # remainder of table parsing ...

            case ['define', name, '=', definition]:
                json_parsed['define'].append({name: definition})

            case [('object' | 'obj'), name]:
                block_data = {"name": name}
                json_parsed['object'].append(block_data)
                # remainder of object parsing ...

            case [('object' | 'obj'), name, ':', type_alias]:
                raise NotImplementedError("Type aliases don't seem supported, yet")

            case [('info' | 'region') as block_type, name]:
                block_data = {"name": name}
                json_parsed[block_type].append(block_data)
                # remainder of block parsing ...

            case _:
                print("Warning: Unrecognized block:", line)

Note the use of Or-patterns to treat both "object" and "obj" as object definitions, per the ADL definition.

Also, from the sample file, there is the block:

object vetoleptons : Union(vetoelectrons, vetomuons)

which does not appear to be properly supported in your code yet.

(To be continued, again)

Python 3.10

Assuming you're up to the latest version of Python, there is a new feature: Structured Pattern Matching. With it, you can match compound structures.

For example, parsing of the top-level blocks could look like:

        # Top-level block (non-indented line)

        words = line.split(maxsplits=3)
        match words:

            case ['table', name]:
                block_data = {"name": name}
                json_parsed['table'].append(block_data)
                # remainder of table parsing ...

            case ['define', name, '=', definition]:
                json_parsed['define'].append({name: definition})

            case [('object' | 'obj'), name]:
                block_data = {"name": name}
                json_parsed['object'].append(block_data)
                # remainder of object parsing ...

            case [('object' | 'obj'), name, ':', type_alias]:
                raise NotImplementedError("Type aliases don't seem supported, yet")

            case [('info' | 'region') as block_type, name]:
                block_data = {"name": name}
                json_parsed[block_type].append(block_data)
                # remainder of block parsing ...

            case _:
                print("Warning: Unrecognized block:", line)

Note the use of Or-patterns to treat both "object" and "obj" as object definitions, per the ADL definition.

Also, from the sample file, there is the block:

object vetoleptons : Union(vetoelectrons, vetomuons)

which does not appear to be properly supported in your code yet.

continued review

Source Link

edited Jan 19, 2022 at 20:54

AJNeufeld

35.3k
5
41
103

Indentation

The ADL syntax is composed of blocks with indented lines. Using line = line.strip() removes all leading and trailing white-space, destroying the block-level formatting. You are left with relying on region, table, etc as being a block level keywords, with the hope that in the future a region will never have a table keyword inside it, or vis-versa.

Instead:

    for line in f:

        # Remove comments
        if '#' in line:
            line = line[:line.index('#')]

        if not line or line.isblank():
            continue    # White-space only line
        elif line.startswith((' ', '\t')):
            ...         # Indented line: should be inside a block
        else:
            ...         # Outdented line: should be start of a block

End-of-block processing

When you detect the end of a block, or start a new block with an uncommitted block, you append the previous block and clear the collection area ...

            json_parsed[block_type].append(block_data)
            block_data = {}
            parsing_block = False

When you start a new block, you remember the block_type, and set the parsing_block to True.

You could make block_type do double duty. Set block_type to None when you are not parsing a block, and you could remove the parsing_block flag.

A possibly better approach could be to immediately store the block in the output structure when you start a block ...

        if line.startswith(block_names):    # Based on current code
            block = line.split(maxsplit=1)
            block_data = {"name": block[1]}
            json_parsed[block[0]].append(block_data)

Then, when it is time to do the end-of-block processing, you're already done! Though block_data was initially empty when it was added, you fill in the data at subsequent steps like you are already doing.

Using block_data = None when you finished one block and haven't started the next may be useful, as it will prevent accidental corruption of your completed block.

This has eliminated both block_type and parsing_block variables.

table_k_v

This is very fragile code:

table_k_v = "# val    err-     err+     xmin     xmax"
...
                if block_type == "table":
                    if line == table_k_v:
                        parsing_table = True
                        ...
                    ...

If the table layout changes ever so slightly, perhaps adding another significant figure to the data, the comment may be adjusted without a second thought, and your code with fail to detect the table.

A slightly better approach might be to split the line on spaces, and compare that to a pre-split constant:

TABLE_K_V = "#  val  err-  err+  xmin  xmax".split()
...
                if block_type == "table":
                    if line.split() == TABLE_K_V:
                        parsing_table = True
                        ...
                    ...

You could throw in a .casefold() in case someone decides # Val Err- Err+ XMin XMax looks better in the comment line.

However, this is still fragile. It is only a comment line, and other modifications like err- to -Err would be very difficult to account for. Since we cannot rely on it, removing the comments completely (as shown above) makes sense.

In the table section, if the line consists of 5 numbers, it is a table row. If the line consists of an identifier and a value, it is a key-value pair. If it is anything else, it is likely an error.

(To be continued, again)

(To be continued)

Indentation

The ADL syntax is composed of blocks with indented lines. Using line = line.strip() removes all leading and trailing white-space, destroying the block-level formatting. You are left with relying on region, table, etc as being a block level keywords, with the hope that in the future a region will never have a table keyword inside it, or vis-versa.

Instead:

    for line in f:

        # Remove comments
        if '#' in line:
            line = line[:line.index('#')]

        if not line or line.isblank():
            continue    # White-space only line
        elif line.startswith((' ', '\t')):
            ...         # Indented line: should be inside a block
        else:
            ...         # Outdented line: should be start of a block

End-of-block processing

When you detect the end of a block, or start a new block with an uncommitted block, you append the previous block and clear the collection area ...

            json_parsed[block_type].append(block_data)
            block_data = {}
            parsing_block = False

When you start a new block, you remember the block_type, and set the parsing_block to True.

You could make block_type do double duty. Set block_type to None when you are not parsing a block, and you could remove the parsing_block flag.

A possibly better approach could be to immediately store the block in the output structure when you start a block ...

        if line.startswith(block_names):    # Based on current code
            block = line.split(maxsplit=1)
            block_data = {"name": block[1]}
            json_parsed[block[0]].append(block_data)

Then, when it is time to do the end-of-block processing, you're already done! Though block_data was initially empty when it was added, you fill in the data at subsequent steps like you are already doing.

Using block_data = None when you finished one block and haven't started the next may be useful, as it will prevent accidental corruption of your completed block.

This has eliminated both block_type and parsing_block variables.

table_k_v

This is very fragile code:

table_k_v = "# val    err-     err+     xmin     xmax"
...
                if block_type == "table":
                    if line == table_k_v:
                        parsing_table = True
                        ...
                    ...

If the table layout changes ever so slightly, perhaps adding another significant figure to the data, the comment may be adjusted without a second thought, and your code with fail to detect the table.

A slightly better approach might be to split the line on spaces, and compare that to a pre-split constant:

TABLE_K_V = "#  val  err-  err+  xmin  xmax".split()
...
                if block_type == "table":
                    if line.split() == TABLE_K_V:
                        parsing_table = True
                        ...
                    ...

You could throw in a .casefold() in case someone decides # Val Err- Err+ XMin XMax looks better in the comment line.

However, this is still fragile. It is only a comment line, and other modifications like err- to -Err would be very difficult to account for. Since we cannot rely on it, removing the comments completely (as shown above) makes sense.

In the table section, if the line consists of 5 numbers, it is a table row. If the line consists of an identifier and a value, it is a key-value pair. If it is anything else, it is likely an error.

(To be continued, again)

Source Link

answered Jan 19, 2022 at 0:12

AJNeufeld

35.3k
5
41
103

Global Constants

block_names = ("object", "region", "info", "table ")
keywords = ("define", "select", "reject", "take", "using", "sort", "weight", "histo")
table_k_v = "# val    err-     err+     xmin     xmax"

These should be defined in UPPER_CASE, to indicate they are constants.

Suspicious value

In block_names = ("object", "region", "info", "table "), the final value has a trailing space, none of the other values do. This is suspicious.

Reading the code, I see the test line.startswith(block_names), and reviewing the sample file, I see the word tabletype, so I'm guessing you've used "table " with the trailing space to avoid accidentally matching tabletype. However, this is confusing and fragile. A new term like "regional" could accidentally match, and a change in white space, such as using a tab character (\t) instead of a space would prevent the correct match in "table\tbtagdeepCSVmedium.

Close resources

When opening resources, use a with statement, to ensure the resource is closed. You never explicitly call f.close() in your code, so the resource may be held open for much longer than you intend. However, you should prefer using the with-syntax to avoid the need to call f.close() explicitly.

    # Read the ADL file
    with open(adl_file, 'r') as f:
        ...

Multiple ways to parse blank lines

        if line in ['\n', '\r\n']:
            if parsing_block:
                json_parsed[block_type].append(block_data)
                block_data = {}
                parsing_block = False
                parsing_table = False
            else:
                continue

        line = line.strip()
        if not line:
            continue

If a blank line, then if in a parsing block you do some work and then proceed to the line = line.strip(). If not in a parsing block, you continue the loop. After the blank line check, you strip white space from the start/end, and if that results in a blank line, you continue without checking if you are in a parsing block.

Uh ... so two line which are visually the same are handled differently if they contain spaces or tabs?

Plus you later have to handle "when new block starts without an empty line" anyway.

It would be simpler to strip whitespace, and then check for empty/blank lines, and rely on the existing "when new block starts without an empty line" code:

        line = line.strip()
        if not line:
            continue

Comments

        if line.startswith("#"):
            continue

This happens before stripping off leading white-space, so indented comments would still be processed.

From the ADL definition you linked to, comments can also appear at end of lines:

keyword3 value3 # comment about value3

As long as you can't have any # character embedded inside quoted string, it would simplify later parsing if these were removed. Since removing the comment can turn a non-blank line into a blank line, let's strip off comments first, then check for blank lines:

        if '#' in line:
            line = line[:line.index('#')]

        line = line.strip()
        if not line:
            continue

Splitting on Spaces

                        values = list(filter(None, line.split(" ")))

It took me a while to figure out what you were trying to do here, but I finally got it. You are splitting a line on the space character, which is giving you a bunch of empty strings if terms are separated by multiple spaces. The filter(None, ...) produces a generator that removes falsy empty strings, which you then turn into a list.

This does practically the same thing:

                        values = line.split()

It is better, actually, since it will treat any combination of one or more white-space characters as one separator, allowing the tables to use tabs to line things up nicely.

Last element of list

This bugged me ... keyword_value[-1] ... until I realized what you were doing with .split(" "). Using .split() fixes that issue, and you can safely use keyword_value[1] to reference the second item in the line, even if multiple spaces exist between the first and second items.

Exception Handling

    for line in f:
        ...

        try:
            ...
                else:
                    raise ValueError('Parsing of Keyword {} is not supported.'.format(keyword_value[0]))
        except Exception:
            block_data = {}
            parsing_block = False
            parsing_table = False

Catching any Exception is bad. If you make an error, like using keyword_val instead of keyword_value, or using a bad list index, the exception (NameError, IndexError) is caught, and your code will blank out the current parsing and muddle-on without ever telling you what happened. Yikes!

You are explicitly raising a ValueError. You should explicitly catch that, and only that.

However, you've only got one raise statement, which is followed immediately by the catch clause. You could eliminate the entire try...except, and move the cleanup code into the else: statement.

You might want to print out the unrecognized keywords too. You might think you've coded for them all, but you might not have.

(To be continued)

Stack Exchange Network

Return to Answer

Python 3.10

Python 3.10

Indentation

End-of-block processing

table_k_v

Indentation

End-of-block processing

table_k_v

Global Constants

Suspicious value

Close resources

Multiple ways to parse blank lines

Comments

Splitting on Spaces

Last element of list

Exception Handling