How to exclude group of characters in python

Question

I wanna write a script that returns digits with power of 1. User's inputs are quadratic and normal digits. what I want is described below:

input = "+2**5+3+4**8-7"
Output = "3,-7"

I tried regex re.findall(r"[+-]?[0-9]+[^[*][*][2]]", input) but it doesn't work Thanks in advance :)

Martijn Pieters · Accepted Answer · 2019-09-23 09:51:42Z

4

You need a negative look-around assertions, and add boundary anchors:

r'(?<!\*\*)-?\b\d+\b(?!\*\*)'

The (?<!...) syntax only matches at positions where the text before it doesn't match the pattern. Similarly, the (?!...) syntax does the same for following text. Together they ensure you only match numbers that are not exponents (follow **) and not have an exponent (followed by **).

The \b boundary anchor only matches at the start or end of a string, and anywhere there’s a word character followed by a non-word character or vice versa (so in between \w\W or \W\w, where \w happily includes digits but not arithmetic characters):

>>> import re
>>> input = "+2**5+3+4**8-7"
>>> re.findall(r'(?<!\*\*)-?\b\d+\b(?!\*\*)', input)
['3', '-7']

Note that I used \d to match digits, and removed the + from the pattern, since you don't want that in your expected output.

You can play with the expression in the online regex101 demo; e.g. you can try it with numbers > 10 and using a single * for multiplication.

If you must support negative exponents, then the above won’t suffice as ...**-42 has 42 match without ** preceding the digits. In that case an extra negative look-behind before the -? that disallows **- is needed:

r'(?<!\*\*)-?(?<!\*\*-)\b\d+\b(?!\*\*)'

(Thanks to Casimir eg Hippolyte for points my this out and suggesting a solution for it).

However, at this point I’d suggest you switch to just parsing the expression into an abstract syntax tree and then walking the tree to extract the operands that are not part of an exponent:

import ast

class NumberExtractor(ast.NodeVisitor):
    def __init__(self):
        self.reset()

    def reset(self):
        self.numbers = []

    def _handle_number(self, node):
        if isinstance(node, ast.Constant):
            if isinstance(node.value, (int, float, complex)):
                return node.value
        elif isinstance(node, ast.Num):
            return node.n

    def visit_UnaryOp(self, node):
        if isinstance(node.op, (ast.UAdd, ast.USub)):
            operand = self._handle_number(node.operand)
            if operand is None:
                return
            elif isinstance(node.op, UAdd):
                self.numbers.append(+operand)
            else:
                self.numbers.add(-operand)

    def visit_Constant(self, node):
        if isinstance(node.value, (int, float, complex)):
            self.numbers.append(node.value)

    def visit_Num(self, node):
        self.numbers.append(node.n)

    def visit_BinOp(self, node):
        if isinstance(node.op, ast.Pow):
            return  # ignore exponentiation
        self.generic_visit(node)  # process the rest

def extract(expression):
    try:
        tree = ast.parse(expression, mode='eval')
    except SyntaxError:
        return []
    extractor = NumberExtractor()
    extractor.visit(tree)
    return extractor.numbers

This extracts just the numbers; subtraction won’t produce a negative number:

>>> input = "+2**5+3+4**8-7"
>>> extract(input)
[3, 7]

Moreover, it can handle arbitrary amounts of whitespace, and much more complex expressions than a regex could ever handle:

>>> extract("(10 + 15) * 41 ** (11 + 19 * 17) - 42")
[10, 15, 42]

edited Sep 23, 2019 at 9:51

answered Sep 22, 2019 at 10:15

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Casimir et Hippolyte Over a year ago

**-12: I'm afraid two lookbehinds are needed for this case (regex101.com/r/qZy32F/1)

Martijn Pieters Over a year ago

@CasimiretHippolyte ah yes, that’s unfortunate. I’ll ponder it for a bit; limited time at the moment.

Code Maniac Over a year ago

@CasimiretHippolyte one alternate i think is using alternation, have a pattern to take care about start and end part of string Regex Demo

Martijn Pieters Over a year ago

@CodeManiac using the Python parser and traversing the AST is definitely a better option there :-)

Casimir et Hippolyte Over a year ago

You can also use a simple and stupid solutiion: remove all numbers with exponent first.

|

Toto · Accepted Answer · 2019-09-22 10:48:40Z

3

re.findall(r"(?<!\*\*)(?<!\*\*[+-])[+-]?\b[0-9]++(?!\*\*)", input)

(?!\*\*) is a negative lookahead that makes sure we haven't 2 * after digits.

re doesn't support posssessive quantifiers, you have to use PyPi regex

Demo

edited Sep 22, 2019 at 10:48

answered Sep 22, 2019 at 10:15

Toto

91.7k63 gold badges97 silver badges135 bronze badges

4 Comments

Casimir et Hippolyte Over a year ago

Python re module doesn't have possessive quantifiers.

Toto Over a year ago

@CasimiretHippolyte: I wasn't aware of that, I've just test with regex101.

Casimir et Hippolyte Over a year ago

Note that even with possessive quantifier support, your pattern will match **-12 since the ? isn't possessive too.

Toto Over a year ago

@CasimiretHippolyte: Fixed with 2 lookbehind and a word boundary.

han solo · Accepted Answer · 2019-09-22 11:27:01Z

2

You could write a parser and check whatever you need. I know it is a bit long, but fun :)

$ cat lexer.py
import re
from collections import namedtuple

tokens = [
    r'(?P<TIMES>\*)',
    r'(?P<POW>(\+|-)?\d+\*\*\d+)',
    r'(?P<NUM>(\+|-)?\d+)'
    ]

master_re = re.compile('|'.join(tokens))
Token = namedtuple('Token', ['type','value'])
def tokenize(text):
    scan = master_re.scanner(text)
    return (Token(m.lastgroup, m.group())
            for m in iter(scan.match, None))

x = '+2**5+3+4**8-7'

required = []
for tok in tokenize(x):
  if tok.type == 'POW':
      coeff, exp = tok.value.split('**')
      if exp == '1':
          required.append(coeff)
  elif tok.type == 'NUM':
      required.append(tok.value)

print(required)

Output:

$ python lexer.py
['+3', '-7']

edited Sep 22, 2019 at 11:27

answered Sep 22, 2019 at 11:08

han solo

6,6501 gold badge20 silver badges22 bronze badges

Comments

Maninder · Accepted Answer · 2019-09-22 14:03:37Z

0

You can try this simple regex expression

re.findall(r'[-\+]\d(?!\*\*)', search_data)

answered Sep 22, 2019 at 14:03

Maninder

311 bronze badge

Collectives™ on Stack Overflow

How to exclude group of characters in python

4 Answers 4

10 Comments

4 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

10 Comments

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related