3

I'm trying to extract data from sentences such as:

"monthly payment of 525 and 5000 drive off"

using a python regex search function: re.search()

My regex query string is as follows for down payment:

match1 = "(?P<down_payment>\d+)\s*(|\$|dollars*|money)*\s*" + \
         "(down|drive(\s|-)*off|due\s*at\s*signing|drive\s*-*\s*off)*"

My problem is that it matches the wrong numerical value as down payment, it gets both 525, and 5000.

How can I improve my regex string such that it only matches an element if another element is successfully matched as well?

In this case, for example, both 5000 and drive-off matched so we can extract 5000 as down_payment, but 525 did not match with the any down payment values, so it should not even consider the 525.

Clearer explanation here

7
  • 3
    Remove the final asterisk to make the right-hand side context obligatory. Check regex101.com/r/bm8EHE/1. What was the point of making it optional? Commented Jan 4, 2017 at 7:25
  • 1
    You can right some conditions based group values. Commented Jan 4, 2017 at 7:27
  • Please clarify the requirements or let know if my suggestion works for you. Commented Jan 4, 2017 at 7:34
  • @WiktorStribiżew very new at regex and I misunderstood how it works. I thought it grabs parts of patterns too... that extra asterix was a typo I should use a + or remove it instead. thank you Commented Jan 4, 2017 at 7:37
  • So, the first numbered capturing group ((|\$|dollars*|money)*) can be still missing, right? And the last one is obligatory. Commented Jan 4, 2017 at 7:38

1 Answer 1

2

The point is that you want to match a sequence of patterns. In order to make sure the trailing patterns are taken into account, they cannot be all optional. Look, \s*, (|\$|dollars*|money)*, \s*, (down|drive(\s|-)*off|due\s*at\s*signing|drive\s*-*\s*off)* can match empty strings.

I suggest removing the final * quantifier to match exactly one occurrence of the pattern:

(?P<down_payment>\d+)\s*(?:\$|dollars*|money)?\s*(down|drive[\s-]*off|due\s*at\s*signing|drive\s*-*\s*off)

See the regex demo

Also note that I contracted a (\s|-) group into a character class [\s-] as you only alternate single char patterns, and also turned (|\$|dollars*|money)* into a non-capturing optional group (?:\$|dollars*|money)? that matches just 1 or 0 occurrences of $, dollar(s) or money.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.