3

How can I extract a numeric expression from a string, which may or may not have underscore or hyphen. For eg; like 2016-03 or 2016_03 or simply 201603.

Sample strings:

s = 'Total revenue for 2016-03 is 3000 €'  # Output 2016-03
s = 'Total revenue for 2016_03 is 3000 €'  # Output 2016_03
s = 'Total revenue for 201603 is 3000 €'   # Output 201603

There are 6 numbers and in case we have either of - or _, then the total length is 7. There is no other number in the entire string.

I don't know how to use if-else in regex, so that in can include the logic of length 6 or 7. For simple strings like 201603, I am able to do it -

import re
print(re.findall('\d{6}','Total revenue for 201603 is 3000 €'))
['201603']

print(re.findall('\d{6}','Total revenue for 2016-03 is 3000 €'))
[]

Note: I am looking for a solution where theoretically _ or - could be anywhere in between the 6 length number. Like 123-456 or 123456 or 12345-6 and so on.

14
  • 1
    You can try (?<=^Total revenue for )(\d+[-_]?\d+) Commented Oct 7, 2019 at 12:41
  • \d{6} matches at least 6 digits in a row... Commented Oct 7, 2019 at 12:41
  • I suspect that in your case, whitespace boundaries will work, i.e. r'(?<!\S)(?=\d+[_-]\d+)[\d_-]{6,7}(?!\S)'. Probably, it will be simpler to split with whitespace and then test with ^(?=.{6,7}$)\d+[-_]\d+$ Commented Oct 7, 2019 at 12:41
  • @WiktorStribiżew It doesn't work for the case where there is no - or _ Total revenue for 201603 is 3000 € Commented Oct 7, 2019 at 12:46
  • 1
    I missed the ? again, sorry, I edited the comment above and added the link to the regex demo. I do not like this pattern since there are too many checks involved. Probably, the (?!\S) is still better at the end: r'(?<!\S)(?=\d+(?:[_-]\d+)?)[\d_-]{6,7}(?!\S)' or even doubled: r'(?<!\S)(?=\d+(?:[_-]\d+)?(?!\S))[\d_-]{6,7}(?!\S)'. Too much redundancy. I would combine a regex with some code. Commented Oct 7, 2019 at 13:04

4 Answers 4

2

There can be two approaches: one is more readable with splitting the string first and then getting the first item that matches your required pattern, or a less readable approach with a single regex.

See the Python demo:

import re
s = 'Total revenue for 201603 is 3000 €'
rx = re.compile(r'^(?=\d+(?:[_-]\d+)?$)[\d_-]{6,7}$')
res = [x for x in s.split() if rx.search(x)]
if len(res):
    print(res[0])

# Pure regex approach:
rx = re.compile(r'(?<!\S)(?=\d+(?:[_-]\d+)?(?!\S))[\d_-]{6,7}(?!\S)')
res = rx.search(s)
if res:
    print(res.group())

So, in the first approach, the string is split with whitespaces, and a ^(?=\d+(?:[_-]\d+)?$)[\d_-]{6,7}$ pattern is applied to each item, and if there are any matches, the first one is returned. The pattern matches:

  • ^ - start of string
  • (?=\d+(?:[_-]\d+)?$) - a positive lookahead that makes sure there is 1+ digits, then _ or -, and then again 1+ digits up to the end of string,
  • [\d_-]{6,7} - matches 6 or 7 digits, - or _
  • $ - end of string.

The second approach involves regex only and the ^ anchor is replced with (?<!\S) and $ is replaced with (?!\S) that act as whitespace boundaries. (?<!\S) is a negative lookbehind that requires a whitespace or start of string right before the current position and the (?!\S) is a negative lookahead that requires a whitespace or end of string right after the current position.

Sign up to request clarification or add additional context in comments.

Comments

1

You can use positive lookbehind if you're sure your required value is always follow a standard pattern

(?<=^Total revenue for )\d+[-_]?\d+
  • (?<=^Total revenue for ) - Match must be preceded by Total revenue for, ^ start from start of string
  • \d+ - Match one or more digit
  • [-_]? - Match - or _ ( optional )

Regex Demo


Or you can extend the above regex in such manner if you're not sure that the required value format

(?<=^Total revenue for )(?=\d+[-_]?\d+)[\d_-]{6,7}(?!\S)
  • (?=\d+[-_]?\d+) - To ensure digit followed by - or _ optional followed by digit
  • [\d_-]{6,7} - To match digit or - or _, 6 or 7 times
  • (?!\S) - Should not be followed by a non space character

Regex Demo

4 Comments

Though your code works, I am not used to the demo link you sent. What would be the first argument in re.findall() - (?=\d+[-_]?\d+)[\d_-]{6,7}\b??
@cph_sto yes you should add the pattern there, also if you go to link, there's a option on left side, code generator where you can see the sample code
Though, I accepted @Wiktor's answer. Thank you so much for your efforts and time. Very appreciated. I think I have learnt a lot here :)
@cph_sto always happy to help :) no problem even i would have selected his approach, because of smart use of (?!\S)
0

This should do it fairly simply:

print(re.findall(r'\d{4}[-_]?\d{2}', 'Total revenue for 201603 is 3000 €'))
# ['201603']

Specifically, this is "Four digits, followed by either zero or one occurrences of either '-' or '_', followed by two more digits". If there isn't a hyphen or underscore, the four-digits and two-digits just end up the same as asking for six-digits.

This does capture the hyphen or underscore if it's there, though, so one thing you can do is just filter it out:

nums = re.findall(r'\d{4}[-_]?\d{2}', 'Total revenue for 2016-03 is 3000 €')
# nums = ['2016-03']
nums = [num.replace('-', '').replace('_', '') for num in nums]
# nums = ['201603']

Note that this is the solution that least disturbs your original regex, and it'll search for this pattern of "four digits followed by maybe a separator and then two digits" anywhere in the string. If you want to restrict this to just the string you're trying to look for, ignoring similar ones, you may need to make the regex more specific. See also the re documentation

1 Comment

I want a generic solution, where underscore could be like 03-2016 or 032016 or simply 032016. Your answer assumes that underscore will be at 5th position. Theoretically speaking, the solution should capture cases like 123-456.
0

Your RegEx is following: starts with space, sequence of at least one digit(s) and ends with space. It comes to this:

\s(\d*)\s

Check it here: https://regex101.com/r/V4NzLj/1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.