1

I've just learned regex in python3 and was trying to solve a problem. The problem is something like this:

You have given a string where the first part is a float or integer number and the next part is a substring. You must split the number and the substring and return it as a list. The substring will only contain the alphabet from a-z and A-Z. The values of numbers can be negative. For example:

  1. Input: 2.5ax
    Output:['2.5','ax']
  2. Input: -5bcf
    Output:['-5','bcf']
  3. Input:-69.67Gh
    Output:['-69.67','Gh']

and so on.

I did several attempts with regex to solve the problem.

1st attempt:

import re
i=input()
print(re.findall(r'^(-?\d+(\.\d+)?)|[a-zA-Z]+$',i))

For the input -2.55xy, the expected output was ['-2.55','xy'] But the output came:

[('-2.55', '.55'), ('', '')]

2nd attempt: My second attempt was similar to my first attempt just a little different:

import re
i=input()
print(re.findall(r'^(-?(\d+\.\d+)|\d+)|[a-zA-Z]+$',i))

For the same input -2.55xy, the output came as:

[('-2.55', '2.55'), ('', '')]

3rd attempt: My next attempt was like that:

import re
i=input()
print(re.findall(r'^-?[1-9.]+|[a-z|A-Z]+$',i))

which matched the expected output for -2.55xy and also with the sample examples. But when the input is 2..5 or something like that, it considers that also as a float.

4th attempt:

import re
i=input()
value=re.findall(r"[a-zA-Z]+",i)
print([i.replace(value[0],""),value[0]])

which also matches the expected output but has the same problem as 3rd one that goes with it. Also, it doesn't look like an effective way to do it.

Conclusion: So I don't know why my 1st and 2nd attempt isn't working. The output comes with a list of tuples which is maybe because of the groups but I don't know the exact reason and don't know how to solve them. Maybe I didn't understand the way the pattern works. Also why the substring didn't show in the output? In the end, I want to know what's the mistake in my code and how can I write better and more efficient code to solve the problem. Thank you and sorry for my bad English.

2 Answers 2

2

The alternation | matches either the left part or the right part.

If the chars a-zA-Z are after the digit, you don't need the alternation | and you can use 2 capture groups to get the matches in that order.

Then using re.findall will return a list of tuples for the capture group values.

(-?\d+(?:\.\d+)?)([a-zA-Z]+)

Explanation

  • ( Capture group 1
    • -?\d+ Match an optional -
    • (?:\.\d+)? Optionally match . and 1+ digits using a non capture group (so it is not outputted separately by re.findall)
  • ) Close group 1
  • ( Capture group 2
    • [a-zA-Z]+ Match 1+ times a char a-z or A-Z
  • ) Close group 2

regex demo

import re

strings = [
    "2.5ax",
    "-5bcf",
    "-69.67Gh",
]

pattern = r"(-?\d+(?:\.\d+)?)([a-zA-Z]+)"
for s in strings:
    print(re.findall(pattern, s))

Output

[('2.5', 'ax')]
[('-5', 'bcf')]
[('-69.67', 'Gh')]
Sign up to request clarification or add additional context in comments.

2 Comments

Ok, I understand how the capture group works but can you please explain how does '?:' expression works in your following pattern? Does this somehow remove (\.\d+) as a capture group?
@SamsilArefeen I have added an explanation about the pattern. Then (?: is a non capture group, so you can still make that whole part optional but as re.findall returns capture group values, the non capture group prevents that.
1

lookahead and lookbehind in re.sub simplify things sometimes.

  • (?<=\d) look behind
  • (?=[a-zA-Z]) look ahead

that is split between the digit and the letter.

strings = [
    "2.5ax",
    "-5bcf",
    "-69.67Gh",
]

for s in strings:
    print(re.split(r'(?<=\d)(?=[a-zA-Z])', s))


['2.5', 'ax']
['-5', 'bcf']
['-69.67', 'Gh']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.