Split this string using regular expression - python

Question

Input string
---------------
South Africa 109/0 
Australia 100
Sri Lanka 111
Sri Lanka 331/4

Expected Output
---------------
['South Africa', '109', '0']
['Australia', '100']
['Sri Lanka', '111']
['Sri Lanka', '331', '4']

I tried several regex, but couldn't figure out to write the correct one. Space delimiter doesnt help me in this case as the country names may or may not have spaces (South Africa, India). Thanks in Advance

kennytm · Accepted Answer · 2012-09-13 09:23:41Z

2

We could use the regex:

r'(\D+)\s(\d+)(?:/(\d+))?'

("a lot of non-digits, followed by a space, followed by a lot digits, and then optionally followed by a slash and then a lot of digits.")

This will return, e.g.

>>> [re.match(r'(\D+)\s(\d+)(?:/(\d+))?', x).groups() 
...  for x in ['South Africa 109/0', 
...            'Australia 100',
...            'Sri Lanka 111',
...            'Sri Lanka 331/4']]
[('South Africa', '109', '0'), 
 ('Australia', '100', None), 
 ('Sri Lanka', '111', None), 
 ('Sri Lanka', '331', '4')]

Notice the Nones, which you may need to filter out manually.

answered Sep 13, 2012 at 9:23

kennytm

526k110 gold badges1.1k silver badges1k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Pierre GM Over a year ago

Shouldn't you use [\w\s] instead of \D in order to fail on 'Au$tralia' ?

kennytm Over a year ago

@PierreGM: What if OP wants Bishop's Stortford and Xi'an to succeed? And maybe Áŭ$t®å£ià is really considered valid.

xdazz · Accepted Answer · 2012-09-13 09:26:19Z

1

Try:

import re
re.split(r"(?<=[a-zA-Z])\s+(?=\d)|(?=\d)\s+(?=[a-zA-Z])|/", "South Africa 109/0")

answered Sep 13, 2012 at 9:26

xdazz

161k38 gold badges255 silver badges278 bronze badges

Comments

Pierre GM · Accepted Answer · 2012-09-13 09:18:52Z

0

re.compile("^([\w\s]+)\s(\d+)\/?(\d+)?")

gives you the three groups. We can decompose it

A group of only letters and space ([\w\s]+) at the beggining of the line (^)
a space
a group of digits, at least one (\d+)
a / or not
a group of digits (potententially None)

answered Sep 13, 2012 at 9:18

Pierre GM

20.5k3 gold badges58 silver badges67 bronze badges

2 Comments

Nikola Malešević Over a year ago

This outputs Australia 100 and Sri Lanka 111 in the first group.

Pierre GM Over a year ago

No, that gives you an empty group at the end, just like @KennyTM version.

Nikola Malešević · Accepted Answer · 2012-09-13 09:33:19Z

This is the regex you need:

for match in re.finditer(r"(?m)^(?P<Country>.*?)\s*(?P<Number1>\d+)\s*?/?\s*?(?P<Number2>\d*?)\s*?$", inputText):
    country = match.group("Country")
    number1 = match.group("Number1")
    number2 = match.group("Number2")

You can see the results here.

And here's the explanation of the pattern:

# ^(?P<Country>.*?)\s*(?P<Number1>\d+)\s*?/?\s*?(?P<Number2>\d*?)\s*?$
# 
# Options: ^ and $ match at line breaks
# 
# Assert position at the beginning of a line (at beginning of the string or after a line break character) «^»
# Match the regular expression below and capture its match into backreference with name “Country” «(?P<Country>.*?)»
#    Match any single character that is not a line break character «.*?»
#       Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
# Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s*»
#    Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
# Match the regular expression below and capture its match into backreference with name “Number1” «(?P<Number1>\d+)»
#    Match a single digit 0..9 «\d+»
#       Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s*?»
#    Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
# Match the character “/” literally «/?»
#    Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
# Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s*?»
#    Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
# Match the regular expression below and capture its match into backreference with name “Number2” «(?P<Number2>\d*?)»
#    Match a single digit 0..9 «\d*?»
#       Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
# Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s*?»
#    Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
# Assert position at the end of a line (at the end of the string or before a line break character) «$»

Still gives you three groups on "Australia 101", and your last group is '' by comparison to @KevinTM 's and my solution(None).

Jon Clements · Accepted Answer · 2012-09-13 09:52:34Z

0

You've got the answers with regex, but I suggest also considering the available builtin str methods (for this use case anyway):

s = 'South Africa 109/0'
country, numbers = s.rsplit(' ', 1)
# ('South Africa', '109/0')
new_list = [country] + numbers.split('/')
# ['South Africa', '109', '0']

answered Sep 13, 2012 at 9:52

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

Collectives™ on Stack Overflow

Split this string using regular expression - python

5 Answers 5

2 Comments

Comments

2 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

Comments

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related