Regex - Match a string up to a digit or a specific string

Question

I am working in python and there I have a list of countries that I would like to clean. Most countries are already written the way I want them to be. However, some country names have a one- or two-digit number attached or there is a text in brackets appended. Here's a sample of that list:

Argentina
Australia1
Bolivia (Plurinational State of)
China, Hong Kong Special Administrative Region
Côte d'Ivoire
Curaçao
Guinea-Bissau
Indonesia8

The part that I want to capture would look like this:

Argentina
Australia
Bolivia
China, Hong Kong Special Administrative Region
Côte d'Ivoire
Curaçao
Guinea-Bissau
Indonesia

The best solution that I was able to come up with is ^[a-zA-Z\s,ô'ç-]+. However, this leaves country names that are followed by a text in parentheses with a trailing white space.

This means I would like to match the entire country name unless there is a digit or a white space followed by an open bracket, then I would like it to stop before the digit or the (

I know that I could probably solve this in two steps but I am also reasonably sure that it should be possible to define a pattern that can do it in one step. Since I am anyway in the process of getting familiar with regex, I thought this would be a nice thing to know.

The fourth bird · Accepted Answer · 2021-12-26 17:53:07Z

1

The pattern can be written as matching any char except digits, parenthesis or whitespace chars. And that part by itself can be optionally repeated preceded by a space.

^[^\d\s()]+(?: [^\d\s()]+)*

^ Start of string
[^\d\s()]+ Match 1+ times any char except a digit, whitespace char or parenthesis using a negated character class
(?: Non capture group to repeat as a whole part
- [^\d\s()]+ Same match as above
)* Close the non capture group and optionally repeat it

Regex demo

answered Dec 26, 2021 at 17:53

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Cary Swoveland · Accepted Answer · 2021-12-26 23:31:03Z

1

I suggest you simply convert the strings you don't want to empty strings, using the regular expression

\d+$| +\(.*\)

with the multiline flag set, causing ^ and $ to respectively match the beginning and end of a line, rather than the beginning and end of the string.

Demo

The expression matches one or more digits at the end of a line or one or more spaces followed by a string that is enclosed in matching parentheses.

answered Dec 26, 2021 at 23:31

Cary Swoveland

111k6 gold badges69 silver badges105 bronze badges

Comments

Tiankai Liu · Accepted Answer · 2021-12-26 14:35:10Z

0

I think you can try ^([^\d \n]| +[^\d (\n])+ or, if you can guarantee your input doesn't contain double-spaces, the slightly simpler ^([^\d \n]| [^\d(\n])+ (The ^ character inside [] excludes the following characters, see https://regexone.com/lesson/excluding_characters)

Technically, the regex I've given omits trailing spaces, but for your application it doesn't sound like that would be a bad thing.

answered Dec 26, 2021 at 14:35

Tiankai Liu

512 bronze badges

3 Comments

Secco Over a year ago

Thanks, this works as well. In my particular case the \n is not necessary because I actually deal with elements in a list that are evaluated individually but it doesn't hurt either.

Tiankai Liu Over a year ago

Thanks. I don't have enough reputation to comment on the other answer, but I'm not sure ^(.+(?=\d| \()|.+) works if there are two-digit numbers, like 'Indonesia82'

Secco Over a year ago

Yeah, you are right, it does not. I should have included such an example in my sample list as well. I did not find a quick way to improve Ron's pattern so that it would work with any number size. Hence, I will probably use your pattern. Thanks again!

Ron Serruya · Accepted Answer · 2021-12-26 14:28:08Z

0

You can test the regex here https://regex101.com/r/dupn18/1
This should do the trick

In [1]: import re

In [2]: pattern = re.compile(r'(.+(?=\d| \()|.+)')

In [3]: data = """Argentina
   ...: Australia1
   ...: Bolivia (Plurinational State of)
   ...: China, Hong Kong Special Administrative Region
   ...: Côte d'Ivoire
   ...: Curaçao
   ...: Guinea-Bissau
   ...: Indonesia8""".splitlines()

In [4]: [pattern.search(country).group() for country in data]
Out[4]:
['Argentina',
 'Australia',
 'Bolivia',
 'China, Hong Kong Special Administrative Region',
 "Côte d'Ivoire",
 'Curaçao',
 'Guinea-Bissau',
 'Indonesia']

answered Dec 26, 2021 at 14:28

Ron Serruya

4,4963 gold badges21 silver badges33 bronze badges

1 Comment

Secco Over a year ago

Thanks! That works quite nicely and I think I understood the pattern.

Collectives™ on Stack Overflow

Regex - Match a string up to a digit or a specific string

4 Answers 4

Comments

Comments

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related