1

I am working in python and there I have a list of countries that I would like to clean. Most countries are already written the way I want them to be. However, some country names have a one- or two-digit number attached or there is a text in brackets appended. Here's a sample of that list:

Argentina
Australia1
Bolivia (Plurinational State of)
China, Hong Kong Special Administrative Region
Côte d'Ivoire
Curaçao
Guinea-Bissau
Indonesia8

The part that I want to capture would look like this:

Argentina
Australia
Bolivia
China, Hong Kong Special Administrative Region
Côte d'Ivoire
Curaçao
Guinea-Bissau
Indonesia

The best solution that I was able to come up with is ^[a-zA-Z\s,ô'ç-]+. However, this leaves country names that are followed by a text in parentheses with a trailing white space.

This means I would like to match the entire country name unless there is a digit or a white space followed by an open bracket, then I would like it to stop before the digit or the (

I know that I could probably solve this in two steps but I am also reasonably sure that it should be possible to define a pattern that can do it in one step. Since I am anyway in the process of getting familiar with regex, I thought this would be a nice thing to know.

4 Answers 4

1

The pattern can be written as matching any char except digits, parenthesis or whitespace chars. And that part by itself can be optionally repeated preceded by a space.

^[^\d\s()]+(?: [^\d\s()]+)*
  • ^ Start of string
  • [^\d\s()]+ Match 1+ times any char except a digit, whitespace char or parenthesis using a negated character class
  • (?: Non capture group to repeat as a whole part
    • [^\d\s()]+ Same match as above
  • )* Close the non capture group and optionally repeat it

Regex demo

Sign up to request clarification or add additional context in comments.

Comments

1

I suggest you simply convert the strings you don't want to empty strings, using the regular expression

\d+$| +\(.*\)

with the multiline flag set, causing ^ and $ to respectively match the beginning and end of a line, rather than the beginning and end of the string.

Demo

The expression matches one or more digits at the end of a line or one or more spaces followed by a string that is enclosed in matching parentheses.

Comments

0

I think you can try ^([^\d \n]| +[^\d (\n])+ or, if you can guarantee your input doesn't contain double-spaces, the slightly simpler ^([^\d \n]| [^\d(\n])+ (The ^ character inside [] excludes the following characters, see https://regexone.com/lesson/excluding_characters)

Technically, the regex I've given omits trailing spaces, but for your application it doesn't sound like that would be a bad thing.

3 Comments

Thanks, this works as well. In my particular case the \n is not necessary because I actually deal with elements in a list that are evaluated individually but it doesn't hurt either.
Thanks. I don't have enough reputation to comment on the other answer, but I'm not sure ^(.+(?=\d| \()|.+) works if there are two-digit numbers, like 'Indonesia82'
Yeah, you are right, it does not. I should have included such an example in my sample list as well. I did not find a quick way to improve Ron's pattern so that it would work with any number size. Hence, I will probably use your pattern. Thanks again!
0

You can test the regex here https://regex101.com/r/dupn18/1
This should do the trick

In [1]: import re

In [2]: pattern = re.compile(r'(.+(?=\d| \()|.+)')

In [3]: data = """Argentina
   ...: Australia1
   ...: Bolivia (Plurinational State of)
   ...: China, Hong Kong Special Administrative Region
   ...: Côte d'Ivoire
   ...: Curaçao
   ...: Guinea-Bissau
   ...: Indonesia8""".splitlines()

In [4]: [pattern.search(country).group() for country in data]
Out[4]:
['Argentina',
 'Australia',
 'Bolivia',
 'China, Hong Kong Special Administrative Region',
 "Côte d'Ivoire",
 'Curaçao',
 'Guinea-Bissau',
 'Indonesia']

1 Comment

Thanks! That works quite nicely and I think I understood the pattern.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.