-1

I have complex task that I want to accomplish: From a string I want to be able to classify words in particular categories.

s = 'Age 63 years, female 35%; race or ethnic group: White 68%, Black 5%, Asian 19%, other 8%'
d = function(s)
print(d)
      {"age": "63 years",
       "gender: "female 35%",
       "race": "White 68%, Black 5%, Asian 19%, other 8%"}

I must not that not all strings are in the same format but there is a finite set of categories in all (age, gender, race, region) but some strings only have 1 or 2 out of the 4 categories.

Here are some other toy strings:

s2 = 'Age 71 years, male 64%'
s3 = 'Age 64 years, female 21%,
Race or ethnicity: White 66%, Black 5%, Asian 18%, other 11%
Region: N. America 7%, Latin America 17%, W. Europe or other 24%, central Europe 33%, Asia-Pacific 18%

As you can see there are some patterns:

  • age is not preceded by any ':'.
  • gender is documented as either female or male.
  • race and region are followed by ':'.

I am in interested in collection all the information corresponding to the category as see in my toy example with the race category.

What I need:

  1. Writing the RegEx pattern with the appropriate capturing groups to obtain the results.
  2. Transform the matches to a dictionary: I have seen a solution using the .groupdict() method to do so.

I have a problem writing the regex pattern that will return the aforementioned groups.

I have seen this interesting solution for a similar problem: python regex: create dictionary from string. But I have trouble applying it to mine.

5
  • It seems like you are looking for a general solution based on one example string. Are all strings in this exact form? Commented Oct 24, 2020 at 18:40
  • Please clarify upon which condition the sepperation take place? (For age the qord age must appear, the possible genders are only male or female? The order is always like in the example?) , second - have you tried something? Share it. Commented Oct 24, 2020 at 18:40
  • Hi both of you @MarkMeyer and @YossiLevi! I updated my question to incorporate your questions. Thanks in advance! Commented Oct 24, 2020 at 18:52
  • You can take first pattern, Age.\d{1,}.\w{1,}, discover it, remove from the string, then process the substring without age to get the first pattern, then discover female 35% using ^\w+\s{1,}\d{1,}%, remove and so on. Commented Oct 24, 2020 at 19:01
  • To find the first one you can use ^\w+\s{0,}\d{1,}\s{0,}\w+, remove it from string and do same ^\w+\s{1,}\d{1,}% on substring and so on. Commented Oct 24, 2020 at 19:07

1 Answer 1

0

Instead of finding one golden regex to handle all the cases you could possibly pass your input string through a set of regexes each trying to extract one of the columns you have mentioned in the question. Something like

ageMatch = re.match( r'Age\s+(\d+)\s+years?', s, re.I)
if ageMatch:
    //Use ageMatch.group(1) to form part of your dict

genderMatch = re.match( r'(male|female)\s+(\d+)\s*%', s, re.I)
if genderMatch:
    //Use genderMatch.group(1) genderMatch.group (2) to form part of your dict
    
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for the tip! That is actually a great alternative!!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.