I have complex task that I want to accomplish: From a string I want to be able to classify words in particular categories.
s = 'Age 63 years, female 35%; race or ethnic group: White 68%, Black 5%, Asian 19%, other 8%'
d = function(s)
print(d)
{"age": "63 years",
"gender: "female 35%",
"race": "White 68%, Black 5%, Asian 19%, other 8%"}
I must not that not all strings are in the same format but there is a finite set of categories in all (age, gender, race, region) but some strings only have 1 or 2 out of the 4 categories.
Here are some other toy strings:
s2 = 'Age 71 years, male 64%'
s3 = 'Age 64 years, female 21%,
Race or ethnicity: White 66%, Black 5%, Asian 18%, other 11%
Region: N. America 7%, Latin America 17%, W. Europe or other 24%, central Europe 33%, Asia-Pacific 18%
As you can see there are some patterns:
ageis not preceded by any':'.genderis documented as either female or male.raceandregionare followed by':'.
I am in interested in collection all the information corresponding to the category as see in my toy example with the race category.
What I need:
- Writing the RegEx pattern with the appropriate capturing groups to obtain the results.
- Transform the matches to a dictionary: I have seen a solution using the
.groupdict()method to do so.
I have a problem writing the regex pattern that will return the aforementioned groups.
I have seen this interesting solution for a similar problem: python regex: create dictionary from string. But I have trouble applying it to mine.
Age.\d{1,}.\w{1,}, discover it, remove from the string, then process the substring without age to get the first pattern, then discoverfemale 35%using^\w+\s{1,}\d{1,}%, remove and so on.^\w+\s{0,}\d{1,}\s{0,}\w+, remove it from string and do same^\w+\s{1,}\d{1,}%on substring and so on.