Regex to match repeating groups in python

Question

I have a countries.txt file that contains the following sample text:

[Country "Kenya"]\n[CapitalCity "Nairobi"]\n\n
[Country "Uganda"]\n[CapitalCity "Kampala"]\n\n
[Country "Tanzania"]\n[CapitalCity "Dodoma"]\n\n

The country can have up to 20 attributes. For simplicity, i have only included Country and CapitalCity. I need a regex that works in python that will return for the sample data above:

a) n matches, in the above case n=3
b) Each match should have m groups, in this case m=2: Country and CapitalCity

I have read this https://www.regular-expressions.info/captureall.html but cannot seem to get it to work for my usecase.

I have tried this

(\[([A-Za-z]+)\s\"([^\"]*)\"\]\\n\\n)+

here https://regex101.com/r/cujIDd/1 but it does not give me the Country.

EDIT: Expected input and output

Example 1: input

[Country "Kenya"]\n[CapitalCity "Nairobi"]\n\n
[Country "Uganda"]\n[CapitalCity "Kampala"]\n\n
[Country "Tanzania"]\n[CapitalCity "Dodoma"]\n\n

expected output

matches: 3
match 1: Country: Kenya
         CapitalCity: Nairobi
match 2: Country: Uganda
         CapitalCity: Kampala
match 3: Country: Tanzania
         CapitalCity: Dodoma

Example 2: input

[Country "Kenya"]\n[CapitalCity "Nairobi"]\n[President "Kenyatta"]\n\n
[Country "Uganda"]\n[CapitalCity "Kampala"]\n[President "Museveni"]\n\n
[Country "Tanzania"]\n[CapitalCity "Dodoma"]\n[President "Magufuli"]\n\n

expected output

matches: 3
match 1: Country: Kenya
         CapitalCity: Nairobi
         President: Kenyatta
match 2: Country: Uganda
         CapitalCity: Kampala
         President: Museveni
match 3: Country: Tanzania
         CapitalCity: Dodoma
         President: Magufuli

Example 3: input

[Country "Kenya"]\n[CapitalCity "Nairobi"]\n[President "Kenyatta"]\n[Continent "Africa"]\n\n
[Country "Uganda"]\n[CapitalCity "Kampala"]\n[President "Museveni"]\n[Continent "Africa"]\n\n
[Country "Tanzania"]\n[CapitalCity "Dodoma"]\n[President "Magufuli"]\n[Continent "Africa"]\n\n

expected output

matches: 3
match 1: Country: Kenya
         CapitalCity: Nairobi
         President: Kenyatta
         Continent: Africa
match 2: Country: Uganda
         CapitalCity: Kampala
         President: Museveni
         Continent: Africa
match 3: Country: Tanzania
         CapitalCity: Dodoma
         President: Magufuli
         Continent: Africa

You get the flow

you've included double \n in your capture group. correction — Omar Einea
– Omar Einea, Commented Feb 17, 2018 at 10:27
1) can you explain what n matches mean here? 2) does your input file literally contains the string \n? 3) added complete expected output to question, it will give better understanding of your question — Sundeep
– Sundeep, Commented Feb 17, 2018 at 11:44
@Sundeep 1) n is number of countries. in this case 3 2) Yes it does — Telewa
– Telewa, Commented Feb 17, 2018 at 12:02

l'L'l · Accepted Answer · 2018-02-17 10:49:08Z

1

You could probably use something similar to the following:

regex = r"^[^\"]*\"(\w+)\"[^\"]+\"(\w+)\"[^\"].*"
subst = "\\1, \\2"

result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

Output:

Kenya, Nairobi
Uganda, Kampala
Tanzania, Dodoma

Example:

https://regex101.com/r/cujIDd/6

answered Feb 17, 2018 at 10:49

l'L'l

47.5k12 gold badges102 silver badges154 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Telewa Over a year ago

Thanks @l'L'l . However that would not satisfy the requirement for this input text: [Country "Kenya"]\n[Capital "Nairobi"]\n[President "Kenyatta"]\n\n [Country "Uganda"]\n[Capital "Kampala"]\n[President "Museveni"]\n\n [Country "Tanzania"]\n[Capital "Dodoma"]\n[President "Magufuli"]\n\n Remember these attributes can be up to 20.

Telewa Over a year ago

@usr2564301 The question indicates that the number of attributes can be up to 20. In the answer provided, only 2 have been captured. Which means if the number of attributes are 20, it will not work

The fourth bird · Accepted Answer · 2018-02-17 12:58:13Z

1

You could remove the outer repeating group ()+ and make the second \\n optional (?:\\n)?:

See the regex in use on regex101.com

\[([A-Za-z]+)\s\"([^\"]*)\"\]\\n(?:\\n)?

If you want to capture only the first 2 attributes, you could use ^ and $ anchors:

^\[([A-Za-z]+)\s*\"([^\"]+)\"\]\\n\[([A-Za-z]+)\s*\"([^\"]+)\"\].*$

See the regex in use on regex101.com

edited Feb 17, 2018 at 12:58

answered Feb 17, 2018 at 11:58

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

11 Comments

The fourth bird Over a year ago

@Telewa I have updated my answer and separated the link and the regex

Telewa Over a year ago

Thanks. Can it have only 3 matches - those are only 3 countries?

The fourth bird Over a year ago

I took those from your example data. Not sure what you mean, but I think you can add as many as you want. Example

Telewa Over a year ago

what i mean is, Per match, all attributes should be captured. So for the example i provided, there should be 3 matches, and two attribures captured. If we add the number of attributes and keep the countries as they are, we should still have 3 matches, and the extra attributes per match captured as well

Telewa Over a year ago

Yes! Absolutely! Now it only didn't match attribute "President" in the second data. All attributes should be captured. Those attributes can be between 1 and m

|

Collectives™ on Stack Overflow

Regex to match repeating groups in python

2 Answers 2

2 Comments

11 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

11 Comments

Your Answer

Sign up or log in

Post as a guest

Related