1

I have a countries.txt file that contains the following sample text:

[Country "Kenya"]\n[CapitalCity "Nairobi"]\n\n
[Country "Uganda"]\n[CapitalCity "Kampala"]\n\n
[Country "Tanzania"]\n[CapitalCity "Dodoma"]\n\n

The country can have up to 20 attributes. For simplicity, i have only included Country and CapitalCity. I need a regex that works in python that will return for the sample data above:

a) n matches, in the above case n=3
b) Each match should have m groups, in this case m=2: Country and CapitalCity

I have read this https://www.regular-expressions.info/captureall.html but cannot seem to get it to work for my usecase.

I have tried this

(\[([A-Za-z]+)\s\"([^\"]*)\"\]\\n\\n)+

here https://regex101.com/r/cujIDd/1 but it does not give me the Country.

EDIT: Expected input and output

Example 1: input

[Country "Kenya"]\n[CapitalCity "Nairobi"]\n\n
[Country "Uganda"]\n[CapitalCity "Kampala"]\n\n
[Country "Tanzania"]\n[CapitalCity "Dodoma"]\n\n

expected output

matches: 3
match 1: Country: Kenya
         CapitalCity: Nairobi
match 2: Country: Uganda
         CapitalCity: Kampala
match 3: Country: Tanzania
         CapitalCity: Dodoma

Example 2: input

[Country "Kenya"]\n[CapitalCity "Nairobi"]\n[President "Kenyatta"]\n\n
[Country "Uganda"]\n[CapitalCity "Kampala"]\n[President "Museveni"]\n\n
[Country "Tanzania"]\n[CapitalCity "Dodoma"]\n[President "Magufuli"]\n\n

expected output

matches: 3
match 1: Country: Kenya
         CapitalCity: Nairobi
         President: Kenyatta
match 2: Country: Uganda
         CapitalCity: Kampala
         President: Museveni
match 3: Country: Tanzania
         CapitalCity: Dodoma
         President: Magufuli

Example 3: input

[Country "Kenya"]\n[CapitalCity "Nairobi"]\n[President "Kenyatta"]\n[Continent "Africa"]\n\n
[Country "Uganda"]\n[CapitalCity "Kampala"]\n[President "Museveni"]\n[Continent "Africa"]\n\n
[Country "Tanzania"]\n[CapitalCity "Dodoma"]\n[President "Magufuli"]\n[Continent "Africa"]\n\n

expected output

matches: 3
match 1: Country: Kenya
         CapitalCity: Nairobi
         President: Kenyatta
         Continent: Africa
match 2: Country: Uganda
         CapitalCity: Kampala
         President: Museveni
         Continent: Africa
match 3: Country: Tanzania
         CapitalCity: Dodoma
         President: Magufuli
         Continent: Africa

You get the flow

4
  • 1
    you've included double \n in your capture group. correction Commented Feb 17, 2018 at 10:27
  • Thanks @OmarEinea. However that gives me 6 matches. Commented Feb 17, 2018 at 10:30
  • 1) can you explain what n matches mean here? 2) does your input file literally contains the string \n? 3) added complete expected output to question, it will give better understanding of your question Commented Feb 17, 2018 at 11:44
  • @Sundeep 1) n is number of countries. in this case 3 2) Yes it does Commented Feb 17, 2018 at 12:02

2 Answers 2

1

You could probably use something similar to the following:

regex = r"^[^\"]*\"(\w+)\"[^\"]+\"(\w+)\"[^\"].*"
subst = "\\1, \\2"

result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

Output:

Kenya, Nairobi
Uganda, Kampala
Tanzania, Dodoma

Example:

https://regex101.com/r/cujIDd/6

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks @l'L'l . However that would not satisfy the requirement for this input text: [Country "Kenya"]\n[Capital "Nairobi"]\n[President "Kenyatta"]\n\n [Country "Uganda"]\n[Capital "Kampala"]\n[President "Museveni"]\n\n [Country "Tanzania"]\n[Capital "Dodoma"]\n[President "Magufuli"]\n\n Remember these attributes can be up to 20.
@usr2564301 The question indicates that the number of attributes can be up to 20. In the answer provided, only 2 have been captured. Which means if the number of attributes are 20, it will not work
1

You could remove the outer repeating group ()+ and make the second \\n optional (?:\\n)?:

See the regex in use on regex101.com

\[([A-Za-z]+)\s\"([^\"]*)\"\]\\n(?:\\n)?

If you want to capture only the first 2 attributes, you could use ^ and $ anchors:

^\[([A-Za-z]+)\s*\"([^\"]+)\"\]\\n\[([A-Za-z]+)\s*\"([^\"]+)\"\].*$

See the regex in use on regex101.com

11 Comments

@Telewa I have updated my answer and separated the link and the regex
Thanks. Can it have only 3 matches - those are only 3 countries?
I took those from your example data. Not sure what you mean, but I think you can add as many as you want. Example
what i mean is, Per match, all attributes should be captured. So for the example i provided, there should be 3 matches, and two attribures captured. If we add the number of attributes and keep the countries as they are, we should still have 3 matches, and the extra attributes per match captured as well
Yes! Absolutely! Now it only didn't match attribute "President" in the second data. All attributes should be captured. Those attributes can be between 1 and m
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.