2

country_names.txt is a file with multiple lines, each line containing a European country and a Asian country. Read in each line of text until there is a line with the country names.

Example line inside text file: <td >England</td> <td>55.98</td> <td >Japan</td> <td>126.8</td></tr>

How do I use ONLY ONE regular expression to extract a European country and a Asian country from any line that contains two countries. After extracting the countries, store the European country in a list of European country names and store the Asian country in a list of Asian country names.

When all the lines have been read in, print a count of how many European countries and Asian countries have been read in.

Currently, this is what I have:

import re

with open('country_names.txt') as infile:

for line in infile:

        countries = re.findall("", "", infile) # regex code inside ""s in parenthesis

european_countries = countries.group(1)

asian_countries = countries.group(2)
1
  • You are going to have issues with countries = re.findall("", "", infile), You may want to use single quotes surrounding your expression, like '", "' Commented Dec 3, 2019 at 18:03

3 Answers 3

3

For one regex only you should use ^<td\s*>([a-zA-Z]+)<\/td\s*>.*<td\s*>([a-zA-Z]+)<\/td\s*>. You can play with it here: https://regex101.com/r/q9XHDD/1

When running it on your example you'll get:

>>> re.findall("^<td\s*>([a-zA-Z]+)<\/td\s*>.*<td\s*>([a-zA-Z]+)<\/td\s*>.*", "<td >England</td> <td>55.98</td> <td >Japan</td> <td>126.8</td></tr>")
[('England', 'Japan')]

My suggestion to you is not to use re.findall but to use re.match and then you code should be

import re

regex = "^<td\s*>([a-zA-Z]+)<\/td\s*>.*<td\s*>([a-zA-Z]+)<\/td\s*>.*"
eu_countries = []
as_countries = []
with open('country_names.txt') as infile:
   for line in infile:
        match = re.match(regex, line )
        if match:
            eu_countries.append(match.group(1))
            as_countries.append(match.group(2))
Sign up to request clarification or add additional context in comments.

Comments

1

You can use this regex to pull out the countries. <\s*(td)[^>]*>(\w*)<\s*/\s*(td)> This is selecting the tags where the text inside the tags is a word (i.e. not numbers)

This returns a list of tuples [('td', 'England', 'td'), ('td', 'Japan', 'td')]

I then map over and select the 2nd element in the tuple which is the country.

regex = '<\s*(td)[^>]*>(\w*)<\s*/\s*(td)>'
countries = list(map(lambda x: x[1], re.findall(regex, line)))
print(countries)  # ['England', 'Japan']

One thing to note is you need to use line instead of infile in the loop.

So to put it together:

regex = '<\s*(td)[^>]*>(\w*)<\s*/\s*(td)>'
european_countries = []
asian_countries = []

for line in infile:
    countries = list(map(lambda x: x[1], re.findall(regex, line)))
    european_countries.append(countries[0])
    asian_countries.append(countries[1])

Please note this will not work if you have other <td> tags with text in them. Also the order of the countries is important for this code. But a quick solution to your problem.

Comments

0
f = open('country_names.txt', 'r')
line = f.readlines()
e_countries = []
a_countries = []
for i in line:
  line1 = i.split(', ')[0]
  line2 = i.split(', ')[1]
  e_countries.append(line1)
  a_countries.append(line2)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.