-1

So I'm tring to find exact words from country.txt file which is define name of places with a descriptions file below:

here is the example of country.txt

Pic de Font Blanca
Roc Mélé
Pic des Langounelles
Pic de les Abelletes
Estany de les Abelletes
Port Vieux de la Coume d’Ose
Port de la Cabanette
Port Dret
Costa de Xurius
Font de la Xona

and here is a description.csv description file

Descriptions file is a list of data that contains titles and descriptions of the article. What I am trying to do is to find exact words of place name from descriptions file with country.txt file

code.py

import csv
import time
import re

allCities = open('country.txt', encoding="utf8").readlines()
timestr = time.strftime("%Y-%m-%d-(%H-%M-%S)")

with open('description.csv') as descriptions,open('desc_place7---' + str(timestr) + '.csv', 'w', newline='', encoding='utf-8') as output:
    descriptions_reader = csv.DictReader(descriptions)
    fieldnames = ['title', 'description', 'place']
    output_writer = csv.DictWriter(output, delimiter='|', fieldnames=fieldnames)
    output_writer.writeheader()
    line=0
    pattern = r'|'.join(r'\b{}\b'.format(re.escape(city.strip())) for city in sorted(allCities, key=len, reverse=True))

    for eachRow in descriptions_reader:
        title = eachRow['row']
        description = eachRow['desc']
        citiesFound = set()
        found = re.findall(pattern, description, re.IGNORECASE | re.MULTILINE)
        citiesFound.update(found)
        if len(citiesFound)==0:
            output_writer.writerow({'title': title, 'description': description, 'place': " - "})

        else:
            output_writer.writerow({'title': title, 'description': description, 'place': " , ".join(citiesFound)})
        line += 1
        print(line)

expected output: output

But because country.txt(185.94MB) is a large file, so my code can't fully run. It makes my laptop freeze. Is there a good way to handle this? I think its also because of the pattern line I have makes low performance but I also need a regex to find exact words

17
  • Hi, why are you sorting allCitices? Commented May 8, 2020 at 5:19
  • which of these files is the smallest? Commented May 8, 2020 at 5:19
  • @DavidDr90 If there are potential matches like "New York" and "New York City" - the longer candidate must appear first in the pattern. Commented May 8, 2020 at 5:22
  • @MushifAliNawaz descriptions.csv file Commented May 8, 2020 at 5:23
  • @drowsyone so first find all "New York" cantitates and then sort them. Don't sort ~190MB file Commented May 8, 2020 at 5:24

1 Answer 1

0

Here is a first implementation for your problem, you need to take and improve it to your specific needs.

First save all your descriptions to a pandas DataFrame like this:

import pandas as pd
descriptions = pd.read_csv('description.csv')

Then Do not read all file lines to memory. You can read the country file line by line and look for matches in the descriptions data. Use the following:

 with open('country.txt', encoding="utf8") as cities_file, open('desc_place7---' + str(timestr) + '.csv', 'w', newline='', encoding='utf-8') as output:
    fieldnames = ['title', 'description', 'place']
    output_writer = csv.DictWriter(output, delimiter='|', fieldnames=fieldnames)
    output_writer.writeheader()
    line = 0        
    for city in cities_file:
        pattern = r'\b{}\b'.format(re.escape(city.strip())
        for index, row in descriptions.iterrows():
            title = row['row']
            description = row['desc']
            citiesFound = set()            
            found = re.findall(pattern, description, re.IGNORECASE | re.MULTILINE)
            citiesFound.update(found)
            if len(citiesFound)==0:
                output_writer.writerow({'title': title, 'description': description, 'place': " - "})
            else:
                output_writer.writerow({'title': title, 'description': description, 'place': " , ".join(citiesFound)})
            line += 1
            print(line)
Sign up to request clarification or add additional context in comments.

4 Comments

Hey, thank you. It works, but what if I want descriptions to just only one time, not iterate until it get a place name?
I dont want iterate. I mean how to join all the name place of each row of article. Not iterate all row based on name place
Hey I think your code still can't run for large file

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.