How to find place name in large file with regex python

Question

So I'm tring to find exact words from country.txt file which is define name of places with a descriptions file below:

here is the example of country.txt

Pic de Font Blanca
Roc Mélé
Pic des Langounelles
Pic de les Abelletes
Estany de les Abelletes
Port Vieux de la Coume d’Ose
Port de la Cabanette
Port Dret
Costa de Xurius
Font de la Xona

and here is a description.csv description file

Descriptions file is a list of data that contains titles and descriptions of the article. What I am trying to do is to find exact words of place name from descriptions file with country.txt file

code.py

import csv
import time
import re

allCities = open('country.txt', encoding="utf8").readlines()
timestr = time.strftime("%Y-%m-%d-(%H-%M-%S)")

with open('description.csv') as descriptions,open('desc_place7---' + str(timestr) + '.csv', 'w', newline='', encoding='utf-8') as output:
    descriptions_reader = csv.DictReader(descriptions)
    fieldnames = ['title', 'description', 'place']
    output_writer = csv.DictWriter(output, delimiter='|', fieldnames=fieldnames)
    output_writer.writeheader()
    line=0
    pattern = r'|'.join(r'\b{}\b'.format(re.escape(city.strip())) for city in sorted(allCities, key=len, reverse=True))

    for eachRow in descriptions_reader:
        title = eachRow['row']
        description = eachRow['desc']
        citiesFound = set()
        found = re.findall(pattern, description, re.IGNORECASE | re.MULTILINE)
        citiesFound.update(found)
        if len(citiesFound)==0:
            output_writer.writerow({'title': title, 'description': description, 'place': " - "})

        else:
            output_writer.writerow({'title': title, 'description': description, 'place': " , ".join(citiesFound)})
        line += 1
        print(line)

expected output: output

But because country.txt(185.94MB) is a large file, so my code can't fully run. It makes my laptop freeze. Is there a good way to handle this? I think its also because of the pattern line I have makes low performance but I also need a regex to find exact words

@DavidDr90 If there are potential matches like "New York" and "New York City" - the longer candidate must appear first in the pattern. — drowsyone
– drowsyone, Commented May 8, 2020 at 5:22
@drowsyone so first find all "New York" cantitates and then sort them. Don't sort ~190MB file — DavidDr90
– DavidDr90, Commented May 8, 2020 at 5:24

DavidDr90 · Accepted Answer · 2020-05-08 07:01:13Z

0

Here is a first implementation for your problem, you need to take and improve it to your specific needs.

First save all your descriptions to a pandas DataFrame like this:

import pandas as pd
descriptions = pd.read_csv('description.csv')

Then Do not read all file lines to memory. You can read the country file line by line and look for matches in the descriptions data. Use the following:

 with open('country.txt', encoding="utf8") as cities_file, open('desc_place7---' + str(timestr) + '.csv', 'w', newline='', encoding='utf-8') as output:
    fieldnames = ['title', 'description', 'place']
    output_writer = csv.DictWriter(output, delimiter='|', fieldnames=fieldnames)
    output_writer.writeheader()
    line = 0        
    for city in cities_file:
        pattern = r'\b{}\b'.format(re.escape(city.strip())
        for index, row in descriptions.iterrows():
            title = row['row']
            description = row['desc']
            citiesFound = set()            
            found = re.findall(pattern, description, re.IGNORECASE | re.MULTILINE)
            citiesFound.update(found)
            if len(citiesFound)==0:
                output_writer.writerow({'title': title, 'description': description, 'place': " - "})
            else:
                output_writer.writerow({'title': title, 'description': description, 'place': " , ".join(citiesFound)})
            line += 1
            print(line)

edited May 8, 2020 at 7:01

answered May 8, 2020 at 6:17

DavidDr90

5695 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

drowsyone Over a year ago

Hey, thank you. It works, but what if I want descriptions to just only one time, not iterate until it get a place name?

DavidDr90 Over a year ago

@drowsyone you can see this answer how to filter rows in pandas by regex or searching matching string pattern from dataframe column in python pandas it will iterate for you

drowsyone Over a year ago

I dont want iterate. I mean how to join all the name place of each row of article. Not iterate all row based on name place

drowsyone Over a year ago

Hey I think your code still can't run for large file

Collectives™ on Stack Overflow

How to find place name in large file with regex python

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related