0

I am currently performing a reverse-geocoding operation as follows:

import json
from shapely.geometry import shape, Point
import time

with open('districts.json') as f: districts = json.load(f)
# file also kept at https://raw.githubusercontent.com/Thevesh/Display/master/districts.json

def reverse_geocode(lon,lat):
    point = Point(lon, lat) # lon/lat
    for feature in districts['features']:
        polygon = shape(feature['geometry'])
        if polygon.contains(point): return [(feature['properties'])['ADM1_EN'], (feature['properties'])['ADM2_EN']]
    return ['','']

start_time = time.time()
for i in range(1000): test = reverse_geocode(103, 3)
print('----- Code ran in ' + "{:.3f}".format(time.time() - start_time) + ' seconds -----')

This takes about 13 seconds to reverse geocode 1000 points, which is fine.

However, I'm going to need to reverse geocode 10mil coordinate pairs for a task, which means it'll take 130k seconds (1.5 days) assuming linear complexity. Not fine.

The obvious inefficiency in this algorithm is that it iterates through the entire set of polygons each and every time it classifies a point, which is a giant waste of time.

How can I improve this code? To compute 10mil pairs in a time acceptable for the task, I need to run 1k pairs in 1 second.

1
  • The first obvious solution is to run it with multiprocessing, did you try that? Commented Mar 3, 2021 at 16:54

1 Answer 1

1

I arrived at this algorithm using parallelism

if possible, return it to me if it was useful for your purpose. remember that it is an amateur algorithm, it will need adjustments.

import concurrent.futures

with open('districts.json') as f: districts = json.load(f)

def reverse_geocode(lon:int, lat:int) -> list:

    point = Point(lon, lat) # lon/lat
    for feature in districts['features']:
        polygon = shape(feature['geometry'])
        if polygon.contains(point):
            return [(feature['properties'])['ADM1_EN'], (feature['properties'])['ADM2_EN']]
    return ['','']

if __name__ == '__main__':
    time_start = time.time()

    with concurrent.futures.ProcessPoolExecutor() as process:
        for url in range(1000):
            process.submit(reverse_geocode, 103, 3)

    time_end = time.time()
    print(f'\nfim {round(time_end - time_start, 2)} seconds')
Sign up to request clarification or add additional context in comments.

1 Comment

This cuts the time in half, which is awesome, but I think the fundamental problem is still that we're iterating through the whole list. I'm going to try an approach which sorts the list to search, and the features to search through.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.