Context:
I’m using a polygon layer as a makeshift gazetteer to find approximate addresses for point locations. My current workflow uses geopandas to:
- Load the polygon dataset into memory
- Use the built-in spatial index to
- Perform point-in-polygon lookups (using
polygons.sindex.query(point, predicate="within")[0]<- simply taking the first result)
Problem:
Despite having simplified the polygon geometries, the data set still consumes significant RAM once loaded into memory. Somehow, I’d like to keep it in memory, as this is an API endpoint response, so latency is an issue, but I realise I might have to implement some kind of an on-disk spatial index in order to reduce the memory footprint.
I’m aware I could always throw more RAM at it, but that’s maybe not a sustainable solution.
Question:
Is there any good way of converting the data into an on-disk spatial index or something similar? The aim is to:
- Reduce memory footprint
- Maintain reasonable performance for point-in-polygon lookups.
- Work well with GeoPandas or similar libraries, in Python
I’m open to alternative approaches, really fishing for ideas here as I’ve run out of ideas of my own and my googling karma has not been favourable so far. I have a SpatiaLite database in the same project, but cannot use non file-based databases. Also happy to experiment with different file formats (GPKG right now) or specialised libraries
A bit more context, still:
- It’s a Flask app
- The polygon layer has about 2_000_000 features that relate to the voronoi polygons around housenumbers in a limited area of coverage
- I experimented with sequential look-up of (1) county (2) municipality (3) street name (4) housenumber, using the already found details as a filter for each next step (this keeps user experience from tanking completely when re-reading the data from disk on every lookup, but I/O is not great)
Any suggestions that could point me in the right direction?
-- Edit: I want to add that I did save the GPKG including spatial indices in QGIS, if there is any way of keeping the index only in memory and reading the details on-the-fly, that would be a good way to go