Reduce memory-footprint of point-in-polygon lookups in GeoPandas

Question

Context:

I’m using a polygon layer as a makeshift gazetteer to find approximate addresses for point locations. My current workflow uses geopandas to:

Load the polygon dataset into memory
Use the built-in spatial index to
Perform point-in-polygon lookups (using polygons.sindex.query(point, predicate="within")[0] <- simply taking the first result)

Problem:

Despite having simplified the polygon geometries, the data set still consumes significant RAM once loaded into memory. Somehow, I’d like to keep it in memory, as this is an API endpoint response, so latency is an issue, but I realise I might have to implement some kind of an on-disk spatial index in order to reduce the memory footprint.

I’m aware I could always throw more RAM at it, but that’s maybe not a sustainable solution.

Question:

Is there any good way of converting the data into an on-disk spatial index or something similar? The aim is to:

Reduce memory footprint
Maintain reasonable performance for point-in-polygon lookups.
Work well with GeoPandas or similar libraries, in Python

I’m open to alternative approaches, really fishing for ideas here as I’ve run out of ideas of my own and my googling karma has not been favourable so far. I have a SpatiaLite database in the same project, but cannot use non file-based databases. Also happy to experiment with different file formats (GPKG right now) or specialised libraries

A bit more context, still:

It’s a Flask app
The polygon layer has about 2_000_000 features that relate to the voronoi polygons around housenumbers in a limited area of coverage
I experimented with sequential look-up of (1) county (2) municipality (3) street name (4) housenumber, using the already found details as a filter for each next step (this keeps user experience from tanking completely when re-reading the data from disk on every lookup, but I/O is not great)

Any suggestions that could point me in the right direction?

-- Edit: I want to add that I did save the GPKG including spatial indices in QGIS, if there is any way of keeping the index only in memory and reading the details on-the-fly, that would be a good way to go

Point-in-poly performance is linked to vertex count/complexity. You need to dice high-vertex geometries into multiple parts. — Vince
– Vince, Commented Oct 9 at 14:13
Thanks for the suggestion! Vertex count does not seem to be a real problem with this data set: ``` >>> df.geometry.count_coordinates().describe() count 2.081421e+06 mean 6.999601e+00 std 1.558734e+00 min 4.000000e+00 25% 6.000000e+00 50% 7.000000e+00 75% 8.000000e+00 max 3.800000e+01 dtype: float64 ``` (Also, it’s probably the total number of vertices across all polygons that really counts, isn’t it?) — christoph
– christoph, Commented Oct 9 at 14:23
To re-iterate: I’m not necessarily after query performance, but would like to balance it against memory usage — christoph
– christoph, Commented Oct 9 at 14:41
Memory is 1000 times faster than disk. I've gotten memory-level performance out of disk by optimizing feature size (e.g. cutting US, Canada, Russia, China, and Australia into 15-30 degree tiles for a global country encoder), but the same optimization in memory was microseconds per query. If you need all the data, and you need it to be accurate, you can't work around having sufficient RAM. — Vince
– Vince, Commented Oct 9 at 16:32
For heavy geoprocessing workloads I suggest PostgreSQL with PostGIS and spatial indexing, it offers the best performance for speed and resource efficiency. — Pepe N O
– Pepe N O, Commented Oct 9 at 19:53

Pieter · Accepted Answer · 2025-10-13 06:45:17Z

4

Apparently your data is already in a geopackage file, so you already have your spatial index on disk...

Normally the code sample below should be reasonably fast as it will use the spatial index in the geopackage file. Giving more cache memory to sqlite as shown below can also help (reference).

Edit: added rows=1 as well based on comment by @christoph

import os
import geopandas as gpd
import shapely

# Optional: give plenty of cache memory to sqlite
os.environ["OGR_SQLITE_CACHE"] = "128"

path = ...
gdf = gpd.read_file(path, mask=shapely.Point(1, 2), rows=1)

edited Oct 13 at 6:45

answered Oct 9 at 22:38

Pieter

4,6711 gold badge11 silver badges27 bronze badges

1

Thanks Pieter! This is exactly the kind of hint I was searching for. It had not occurred to me that I can use a point as a mask for read_file(); I also had not thought of checking whether I can increase SQLite cache memory that easily.

christoph
– christoph

2025-10-10 10:02:20 +00:00
Commented Oct 10 at 10:02
1

Could you edit your answer to put quotation marks around the env variable value (they have to be str)? (for future reference) Thanks!

christoph
– christoph

2025-10-10 10:52:44 +00:00
Commented Oct 10 at 10:52
1

To confirm, still, that this is a really fantastic answer: I used pyperf to evaluate that code against my previous, RAM-cached approach. To look-up 100 randomly distributed points took 100ms as opposed to almost 19s.

christoph
– christoph

2025-10-10 14:53:45 +00:00
Commented Oct 10 at 14:53
1

Worth to note, also: I added rows=1 to read_file() which further sped up the query

christoph
– christoph

2025-10-10 14:55:06 +00:00
Commented Oct 10 at 14:55

Add a comment |

Stack Exchange Network

Reduce memory-footprint of point-in-polygon lookups in GeoPandas

Context:

Problem:

Question:

A bit more context, still:

1 Answer 1

Your Answer

Hot Network Questions

Reduce memory-footprint of point-in-polygon lookups in GeoPandas

Context:

Problem:

Question:

A bit more context, still:

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions