1

Context:

I’m using a polygon layer as a makeshift gazetteer to find approximate addresses for point locations. My current workflow uses geopandas to:

  • Load the polygon dataset into memory
  • Use the built-in spatial index to
  • Perform point-in-polygon lookups (using polygons.sindex.query(point, predicate="within")[0] <- simply taking the first result)

Problem:

Despite having simplified the polygon geometries, the data set still consumes significant RAM once loaded into memory. Somehow, I’d like to keep it in memory, as this is an API endpoint response, so latency is an issue, but I realise I might have to implement some kind of an on-disk spatial index in order to reduce the memory footprint.

I’m aware I could always throw more RAM at it, but that’s maybe not a sustainable solution.

Question:

Is there any good way of converting the data into an on-disk spatial index or something similar? The aim is to:

  • Reduce memory footprint
  • Maintain reasonable performance for point-in-polygon lookups.
  • Work well with GeoPandas or similar libraries, in Python

I’m open to alternative approaches, really fishing for ideas here as I’ve run out of ideas of my own and my googling karma has not been favourable so far. I have a SpatiaLite database in the same project, but cannot use non file-based databases. Also happy to experiment with different file formats (GPKG right now) or specialised libraries

A bit more context, still:

  • It’s a Flask app
  • The polygon layer has about 2_000_000 features that relate to the voronoi polygons around housenumbers in a limited area of coverage
  • I experimented with sequential look-up of (1) county (2) municipality (3) street name (4) housenumber, using the already found details as a filter for each next step (this keeps user experience from tanking completely when re-reading the data from disk on every lookup, but I/O is not great)

Any suggestions that could point me in the right direction?

-- Edit: I want to add that I did save the GPKG including spatial indices in QGIS, if there is any way of keeping the index only in memory and reading the details on-the-fly, that would be a good way to go

5
  • Point-in-poly performance is linked to vertex count/complexity. You need to dice high-vertex geometries into multiple parts. Commented Oct 9 at 14:13
  • Thanks for the suggestion! Vertex count does not seem to be a real problem with this data set: ``` >>> df.geometry.count_coordinates().describe() count 2.081421e+06 mean 6.999601e+00 std 1.558734e+00 min 4.000000e+00 25% 6.000000e+00 50% 7.000000e+00 75% 8.000000e+00 max 3.800000e+01 dtype: float64 ``` (Also, it’s probably the total number of vertices across all polygons that really counts, isn’t it?) Commented Oct 9 at 14:23
  • To re-iterate: I’m not necessarily after query performance, but would like to balance it against memory usage Commented Oct 9 at 14:41
  • Memory is 1000 times faster than disk. I've gotten memory-level performance out of disk by optimizing feature size (e.g. cutting US, Canada, Russia, China, and Australia into 15-30 degree tiles for a global country encoder), but the same optimization in memory was microseconds per query. If you need all the data, and you need it to be accurate, you can't work around having sufficient RAM. Commented Oct 9 at 16:32
  • For heavy geoprocessing workloads I suggest PostgreSQL with PostGIS and spatial indexing, it offers the best performance for speed and resource efficiency. Commented Oct 9 at 19:53

1 Answer 1

4

Apparently your data is already in a geopackage file, so you already have your spatial index on disk...

Normally the code sample below should be reasonably fast as it will use the spatial index in the geopackage file. Giving more cache memory to sqlite as shown below can also help (reference).

Edit: added rows=1 as well based on comment by @christoph

import os
import geopandas as gpd
import shapely

# Optional: give plenty of cache memory to sqlite
os.environ["OGR_SQLITE_CACHE"] = "128"

path = ...
gdf = gpd.read_file(path, mask=shapely.Point(1, 2), rows=1)
4
  • 1
    Thanks Pieter! This is exactly the kind of hint I was searching for. It had not occurred to me that I can use a point as a mask for read_file(); I also had not thought of checking whether I can increase SQLite cache memory that easily. Commented Oct 10 at 10:02
  • 1
    Could you edit your answer to put quotation marks around the env variable value (they have to be str)? (for future reference) Thanks! Commented Oct 10 at 10:52
  • 1
    To confirm, still, that this is a really fantastic answer: I used pyperf to evaluate that code against my previous, RAM-cached approach. To look-up 100 randomly distributed points took 100ms as opposed to almost 19s. Commented Oct 10 at 14:53
  • 1
    Worth to note, also: I added rows=1 to read_file() which further sped up the query Commented Oct 10 at 14:55

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.