Optimizing mysql query to select all points with in polygon using spatial indexes

Question

Firstly, I admit that my experience with spatial functions is very minimal. I have a table in MySQL with 20 fields and 23549187 records that contain geographical data. One of the fields is 'point' which is of point data type and has spatial index on it. I have a query that selects all points within a polygon which looks like this,

select * from `table_name` where ST_CONTAINS(ST_GEOMFROMTEXT('POLYGON((151.186 -23.497,151.207 -23.505,151.178 -23.496,151.174 -23.49800000000001,151.176 -23.496,151.179 -23.49500000000002,151.186 -23.497))'), `point`)

This works well as the polygon is small. However, if the polygon gets massive, the execution times gets really slow and the slowest query until now ran for 15 mins. Adding the index had really helped to bring it down to 15 mins which otherwise would have taken close to an hour. Is there anything I can do here for further improvement. This query will be run by a PHP script that runs as a daemon and I am worried if this slow queries will bring the MySQL server down.

All suggestions to make it better are welcome. Thanks.

EDIT:

show create table;

CREATE TABLE `table_name` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `lat` float(12,6) DEFAULT NULL,
  `long` float(12,6) DEFAULT NULL,
  `point` point NOT NULL,
  PRIMARY KEY (`id`),
  KEY `lat` (`lat`,`long`),
  SPATIAL KEY `sp_index` (`point`)
) ENGINE=MyISAM AUTO_INCREMENT=47222773 DEFAULT CHARSET=utf8mb4

There are few more fields that I am not supposed to disclose it here however the filter won

Explain sql output for the slow query:

+----+-------------+------------+------+---------------+------+---------+------+----------+-------------+
| id | select_type | table      | type | possible_keys | key  | key_len | ref  | rows     | Extra       |
+----+-------------+------------+------+---------------+------+---------+------+----------+-------------+
|  1 | SIMPLE      | table_name | ALL  | NULL          | NULL | NULL    | NULL | 23549187 | Using where |
+----+-------------+------------+------+---------------+------+---------+------+----------+-------------+

Explain sql output for query with smaller polygons,

+----+-------------+------------+-------+---------------+----------+---------+------+------+-------------+
| id | select_type | table      | type  | possible_keys | key      | key_len | ref  | rows | Extra       |
+----+-------------+------------+-------+---------------+----------+---------+------+------+-------------+
|  1 | SIMPLE      | table_name | range | sp_index      | sp_index | 34      | NULL |    1 | Using where |
+----+-------------+------------+-------+---------------+----------+---------+------+------+-------------+

Looks like the biggest polygon does not use the index.

You might have better luck moving this to dba.stackexchange.com or possibly gis.stackexchange.com — miken32
– miken32, Commented Nov 7, 2018 at 2:14
With massive, do you mean big size (big bounding box, like covering the whole earth), and/or a complicated shape of thousands of points, and/or containing a lot of/all of the points? MySQL seems to have decided on the latter one, can you verify that this is the case (e.g. that you get a large portion of your 23million points in your result set)? — Solarflare
– Solarflare, Commented Nov 7, 2018 at 7:55
The slow one or both? The more important question is how many points are in it, and specifically how many points are in the bounding box of the polygon (if you make a square around the polygon, how many points would be in it - that is the part that the index provide). There might not be a simple way to speed up your query though, but it would help to have a general idea about the polygon shape (see bishops answer): if your polygon e.g. describes a circle, you can get rid of 60% of your comparisons by using an inner square (easy for the index), and then union the remaining 4 circle parts to it. — Solarflare
– Solarflare, Commented Nov 9, 2018 at 12:50
These are some large numbers. Checking for probably 500k points (that are in the vicinity of your polyogon) for a 6k polygon will take significant time. I am not sure if you can optimize your query from a general viewpoint. You may need to incorporate other factors about your situation. E.g.: if your polygon is a 6k polygon of florida, and you want to know which of your customers is inside it, you could store/precalcuate that someone lives in Miami, and then you only need to check if Miami is in Florida to get 30% of your results in one comparison. — Solarflare
– Solarflare, Commented Nov 11, 2018 at 12:04
Or maybe you can simplify your polygon, as you do not need 4k points to describe the coastline of key west in detail, as there is no reasonable expectation that someone actually lives in the golf of mexico, so 10 points instead of 4k might suffice. There likely will be such a clustering or reasoning for your problem, the hard task might be to find one. You would need to add details about your problem, and/or you may want to describe your problem on gis.stackexchange.com, maybe they have an idea. — Solarflare
– Solarflare, Commented Nov 11, 2018 at 12:04

bishop · Accepted Answer · 2018-11-08 00:33:43Z

2

MySQL uses R-Trees for indexing spatial data. Like B-Tree indexes, these are optimal for queries targeting a small subset of the total number. As your bounding polygon gets larger the number of possible matches increases and, at some point, the optimizer decides it is more efficient to switch to a full table scan. That appears to be the scenario here, and I see three options:

First, try adding a LIMIT to your query. Normally, MySQL ignores the index if the optimizer concludes fewer I/O seeks would occur in a full table scan. But, with B-Tree indexes at least, MySQL will short-circuit that logic and always perform the B-Tree dive when LIMIT is present. I hypothesize R-Tree have a similar short-circuiting.

Second, and similar in spirit to the first, try forcing MySQL to use the index. This instructs MySQL that the table scan is more expensive than the optimizer decides. Understand that the optimizer only has heuristics and doesn't really know how "expensive" things are beyond what its internal statistics conclude. We humans have intuition, which sometimes - sometimes - knows better.

select * force index (`sp_index`) from `table_name` where ST_CONTAINS(ST_GEOMFROMTEXT('POLYGON((151.186 -23.497,151.207 -23.505,151.178 -23.496,151.174 -23.49800000000001,151.176 -23.496,151.179 -23.49500000000002,151.186 -23.497))'), `point`)

Finally, if those don't work, then what you need to do is break up your bounding polygon into smaller polygons. For example, if your bounding polygon is a square 500km per side, break it up into 4 squares 250km on each side, or 16 squares 125km per side, etc. Then UNION all of these together. The index will be used on each one, and the cumulative result may be faster. (Note it's important to UNION them together: MySQL cannot apply multiple range scans on a spatial query.)

edited Nov 8, 2018 at 0:33

answered Nov 7, 2018 at 16:30

bishop

39.8k14 gold badges116 silver badges150 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Rohith Mohan Over a year ago

That was really informative. I did try forcing the index but did not make any difference and in fact delayed the execution. I would try finding out a way to break the polygon and do union. Since it is not a square all the time it would not be that easy.

bishop Over a year ago

@RohithMohan There are algorithms for partitioning polygons. The basic idea is to divide into halves of near equal area and then to repeat that process on each half, however many times as are necessary.

Collectives™ on Stack Overflow

Optimizing mysql query to select all points with in polygon using spatial indexes

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related