1

I believe postgresql can be faster dealing with my query but every attempt to modify it makes it slower!

I have 2 tables:

  • statistics (id, field1, [...], field10)
  • geometry (id, geom)

I created indexes on :

  • statistics.id
  • geometry.id
  • geometry (st_x(st_centroid(st_transform(geom, 2154))), st_y(st_centroid(st_transform(geom, 2154))))

Here is the query

EXPLAIN ANALYZE SELECT 
statistics.*,
st_x(st_centroid(st_transform(geometry.geom, 2154))) AS x,
st_y(st_centroid(st_transform(geometry.geom, 2154))) AS y

FROM statistics
 JOIN geometry ON statistics.id = geometry.id 

WHERE statistics.id not like '97%';

Here is the result

Hash Join  (cost=1294.66..5158.10 rows=36593 width=342) (actual time=20.788..1085.257 rows=36552 loops=1)
Hash Cond: (geometry.id = (statistics.id)::text)
->  Seq Scan on geometry  (cost=0.00..2445.46 rows=36593 width=279) (actual time=0.010..25.271 rows=36597 loops=1)
    Filter: (id !~~ '97%'::text)
->  Hash  (cost=835.96..835.96 rows=36696 width=69) (actual time=19.892..19.892 rows=36696 loops=1)
    Buckets: 4096  Batches: 1  Memory Usage: 3780kB
    ->  Seq Scan on statistics  (cost=0.00..835.96 rows=36696 width=69) (actual time=0.005..6.871 rows=36696 loops=1)
Planning time: 0.401 ms
Execution time: 1088.612 ms

The most expensive operation is the Hash Join. How would you reorganise there query to get a better result?

Below is the schema of the tables

CREATE TABLE "statistics" (
    "REG" integer,
    "DEP" character varying(10),
    "COM" character varying(50),
    "D03" integer,
    "D04" integer,
    "D05" integer,
    "D06" integer,
    "D07" integer,
    "D08" integer,
    "D09" integer,
    "D10" integer,
    "D11" integer,
    "D12" integer,
    "D13" integer,
    "id" text
);

CREATE TABLE geometry (  
    id text NOT NULL,
    id_geo numeric(10,0),
    cm_code character varying(3),
    name character varying(50),
    status character varying(20),
    lat integer,
    long integer,
    lat_centroid integer,
    long_centroid integer,
    z_ smallint,
    area numeric(10,0),
    population double precision,
    code_ct character varying(2),
    code_r character varying(1),
    code_dp character varying(2),
    name_dp character varying(30),
    code_rg character varying(2),
    geom geometry(MultiPolygon,4326),
    x real,
    y real
);

There is about 40 000 rows in each table

The indexes have been created as follow

CREATE INDEX statistics_id_idx ON public.statistics USING btree (id COLLATE pg_catalog."default");
CREATE INDEX geometry_geom_idx ON public.geometry USING gist (geom);
CREATE INDEX geometry_id_gin2 ON public.geometry  USING gin (id COLLATE pg_catalog."default" gin_trgm_ops);

for information I have tried different indexes (btree & gin) on geometry_id & statistics_id.

6
  • Size of each table? The only problem I see is the like and geometry.id is different type of statistics.id ?? I dont understand your geometry index. for st_x. That wont help for that select. Commented Aug 26, 2015 at 4:12
  • each table has 36582 rows Commented Aug 26, 2015 at 5:37
  • The tables don't have a primary key, so a hashjoin is the best you can get. (also: using a text field for an id might be sub-optimal) Commented Aug 26, 2015 at 10:20
  • @wildplasser there is a primary on the id of each table. However the I get the hash join. Commented Aug 26, 2015 at 12:32
  • How could I know that? You did not put that in your table definitions. BTW: for small tables there is nothing wrong with a hash join. Commented Aug 26, 2015 at 12:40

1 Answer 1

2

I dont see any problem with your query.

Things to check

  • (geometry.id = (statistics.id)::text) are both field same data type?
  • WHERE statistics.id not like '97%'; . LIKE '%me' will never use an index, but LIKE 'me%' can possibly use an index. Why doesnt use index?
  • st_x(st_centroid(st_transform(geometry.geom, 2154))) AS x, is a function, and that take time. Need to transform coordinate and then extract a value. You are better if calculate that value and store it in a field.
  • Your geometry index doesn't have any impact on this query, because you are calculating a value not searching for something.
  • If you want perfom geography search that isnt the right index neither. But we can talk about that later

Things to try

First the where like.

SELECT *
FROM statistics
WHERE statistics.id not like '97%';

Then just the join

SELECT statistics.*,
       geometry.geom
FROM statistics
JOIN geometry ON statistics.id = geometry.id 

Then join + st_x

SELECT statistics.*,
       st_x(st_centroid(st_transform(geometry.geom, 2154))) AS x,
       st_y(st_centroid(st_transform(geometry.geom, 2154))) AS y
FROM statistics
JOIN geometry ON statistics.id = geometry.id 

Then create the precalculate x, y column in geometry table

SELECT statistics.*,
       geometry.x,
       geometry.y,
FROM statistics
JOIN geometry ON statistics.id = geometry.id 

Then join + st_x + where like and join + geometry.xy + where like

Compare time between each steps to check where is taking the most time.

Sign up to request clarification or add additional context in comments.

8 Comments

Thanks for your feedback. Will try what you suggested @Juan Carlos Oropeza
Transforming the coordinates to store them in the geometry table (x and y) did the trick! I divided the time by 10. The query now runs at 100ms ! Thanks @Juan Carlos Oropeza .
However when I run the query for real it still takes 5 seconds to run which is better than 6... but I expected a better result.
Then there is still something wrong doesnt make sense a query be that slow. I do query with 10 millions rows in ms. Can you edit your post with the table schema? Unless you have slow drives or bad memory setup.
Ok @Juan Carlos Oropeza. So it is not quite right that pg spend that much time.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.