0

I have a table with containing about 1m records. When I run select * from table it will cause timeout and I see the query is in state IO: DataFileRead. When I run the select * from table where id>0 and id<=2147483647 which id is primary key it returns all data in couple of seconds.

Should I always include where clause even for returning all records?

Table schema

CREATE TABLE table
(
    id integer NOT NULL GENERATED BY DEFAULT AS IDENTITY ( INCREMENT 1 START 1 MINVALUE 1 MAXVALUE 2147483647 CACHE 1 ),
    batch_id integer,
    area_id integer,
    asset_group text COLLATE pg_catalog."default",
    asset_id text COLLATE pg_catalog."default",
    parent_id text COLLATE pg_catalog."default",
    reference_key text COLLATE pg_catalog."default",
    maintainer_code text COLLATE pg_catalog."default",
    type_code text COLLATE pg_catalog."default",
    super_type_code text COLLATE pg_catalog."default"
)

The primary key is integer if I specify whole range of integer it returns data quickly but without where it takes one hour. Even if I use column names for example select id,type_code from table it's very slow comparing to select id,type_code from table where id>0 and id<=2147483647

Below is the execution plan without using where:

 Seq Scan on table  (cost=0.00..6894676.46 rows=630746 width=379) (actual time=2590902.656..4068047.762 rows=792777 loops=1)
Planning Time: 0.095 ms
Execution Time: 4068076.818 ms

And when using where:

Bitmap Heap Scan on table  (cost=597265.81..1252327.52 rows=630747 
width=379) (actual time=72.493..211.108 rows=792777 loops=1)

Recheck Cond: ((id > 0) AND (id < 2147483647))
  Heap Blocks: exact=30533
  ->  Bitmap Index Scan on pk_information_model_entry  (cost=0.00..597108.12 rows=630747 width=0) (actual time=64.017..64.017 rows=792777 loops=1)
        Index Cond: ((id > 0) AND (id < 2147483647))
Planning Time: 8.594 ms
Execution Time: 233.207 ms

I'm aware using index can improve it but why using where clause will make such a difference?

5
  • why are you using * instead of column names? It is not the best practice. what are the column datatypes in the table., how many columns are there ? Commented Aug 17, 2022 at 8:08
  • select count(*) - count(case when id>0 and id<=2147483647 then 1 end) as diff from table to verify that you indeed select all rows when using the where clause Commented Aug 17, 2022 at 8:31
  • I did and it's returning all data Commented Aug 17, 2022 at 11:35
  • what is the maximum length of all the text columns? Commented Aug 17, 2022 at 12:03
  • it varies from 3 to 20 and it should have been varchar. It's client decision not mine and I'm working on why using where returns quicker. I also created a copy of table and the copy table without where returns in 13 seconds not one hour. Commented Aug 17, 2022 at 12:23

1 Answer 1

2

Your table seems to be massively bloated (full of totally empty pages). Using the index allows to skip the reading of those pages. You could fix it with a VACUUM FULL of the table, or using something like pg_squeeze.

You might also want to investigate how it got that way in the first place, so you can prevent it from recurring.

To reduce planning time, PostgreSQL doesn't consider using an index unless it "might possibly be useful". But just overcoming extreme bloat is not considered to be "possibly useful", which is why it only uses the index after you introduce a dummy WHERE clause which references the column.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.