Efficient incremental inserts in postgresql

Question

I use a database to represent a list of files, and some metadata associated to each of them. I need to update regularly this list of files, adding only the new files and deleting files that do not exist anymore (I have not to touch the existing rows in the table as I would lose the metadata associated).

My current queries take only seconds when I have around 10000 files, but take one hour with my current 150000-files table.

After some research on the Internet, I have been the following process :

Populate a table "newfiles" with the results of the scan
DELETE FROM files WHERE path NOT IN (SELECT path FROM newfiles);
INSERT INTO files (SELECT * FROM newfiles WHERE path NOT IN (SELECT path FROM files));

I also have indexes :

CREATE INDEX "files_path" ON "files" ("path");
CREATE INDEX "files_path_like" ON "files" ("path" varchar_pattern_ops);
CREATE INDEX "files_path" ON "newfiles" ("path");
CREATE INDEX "files_path_like" ON "newfiles" ("path" varchar_pattern_ops);

(I mostly use these indexes for searching in the database; my application has a search engine in files.)

Each of these two queries take more than one hour when I have 150000 files. How can I optimize that ?

Thank you.

A sometimes-viable option is to add new partitions: create a new table that INHERITS a parent table, add an appropriate constraint, populate it, create indexes on it. This only works when your new data can be clearly partitioned on a single constraint. — Craig Ringer
– Craig Ringer, Commented Apr 8, 2013 at 10:52
This sounds more like a memory or disk IO issue. 150K rows isn't a huge amount - maybe you just need to allocate more memory to postgres? Even then, how big is the table. It shouldn't take an hour to read all this data from disk. — AngerClown
– AngerClown, Commented Apr 8, 2013 at 12:32

Daniel Vérité · Accepted Answer · 2013-04-08 11:27:02Z

1

Try NOT EXISTS instead of NOT IN, as in:

DELETE FROM files WHERE NOT EXISTS
  (SELECT 1 FROM newfiles WHERE newfiles.path=files.path);

Also if newfiles is populated each time from scratch, make sure that you ANALYZE newfiles before issuing any query that uses it, so that the optimizer can work with good statistics.

If that doesn't solve it, try EXPLAIN or EXPLAIN ANALYZE on your queries to have the execution plan and append it to the question.

answered Apr 8, 2013 at 11:27

Daniel Vérité

62.3k16 gold badges134 silver badges160 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

alphatiger Over a year ago

Sorry to have completely forgot i asked about that here ... -_-

alphatiger Over a year ago

That helped very much, actually now it takes less than one second for each query. After trying both options, it's using NOT EXISTS instead of NOT IN that helped. Thank you very much !

Collectives™ on Stack Overflow

Efficient incremental inserts in postgresql

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related