1

I use a database to represent a list of files, and some metadata associated to each of them. I need to update regularly this list of files, adding only the new files and deleting files that do not exist anymore (I have not to touch the existing rows in the table as I would lose the metadata associated).

My current queries take only seconds when I have around 10000 files, but take one hour with my current 150000-files table.

After some research on the Internet, I have been the following process :

  1. Populate a table "newfiles" with the results of the scan
  2. DELETE FROM files WHERE path NOT IN (SELECT path FROM newfiles);
  3. INSERT INTO files (SELECT * FROM newfiles WHERE path NOT IN (SELECT path FROM files));

I also have indexes :

CREATE INDEX "files_path" ON "files" ("path");
CREATE INDEX "files_path_like" ON "files" ("path" varchar_pattern_ops);
CREATE INDEX "files_path" ON "newfiles" ("path");
CREATE INDEX "files_path_like" ON "newfiles" ("path" varchar_pattern_ops);

(I mostly use these indexes for searching in the database; my application has a search engine in files.)

Each of these two queries take more than one hour when I have 150000 files. How can I optimize that ?

Thank you.

2
  • A sometimes-viable option is to add new partitions: create a new table that INHERITS a parent table, add an appropriate constraint, populate it, create indexes on it. This only works when your new data can be clearly partitioned on a single constraint. Commented Apr 8, 2013 at 10:52
  • This sounds more like a memory or disk IO issue. 150K rows isn't a huge amount - maybe you just need to allocate more memory to postgres? Even then, how big is the table. It shouldn't take an hour to read all this data from disk. Commented Apr 8, 2013 at 12:32

1 Answer 1

1

Try NOT EXISTS instead of NOT IN, as in:

DELETE FROM files WHERE NOT EXISTS
  (SELECT 1 FROM newfiles WHERE newfiles.path=files.path);

Also if newfiles is populated each time from scratch, make sure that you ANALYZE newfiles before issuing any query that uses it, so that the optimizer can work with good statistics.

If that doesn't solve it, try EXPLAIN or EXPLAIN ANALYZE on your queries to have the execution plan and append it to the question.

Sign up to request clarification or add additional context in comments.

2 Comments

Sorry to have completely forgot i asked about that here ... -_-
That helped very much, actually now it takes less than one second for each query. After trying both options, it's using NOT EXISTS instead of NOT IN that helped. Thank you very much !

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.