postgreSQL nested query performing slow

Question

I have these three tables:

create table words (id integer, word text, freq integer);
create table sentences (id integer, sentence text);
create table index (wordId integer, sentenceId integer, position integer);

Index is a inverted index and denotes which word occurs in which sentence. Furthermoore I have an index on id from the table words and sentences.

This query determines in which sentences a given word occurs and returns the first match:

select S.sentence from sentences S, words W, index I
where W.word = '#erhoehungen' and W.id = I.wordId and S.id = I.sentenceId
limit 1;

But when I want to retrieve a sentence where two words occur together like:

select S.sentence from sentences S, words W, index I
where W.word = '#dreikampf' and I.wordId = W.id and S.id = I.sentenceId and
S.id in (
    select S.id from sentences S, words W, index I
    where W.word = 'bruederle' and W.id = I.wordId and S.id = I.sentenceId
)
limit 1;

This query is much slower. Is there any trick to speed it up? Following things I did so far:

increased shared_buffer to 32MB
increased work_mem to 15MB
ran analyze on all tables
as mentioned created index on words id and sentences id

Regards.

€Dit:

Here is the output of the explain analyze query statement: http://pastebin.com/t2M5w4na

These three create statements are actually my original create statements. Should I add primary key to the tables sentences and words and reference these as foreign keys in the index? But what primary key should I use for the index table? SentId and wordId together are not unique and even if I add pos which denotes the position of the word in the sentence it is not unique.

updated to:

create table words (id integer, word text, freq integer, primary key(id));
create table sentences (id integer, sentence text, primary key(id));
create table index (wordId integer, sentenceId integer, position integer, foreign key(wordId) references words(id), foreign key(sentenceId) references sentences(sentenceId));

Edit your question, and paste the output of explain analyze your_query, where "your_query" represents your troublesome SELECT statement. Also, actual CREATE TABLE statements can help a lot. — Mike Sherrill 'Cat Recall'
– Mike Sherrill 'Cat Recall', Commented Oct 27, 2013 at 22:38
Your table index (terrible name, BTW) needs at least a primary key. {sentenceid, position} is the obvious choice. Having one or two compound indexes on {sentenceid,wordid} and/or {wordid,sentenceid} would probably help, too. — wildplasser
– wildplasser, Commented Oct 28, 2013 at 0:41
Plus: you will need a UNIQUE constraint or index for the natural key of the words table: the word itself. off-record: RDBMS and nlp are a bad match. You could take a look at other storage methods (for Postgres: hstore, or GIST indexes for full-text search) — wildplasser
– wildplasser, Commented Oct 28, 2013 at 0:49
The key value pair {sentenceid, position} is not unique, because some sentences are duplicated. Thanks for the information about the other storage methods. — user2715478
– user2715478, Commented Oct 28, 2013 at 15:26
Why would you want to allow a duplicate sentence? Without extra (key) columns, a duplicate sentence is meaningless. — joop
– joop, Commented Oct 28, 2013 at 16:23

Mosty Mostacho · Accepted Answer · 2013-10-27 22:41:44Z

1

I guess this should be more efficient:

SELECT s.id, s.sentence FROM words w
JOIN INDEX i ON w.id = i.wordId
JOIN sentences s ON i.sentenceId = s.id
WHERE w.word IN ('#dreikampf', 'bruederle')
GROUP BY s.id, s.sentence
HAVING COUNT(*) >= 2

Just make sure the amount of items in the IN clause matches the amount of items in the HAVING clause.

Fiddle here.

answered Oct 27, 2013 at 22:41

Mosty Mostacho

43.6k16 gold badges99 silver badges124 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Mosty Mostacho Over a year ago

Also you don't need to add more SQL code in this solution if you want to add more words but rather change the parameters :)

user2715478 Over a year ago

Thank you very much. It's much faster than my solution, but still in seconds range. Maybe it's because of the size of the tables: words(255715 rows), sentences(5085623 rows) and index(61029790 rows).

Mosty Mostacho Over a year ago

61 MM? That's big number :) The next level of performance would be working on indexes I guess. But probably you should ask that question in Database Administrators.

user2715478 Over a year ago

Thanks for the link. Maybe I could also try mysql with MyISAM as storage engine, because it uses less security stuff than postgreSQL - But I have no experience with that.

Community · Accepted Answer · 2017-05-23 12:28:06Z

0

Looks like you don't have indexes on columns wordId, sentenceId. Please create them and query will work much faster.

CREATE INDEX idx_index_wordId ON index USING btree (wordId);
CREATE INDEX idx_index_sentenceId ON index USING btree (sentenceId);

Using reserved word index as table name is not a good idea – you may need to escape it in some cases. Probably you should also add column id to index table and make it primary key.

Please use Mosty Mostacho query and show it's explain analyze output after you make indexes. May be it can work even faster.

Update:

please try new query:

select S.sentence from sentences S where S.id in
(select sentenceId from index I where 
I.wordId in (select id from words where word IN ('#dreikampf', 'bruederle'))
group by I.sentenceId
having count(distinct I.wordId) = 2
limit 1)

edited May 23, 2017 at 12:28

CommunityBot

11 silver badge

answered Oct 28, 2013 at 2:15

alexius

2,58620 silver badges24 bronze badges

5 Comments

user2715478 Over a year ago

added index to both id's and renamed the index table to inv_w. Here is the output of explain analyze: pastebin.com/veVds6KP Still in seconds range. I'm only interested in the first / one match, so maybe I can use a cursor? Because this query retrieves all solutions.

alexius Over a year ago

Please also create this index: CREATE INDEX idx_words_word ON words USING btree (word); and add LIMIT 1 to the end of the query for fetching only one row.

alexius Over a year ago

I also updated my answer – please try new query. It should work faster and more correctly (handle cases when 2 identical words are in one sentence).

user2715478 Over a year ago

The query with "#dreikampf' and 'bruederle' needs ~ 900 ms. But with the words 'bruederle' and 'punto-vergleich' it takes 12629 ms :/

alexius Over a year ago

Please create one more index CREATE INDEX idx_index_wordId_sentenceId ON index USING btree (wordId, sentenceId);. For some reason Postgres chosen bad plan, this should help.

Collectives™ on Stack Overflow

postgreSQL nested query performing slow

2 Answers 2

4 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related