2

I have these three tables:

  1. create table words (id integer, word text, freq integer);
  2. create table sentences (id integer, sentence text);
  3. create table index (wordId integer, sentenceId integer, position integer);

Index is a inverted index and denotes which word occurs in which sentence. Furthermoore I have an index on id from the table words and sentences.

This query determines in which sentences a given word occurs and returns the first match:

select S.sentence from sentences S, words W, index I
where W.word = '#erhoehungen' and W.id = I.wordId and S.id = I.sentenceId
limit 1;

But when I want to retrieve a sentence where two words occur together like:

select S.sentence from sentences S, words W, index I
where W.word = '#dreikampf' and I.wordId = W.id and S.id = I.sentenceId and
S.id in (
    select S.id from sentences S, words W, index I
    where W.word = 'bruederle' and W.id = I.wordId and S.id = I.sentenceId
)
limit 1;

This query is much slower. Is there any trick to speed it up? Following things I did so far:

  • increased shared_buffer to 32MB
  • increased work_mem to 15MB
  • ran analyze on all tables
  • as mentioned created index on words id and sentences id

Regards.

€Dit:

Here is the output of the explain analyze query statement: http://pastebin.com/t2M5w4na

These three create statements are actually my original create statements. Should I add primary key to the tables sentences and words and reference these as foreign keys in the index? But what primary key should I use for the index table? SentId and wordId together are not unique and even if I add pos which denotes the position of the word in the sentence it is not unique.

updated to:

  1. create table words (id integer, word text, freq integer, primary key(id));
  2. create table sentences (id integer, sentence text, primary key(id));
  3. create table index (wordId integer, sentenceId integer, position integer, foreign key(wordId) references words(id), foreign key(sentenceId) references sentences(sentenceId));
6
  • 1
    Edit your question, and paste the output of explain analyze your_query, where "your_query" represents your troublesome SELECT statement. Also, actual CREATE TABLE statements can help a lot. Commented Oct 27, 2013 at 22:38
  • Your table index (terrible name, BTW) needs at least a primary key. {sentenceid, position} is the obvious choice. Having one or two compound indexes on {sentenceid,wordid} and/or {wordid,sentenceid} would probably help, too. Commented Oct 28, 2013 at 0:41
  • Plus: you will need a UNIQUE constraint or index for the natural key of the words table: the word itself. off-record: RDBMS and nlp are a bad match. You could take a look at other storage methods (for Postgres: hstore, or GIST indexes for full-text search) Commented Oct 28, 2013 at 0:49
  • The key value pair {sentenceid, position} is not unique, because some sentences are duplicated. Thanks for the information about the other storage methods. Commented Oct 28, 2013 at 15:26
  • Why would you want to allow a duplicate sentence? Without extra (key) columns, a duplicate sentence is meaningless. Commented Oct 28, 2013 at 16:23

2 Answers 2

1

I guess this should be more efficient:

SELECT s.id, s.sentence FROM words w
JOIN INDEX i ON w.id = i.wordId
JOIN sentences s ON i.sentenceId = s.id
WHERE w.word IN ('#dreikampf', 'bruederle')
GROUP BY s.id, s.sentence
HAVING COUNT(*) >= 2

Just make sure the amount of items in the IN clause matches the amount of items in the HAVING clause.

Fiddle here.

Sign up to request clarification or add additional context in comments.

4 Comments

Also you don't need to add more SQL code in this solution if you want to add more words but rather change the parameters :)
Thank you very much. It's much faster than my solution, but still in seconds range. Maybe it's because of the size of the tables: words(255715 rows), sentences(5085623 rows) and index(61029790 rows).
61 MM? That's big number :) The next level of performance would be working on indexes I guess. But probably you should ask that question in Database Administrators.
Thanks for the link. Maybe I could also try mysql with MyISAM as storage engine, because it uses less security stuff than postgreSQL - But I have no experience with that.
0

Looks like you don't have indexes on columns wordId, sentenceId. Please create them and query will work much faster.

CREATE INDEX idx_index_wordId ON index USING btree (wordId);
CREATE INDEX idx_index_sentenceId ON index USING btree (sentenceId);

Using reserved word index as table name is not a good idea – you may need to escape it in some cases. Probably you should also add column id to index table and make it primary key.

Please use Mosty Mostacho query and show it's explain analyze output after you make indexes. May be it can work even faster.

Update:

please try new query:

select S.sentence from sentences S where S.id in
(select sentenceId from index I where 
I.wordId in (select id from words where word IN ('#dreikampf', 'bruederle'))
group by I.sentenceId
having count(distinct I.wordId) = 2
limit 1)

5 Comments

added index to both id's and renamed the index table to inv_w. Here is the output of explain analyze: pastebin.com/veVds6KP Still in seconds range. I'm only interested in the first / one match, so maybe I can use a cursor? Because this query retrieves all solutions.
Please also create this index: CREATE INDEX idx_words_word ON words USING btree (word); and add LIMIT 1 to the end of the query for fetching only one row.
I also updated my answer – please try new query. It should work faster and more correctly (handle cases when 2 identical words are in one sentence).
The query with "#dreikampf' and 'bruederle' needs ~ 900 ms. But with the words 'bruederle' and 'punto-vergleich' it takes 12629 ms :/
Please create one more index CREATE INDEX idx_index_wordId_sentenceId ON index USING btree (wordId, sentenceId);. For some reason Postgres chosen bad plan, this should help.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.