0

Let's say I have a set of docs. Each doc is an unordered bag of strings

{a, b, b, d}, {a, b}, {j, k, d, a}, ....

Is it possible to use GIN to find all docs that are similar to the doc X? As a similarity - cosine or euclidean distance is used.

I know PostgreSQL provides trigram search. It's very similar to what I want. But without trigram. I want to use my own vectors.

Something like SELECT * from DOCS where content like {a, b, c}.

INSERT INTO docs (content) VALUES ({i, j, k})
INSERT INTO docs (content) VALUES ({a})
INSERT INTO docs (content) VALUES ({b, c})
...

-- Somehow build GIN index over the docs.content field

SELECT * FROM docs WHERE content LIKE {a, b, c}

Is it possible to do something like that with GIN?

If it helps - a bag of numbers could be used instead of bag of strings.

5
  • Actually, now that I read your question more carefully (obviously, after replying) I am not 100% sure about what you want to do... can you specify? Commented Sep 30, 2017 at 1:40
  • As a similarity measure pretty much anything could be used - cosine, euclidean, etc. That makes the question completely random. Please specify the kind of similarity you need. Commented Sep 30, 2017 at 3:30
  • @ErwinBrandstetter thanks for reply, similarity - cosine or euclidean distance. Commented Oct 3, 2017 at 21:15
  • @giorgiga thanks for reply, similarity - cosine or euclidean distance. Commented Oct 3, 2017 at 21:15
  • @AlexeyPetrushin the reason I was asking is because full-text search support in postgres (which does make use of GIN/GIST) could be what you are looking for, but I can't really tell since you don't explain what you are implementing. In case, see postgresql.org/docs/current/static/textsearch.html (and then §12.9 about indexes) Commented Oct 5, 2017 at 15:18

1 Answer 1

1

You can use GIN indexes to check if an array contains another array:

create table docs(content text[]);

insert into docs values ('{a,b}'),('{a,b,c}'),('{a,b,c,d}'), ('{a,c,d}'),('{a,b,d}');

create index on docs using gin(content);

select content from docs where content @> '{b,c}'; -- this can use the index

Caveat emptor!

The @> operator may not work the way one expects: it treats arrays a bit like they were sets...

select '{a}'  ::text[] @> '{a,a}'::text[]; -- true!
select '{a,b}'::text[] @> '{b,a}'::text[]; -- true!

Relevant doc topics

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.