2

In my project, I encounter numerous documents that have identical embeddings but differ in metadata, which might influence retrieval through filtering. (400K documents but only 22040 different embeddings). The problem arises because there are multiple versions of editions of the same document embedding, with different metadata values for each version.

I am considering three strategies to address this issue:
a) Insert all documents with matching embeddings but varying metadata into the database and apply filters as needed.
b) Preprocess the documents to create an index that eliminates embedding duplicates in the vector database. Each document in this index would link to all original documents with the same embedding. Metadata filtering would then be conducted outside the vector database. I assume this would make the vector database search faster in my situation, but will add extra complexity outside the vector database).
c) Compose the metadata of all documents that share the same embedding into a vector (ie metadata=[{"x":1, "y":1000, "id":"a"}, {"x":3, "y":2000, "id":"b"}...]) and make a complex filtering function that will true if any of the elements of the metadata vector satisfy a condition. I am afraid that this may not be expresable as a filtering function (would the vector database support that filtering function?), or the impact in performance.

Could you recommend a specific approach for this situation?

1 Answer 1

1
  • Create one document type containing the embedding field ("embedding"), and another for the unique metadata ("metadata"), so that you have 22040 and 400k documents of these types, respectively.

  • Use Vespa's parent-child feature to import the vector field into the "metadata" type (the "embedding" being the parent). This ensures the vectors are not duplicated in memory.

  • Search by exact nearest neighbor. You cannot create an index on the imported vector field but with these numbers brute force is just fine.

Here's an example doing this:

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, this is an option worth exploring. It seems to me that vespa is ahead of functional features compared with other vector databases (ie. possibility to rerank documents, hybrid search, and this parent-child feature). Happy to stand corrected though.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.