Recommended approach for managing entries in a vector database when the embeddings are identical but their metadata differs?

Question

In my project, I encounter numerous documents that have identical embeddings but differ in metadata, which might influence retrieval through filtering. (400K documents but only 22040 different embeddings). The problem arises because there are multiple versions of editions of the same document embedding, with different metadata values for each version.

I am considering three strategies to address this issue:
a) Insert all documents with matching embeddings but varying metadata into the database and apply filters as needed.
b) Preprocess the documents to create an index that eliminates embedding duplicates in the vector database. Each document in this index would link to all original documents with the same embedding. Metadata filtering would then be conducted outside the vector database. I assume this would make the vector database search faster in my situation, but will add extra complexity outside the vector database).
c) Compose the metadata of all documents that share the same embedding into a vector (ie metadata=[{"x":1, "y":1000, "id":"a"}, {"x":3, "y":2000, "id":"b"}...]) and make a complex filtering function that will true if any of the elements of the metadata vector satisfy a condition. I am afraid that this may not be expresable as a filtering function (would the vector database support that filtering function?), or the impact in performance.

Could you recommend a specific approach for this situation?

Jon · Accepted Answer · 2024-03-04 21:06:25Z

1

Create one document type containing the embedding field ("embedding"), and another for the unique metadata ("metadata"), so that you have 22040 and 400k documents of these types, respectively.
Use Vespa's parent-child feature to import the vector field into the "metadata" type (the "embedding" being the parent). This ensures the vectors are not duplicated in memory.
Search by exact nearest neighbor. You cannot create an index on the imported vector field but with these numbers brute force is just fine.

Here's an example doing this:

answered Mar 4, 2024 at 21:06

Jon

2,37412 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Javier Moran Over a year ago

Thanks, this is an option worth exploring. It seems to me that vespa is ahead of functional features compared with other vector databases (ie. possibility to rerank documents, hybrid search, and this parent-child feature). Happy to stand corrected though.

Collectives™ on Stack Overflow

Recommended approach for managing entries in a vector database when the embeddings are identical but their metadata differs?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related