In my project, I encounter numerous documents that have identical embeddings but differ in metadata, which might influence retrieval through filtering. (400K documents but only 22040 different embeddings). The problem arises because there are multiple versions of editions of the same document embedding, with different metadata values for each version.
I am considering three strategies to address this issue:
a) Insert all documents with matching embeddings but varying metadata into the database and apply filters as needed.
b) Preprocess the documents to create an index that eliminates embedding duplicates in the vector database. Each document in this index would link to all original documents with the same embedding. Metadata filtering would then be conducted outside the vector database. I assume this would make the vector database search faster in my situation, but will add extra complexity outside the vector database).
c) Compose the metadata of all documents that share the same embedding into a vector (ie metadata=[{"x":1, "y":1000, "id":"a"}, {"x":3, "y":2000, "id":"b"}...]) and make a complex filtering function that will true if any of the elements of the metadata vector satisfy a condition. I am afraid that this may not be expresable as a filtering function (would the vector database support that filtering function?), or the impact in performance.
Could you recommend a specific approach for this situation?