AstraDBVectorStore add_documents() returns exception "'dict' object has no attribute 'page_content'"

Question

def store_embeddings_in_astradb(embeddings,text_chunks, metadata):

    vstore = AstraDBVectorStore(
        collection_name="test",
        embedding=embedding_model,
        token=os.getenv("ASTRA_DB_APPLICATION_TOKEN"),
        api_endpoint=os.getenv("ASTRA_DB_API_ENDPOINT"),
    )
    print("after Vstore")

    # Create documents with page content, embeddings, and metadata
    documents = [
        {
            "page_content": chunk,
            "metadata": metadata
        }
        for chunk in text_chunks
    ]
    for doc in documents:
        print(f"Document structure: {doc}")
    print("after documents")

    # Add documents to AstraDB vector store
    inserted_ids = vstore.add_documents(documents)
    return inserted_ids
# List of PDF files to process
pdf_files = ["WhatYouNeedToKnowAboutWOMENSHEALTH.pdf", "Womens-Health-Book.pdf"]

# Initialize embedding model
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Process each PDF file
for pdf_file in pdf_files:
    if not os.path.isfile(pdf_file):
        raise ValueError(f"PDF file '{pdf_file}' not found.")

    print(f"Processing file: {pdf_file}")

    # Extract text from PDF
    text = extract_text_from_pdf(pdf_file)

    # Split text into chunks
    text_chunks = split_text_into_chunks(text)

    # Embed text chunks
    embeddings = embed_text_chunks(text_chunks, embedding_model)

    # Extract metadata
    metadata = extract_metadata(pdf_file)

    # Store embeddings in AstraDB
    try:
        inserted_ids = store_embeddings_in_astradb(embeddings,text_chunks, metadata)
        print(f"Inserted {len(inserted_ids)} embeddings from '{pdf_file}' into AstraDB.")
    except Exception as e:
        print(f"Failed to insert embeddings for '{pdf_file}': {e}")

This is the code iam using to convert text chunks into embeddings and then store them in the AstraDB. At the time of insertion iam getting error 'dict' object has no attribute 'page_content'. How to resolve it?

Please post a minimal reproducible example (including the full error trace). — desertnaut
– desertnaut, Commented Jul 18, 2024 at 12:21

Stefano L · Accepted Answer · 2024-07-24 09:36:31Z

I agree with all remarks by Erick above (the LangChain vector store class will take care of embedding computation, and it is important the vector store instance is created once as the instantiation has some overhead: so you gain substantial performance by sharing a single vectorstore throughout calls).

Now to the core of the problem: the code above is mixing LangChain abstractions and bare-bones Python structures (dictionaries). Since you are using the LC vector store (AstraDBVectorStore instance) you should pass (a list of) the corresponding LC abstraction for documents, instead of dictionaries, to the add_documents method. Please add the following import and replace the documents= ... statement as follows:

from langchain_core.documents import Document

[...]

# replace the `documents = ...` part with:

    documents = [
        Document(
            page_content=chunk,
            metadata=metadata,
        )
        for chunk in text_chunks
    ]

[...]

This should now work as intended.

Side note:

if you don't feel like creating Documents for the sole transient purpose of passing them to the vector store add_documents method, keep in mind you also have the option to call add_texts, and pass directly two parallel lists of texts and metadata dicts to the vector store:

vstore.add_texts(["text 1", "text 2", ...], metadatas=[{...}, {...}, ...])

(the above also supports a nice ids=... third list argument if you want to impose your own string IDs to documents: that helps in case you re-run the insertion, since it allow you to avoid storing duplicate entries in the vector store).

Good answer! I was going to say the same thing about using the Document class.

Erick Ramirez · Accepted Answer · 2024-07-24 09:18:56Z

0

I'm struggling to understand your code but I suspect the issue is that the variable scope is incorrect. If you include a minimal code sample plus steps to replicate the problem, I'd be happy to help you troubleshoot it.

As a side note, I would suggest not creating the AstraDBVectorStore object in a function because it is not necessary. You should only instantiate it once and share it for the life of your application.

Also when you make a call to AstraDBVectorStore.add_documents(), it will automatically generate embeddings for each document then store it in Astra DB so it's not necessary to make multiple calls to embed_text_chunks(). In fact, I can't see the embeddings variable being used anywhere. Cheers!

answered Jul 24, 2024 at 9:18

Erick Ramirez

16.5k2 gold badges22 silver badges33 bronze badges

Collectives™ on Stack Overflow

AstraDBVectorStore add_documents() returns exception "'dict' object has no attribute 'page_content'"

2 Answers 2

Side note:

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Side note:

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related