1

Not a coding question, but a documentation omission that is nowhere mentioned online at this point. When using the Langchain CSVLoader, which column is being vectorized via the OpenAI embeddings I am using?

I ask because viewing this code below, I vectorized a sample CSV, did searches (on Pinecone) and consistently received back DISsimilar responses. How do know which column Langchain is actually identifying to vectorize?

loader = CSVLoader(file_path=file, metadata_columns=['col2', 'col3', 'col4','col5'])
langchain_docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=100)
docs = text_splitter.split_documents(langchain_docs)
for doc in docs:
    doc.metadata.pop('source')
    doc.metadata.pop('row')
my_index = pc_store.from_documents(docs, embeddings, index_name=PINECONE_INDEX_NAME)

I am assuming the CSVLoader is then identifying col1 to vectorize. But, searches of Pinecone are terrible, leading me to think some other column is being vectorized.

1 Answer 1

1

You can check docs variable, this is Document objects of list that contain content and metadata property.

Vectorized use Document's content and for a more detailed content you can refer to langchain csv_loader.py source code (line 98).

content = "\n".join(
                f"{k.strip()}: {v.strip() if v is not None else v}"
                for k, v in row.items()
                if k not in self.metadata_columns
            )
metadata = {"source": source, "row": i}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.