Langchain CSVLoader

Question

Not a coding question, but a documentation omission that is nowhere mentioned online at this point. When using the Langchain CSVLoader, which column is being vectorized via the OpenAI embeddings I am using?

I ask because viewing this code below, I vectorized a sample CSV, did searches (on Pinecone) and consistently received back DISsimilar responses. How do know which column Langchain is actually identifying to vectorize?

loader = CSVLoader(file_path=file, metadata_columns=['col2', 'col3', 'col4','col5'])
langchain_docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=100)
docs = text_splitter.split_documents(langchain_docs)
for doc in docs:
    doc.metadata.pop('source')
    doc.metadata.pop('row')
my_index = pc_store.from_documents(docs, embeddings, index_name=PINECONE_INDEX_NAME)

I am assuming the CSVLoader is then identifying col1 to vectorize. But, searches of Pinecone are terrible, leading me to think some other column is being vectorized.

chifu lin · Accepted Answer · 2024-03-05 01:13:13Z

1

You can check docs variable, this is Document objects of list that contain content and metadata property.

Vectorized use Document's content and for a more detailed content you can refer to langchain csv_loader.py source code (line 98).

content = "\n".join(
                f"{k.strip()}: {v.strip() if v is not None else v}"
                for k, v in row.items()
                if k not in self.metadata_columns
            )
metadata = {"source": source, "row": i}

answered Mar 5, 2024 at 1:13

chifu lin

1565 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Langchain CSVLoader

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related