Building production level RAG for csv files

Question

I am tasked to build a production level RAG application over CSV files. Possible Approches:

Embedding --> VectorDB --> Taking user query --> Similarity or Hybrid Search --> LLM --> Result
Csv to pandas df --> Ask LLM for py code to query from user prompt --> Query in df --> Give to LLM for analysis --> Result

First approach is giving vague answer for using unstructured approach to structured data and second is doing very good but I suspect its scalability. I need suggesstion.

Safder Raza · Accepted Answer · 2024-09-04 13:20:07Z

1

Depends on how many and large your files are in production. Can you describe your use case a bit more?

How many csv files do you think you have?
are the files related to each other like tables in a database.

One issue you are going to run into with even approach #2 is how do you know which csv file to load in dataframe?

One approach, though not used in production environment, that worked well for me was

during indexing, instead of loading the whole csv into vectordb, use LLM to summarize the csv file. Index the summary and make sure file path is included in the document metadata
At retrieval time, you should get good hits due the table summary. Find the source csv from the document's metadata and load that into the dataframe.

Obviously this approach might get expensive if you have tons csv files.

There are a few other methods like Chain of tables and Mix-Self-Consistency approach that I have read but not implemented

answered Sep 4, 2024 at 13:20

Safder Raza

111 bronze badge

Sign up to request clarification or add additional context in comments.

3 Comments

Ahamad Mamun Nishar Miya Over a year ago

Thats great. Csv files will have approximately 200 to 300 rows and we may have around 10 to 20 at least for now. We are getting csv file from the Oracle endpoint that is managed by other teams. I am tasked to build this RAG end. We also have Pinecone under our umbrella.

Safder Raza Over a year ago

oh 200-300 rows. You can actually load the whole thing into context window. Most LLMs do a good job of answering questions if you give them a whole csv. Have you tried just putting the whole CSV in the prompt and asking it questions? The question still remains on how do you find the right csv file from the 20-30 you have... you might still have to store the summary in your vector db but instead of loading the csv file in dataframe, may be put the whole file in as prompt

Ahamad Mamun Nishar Miya Over a year ago

Since this production could grow, I was thinking of putting the whole csv into llm but if the token limit arrives, go for the pandas approach. Do you think this is doable?

Collectives™ on Stack Overflow

Building production level RAG for csv files

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related